I am trying to convert URLs in a piece of text into hyperlinks - using regular expressions. I have managed to achieve this but the problem is when there are already existing links in the text
我试图将一段文本中的URL转换为超链接 - 使用正则表达式。我已设法实现这一点,但问题是文本中已存在链接
so
bla bla blah www.google.com bla blah <a href="www.google.com">www.google.com</a>
should result in
应该导致
bla bla blah <a href="http://www.google.com">www.google.com</a> bla blah <a href="www.google.com">www.google.com</a>
not
bla bla blah <a href="http://www.google.com">www.google.com</a> bla blah <a href="<a href="http://www.google.com">www.google.com</a></a>"><a href="http://www.google.com">www.google.com</a></a>
4 个解决方案
#1
Finally finished it:
最后完成了:
function add_url_links($data)
{
$data = preg_replace_callback('/(<a href=.+?<\/a>)/','guard_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)([ \\n\\r])/','link_url',$data);
$data = preg_replace_callback('/^(http:\/\/.+?)/','link_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)$/','link_url',$data);
$data = preg_replace_callback('/{{([a-zA-Z0-9+=]+?)}}/','unguard_url',$data);
return $data;
}
function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return guard_url(array('','<a href="'.$arr[1].'">'.$arr[1].'</a>')).$arr[2]; }
#2
This is almost impossible to do with a single regular expression. I would instead recommend a state-machine based approach. Something like this (in pseudo-code)
这对于单个正则表达式几乎是不可能的。我会建议采用基于状态机的方法。像这样的东西(伪代码)
state = OUTSIDE_LINK
for pos (0 .. length input)
switch state
case OUTSIDE_LINK
if substring at pos matches /<a/
state = INSIDE_LINK
else if substring at pos matches /(www.\S+|\S+.com|\S+.org)/
substitute link
case INSIDE_LINK
if substring at post matches /<\/a>/
state = OUTSIDE_LINK
#3
Another way of doing it (in php)
另一种方式(在PHP中)
$strParts = preg_split( '/(<[^>]+>)/', $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
foreach( $strParts as $key=>$part ) {
/*check this part isn't a tag or inside a link*/
if( !(preg_match( '@(<[^>]+>)@', $part ) || preg_match( '@(<a[^>]+>)@', $strParts[$key - 1] )) ) {
$strParts[$key] = preg_replace( '@((http(s)?://)?(\S+\.{1}[^\s\,\.\!]+))@', '<a href="http$3://$4">$1</a>', $strParts[$key] );
}
}
$html = implode( $strParts );
#4
Another trick is to guard all the existing links by encoding the code, then replacing urls with links, and then un-encoding the guarded links.
另一个技巧是通过编码代码来保护所有现有链接,然后用链接替换URL,然后取消编码受保护链接。
$data = 'test http://foo <a href="http://link">LINK</a> test';
$data = preg_replace_callback('/(<a href=".+?<\/a>)/','guard_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)([ .\\n\\r])/','link_url',$data);
$data = preg_replace_callback('/{{([a-zA-Z0-9+]+?)}}/','unguard_url',$data);
print $data;
function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return '<a href="'.$arr[1].'">'.$arr[1].'</a>'.$arr[2]; }
The code above is just a proof of concept, and doesn't handle all situations. Still, you can see that the code is pretty straightforward.
上面的代码只是一个概念证明,并不处理所有情况。不过,你可以看到代码非常简单。
#1
Finally finished it:
最后完成了:
function add_url_links($data)
{
$data = preg_replace_callback('/(<a href=.+?<\/a>)/','guard_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)([ \\n\\r])/','link_url',$data);
$data = preg_replace_callback('/^(http:\/\/.+?)/','link_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)$/','link_url',$data);
$data = preg_replace_callback('/{{([a-zA-Z0-9+=]+?)}}/','unguard_url',$data);
return $data;
}
function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return guard_url(array('','<a href="'.$arr[1].'">'.$arr[1].'</a>')).$arr[2]; }
#2
This is almost impossible to do with a single regular expression. I would instead recommend a state-machine based approach. Something like this (in pseudo-code)
这对于单个正则表达式几乎是不可能的。我会建议采用基于状态机的方法。像这样的东西(伪代码)
state = OUTSIDE_LINK
for pos (0 .. length input)
switch state
case OUTSIDE_LINK
if substring at pos matches /<a/
state = INSIDE_LINK
else if substring at pos matches /(www.\S+|\S+.com|\S+.org)/
substitute link
case INSIDE_LINK
if substring at post matches /<\/a>/
state = OUTSIDE_LINK
#3
Another way of doing it (in php)
另一种方式(在PHP中)
$strParts = preg_split( '/(<[^>]+>)/', $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
foreach( $strParts as $key=>$part ) {
/*check this part isn't a tag or inside a link*/
if( !(preg_match( '@(<[^>]+>)@', $part ) || preg_match( '@(<a[^>]+>)@', $strParts[$key - 1] )) ) {
$strParts[$key] = preg_replace( '@((http(s)?://)?(\S+\.{1}[^\s\,\.\!]+))@', '<a href="http$3://$4">$1</a>', $strParts[$key] );
}
}
$html = implode( $strParts );
#4
Another trick is to guard all the existing links by encoding the code, then replacing urls with links, and then un-encoding the guarded links.
另一个技巧是通过编码代码来保护所有现有链接,然后用链接替换URL,然后取消编码受保护链接。
$data = 'test http://foo <a href="http://link">LINK</a> test';
$data = preg_replace_callback('/(<a href=".+?<\/a>)/','guard_url',$data);
$data = preg_replace_callback('/(http:\/\/.+?)([ .\\n\\r])/','link_url',$data);
$data = preg_replace_callback('/{{([a-zA-Z0-9+]+?)}}/','unguard_url',$data);
print $data;
function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return '<a href="'.$arr[1].'">'.$arr[1].'</a>'.$arr[2]; }
The code above is just a proof of concept, and doesn't handle all situations. Still, you can see that the code is pretty straightforward.
上面的代码只是一个概念证明,并不处理所有情况。不过,你可以看到代码非常简单。