如何搜索某些不属于网址的文本？

Suppose the text to search is pqr.

假设要搜索的文本是pqr。

"http://abc.zzz/pqr/xyz"      -> Should not match
"/pqr/"                       -> Should Match
"pqr"                         -> Should Match
"http://abc.zzz/pqr/pqr/"     -> Should not match
"http://abc.zzz/pqr/pqr/ pqr" -> Should match the last "pqr"
"www.pqr.zzz"                 -> Should not match

I tried using the following regex,

我尝试使用以下正则表达式,

((?:(?:(?:https?|ftp|file|mailto):)|www)[^ ]+?)?(pqr)

I then looked for group 1, if it is empty then I was considering it as a match. But this fails for http://abc.zzz/pqr/pqr/

然后我找了第1组,如果它是空的,那么我认为它是一个匹配。但这对http://abc.zzz/pqr/pqr/失败了

Any help here in detecting if the text to match is not part of a url?

这里有任何帮助来检测匹配的文本是否不是网址的一部分?

The worst case I think is to detect all the urls first and then store the start and end indexes of the matched urls. Then try to match pqr and exclude all those which are part of the url. I was thinking if there is something that can be done better.

我认为最糟糕的情况是先检测所有网址,然后存储匹配网址的开始和结束索引。然后尝试匹配pqr并排除所有属于url的部分。我在想是否有可以做得更好的事情。

1 个解决方案

#1

Taking into account you are using Java, you can leverage the constrained-width lookbehind that Java regex engine supports. It means you can use {n,m} limiting quantifier in the pattern. Right now, Java 8 supports even * and + quantifiers inside a lookbehind (although unofficially), but this is a bug and is likely to be fixed in the next version. Thus, you may use some range, say 0 to 1000 (as the link is not likely to contain more than 1K symbols, but you may adjust it to the factual situation):

考虑到您使用的是Java,您可以利用Java正则表达式引擎支持的约束宽度外观。这意味着您可以在模式中使用{n,m}限制量词。现在,Java 8支持看后面的*和+量词(尽管是非正式的),但这是一个错误,可能会在下一个版本中修复。因此,您可以使用某个范围,例如0到1000(因为链接不可能包含超过1K的符号,但您可以根据实际情况进行调整):

 (?<!(?:(?:https?|ftp|file)://|mailto:)(?:www\.)?\S{0,1000})(?<!\bwww\.\S{0,1000})pqr

See the regex demo

请参阅正则表达式演示

The first lookbehind (?<!(?:(?:https?|ftp|file)://|mailto:)(?:www\.)?\S{0,1000}) will check if the pqr is not preceded with a full URL, and (?<!\bwww\.\S{0,1000}) lookbehind will check if the pqr is not preceded with www..

第一个lookbehind(?

#1

 (?<!(?:(?:https?|ftp|file)://|mailto:)(?:www\.)?\S{0,1000})(?<!\bwww\.\S{0,1000})pqr

See the regex demo

请参阅正则表达式演示

第一个lookbehind(?

秒客网

如何搜索某些不属于网址的文本？

1 个解决方案

#1

#1

相关文章