I want to regex match text in Wikipedia article source code with following rules:
我想要regex匹配*文章源代码中的文本,并遵循以下规则:
-
Match only links to internal articles. So don't match links with any namespaces like files, categories, users, ... etc (complete list of these namespaces here)
- Example link to match
[[Without|namespace]]
- 匹配[[没有|命名空间]的示例链接]
- Example links NOT to match
[[Category:Nope]]
,[[File:Nopeish]]
etc. - 示例链接不匹配[[类别:不]]、[[文件:Nopeish]]等。
- Example link to match
- 只匹配到内部文章的链接。因此,不要将链接与任何名称空间(如文件、类别、用户等)匹配。etc(这里列出这些名称空间的完整列表)示例链接以匹配[[[没有|命名空间]]]]示例链接不匹配[[[Category: no]], [[File:Nopeish]]等。
-
Match only links having delimiter "|". Links with this symbol are displayed in article with different text as the title of article they are referring to
- Example link to match
[[Something|else]]
- 示例链接以匹配[[某个|else]]
- Example link NOT to match
[[text]]
- 示例链接不匹配[[text]]
- Example link to match
- 只匹配具有分隔符“|”的链接。与此符号的链接在文章中以不同文本作为文章标题显示,它们引用示例链接来匹配[[|else]]]示例链接不匹配[[text]]
-
Match links in two groups
- Example link to match
[[Something|else]]
will be matched into two groups with text:- group:
"Something"
- 组:“东西”
- group:
"else"
- 组:“其他”
- group:
- 匹配的示例链接[[Something|else]]将被匹配为两个组,文本为:group: "Something" group: "else"
- Example link to match
- 在两个组的例子中,Match链接将被匹配为两个组:组:“Something”组:“else”
I have tested this and so far I've come up with following regex: \[\[(?!.+?:)(.+?)\|(.+?)\]\]
which is not working as expected since it also matches text like this:
我已经测试过了,到目前为止,我已经找到了以下的regex: \[\]\ [(?! +?:) \ (.+?)\|(.+?)\]\
[[Problem]] non link text [[Another link|problemAgain]]
^------------ group 1 (wrong) -------^ ^-group 2 -^
[[This should be|matched|]]
演示
Thanks
谢谢
1 个解决方案
#1
3
Just use a negated character class instead of .+?
,
用一个否定的字符类代替。
\[\[(?!.+?:)([^\]\[]+)\|([^\]\[]+)\]\]
Java regex would be,
Java正则表达式,
"\\[\\[(?!.+?:)([^\\]\\[]+)\\|([^\\]\\[]+)\\]\\]"
演示
OR
或
simply you could do like this,
你可以这样做,
\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]
Java regex would be,
Java正则表达式,
"\\[\\[([^\\]\\[:]+)\\|([^\\]\\[:]+)\\]\\]"
演示
#1
3
Just use a negated character class instead of .+?
,
用一个否定的字符类代替。
\[\[(?!.+?:)([^\]\[]+)\|([^\]\[]+)\]\]
Java regex would be,
Java正则表达式,
"\\[\\[(?!.+?:)([^\\]\\[]+)\\|([^\\]\\[]+)\\]\\]"
演示
OR
或
simply you could do like this,
你可以这样做,
\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]
Java regex would be,
Java正则表达式,
"\\[\\[([^\\]\\[:]+)\\|([^\\]\\[:]+)\\]\\]"
演示