If I have some xml containing things like the following mediawiki markup:
如果我有一些xml包含以下mediawiki标记:
" ...collected in the 12th century, of which [[Alexander the Great]] was the hero, and in which he was represented, somewhat like the British [[King Arthur|Arthur]]"
“...收集于12世纪,其中[[亚历山大大帝]]是英雄,并且代表他,有点像英国[[亚瑟王|亚瑟]]”
what would be the appropriate arguments to something like:
什么是适当的论据,如:
re.findall([[__?__]], article_entry)
I am stumbling a bit on escaping the double square brackets, and getting the proper link for text like: [[Alexander of Paris|poet named Alexander]]
我有点躲过双方括号,并得到文本的正确链接,如:[[巴黎亚历山大|诗人亚历山大]]
4 个解决方案
#1
Here is an example
这是一个例子
import re
pattern = re.compile(r"\[\[([\w \|]+)\]\]")
text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas"
results = pattern.findall(text)
output = []
for link in results:
output.append(link.split("|")[0])
# outputs ['Alexander of Paris']
Version 2, puts more into the regex, but as a result, changes the output:
版本2,更多地放入正则表达式,但结果,更改输出:
import re
pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)
# outputs [('a', '|b'), ('c', '|d'), ('efg', '')]
print [link[0] for link in results]
# outputs ['a', 'c', 'efg']
Version 3, if you only want the link without the title.
版本3,如果您只想要没有标题的链接。
pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)
# outputs ['a', 'c', 'efg']
#2
RegExp: \w+( \w+)+(?=]])
RegExp:\ w +(\ w +)+(?=]])
input
[[Alexander of Paris|poet named Alexander]]
[[巴黎亚历山大|诗人亚历山大]]
output
poet named Alexander
诗人亚历山大
input
[[Alexander of Paris]]
[[巴黎亚历山大]]
output
Alexander of Paris
巴黎亚历山大
#3
import re
pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])")
text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]"
results = pattern.findall(text)
print results
Would give the output
会给出输出
["Alexander the Great", "King Arthur"]
#4
If you are trying to get all the links from a page, of course it is much easier to use the MediaWiki API if at all possible, e.g. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website).
如果您尝试从页面获取所有链接,当然,如果可能的话,使用MediaWiki API要容易得多,例如: http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website)。
Note that both these methods miss links embedded in templates.
请注意,这两种方法都会错过嵌入模板中的链接。
#1
Here is an example
这是一个例子
import re
pattern = re.compile(r"\[\[([\w \|]+)\]\]")
text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas"
results = pattern.findall(text)
output = []
for link in results:
output.append(link.split("|")[0])
# outputs ['Alexander of Paris']
Version 2, puts more into the regex, but as a result, changes the output:
版本2,更多地放入正则表达式,但结果,更改输出:
import re
pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)
# outputs [('a', '|b'), ('c', '|d'), ('efg', '')]
print [link[0] for link in results]
# outputs ['a', 'c', 'efg']
Version 3, if you only want the link without the title.
版本3,如果您只想要没有标题的链接。
pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)
# outputs ['a', 'c', 'efg']
#2
RegExp: \w+( \w+)+(?=]])
RegExp:\ w +(\ w +)+(?=]])
input
[[Alexander of Paris|poet named Alexander]]
[[巴黎亚历山大|诗人亚历山大]]
output
poet named Alexander
诗人亚历山大
input
[[Alexander of Paris]]
[[巴黎亚历山大]]
output
Alexander of Paris
巴黎亚历山大
#3
import re
pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])")
text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]"
results = pattern.findall(text)
print results
Would give the output
会给出输出
["Alexander the Great", "King Arthur"]
#4
If you are trying to get all the links from a page, of course it is much easier to use the MediaWiki API if at all possible, e.g. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website).
如果您尝试从页面获取所有链接,当然,如果可能的话,使用MediaWiki API要容易得多,例如: http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website)。
Note that both these methods miss links embedded in templates.
请注意,这两种方法都会错过嵌入模板中的链接。