I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.
我搜集了几篇有关恐怖袭击的文章。我想从这些文章中提炼出一个具体的段落。
This is a sample of the articles scraped:
这是一篇精选文章的范例:
By DAVID D. KIRKPATRICK MARCH 18, 2015
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry
that is vital to Tunisia as it struggles to consolidate the only transition to democracy
after the Arab Spring revolts.
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.
What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.
我想进一步分析的是,在这个例子中,从“CAIRO -”到第一个fullstop的文本。
This is the regular expression that I came up with:
这是我想到的正则表达式:
([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s
With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.
使用这个正则表达式,我只提取段落的起点,但不提取其余部分。
2 个解决方案
#1
2
Use non-greedy
使用贪婪的
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)
The ?
after a +
(or *
) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.
的吗?在+(或*)之后使它变得不贪婪。这意味着它只会尽可能少地匹配,而不是正常的行为,尽可能地匹配它。
#2
0
EDIT1:
EDIT1:
try the regex as follows:
试试下面的regex:
([A-Z]+\w+\s*—\s*.*?\.)
It is about grouping, though it matches the text that you want.
它是关于分组的,尽管它与您想要的文本相匹配。
try the following regex (surround the regex with parenthisis):
尝试以下regex(用括号括住regex):
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)
Group 1 contains the required string/text.
组1包含所需的字符串/文本。
图片参考:
#1
2
Use non-greedy
使用贪婪的
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)
The ?
after a +
(or *
) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.
的吗?在+(或*)之后使它变得不贪婪。这意味着它只会尽可能少地匹配,而不是正常的行为,尽可能地匹配它。
#2
0
EDIT1:
EDIT1:
try the regex as follows:
试试下面的regex:
([A-Z]+\w+\s*—\s*.*?\.)
It is about grouping, though it matches the text that you want.
它是关于分组的,尽管它与您想要的文本相匹配。
try the following regex (surround the regex with parenthisis):
尝试以下regex(用括号括住regex):
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)
Group 1 contains the required string/text.
组1包含所需的字符串/文本。
图片参考: