从文章|中提取一个段落的正则表达式

时间:2022-12-10 18:17:28

I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.

我搜集了几篇有关恐怖袭击的文章。我想从这些文章中提炼出一个具体的段落。

This is a sample of the articles scraped:

这是一篇精选文章的范例:

By   DAVID D. KIRKPATRICK    MARCH 18, 2015 
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry 
that is vital to  Tunisia  as it struggles to consolidate the only transition to democracy 
after the Arab Spring revolts. 
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.

What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.

我想进一步分析的是,在这个例子中,从“CAIRO -”到第一个fullstop的文本。

This is the regular expression that I came up with:

这是我想到的正则表达式:

([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s

With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.

使用这个正则表达式,我只提取段落的起点,但不提取其余部分。

2 个解决方案

#1


2  

Use non-greedy

使用贪婪的

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)

The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.

的吗?在+(或*)之后使它变得不贪婪。这意味着它只会尽可能少地匹配,而不是正常的行为,尽可能地匹配它。

#2


0  

EDIT1:

EDIT1:

try the regex as follows:

试试下面的regex:

([A-Z]+\w+\s*—\s*.*?\.)

It is about grouping, though it matches the text that you want.

它是关于分组的,尽管它与您想要的文本相匹配。

try the following regex (surround the regex with parenthisis):

尝试以下regex(用括号括住regex):

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)

Group 1 contains the required string/text.

组1包含所需的字符串/文本。

Image reference: 从文章|中提取一个段落的正则表达式

图片参考:

#1


2  

Use non-greedy

使用贪婪的

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)

The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.

的吗?在+(或*)之后使它变得不贪婪。这意味着它只会尽可能少地匹配,而不是正常的行为,尽可能地匹配它。

#2


0  

EDIT1:

EDIT1:

try the regex as follows:

试试下面的regex:

([A-Z]+\w+\s*—\s*.*?\.)

It is about grouping, though it matches the text that you want.

它是关于分组的,尽管它与您想要的文本相匹配。

try the following regex (surround the regex with parenthisis):

尝试以下regex(用括号括住regex):

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)

Group 1 contains the required string/text.

组1包含所需的字符串/文本。

Image reference: 从文章|中提取一个段落的正则表达式

图片参考: