从字符串位置提取python中的周围单词

时间:2021-08-06 06:29:44

Let's assume, I have a string:

我们假设,我有一个字符串:

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

and I have a position of word in this string, for example:

我在这个字符串中有一个单词的位置,例如:

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

I need to extract several words behind and several words after from each position. How to implement it using Python and regular expressions?

我需要从每个位置后面提取几个单词和几个单词。如何使用Python和正则表达式实现它?

E.g.:

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

So:

>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

Thank you!

2 个解决方案

#1


I first proposed using word boundary meta characters, but that's not quite right, because they don't consume any of the string, and \B doesn't really match what I wanted it to anyway.

我首先提出使用单词边界元字符,但这不太正确,因为它们不消耗任何字符串,而且\ B无论如何都不能与我想要的完全匹配。

Instead, I propose using the underlying definition of a word boundary -- that is, the boundary between \W and \w. Look for one or more word character (\w) along with one or more non-word character (\W) in the right order, repeated as many times as you want, on either side of the search substring.

相反,我建议使用单词边界的基础定义 - 即\ W和\ w之间的边界。在搜索子字符串的任一侧,以正确的顺序查找一个或多个单词字符(\ w)以及一个或多个非单词字符(\ W),并重复多次。

For example: (?:\w+\W+){,3}some string(?:\W+\w+){,3}

例如:(?:\ w + \ W +){,3}一些字符​​串(?:\ W + \ w +){,3}

This finds up to three words before and up to three words after "some string".

在“some string”之后,最多可找到三个单词,最多可找到三个单词。

#2


You want a "concordance" of your regexp hits, let's say two words before and after the place where your regexps matched. The easiest way to do it is to break your string there and anchor your search to the endpoints of the pieces. For example, to get two words before and after index 263 (your first m.start()), you'd do:

你想要你的正则表达式命中的“一致性”,让我们说你的正则表达式匹配的地方之前和之后的两个单词。最简单的方法是在那里打破你的字符串并将你的搜索锚定到各个部分的端点。例如,要获得索引263(您的第一个m.start())之前和之后的两个单词,您需要:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

The first expression should be read from the end of the string backwards: It anchors at the end $, possibly skips a partial word if the match ended mid-word, (\S*), skips some spaces (\s+), and then matches up to two {2,} word-space sequences, \s+\S+. It's not exactly two because if we reach the beginning of the string, we want to return a short match.

第一个表达式应该从字符串的末尾向后读取:它锚定在结尾$,如果匹配结束中间单词,可能会跳过部分单词,(\ S *),跳过一些空格(\ s +),然后最多匹配两个{2,}字空间序列,\ s + \ S +。它不完全是两个,因为如果我们到达字符串的开头,我们想要返回一个短匹配。

The second regexp does the same but in reverse direction.

第二个正则表达式反向但反向。

For a concordance you'd probably want to start reading right after the end of the regexp match, not the beginning. In that case, use m.end() as the beginning of the second string.

对于一致性,您可能希望在正则表达式匹配结束后立即开始阅读,而不是开头。在这种情况下,使用m.end()作为第二个字符串的开头。

It's pretty obvious how to use this with a list of regexp matches, I think.

我认为很明显如何将它与regexp匹配列表一起使用。

#1


I first proposed using word boundary meta characters, but that's not quite right, because they don't consume any of the string, and \B doesn't really match what I wanted it to anyway.

我首先提出使用单词边界元字符,但这不太正确,因为它们不消耗任何字符串,而且\ B无论如何都不能与我想要的完全匹配。

Instead, I propose using the underlying definition of a word boundary -- that is, the boundary between \W and \w. Look for one or more word character (\w) along with one or more non-word character (\W) in the right order, repeated as many times as you want, on either side of the search substring.

相反,我建议使用单词边界的基础定义 - 即\ W和\ w之间的边界。在搜索子字符串的任一侧,以正确的顺序查找一个或多个单词字符(\ w)以及一个或多个非单词字符(\ W),并重复多次。

For example: (?:\w+\W+){,3}some string(?:\W+\w+){,3}

例如:(?:\ w + \ W +){,3}一些字符​​串(?:\ W + \ w +){,3}

This finds up to three words before and up to three words after "some string".

在“some string”之后,最多可找到三个单词,最多可找到三个单词。

#2


You want a "concordance" of your regexp hits, let's say two words before and after the place where your regexps matched. The easiest way to do it is to break your string there and anchor your search to the endpoints of the pieces. For example, to get two words before and after index 263 (your first m.start()), you'd do:

你想要你的正则表达式命中的“一致性”,让我们说你的正则表达式匹配的地方之前和之后的两个单词。最简单的方法是在那里打破你的字符串并将你的搜索锚定到各个部分的端点。例如,要获得索引263(您的第一个m.start())之前和之后的两个单词,您需要:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

The first expression should be read from the end of the string backwards: It anchors at the end $, possibly skips a partial word if the match ended mid-word, (\S*), skips some spaces (\s+), and then matches up to two {2,} word-space sequences, \s+\S+. It's not exactly two because if we reach the beginning of the string, we want to return a short match.

第一个表达式应该从字符串的末尾向后读取:它锚定在结尾$,如果匹配结束中间单词,可能会跳过部分单词,(\ S *),跳过一些空格(\ s +),然后最多匹配两个{2,}字空间序列,\ s + \ S +。它不完全是两个,因为如果我们到达字符串的开头,我们想要返回一个短匹配。

The second regexp does the same but in reverse direction.

第二个正则表达式反向但反向。

For a concordance you'd probably want to start reading right after the end of the regexp match, not the beginning. In that case, use m.end() as the beginning of the second string.

对于一致性,您可能希望在正则表达式匹配结束后立即开始阅读,而不是开头。在这种情况下,使用m.end()作为第二个字符串的开头。

It's pretty obvious how to use this with a list of regexp matches, I think.

我认为很明显如何将它与regexp匹配列表一起使用。