在Python regex搜索中匹配字符串的通配符

时间:2022-09-13 09:31:45

I thought I would write some quick code to download the number of "fans" a Facebook page has.

我想我可以写一些快速的代码来下载Facebook页面上的“粉丝”数量。

For some reason, despite a fair number of iterations I've tried, I can't get the following code to pick out the number of fans in the HTML. None of the other solutions I found on the web correctly match the regex in this case either. Surely it is possible to have some wildcard between the two matching bits?

出于某种原因,尽管我尝试了很多次迭代,但我还是无法获得以下代码来确定HTML中粉丝的数量。在本例中,我在web上找到的其他解决方案也没有一个与regex匹配。当然,在这两个匹配的位之间有一些通配符是可能的吗?

The text I'd like to match against is "6 of X fans", where X is an arbitrary number of fans a page has - I would like to get this number.

我想匹配的文本是“6个X的粉丝”,其中X是一个页面拥有的任意数量的粉丝——我想得到这个数字。

I was thinking of polling this data intermittently and writing to a file but I haven't gotten around to that yet. I'm also wondering if this is headed in the right direction, as the code seems pretty clunky. :)

我正在考虑间歇性地轮询这些数据并将其写入文件,但我还没有考虑到这一点。我还想知道这是否朝着正确的方向发展,因为代码看起来相当笨拙。:)

import urllib
import re

fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)

for lines in fbhandle.readlines():
        ms = compiled.match(lines)
        print ms #debugging
        if ms: break
#ms.group()
print ms
fbhandle.close()

3 个解决方案

#1


10  

import urllib
import re

fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)

ms = compiled.search(fbhandle.read())
print ms.group(1).strip()
fbhandle.close()

You needed to use re.search() instead. Using re.match() tries to match the pattern against the whole document, but really you're just trying to match a piece inside the document. The code above prints: 79,110. Of course, this will probably be a different number by the time it gets run by someone else.

您需要使用re.search()。使用re.match()尝试将模式与整个文档相匹配,但实际上您只是试图在文档中匹配一个片段。上面的代码打印:79,110。当然,当它被其他人运行时,这个数字可能会是不同的。

#2


10  

Evan Fosmark already gave a good answer. This is just more info.

Evan Fosmark已经给出了一个很好的答案。这只是更多的信息。

You have this line:

你有这条线:

pattern = "6 of(.*)fans"

In general, this isn't a good regular expression. If the input text was:

一般来说,这不是一个好的正则表达式。如果输入文本是:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

Then the match group (the stuff inside the parentheses) would be:

然后匹配组(括号内的内容)为:

" 99 fans in the whole galaxy of "

“全银河系99粉丝”

So, we want a pattern that will just grab what you want, even with a silly input text like the above.

所以,我们想要的是一种模式,它可以抓取你想要的东西,即使是像上面这样愚蠢的输入文本。

In this case, it doesn't really matter if you match the white space, because when you convert a string to an integer, white space is ignored. But let's write the pattern to ignore white space.

在这种情况下,是否匹配空格并不重要,因为当您将字符串转换为整数时,空格将被忽略。但是让我们写一个忽略空白的模式。

With the * wildcard, it is possible to match a string of length zero. In this case I think you always want a non-empty match, so you want to use + to match one or more characters.

使用*通配符,可以匹配长度为0的字符串。在这种情况下,我认为您总是希望使用非空匹配,因此您希望使用+来匹配一个或多个字符。

Python has non-greedy matching available, so you could rewrite with that. Older programs with regular expressions may not have non-greedy matching, so I'll also give a pattern that doesn't require non-greedy.

Python有可用的非贪婪匹配,所以您可以用它重写。具有正则表达式的旧程序可能没有非贪婪匹配,因此我还将给出一个不需要非贪婪的模式。

So, the non-greedy pattern:

所以,非贪婪模式:

pattern = "6 of\s+(.+?)\s+fans"

The other one:

另一个:

pattern = "6 of\s+(\S+)\s+fans"

\s means "any white space" and will match a space, a tab, and a few other characters (such as "form feed"). \S means "any non-white-space" and matches anything that \s would not match.

\s表示“任何空格”,并将匹配空格、制表符和其他一些字符(如“表单提要”)。\S表示“任何非空白”,并匹配任何不匹配的\S。

The first pattern does better than your first pattern with the silly input text:

第一个模式比你的第一个模式更好用愚蠢的输入文本:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

It would return a match group of just 99.

它将返回一个仅为99的匹配组。

But try this other silly input text:

但是试试这个愚蠢的输入文本:

"6 of 99 crazed fans"

“99个疯狂粉丝中的6个”

It would return a match group of 99 crazed.

它将返回一组99人疯狂的比赛。

The second pattern would not match at all, because the word "crazed" isn't the word "fans".

第二种模式完全不匹配,因为“疯狂”一词不是“粉丝”一词。

Hmm. Here's one last pattern that should always do the right thing even with silly input texts:

嗯。这是最后一种模式,即使输入文本很傻,它也应该做正确的事情:

pattern = "6 of\D*?(\d+)\D*?fans"

\d matches any digit ('0' to '9'). \D matches any non-digit.

\d匹配任何数字('0'到'9')。任何non-digit \ D匹配。

This will successfully match anything that is remotely non-ambiguous:

这将成功地匹配任何远无歧义的东西:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

The match group will be 99.

比赛小组将是99人。

"6 of 99 crazed fans"

“99个疯狂粉丝中的6个”

The match group will be 99.

比赛小组将是99人。

"6 of 99 41 fans"

“99个球迷中的6个”

It will not match, because there was a second number in there.

它不匹配,因为有第二个数字。

To learn more about Python regular expressions, you can read various web pages. For a quick reminder, inside the Python interpreter, do:

要了解有关Python正则表达式的更多信息,您可以阅读各种web页面。为了快速提醒您,在Python解释器中,请执行以下操作:

>>> import re
>>> help(re)

When you are "scraping" text from a web page, you might sometimes run afoul of HTML codes. In general, regular expressions are not a good tool for disregarding HTML or XML markup (see here); you would probably do better to use Beautiful Soup to parse the HTML and extract the text, followed by a regular expression to grab the text you really wanted.

当您从web页面“抓取”文本时,您可能有时会与HTML代码发生冲突。一般来说,正则表达式并不是不考虑HTML或XML标记的好工具(参见这里);最好使用漂亮的Soup解析HTML并提取文本,然后使用正则表达式获取您真正想要的文本。

I hope this was interesting and/or educational.

我希望这是有趣和/或教育。

#3


0  

don't need regex

不需要正则表达式

import urllib
fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
for line in fbhandle.readlines():
    line=line.rstrip().split("</span>")
    for item in line:
        if ">Fans<" in item:
            rind=item.rindex("<span>")
            print "-->",item[rind:].split()[2]

output

输出

$ ./python.py
--> 79,133

#1


10  

import urllib
import re

fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)

ms = compiled.search(fbhandle.read())
print ms.group(1).strip()
fbhandle.close()

You needed to use re.search() instead. Using re.match() tries to match the pattern against the whole document, but really you're just trying to match a piece inside the document. The code above prints: 79,110. Of course, this will probably be a different number by the time it gets run by someone else.

您需要使用re.search()。使用re.match()尝试将模式与整个文档相匹配,但实际上您只是试图在文档中匹配一个片段。上面的代码打印:79,110。当然,当它被其他人运行时,这个数字可能会是不同的。

#2


10  

Evan Fosmark already gave a good answer. This is just more info.

Evan Fosmark已经给出了一个很好的答案。这只是更多的信息。

You have this line:

你有这条线:

pattern = "6 of(.*)fans"

In general, this isn't a good regular expression. If the input text was:

一般来说,这不是一个好的正则表达式。如果输入文本是:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

Then the match group (the stuff inside the parentheses) would be:

然后匹配组(括号内的内容)为:

" 99 fans in the whole galaxy of "

“全银河系99粉丝”

So, we want a pattern that will just grab what you want, even with a silly input text like the above.

所以,我们想要的是一种模式,它可以抓取你想要的东西,即使是像上面这样愚蠢的输入文本。

In this case, it doesn't really matter if you match the white space, because when you convert a string to an integer, white space is ignored. But let's write the pattern to ignore white space.

在这种情况下,是否匹配空格并不重要,因为当您将字符串转换为整数时,空格将被忽略。但是让我们写一个忽略空白的模式。

With the * wildcard, it is possible to match a string of length zero. In this case I think you always want a non-empty match, so you want to use + to match one or more characters.

使用*通配符,可以匹配长度为0的字符串。在这种情况下,我认为您总是希望使用非空匹配,因此您希望使用+来匹配一个或多个字符。

Python has non-greedy matching available, so you could rewrite with that. Older programs with regular expressions may not have non-greedy matching, so I'll also give a pattern that doesn't require non-greedy.

Python有可用的非贪婪匹配,所以您可以用它重写。具有正则表达式的旧程序可能没有非贪婪匹配,因此我还将给出一个不需要非贪婪的模式。

So, the non-greedy pattern:

所以,非贪婪模式:

pattern = "6 of\s+(.+?)\s+fans"

The other one:

另一个:

pattern = "6 of\s+(\S+)\s+fans"

\s means "any white space" and will match a space, a tab, and a few other characters (such as "form feed"). \S means "any non-white-space" and matches anything that \s would not match.

\s表示“任何空格”,并将匹配空格、制表符和其他一些字符(如“表单提要”)。\S表示“任何非空白”,并匹配任何不匹配的\S。

The first pattern does better than your first pattern with the silly input text:

第一个模式比你的第一个模式更好用愚蠢的输入文本:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

It would return a match group of just 99.

它将返回一个仅为99的匹配组。

But try this other silly input text:

但是试试这个愚蠢的输入文本:

"6 of 99 crazed fans"

“99个疯狂粉丝中的6个”

It would return a match group of 99 crazed.

它将返回一组99人疯狂的比赛。

The second pattern would not match at all, because the word "crazed" isn't the word "fans".

第二种模式完全不匹配,因为“疯狂”一词不是“粉丝”一词。

Hmm. Here's one last pattern that should always do the right thing even with silly input texts:

嗯。这是最后一种模式,即使输入文本很傻,它也应该做正确的事情:

pattern = "6 of\D*?(\d+)\D*?fans"

\d matches any digit ('0' to '9'). \D matches any non-digit.

\d匹配任何数字('0'到'9')。任何non-digit \ D匹配。

This will successfully match anything that is remotely non-ambiguous:

这将成功地匹配任何远无歧义的东西:

"6 of 99 fans in the whole galaxy of fans"

“99个粉丝中的6个”

The match group will be 99.

比赛小组将是99人。

"6 of 99 crazed fans"

“99个疯狂粉丝中的6个”

The match group will be 99.

比赛小组将是99人。

"6 of 99 41 fans"

“99个球迷中的6个”

It will not match, because there was a second number in there.

它不匹配,因为有第二个数字。

To learn more about Python regular expressions, you can read various web pages. For a quick reminder, inside the Python interpreter, do:

要了解有关Python正则表达式的更多信息,您可以阅读各种web页面。为了快速提醒您,在Python解释器中,请执行以下操作:

>>> import re
>>> help(re)

When you are "scraping" text from a web page, you might sometimes run afoul of HTML codes. In general, regular expressions are not a good tool for disregarding HTML or XML markup (see here); you would probably do better to use Beautiful Soup to parse the HTML and extract the text, followed by a regular expression to grab the text you really wanted.

当您从web页面“抓取”文本时,您可能有时会与HTML代码发生冲突。一般来说,正则表达式并不是不考虑HTML或XML标记的好工具(参见这里);最好使用漂亮的Soup解析HTML并提取文本,然后使用正则表达式获取您真正想要的文本。

I hope this was interesting and/or educational.

我希望这是有趣和/或教育。

#3


0  

don't need regex

不需要正则表达式

import urllib
fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
for line in fbhandle.readlines():
    line=line.rstrip().split("</span>")
    for item in line:
        if ">Fans<" in item:
            rind=item.rindex("<span>")
            print "-->",item[rind:].split()[2]

output

输出

$ ./python.py
--> 79,133