python RE findall()返回值是一个完整的字符串

I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().

我正在编写一个爬虫来获取html文件的某些部分。但是我不知道如何使用re.findall()。

Here is an example, when I want to find all ... part in the file, I may write something like this:

这里有一个例子，当我想要找到所有的…在文件的一部分，我可以这样写:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

如果result_page是一个字符串“

”，那么结果将是

['<div> </div> <div> </div>']

Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?

只有整个字符串。这不是我想要的，我希望两个女主角分开。我应该做什么?

2 个解决方案

#1

Quoting the documentation,

引用的文档,

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

“*”、“+”和“?”“限定词都是贪婪的;它们尽可能多地匹配文本。添加”?在限定符使其以非贪婪或最小的方式执行匹配之后;尽可能少的字符将被匹配。

Just add the question mark:

只要加上问号:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:

此外，您不应该使用RegEx来解析HTML，因为有专门为此生成的HTML解析器。示例使用BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

#2

* is a greedy operator, you want to use *? for a non-greedy match.

*是一个贪婪的操作符，你想用*?非贪婪匹配。

re.findall("<div>.*?</div>", result_page)

Or use a parser such as BeautifulSoup instead of regular expression for this task:

或者使用类似于BeautifulSoup的解析器来代替这个任务的正则表达式:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')

#1