I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().
我正在编写一个爬虫来获取html文件的某些部分。但是我不知道如何使用re.findall()。
Here is an example, when I want to find all ... part in the file, I may write something like this:
这里有一个例子,当我想要找到所有的…在文件的一部分,我可以这样写:
re.findall("<div>.*\</div>", result_page)
if result_page is a string "<div> </div> <div> </div>"
, the result will be
如果result_page是一个字符串“
”,那么结果将是['<div> </div> <div> </div>']
Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?
只有整个字符串。这不是我想要的,我希望两个女主角分开。我应该做什么?
2 个解决方案
#1
6
Quoting the documentation,
引用的文档,
The
'*'
,'+'
, and'?'
qualifiers are all greedy; they match as much text as possible. Adding'?'
after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.“*”、“+”和“?”“限定词都是贪婪的;它们尽可能多地匹配文本。添加”?在限定符使其以非贪婪或最小的方式执行匹配之后;尽可能少的字符将被匹配。
Just add the question mark:
只要加上问号:
In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']
Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:
此外,您不应该使用RegEx来解析HTML,因为有专门为此生成的HTML解析器。示例使用BeautifulSoup 4:
In [7]: import bs4
In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
#2
4
*
is a greedy operator, you want to use *?
for a non-greedy match.
*是一个贪婪的操作符,你想用*?非贪婪匹配。
re.findall("<div>.*?</div>", result_page)
Or use a parser such as BeautifulSoup instead of regular expression for this task:
或者使用类似于BeautifulSoup的解析器来代替这个任务的正则表达式:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')
#1
6
Quoting the documentation,
引用的文档,
The
'*'
,'+'
, and'?'
qualifiers are all greedy; they match as much text as possible. Adding'?'
after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.“*”、“+”和“?”“限定词都是贪婪的;它们尽可能多地匹配文本。添加”?在限定符使其以非贪婪或最小的方式执行匹配之后;尽可能少的字符将被匹配。
Just add the question mark:
只要加上问号:
In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']
Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:
此外,您不应该使用RegEx来解析HTML,因为有专门为此生成的HTML解析器。示例使用BeautifulSoup 4:
In [7]: import bs4
In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
#2
4
*
is a greedy operator, you want to use *?
for a non-greedy match.
*是一个贪婪的操作符,你想用*?非贪婪匹配。
re.findall("<div>.*?</div>", result_page)
Or use a parser such as BeautifulSoup instead of regular expression for this task:
或者使用类似于BeautifulSoup的解析器来代替这个任务的正则表达式:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')