美丽的汤使用正则表达式来查找标签？

I'd really like to be able to allow Beautiful Soup to match any list of tags, like so. I know attr accepts regex, but is there anything in beautiful soup that allows you to do so?

我真的希望能够让Beautiful Soup匹配任何标签列表，就像这样。我知道attr接受正则表达式，但有什么美丽的汤可以让你这样做吗？

soup.findAll("(a|div)")

Output:

输出：

<a> ASDFS
<div> asdfasdf
<a> asdfsdf

My goal is to create a scraper that can grab tables from sites. Sometimes tags are named inconsistently, and I'd like to be able to input a list of tags to name the 'data' part of a table.

我的目标是创建一个可以从站点获取表的scraper。有时标签的名称不一致，我希望能够输入一个标签列表来命名表格的“数据”部分。

3 个解决方案

#1

find_all() is the most favored method in the Beautiful Soup search API.

find_all（）是Beautiful Soup搜索API中最受青睐的方法。

You can pass a variation of filters. Also, pass a list to find multiple tags:

您可以传递一系列过滤器。另外，传递一个列表来查找多个标签：

>>> soup.find_all(['a', 'div'])

Example:

例：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>')
>>> soup.find_all(['a', 'div'])
[<div>asdfasdf</div>, <a>foo</a>]

Or you can use a regular expression to find tags that contain a or div:

或者，您可以使用正则表达式查找包含a或div的标记：

>>> import re
>>> soup.find_all(re.compile("(a|div)"))

#2

Note that you can also use regular expressions to search in attributes of tags. For example:

请注意，您还可以使用正则表达式搜索标记的属性。例如：

import re
from bs4 import BeautifulSoup

soup.find_all('a', {'href': re.compile(r'crummy\.com/')})

This example finds all <a> tags that link to a website containing the substring 'crummy.com'.

此示例查找链接到包含子字符串“crummy.com”的网站的所有标记。

(I know this is a very old post, but hopefully someone will find this additional information useful.)

（我知道这是一篇非常古老的帖子，但希望有人会发现这些附加信息很有用。）

#3

yes see docs...

是看文档...

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

import re

soup.findAll(re.compile("^a$|(div)"))

#1