用漂亮的汤解析HTML。从特定标记返回文本

I can parse the full argument of a html Tag addressing it over a unix shell script like this:

我可以通过unix shell脚本解析html标记的完整参数，如下所示:

# !/usr/bin/python3

# import the module
from bs4 import BeautifulSoup

# define your object
soup = BeautifulSoup(open("test.html"))

# get the tag
print(soup(itemprop="name"))

where itemprop="name" uniquely identifies the required tag.

其中itemprop="name"唯一标识所需的标记。

the output is something like

输出是这样的

[<span itemprop="name">
                    Blabla &amp; Bloblo</span>]

Now I would like to return only the Bla Bla Blo Blo part.

现在我只想返回Bla Blo的部分。

my attempt was to do:

我的尝试是:

print(soup(itemprop="name").getText())

but I get an error message like AttributeError: 'ResultSet' object has no attribute 'getText'

但是我得到了一个错误信息比如AttributeError: 'ResultSet'对象没有属性'getText'

it worked experimentally in other contexts such as

它在其他情况下，如。

print(soup.find('span').getText())

So what am I getting wrong?

那么我错在哪里呢?

1 个解决方案

#1

Using the soup object as a callable returns a list of results, as if you used soup.find_all(). See the documentation:

使用soup对象作为callable返回一个结果列表，就像使用soup.find_all()一样。看到文档:

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object.

因为find_all()是这个漂亮的Soup搜索API中最流行的方法，所以您可以使用它的快捷方式。如果您将漂亮的soup对象或标记对象视为函数，那么它与在该对象上调用find_all()是相同的。

Use soup.find() to find just the first match:

使用soup.find()查找第一个匹配项:

soup.find(itemprop="name").get_text()

or index into the resultset:

或索引到结果集:

soup(itemprop="name")[0].get_text()

#1