Python - 如何使用Beautiful Soup找到id为'value'的所有跨度的文本?

时间:2022-04-13 08:51:59

I would like to get all of the text of the spans which have the class of 'value'.

我想获得具有“值”类的跨度的所有文本。


I then need to get the online ISSN of the page by using the first 9 characters of the text. I don't need the ones with text ending in "(print)" but I do need the ones ending in "(online)
Example

<span class="bold">ISSN: </span>
<span class="value">0890-037X (Print)</span>
<span class="value">1550-2740 (Online)</span>


Here I would need to get "1550-2740" as it is the online ISSN. I think I need to find all the spans, check the class and then check the text. If the text ends in "(online)" then I need to get the first 9 characters.
How do I do this? Thank you in advance.

2 个解决方案

#1


2  

Use find_all to extract the elements. Create a generator (or list if you want) which is just the text attribute of each of these. Filter out those which do not end in "(Online)" and slice them to just extract the ISBN. I have used a generator and next() to just get the first occurrence, but you could just use a list if you wanted all of them (if there are multiple).

使用find_all提取元素。创建一个生成器(或列表,如果你想),它只是每个生成器的文本属性。过滤掉那些不以“(在线)”结尾的内容并将其切片以仅提取ISBN。我使用了一个生成器和next()来获得第一次出现,但如果你想要所有这些(如果有多个),你可以使用一个列表。

Hope this works for the whole file!

希望这适用于整个文件!

soup = BeautifulSoup(open("p.html").read(), "lxml")
txt = (t.text for t in soup.find_all("span", class_="value"))
isbn = next(t[:9] for t in txt if t.endswith("(Online)"))

which gives isbn as '1550-2740'.

这使得isbn成为'1550-2740'。

#2


1  

Another way could be something like below:

另一种方式可能如下所示:

soup = BeautifulSoup(content,"lxml")
for item in soup.find_all(class_="value"):
    if "Online" in item.text:
        print(item.text.split()[0])

Output:

1550-2740

#1


2  

Use find_all to extract the elements. Create a generator (or list if you want) which is just the text attribute of each of these. Filter out those which do not end in "(Online)" and slice them to just extract the ISBN. I have used a generator and next() to just get the first occurrence, but you could just use a list if you wanted all of them (if there are multiple).

使用find_all提取元素。创建一个生成器(或列表,如果你想),它只是每个生成器的文本属性。过滤掉那些不以“(在线)”结尾的内容并将其切片以仅提取ISBN。我使用了一个生成器和next()来获得第一次出现,但如果你想要所有这些(如果有多个),你可以使用一个列表。

Hope this works for the whole file!

希望这适用于整个文件!

soup = BeautifulSoup(open("p.html").read(), "lxml")
txt = (t.text for t in soup.find_all("span", class_="value"))
isbn = next(t[:9] for t in txt if t.endswith("(Online)"))

which gives isbn as '1550-2740'.

这使得isbn成为'1550-2740'。

#2


1  

Another way could be something like below:

另一种方式可能如下所示:

soup = BeautifulSoup(content,"lxml")
for item in soup.find_all(class_="value"):
    if "Online" in item.text:
        print(item.text.split()[0])

Output:

1550-2740