我们可以用xpath搭配漂亮的汤吗?

时间:2022-02-03 16:40:47

I am using BeautifulSoup to scrape a url and I had the following code

我正在使用BeautifulSoup来获取url,我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

Now in the above code we can use findAll to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?

在上面的代码中,我们可以使用findAll来获取与它们相关的标记和信息,但是我想使用xpath。是否可以使用xpath和漂亮的汤?如果可能的话,谁能给我提供一个示例代码,以便更有帮助?

6 个解决方案

#1


117  

Nope, BeautifulSoup, by itself, does not support XPath expressions.

不,BeautifulSoup本身不支持XPath表达式。

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

另一种库lxml支持XPath 1.0。它有一个漂亮的与Soup兼容的模式,它将尝试像Soup那样解析破损的HTML。但是,默认的lxml HTML解析器在解析损坏的HTML方面也做得很好,我认为它更快。

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

一旦将文档解析为lxml树,就可以使用.xpath()方法搜索元素。

import urllib2
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

您可能感兴趣的是CSS选择器支持;CSSSelector类将CSS语句转换为XPath表达式,搜索td。empformbody更加容易:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:

完整的循环:漂亮的汤本身有相当不错的CSS选择支持:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

#2


82  

I can confirm that there is no XPath support within Beautiful Soup.

我可以确认在Beautiful Soup中没有XPath支持。

#3


26  

Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse() line prints to the console and doesn't assign the value to the tree variable. Referencing this, I was able to figure out this works using requests and lxml:

Martijn的代码不再正常工作(现在已经4年多了…),etree.parse()行打印到控制台,并且没有将值赋给树变量。通过引用这个例子,我可以使用request和lxml来解决这个问题:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

#4


9  

BeautifulSoup has a function named findNext from current element directed childern,so:

BeautifulSoup有一个名为findNext的函数,来自当前元素指导childern,因此:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

Above code can imitate the following xpath:

上面的代码可以模拟以下xpath:

div[class=class_value]/div[id=id_value]

#5


1  

I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.

我搜索了他们的文档,似乎没有xpath选项。同样,正如您在这里看到的关于SO的类似问题,OP要求将xpath转换为BeautifulSoup,因此我的结论是——不,没有xpath解析可用。

#6


0  

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

这是一条非常古老的线,但是现在有一个解决方案,这个方案在当时可能还不是很好。

Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

这是我所做的一个例子。我使用“请求”模块读取RSS提要并在名为“rss_text”的变量中获取其文本内容。然后,我通过BeautifulSoup运行它,搜索xpath /rss/channel/title,并检索其内容。它并不完全是XPath(通配符、多路径等等),但如果您只有一个想要定位的基本路径,那么它就可以工作了。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

#1


117  

Nope, BeautifulSoup, by itself, does not support XPath expressions.

不,BeautifulSoup本身不支持XPath表达式。

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

另一种库lxml支持XPath 1.0。它有一个漂亮的与Soup兼容的模式,它将尝试像Soup那样解析破损的HTML。但是,默认的lxml HTML解析器在解析损坏的HTML方面也做得很好,我认为它更快。

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

一旦将文档解析为lxml树,就可以使用.xpath()方法搜索元素。

import urllib2
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

您可能感兴趣的是CSS选择器支持;CSSSelector类将CSS语句转换为XPath表达式,搜索td。empformbody更加容易:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:

完整的循环:漂亮的汤本身有相当不错的CSS选择支持:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

#2


82  

I can confirm that there is no XPath support within Beautiful Soup.

我可以确认在Beautiful Soup中没有XPath支持。

#3


26  

Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse() line prints to the console and doesn't assign the value to the tree variable. Referencing this, I was able to figure out this works using requests and lxml:

Martijn的代码不再正常工作(现在已经4年多了…),etree.parse()行打印到控制台,并且没有将值赋给树变量。通过引用这个例子,我可以使用request和lxml来解决这个问题:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

#4


9  

BeautifulSoup has a function named findNext from current element directed childern,so:

BeautifulSoup有一个名为findNext的函数,来自当前元素指导childern,因此:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

Above code can imitate the following xpath:

上面的代码可以模拟以下xpath:

div[class=class_value]/div[id=id_value]

#5


1  

I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.

我搜索了他们的文档,似乎没有xpath选项。同样,正如您在这里看到的关于SO的类似问题,OP要求将xpath转换为BeautifulSoup,因此我的结论是——不,没有xpath解析可用。

#6


0  

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

这是一条非常古老的线,但是现在有一个解决方案,这个方案在当时可能还不是很好。

Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

这是我所做的一个例子。我使用“请求”模块读取RSS提要并在名为“rss_text”的变量中获取其文本内容。然后,我通过BeautifulSoup运行它,搜索xpath /rss/channel/title,并检索其内容。它并不完全是XPath(通配符、多路径等等),但如果您只有一个想要定位的基本路径,那么它就可以工作了。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()