I am using BeautifulSoup to scrape a url and I had the following code
我正在使用BeautifulSoup来获取url,我有以下代码
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll
to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?
在上面的代码中,我们可以使用findAll来获取与它们相关的标记和信息,但是我想使用xpath。是否可以使用xpath和漂亮的汤?如果可能的话,谁能给我提供一个示例代码,以便更有帮助?
6 个解决方案
#1
117
Nope, BeautifulSoup, by itself, does not support XPath expressions.
不,BeautifulSoup本身不支持XPath表达式。
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
另一种库lxml支持XPath 1.0。它有一个漂亮的与Soup兼容的模式,它将尝试像Soup那样解析破损的HTML。但是,默认的lxml HTML解析器在解析损坏的HTML方面也做得很好,我认为它更快。
Once you've parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
一旦将文档解析为lxml树,就可以使用.xpath()方法搜索元素。
import urllib2
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
您可能感兴趣的是CSS选择器支持;CSSSelector类将CSS语句转换为XPath表达式,搜索td。empformbody更加容易:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:
完整的循环:漂亮的汤本身有相当不错的CSS选择支持:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
#2
82
I can confirm that there is no XPath support within Beautiful Soup.
我可以确认在Beautiful Soup中没有XPath支持。
#3
26
Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse()
line prints to the console and doesn't assign the value to the tree
variable. Referencing this, I was able to figure out this works using requests and lxml:
Martijn的代码不再正常工作(现在已经4年多了…),etree.parse()行打印到控制台,并且没有将值赋给树变量。通过引用这个例子,我可以使用request和lxml来解决这个问题:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
#4
9
BeautifulSoup has a function named findNext from current element directed childern,so:
BeautifulSoup有一个名为findNext的函数,来自当前元素指导childern,因此:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
上面的代码可以模拟以下xpath:
div[class=class_value]/div[id=id_value]
#5
1
I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.
我搜索了他们的文档,似乎没有xpath选项。同样,正如您在这里看到的关于SO的类似问题,OP要求将xpath转换为BeautifulSoup,因此我的结论是——不,没有xpath解析可用。
#6
0
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
这是一条非常古老的线,但是现在有一个解决方案,这个方案在当时可能还不是很好。
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
这是我所做的一个例子。我使用“请求”模块读取RSS提要并在名为“rss_text”的变量中获取其文本内容。然后,我通过BeautifulSoup运行它,搜索xpath /rss/channel/title,并检索其内容。它并不完全是XPath(通配符、多路径等等),但如果您只有一个想要定位的基本路径,那么它就可以工作了。
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
#1
117
Nope, BeautifulSoup, by itself, does not support XPath expressions.
不,BeautifulSoup本身不支持XPath表达式。
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
另一种库lxml支持XPath 1.0。它有一个漂亮的与Soup兼容的模式,它将尝试像Soup那样解析破损的HTML。但是,默认的lxml HTML解析器在解析损坏的HTML方面也做得很好,我认为它更快。
Once you've parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
一旦将文档解析为lxml树,就可以使用.xpath()方法搜索元素。
import urllib2
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
您可能感兴趣的是CSS选择器支持;CSSSelector类将CSS语句转换为XPath表达式,搜索td。empformbody更加容易:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:
完整的循环:漂亮的汤本身有相当不错的CSS选择支持:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
#2
82
I can confirm that there is no XPath support within Beautiful Soup.
我可以确认在Beautiful Soup中没有XPath支持。
#3
26
Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse()
line prints to the console and doesn't assign the value to the tree
variable. Referencing this, I was able to figure out this works using requests and lxml:
Martijn的代码不再正常工作(现在已经4年多了…),etree.parse()行打印到控制台,并且没有将值赋给树变量。通过引用这个例子,我可以使用request和lxml来解决这个问题:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
#4
9
BeautifulSoup has a function named findNext from current element directed childern,so:
BeautifulSoup有一个名为findNext的函数,来自当前元素指导childern,因此:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
上面的代码可以模拟以下xpath:
div[class=class_value]/div[id=id_value]
#5
1
I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.
我搜索了他们的文档,似乎没有xpath选项。同样,正如您在这里看到的关于SO的类似问题,OP要求将xpath转换为BeautifulSoup,因此我的结论是——不,没有xpath解析可用。
#6
0
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
这是一条非常古老的线,但是现在有一个解决方案,这个方案在当时可能还不是很好。
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
这是我所做的一个例子。我使用“请求”模块读取RSS提要并在名为“rss_text”的变量中获取其文本内容。然后,我通过BeautifulSoup运行它,搜索xpath /rss/channel/title,并检索其内容。它并不完全是XPath(通配符、多路径等等),但如果您只有一个想要定位的基本路径,那么它就可以工作了。
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()