将Web数据传递到Beautiful Soup - 空列表

时间:2022-08-22 18:10:37

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

我已经重新检查了我的代码并查看了打开URL以将Web数据传递到Beautiful Soup的类似操作,由于某些原因我的代码只是没有返回任何内容,尽管它的格式正确:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

如图所示,很明显urlopen()返回一个由变量内容捕获的HTTP响应,它可以读取响应的状态,但在传递给Beautiful Soup之后,Web数据不会被转换成为一个美丽的汤对象(可变汤)。你可以看到我试图读取一些标签和文本,get_text()返回一个空列表,这很奇怪。

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

奇怪的是,当我通过content.data访问网络数据时,数据显示但由于我不能使用Beautiful Soup来解析它,所以它没有用。我的问题是什么?谢谢。

4 个解决方案

#1


8  

If you just want to scrape the page, requests will get the content you need:

如果您只想抓取页面,请求将获得您需要的内容:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'

#2


8  

urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

urllib3返回一个Response对象,该对象包含具有预加载的主体有效负载的.data。

Per the top quickstart usage example here, I would do something like this:

根据这里的*快速入门用法示例,我会做这样的事情:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

其余的应该按预期工作。

--

-

A little about what went wrong in your original code:

关于原始代码出了什么问题:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

您传递了整个响应对象而不是身体有效负载。这应该是正常的,因为响应对象是一个类似文件的对象,除了在这种情况下urllib3已经消耗了所有的响应并为你解析它,所以没有任何东西留给.read()。这就像传递已经读过的文件指针一样。另一方面,.data将访问已读取的数据。

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

如果要将urllib3响应对象用作类文件对象,则需要禁用内容预加载,如下所示:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

现在它应该按预期工作。

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

我明白这不是很明显的行为,而且作为urllib3的作者我道歉。 :)我们计划有一天将preload_content = False作为默认值。也许有一天很快(我在这里开了一个问题)。

--

-

A quick note on .urlopen vs .request:

关于.urlopen vs .request的快速说明:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

.urlopen假定您将负责编码传递给请求的任何参数。在这种情况下,使用.urlopen是很好的,因为你没有将任何参数传递给请求,但一般来说.request会为你做所有额外的工作,所以它更方便。

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

如果有人愿意为此改进我们的文档,那将非常感激。 :)请发送PR到https://github.com/shazow/urllib3并添加自己作为贡献者!

#3


2  

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

如图所示,很明显urlopen()返回一个由变量内容捕获的HTTP响应......

What you've called content isn't the content, but a file-like object that you can read the content from. BeautifulSoup is perfectly happy taking such a thing, but it's not very helpful to print it out for debugging purposes. So, let's actually read the content out of it to make this easier to debug:

你所谓的内容不是内容,而是一个类似文件的对象,你可以从中读取内容。 BeautifulSoup非常乐意接受这样的事情,但是出于调试目的而打印它并不是很有帮助。所以,让我们实际读取它的内容,以便更容易调试:

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

This should make it pretty clear that BeautifulSoup is not the problem here. But continuing on:

这应该很清楚,BeautifulSoup在这里不是问题。但继续:

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

...但是在将其传递到Beautiful Soup之后,Web数据不会转换为Beautiful Soup对象(变量汤)。

Yes it does. The fact that soup.title gave you None instead of raising an AttributeError is pretty good evidence, but you can test it directly:

是的,它确实。 soup.title为你提供了None而不是引发AttributeError的事实是非常好的证据,但你可以直接测试它:

>>> type(soup)
bs4.BeautifulSoup

That's definitely a BeautifulSoup object.

这绝对是一个BeautifulSoup对象。

When you pass BeautifulSoup an empty string, exactly what you get back will depend on which parser it's using under the covers; if it's relying on the Python 3.x stdlib, what you'll get is an html node with an empty head, and empty body, and nothing else. So, when you look for a title node, there isn't one, and you get None.

当你向BeautifulSoup传递一个空字符串时,你得到的确切取决于它在封面下使用的解析器;如果它依赖于Python 3.x stdlib,那么你会得到一个空头,空身体的html节点,没有别的。因此,当您查找标题节点时,没有一个,并且您将获得无。


So, how do you fix this?

那么,你如何解决这个问题呢?

As the documentation says, you're using "the lowest level call for making a request, so you’ll need to specify all the raw details." What are those raw details? Honestly, if you don't already know, you shouldn't be using this method Teaching you how to deal with the under-the-hood details of urllib3 before you even know the basics would not be doing you a service.

正如文档所说,您正在使用“发出请求的最低级别调用,因此您需要指定所有原始详细信息。”这些原始细节是什么?老实说,如果你还不知道,你不应该使用这种方法教你如何处理urllib3的底层细节,甚至在你知道基础知识不会为你提供服务之前。

In fact, you really don't need urllib3 here at all. Just use the modules that come with Python:

事实上,你真的根本不需要urllib3。只需使用Python附带的模块:

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'

#4


0  

My beautiful soup code was working in one environment (my local machine) and returning an empty list in another one (ubuntu 14 server).

我漂亮的汤代码在一个环境(我的本地机器)中工作,并在另一个环境中返回一个空列表(ubuntu 14服务器)。

I've resolved my problem changing the installation. details in other thread:

我已经解决了改变安装的问题。其他主题的详细信息:

Html parsing with Beautiful Soup returns empty list

使用Beautiful Soup进行Html解析返回空列表

#1


8  

If you just want to scrape the page, requests will get the content you need:

如果您只想抓取页面,请求将获得您需要的内容:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'

#2


8  

urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

urllib3返回一个Response对象,该对象包含具有预加载的主体有效负载的.data。

Per the top quickstart usage example here, I would do something like this:

根据这里的*快速入门用法示例,我会做这样的事情:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

其余的应该按预期工作。

--

-

A little about what went wrong in your original code:

关于原始代码出了什么问题:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

您传递了整个响应对象而不是身体有效负载。这应该是正常的,因为响应对象是一个类似文件的对象,除了在这种情况下urllib3已经消耗了所有的响应并为你解析它,所以没有任何东西留给.read()。这就像传递已经读过的文件指针一样。另一方面,.data将访问已读取的数据。

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

如果要将urllib3响应对象用作类文件对象,则需要禁用内容预加载,如下所示:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

现在它应该按预期工作。

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

我明白这不是很明显的行为,而且作为urllib3的作者我道歉。 :)我们计划有一天将preload_content = False作为默认值。也许有一天很快(我在这里开了一个问题)。

--

-

A quick note on .urlopen vs .request:

关于.urlopen vs .request的快速说明:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

.urlopen假定您将负责编码传递给请求的任何参数。在这种情况下,使用.urlopen是很好的,因为你没有将任何参数传递给请求,但一般来说.request会为你做所有额外的工作,所以它更方便。

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

如果有人愿意为此改进我们的文档,那将非常感激。 :)请发送PR到https://github.com/shazow/urllib3并添加自己作为贡献者!

#3


2  

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

如图所示,很明显urlopen()返回一个由变量内容捕获的HTTP响应......

What you've called content isn't the content, but a file-like object that you can read the content from. BeautifulSoup is perfectly happy taking such a thing, but it's not very helpful to print it out for debugging purposes. So, let's actually read the content out of it to make this easier to debug:

你所谓的内容不是内容,而是一个类似文件的对象,你可以从中读取内容。 BeautifulSoup非常乐意接受这样的事情,但是出于调试目的而打印它并不是很有帮助。所以,让我们实际读取它的内容,以便更容易调试:

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

This should make it pretty clear that BeautifulSoup is not the problem here. But continuing on:

这应该很清楚,BeautifulSoup在这里不是问题。但继续:

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

...但是在将其传递到Beautiful Soup之后,Web数据不会转换为Beautiful Soup对象(变量汤)。

Yes it does. The fact that soup.title gave you None instead of raising an AttributeError is pretty good evidence, but you can test it directly:

是的,它确实。 soup.title为你提供了None而不是引发AttributeError的事实是非常好的证据,但你可以直接测试它:

>>> type(soup)
bs4.BeautifulSoup

That's definitely a BeautifulSoup object.

这绝对是一个BeautifulSoup对象。

When you pass BeautifulSoup an empty string, exactly what you get back will depend on which parser it's using under the covers; if it's relying on the Python 3.x stdlib, what you'll get is an html node with an empty head, and empty body, and nothing else. So, when you look for a title node, there isn't one, and you get None.

当你向BeautifulSoup传递一个空字符串时,你得到的确切取决于它在封面下使用的解析器;如果它依赖于Python 3.x stdlib,那么你会得到一个空头,空身体的html节点,没有别的。因此,当您查找标题节点时,没有一个,并且您将获得无。


So, how do you fix this?

那么,你如何解决这个问题呢?

As the documentation says, you're using "the lowest level call for making a request, so you’ll need to specify all the raw details." What are those raw details? Honestly, if you don't already know, you shouldn't be using this method Teaching you how to deal with the under-the-hood details of urllib3 before you even know the basics would not be doing you a service.

正如文档所说,您正在使用“发出请求的最低级别调用,因此您需要指定所有原始详细信息。”这些原始细节是什么?老实说,如果你还不知道,你不应该使用这种方法教你如何处理urllib3的底层细节,甚至在你知道基础知识不会为你提供服务之前。

In fact, you really don't need urllib3 here at all. Just use the modules that come with Python:

事实上,你真的根本不需要urllib3。只需使用Python附带的模块:

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'

#4


0  

My beautiful soup code was working in one environment (my local machine) and returning an empty list in another one (ubuntu 14 server).

我漂亮的汤代码在一个环境(我的本地机器)中工作,并在另一个环境中返回一个空列表(ubuntu 14服务器)。

I've resolved my problem changing the installation. details in other thread:

我已经解决了改变安装的问题。其他主题的详细信息:

Html parsing with Beautiful Soup returns empty list

使用Beautiful Soup进行Html解析返回空列表