美丽的汤只刮半桌

I am trying to learn how to use Beautiful Soup and I have a problem when scraping a table from Wikipedia.

我正在尝试学习如何使用美丽的汤,我从*刮表时遇到了问题。

from bs4 import BeautifulSoup

import urllib2

wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

page = urllib2.urlopen(wiki)

soup = BeautifulSoup(page, 'lxml')

print soup

It seems like I can't get the full Wikipedia table, but the last entry I get with this code is Omnicon Group and it stops before getting the /tr in the source code. If you check in the original link the last entry of the table is Zoetis so it stops about half way.

看起来我无法获得完整的*表,但我使用此代码获得的最后一个条目是Omnicon Group,它在源代码中获取/ tr之前就停止了。如果你签入原始链接,表格的最后一个条目是Zoetis,所以它停止了大约一半。

Everything seems ok in the Wikipedia source code... Any idea of what I might be doing wrong?

*源代码中的一切似乎都没问题......任何我可能做错的想法?

2 个解决方案

#1

try this. read this for more http://www.crummy.com/software/BeautifulSoup/bs4/doc/

试试这个。阅读本文以获取更多信息http://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup

from urllib.request import urlopen

wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

page = urlopen(wiki)

soup = BeautifulSoup(page, 'lxml')

result = soup.find("table", class_="wikitable")

print(result)

this should be the last <tr> in your result

这应该是结果中的最后一个

<tr>
    <td><a class="external text" href="https://www.nyse.com/quote/XNYS:ZTS" rel="nofollow">ZTS</a></td>
    <td><a href="/wiki/Zoetis" title="Zoetis">Zoetis</a></td>
    <td><a class="external text" href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ZTS&amp;action=getcompany" rel="nofollow">reports</a></td>
    <td>Health Care</td>
    <td>Pharmaceuticals</td>
    <td><a href="/wiki/Florham_Park,_New_Jersey" title="Florham Park, New Jersey">Florham Park, New Jersey</a></td>
    <td>2013-06-21</td>
    <td>0001555280</td>
</tr>

You will also need to install requests with pip install requests and i used

您还需要使用pip安装请求安装请求并使用

python==3.4.3
beautifulsoup4==4.4.1

#2

This is my working answer. It should work for you without even installing lxml. I used Python 2.7

这是我的工作答案。它应该适用于你,甚至没有安装lxml。我使用的是Python 2.7

from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, "html.parser")
print soup.table

#1