I am trying to learn how to use Beautiful Soup and I have a problem when scraping a table from Wikipedia.
我正在尝试学习如何使用美丽的汤,我从*刮表时遇到了问题。
from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
print soup
It seems like I can't get the full Wikipedia table, but the last entry I get with this code is Omnicon Group
and it stops before getting the /tr
in the source code. If you check in the original link the last entry of the table is Zoetis
so it stops about half way.
看起来我无法获得完整的*表,但我使用此代码获得的最后一个条目是Omnicon Group,它在源代码中获取/ tr之前就停止了。如果你签入原始链接,表格的最后一个条目是Zoetis,所以它停止了大约一半。
Everything seems ok in the Wikipedia source code... Any idea of what I might be doing wrong?
*源代码中的一切似乎都没问题......任何我可能做错的想法?
2 个解决方案
#1
1
try this. read this for more http://www.crummy.com/software/BeautifulSoup/bs4/doc/
试试这个。阅读本文以获取更多信息http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
result = soup.find("table", class_="wikitable")
print(result)
this should be the last <tr>
in your result
这应该是结果中的最后一个
<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:ZTS" rel="nofollow">ZTS</a></td>
<td><a href="/wiki/Zoetis" title="Zoetis">Zoetis</a></td>
<td><a class="external text" href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ZTS&action=getcompany" rel="nofollow">reports</a></td>
<td>Health Care</td>
<td>Pharmaceuticals</td>
<td><a href="/wiki/Florham_Park,_New_Jersey" title="Florham Park, New Jersey">Florham Park, New Jersey</a></td>
<td>2013-06-21</td>
<td>0001555280</td>
</tr>
You will also need to install requests with pip install requests
and i used
您还需要使用pip安装请求安装请求并使用
python==3.4.3
beautifulsoup4==4.4.1
#2
1
This is my working answer. It should work for you without even installing lxml. I used Python 2.7
这是我的工作答案。它应该适用于你,甚至没有安装lxml。我使用的是Python 2.7
from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, "html.parser")
print soup.table
#1
1
try this. read this for more http://www.crummy.com/software/BeautifulSoup/bs4/doc/
试试这个。阅读本文以获取更多信息http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
result = soup.find("table", class_="wikitable")
print(result)
this should be the last <tr>
in your result
这应该是结果中的最后一个
<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:ZTS" rel="nofollow">ZTS</a></td>
<td><a href="/wiki/Zoetis" title="Zoetis">Zoetis</a></td>
<td><a class="external text" href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ZTS&action=getcompany" rel="nofollow">reports</a></td>
<td>Health Care</td>
<td>Pharmaceuticals</td>
<td><a href="/wiki/Florham_Park,_New_Jersey" title="Florham Park, New Jersey">Florham Park, New Jersey</a></td>
<td>2013-06-21</td>
<td>0001555280</td>
</tr>
You will also need to install requests with pip install requests
and i used
您还需要使用pip安装请求安装请求并使用
python==3.4.3
beautifulsoup4==4.4.1
#2
1
This is my working answer. It should work for you without even installing lxml. I used Python 2.7
这是我的工作答案。它应该适用于你,甚至没有安装lxml。我使用的是Python 2.7
from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, "html.parser")
print soup.table