Scraping Yahoo table for earnings dates with Bs4. My code works until I try to compart the data into cells. The exact error is:
使用Bs4刮取Yahoo表的收益日期。我的代码工作,直到我尝试将数据分成单元格。确切的错误是:
ticker = cells[1].get_text() IndexError: list index out of range
ticker = cells [1] .get_text()IndexError:列表索引超出范围
I thought it was due to the table having an 'a href'...but there's text too.
我认为这是因为桌子上有'a href'......但也有文字。
Ideally the format should look something like:
理想情况下,格式应如下所示:
{'company': '2U Inc', 'ticker': 'TWOU', 'eps_est': '-0.04', 'time': 'after market close'}
{'company':'2U Inc','ticker':'TWOU','eps_est':' - 0.04','时间':'收市后'}
How can I achieve something like the above output, what am I missing?
我怎样才能实现上述输出,我缺少什么?
from urlparse import urljoin
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import MySQLdb
import re
#mysql portion
mydb = MySQLdb.connect(host='localhost',
user= '####',
passwd='#####',
db='testdb')
cur = mydb.cursor()
#def store (company, ticker, eps_est, time):
# cur.execute('INSERT IGNORE INTO EARN (company, ticker, eps_est, time) VALUES ( \"%s\", \"%s\", \"%s\", \"%s\")',(company, ticker, eps_est, time))
# cur.connection.commit()
base_url = "https://biz.yahoo.com/research/earncal/today.html"
html = urlopen(base_url)
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
table = soup.find_all('table')
rows = table[6].find_all('tr')
for row in rows[2:]:
cells = row.find_all('td')
company = cells[0].get_text()
ticker = cells[1].get_text()
eps_est = cells[2].get_text()
time = cells[3].get_text()
# store(company, ticker, eps_est, time)
data = {
'company': cells[0].get_text(),
'ticker': cells[1].get_link('href'),
'eps_est': cells[2].get_text(),
'time': cells[3].get_text(),
}
print data
print '\n'
1 个解决方案
#1
1
Use the "dot-notation" to find elements inside other elements. Replace:
使用“点符号”查找其他元素中的元素。更换:
cells[1].get_link('href')
with:
cells[1].a.get_text()
which should be read as and is equivalent to cells[1].find("a").get_text()
.
应该被读作并且等同于单元格[1] .find(“a”)。get_text()。
And, you need to skip the last "empty" row as well:
并且,您还需要跳过最后一个“空”行:
for row in rows[2:-1]:
#1
1
Use the "dot-notation" to find elements inside other elements. Replace:
使用“点符号”查找其他元素中的元素。更换:
cells[1].get_link('href')
with:
cells[1].a.get_text()
which should be read as and is equivalent to cells[1].find("a").get_text()
.
应该被读作并且等同于单元格[1] .find(“a”)。get_text()。
And, you need to skip the last "empty" row as well:
并且,您还需要跳过最后一个“空”行:
for row in rows[2:-1]: