I'm trying to parse the tables present in sports website into list of dictionary to render into template, this is my first exposure to selenium, I tried to read selenium documentation and wrote this program
我正在尝试将体育网站中的表格解析为字典列表以呈现模板,这是我第一次接触到硒,我试着阅读selenium文档并编写了这个程序
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))
browser.close()
browser.quit()
I'm getting value as 0 and none, How can I modify to get all the values of table and store it in a list of dictionary?, If you have any other questions feel free to ask.
我得到的值为0而没有,我如何修改以获取表的所有值并将其存储在字典列表中?如果您有任何其他问题,请随时询问。
1 个解决方案
#1
0
First of all, avoid using time.sleep()
. It is against all best practices. Use an Explicit Wait.
首先,避免使用time.sleep()。这是违反所有最佳做法的。使用显式等待。
If you inspect the table, you can see that it is location inside the <iframe>
tag with name="testbat"
. So, you'll have to switch to that frame in order to get the contents of the table. It can be done like this:
如果检查表,可以看到它是
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
After switching the frame, use the Explicit Wait as mentioned above.
切换帧后,使用上面提到的显式等待。
Complete code:
from bs4 import BeautifulSoup
from selenium import webdriver
# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
pass # Handle the time out exception
html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")
You can check whether you've got the table:
你可以检查一下你是否有桌子:
>>> print('S.P.D. Smith' in html)
True
#1
0
First of all, avoid using time.sleep()
. It is against all best practices. Use an Explicit Wait.
首先,避免使用time.sleep()。这是违反所有最佳做法的。使用显式等待。
If you inspect the table, you can see that it is location inside the <iframe>
tag with name="testbat"
. So, you'll have to switch to that frame in order to get the contents of the table. It can be done like this:
如果检查表,可以看到它是
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
After switching the frame, use the Explicit Wait as mentioned above.
切换帧后,使用上面提到的显式等待。
Complete code:
from bs4 import BeautifulSoup
from selenium import webdriver
# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to.default_content()
browser.switch_to.frame('testbat')
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
pass # Handle the time out exception
html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")
You can check whether you've got the table:
你可以检查一下你是否有桌子:
>>> print('S.P.D. Smith' in html)
True