如何使用Selenium从网站解析表格内容?

时间:2022-11-29 22:48:44

I'm trying to parse the tables present in sports website into list of dictionary to render into template, this is my first exposure to selenium, I tried to read selenium documentation and wrote this program

我正在尝试将体育网站中的表格解析为字典列表以呈现模板,这是我第一次接触到硒,我试着阅读selenium文档并编写了这个程序

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))

browser.close()
browser.quit()

I'm getting value as 0 and none, How can I modify to get all the values of table and store it in a list of dictionary?, If you have any other questions feel free to ask.

我得到的值为0而没有,我如何修改以获取表的所有值并将其存储在字典列表中?如果您有任何其他问题,请随时询问。

1 个解决方案

#1


0  

First of all, avoid using time.sleep(). It is against all best practices. Use an Explicit Wait.

首先,避免使用time.sleep()。这是违反所有最佳做法的。使用显式等待。

If you inspect the table, you can see that it is location inside the <iframe> tag with name="testbat". So, you'll have to switch to that frame in order to get the contents of the table. It can be done like this:

如果检查表,可以看到它是

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

After switching the frame, use the Explicit Wait as mentioned above.

切换帧后,使用上面提到的显式等待。

Complete code:

from bs4 import BeautifulSoup
from selenium import webdriver

# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

try:
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
    pass  # Handle the time out exception

html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")

You can check whether you've got the table:

你可以检查一下你是否有桌子:

>>> print('S.P.D. Smith' in html)
True

#1


0  

First of all, avoid using time.sleep(). It is against all best practices. Use an Explicit Wait.

首先,避免使用time.sleep()。这是违反所有最佳做法的。使用显式等待。

If you inspect the table, you can see that it is location inside the <iframe> tag with name="testbat". So, you'll have to switch to that frame in order to get the contents of the table. It can be done like this:

如果检查表,可以看到它是

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

After switching the frame, use the Explicit Wait as mentioned above.

切换帧后,使用上面提到的显式等待。

Complete code:

from bs4 import BeautifulSoup
from selenium import webdriver

# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

try:
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
    pass  # Handle the time out exception

html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")

You can check whether you've got the table:

你可以检查一下你是否有桌子:

>>> print('S.P.D. Smith' in html)
True