在python中以编程方式打开页面

时间:2022-03-31 22:40:17

Can you extract the VIN number from this webpage?

你能从这个网页中提取VIN号码吗?

I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.

我尝试了urllib2.build_opener,requests和mechanize。我也提供了用户代理,但没有人能看到VIN。

opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)

table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin

3 个解决方案

#1


5  

You can use browser automation tools for the purpose.

您可以使用浏览器自动化工具来实现此目的。

For example this simple selenium script can do your work.

例如,这个简单的selenium脚本可以完成您的工作。

from selenium import webdriver
from bs4 import BeautifulSoup

link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source

soup = BeautifulSoup(page)

table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin

BTW, table.contents[0] prints the entire span, including the span tags.

BTW,table.contents [0]打印整个范围,包括span标签。

table.contents.span.contents[0] prints only the VIN no.

table.contents.span.contents [0]仅打印VIN号。

#2


7  

That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.

该页面的大部分信息都是通过Javascript加载和显示的(可能是通过Ajax调用),最有可能直接防止抓取。因此,要么你需要使用运行Javascript的浏览器,并远程控制它,或者在javascript中编写刮刀本身,或者你需要解构网站并确定它用Javascript加载的确切内容以及如何,并查看是否你可以复制这些电话。

#3


2  

You could use selenium, which calls a browser. This works for me :

您可以使用selenium,它可以调用浏览器。这对我有用:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time

# See: http://*.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page


time.sleep(0.5) # Let the page load


# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(@id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'

browser.close()

#1


5  

You can use browser automation tools for the purpose.

您可以使用浏览器自动化工具来实现此目的。

For example this simple selenium script can do your work.

例如,这个简单的selenium脚本可以完成您的工作。

from selenium import webdriver
from bs4 import BeautifulSoup

link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source

soup = BeautifulSoup(page)

table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin

BTW, table.contents[0] prints the entire span, including the span tags.

BTW,table.contents [0]打印整个范围,包括span标签。

table.contents.span.contents[0] prints only the VIN no.

table.contents.span.contents [0]仅打印VIN号。

#2


7  

That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.

该页面的大部分信息都是通过Javascript加载和显示的(可能是通过Ajax调用),最有可能直接防止抓取。因此,要么你需要使用运行Javascript的浏览器,并远程控制它,或者在javascript中编写刮刀本身,或者你需要解构网站并确定它用Javascript加载的确切内容以及如何,并查看是否你可以复制这些电话。

#3


2  

You could use selenium, which calls a browser. This works for me :

您可以使用selenium,它可以调用浏览器。这对我有用:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time

# See: http://*.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page


time.sleep(0.5) # Let the page load


# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(@id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'

browser.close()