如何从亚马逊产品页面中提取asin

时间:2022-04-15 18:17:40

I have the following webpage Product page and I'm trying to get the ASIN from it (in this case ASIN=B014MHZ90M) and I don't have a clue on how to get it from the page.

我有以下网页产品页面,我试图从中获取ASIN(在这种情况下ASIN = B014MHZ90M),我不知道如何从页面获取它。

I'm using Python 3.4, Scrapy and the following code:

我正在使用Python 3.4,Scrapy和以下代码:

hxs = Selector(response)
product_name = "".join(hxs.xpath('//span[contains(@class,"a-text-ellipsis")]/a/text()').extract())
product_model = hxs.xpath('//body//div[@id="buybox_feature_div"]//form[@method="post"]/input[@id="ASIN"/text()').extract()

In this way I don't get the required field (the ASIN number).
1. What should I do in order to get the product model (ASIN)?

这样我就得不到必填字段(ASIN号)。 1.为了获得产品型号(ASIN),我该怎么办?

2.Is there a way to debug such code (I'm using PyCharm). I could not use debugger but only run it without seeing what's going on there in 'slow motion'.

2.有没有办法调试这样的代码(我正在使用PyCharm)。我无法使用调试器但只运行它而没有看到“慢动作”中发生了什么。

Thank everyone in advance.

提前感谢大家。

5 个解决方案

#1


3  

Looking at the Amazon page you linked, the ASIN number appears in the "Product Details" section. Using the scrapy shell the following xpath

查看您链接的亚马逊页面,ASIN编号显示在“产品详细信息”部分中。使用scrapy shell以下xpath

response.xpath('//li[contains(.,"ASIN: ")]//text()').extract()

returns

回报

[u'ASIN: ', u'B014MHZ90M']

For debugging XPATHs I always use scrapy shell and Firebug for Firefox.

为了调试XPATH,我总是使用scrapy shell和Firebug for Firefox。

#2


4  

you can extract B014MHZ90M from the response.url

你可以从response.url中提取B014MHZ90M

response.url.split("/dp/")[1]

response.url.split("/dp/")[1] = B014MHZ90M

response.url.split(“/ dp /”)[1] = B014MHZ90M

response.url.split("/dp/")[0] = http://www.amazon.com

response.url.split(“/ dp /”)[0] = http://www.amazon.com

#3


0  

You can get that from the url.

你可以从网址获得。

r = re.search('www.amazon.com/dp/(.+)/', response.url)
print r.group(1)

#4


0  

I use this:

我用这个:

re.match("http[s]?://www.amazon.(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", url, flags=re.IGNORECASE)

#5


0  

https://www.amazon.com/gp/seller/asin-upc-isbn-info.html

https://www.amazon.com/gp/seller/asin-upc-isbn-info.html

Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items.

亚马逊标准识别码(ASIN)是标识项目的10个字母和/或数字的唯一块。

Your best option and probably the easiest one is to run a regex on the URL looking for a 10 char string between two "/".

您最好的选择,可能最简单的选择是在URL上运行正则表达式,在两个“/”之间查找10个字符串。

'/\w{10}/'

You can then simply omit the "/"s from the result.

然后,您可以简单地省略结果中的“/”。

#1


3  

Looking at the Amazon page you linked, the ASIN number appears in the "Product Details" section. Using the scrapy shell the following xpath

查看您链接的亚马逊页面,ASIN编号显示在“产品详细信息”部分中。使用scrapy shell以下xpath

response.xpath('//li[contains(.,"ASIN: ")]//text()').extract()

returns

回报

[u'ASIN: ', u'B014MHZ90M']

For debugging XPATHs I always use scrapy shell and Firebug for Firefox.

为了调试XPATH,我总是使用scrapy shell和Firebug for Firefox。

#2


4  

you can extract B014MHZ90M from the response.url

你可以从response.url中提取B014MHZ90M

response.url.split("/dp/")[1]

response.url.split("/dp/")[1] = B014MHZ90M

response.url.split(“/ dp /”)[1] = B014MHZ90M

response.url.split("/dp/")[0] = http://www.amazon.com

response.url.split(“/ dp /”)[0] = http://www.amazon.com

#3


0  

You can get that from the url.

你可以从网址获得。

r = re.search('www.amazon.com/dp/(.+)/', response.url)
print r.group(1)

#4


0  

I use this:

我用这个:

re.match("http[s]?://www.amazon.(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", url, flags=re.IGNORECASE)

#5


0  

https://www.amazon.com/gp/seller/asin-upc-isbn-info.html

https://www.amazon.com/gp/seller/asin-upc-isbn-info.html

Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items.

亚马逊标准识别码(ASIN)是标识项目的10个字母和/或数字的唯一块。

Your best option and probably the easiest one is to run a regex on the URL looking for a 10 char string between two "/".

您最好的选择,可能最简单的选择是在URL上运行正则表达式,在两个“/”之间查找10个字符串。

'/\w{10}/'

You can then simply omit the "/"s from the result.

然后,您可以简单地省略结果中的“/”。