I am just starting out with Python/Scrapy.
我刚刚开始使用Python / Scrapy。
I have a written a spider that crawls a website and fetches information. But i am stuck in 2 places.
我有一个写蜘蛛抓取网站并获取信息。但我被困在两个地方。
-
I am trying to retrieve the telephone numbers from a page and they are coded like this
我试图从页面中检索电话号码,它们的编码方式如下
<span class="mrgn_right5">(+001) 44 42676000,</span> <span class="mrgn_right5">(+011) 44 42144100</span>
The code i have is:
我的代码是:
getdata = soup.find(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone']=getdata.get_text().strip()
#print phone
But it is fetching only the first set of numbers and not the second one. How can i fix this?
但它只取得第一组数字而不是第二组数字。我怎样才能解决这个问题?
- On the same page there is another set of information
- 在同一页面上还有另一组信息
I am using this code
我正在使用此代码
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
#print getdata
aditem['Pricerange']=getdata.get_text().strip()
#print pricerange
But it is not fetching any thing.
但它并没有取得任何东西。
Any help on fixing these two would be great.
任何有关修复这两个的帮助都会很棒。
1 个解决方案
#1
0
From a browse of the Beautiful Soup documentation, find
will only return a single result. If multiple results are expected/required, then use find_all
instead. Since there are two results, a list will be returned, so the elements of the list need to be joined together (for example) to add them to Phone
field of your AdItem
.
通过浏览Beautiful Soup文档,find只会返回一个结果。如果预期/需要多个结果,则改为使用find_all。由于有两个结果,因此将返回一个列表,因此需要将列表的元素连接在一起(例如)以将它们添加到AdItem的Phone字段中。
getdata = soup.find_all(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone'] = ''.join([x.get_text().strip() for x in getdata])
For the second issue, you need to access the attributes of the returned object. Try the following:
对于第二个问题,您需要访问返回对象的属性。请尝试以下方法:
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
aditem['Pricerange'] = getdata.attrs['content']
And for the address information, the following code works but is very hacky and could no doubt be improved by someone who understands Beautiful Soup better than me.
而对于地址信息,下面的代码可以工作,但非常hacky,毫无疑问可以通过比我更了解美丽汤的人来改进。
getdata = soup.find(attrs={"itemprop":"address"})
address = getdata.span.get_text()
addressLocality = getdata.meta.attrs['content']
addressRegion = getdata.find(attrs={"itemprop":"addressRegion"}).attrs['content']
postalCode = getdata.find(attrs={"itemprop":"postalCode"}).attrs['content']
#1
0
From a browse of the Beautiful Soup documentation, find
will only return a single result. If multiple results are expected/required, then use find_all
instead. Since there are two results, a list will be returned, so the elements of the list need to be joined together (for example) to add them to Phone
field of your AdItem
.
通过浏览Beautiful Soup文档,find只会返回一个结果。如果预期/需要多个结果,则改为使用find_all。由于有两个结果,因此将返回一个列表,因此需要将列表的元素连接在一起(例如)以将它们添加到AdItem的Phone字段中。
getdata = soup.find_all(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone'] = ''.join([x.get_text().strip() for x in getdata])
For the second issue, you need to access the attributes of the returned object. Try the following:
对于第二个问题,您需要访问返回对象的属性。请尝试以下方法:
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
aditem['Pricerange'] = getdata.attrs['content']
And for the address information, the following code works but is very hacky and could no doubt be improved by someone who understands Beautiful Soup better than me.
而对于地址信息,下面的代码可以工作,但非常hacky,毫无疑问可以通过比我更了解美丽汤的人来改进。
getdata = soup.find(attrs={"itemprop":"address"})
address = getdata.span.get_text()
addressLocality = getdata.meta.attrs['content']
addressRegion = getdata.find(attrs={"itemprop":"addressRegion"}).attrs['content']
postalCode = getdata.find(attrs={"itemprop":"postalCode"}).attrs['content']