useragentstring.com 网站几乎廊括了所有的User-Agent,刚学了scrapy,打算那它练手,把上面的 user-agent 爬取下来。
本文只爬取常见的 FireFox, Chrome, Opera, Safri, Internet Explorer
一、创建爬虫项目
1.创建爬虫项目useragent
$ scrapy startproject useragent
2.进入项目目录
$ cd useragent
3.生成爬虫文件 ua
这一步不是必须的,不过有了就方便些
$ scrapy genspider ua useragentstring.com
二、编辑 item 文件
# useragent\items.py
import scrapy
class UseragentItem(scrapy.Item):
# define the fields for your item here like:
ua_name = scrapy.Field()
ua_string = scrapy.Field()
三、编辑爬虫文件
# useragent\spiders\ua.py
import scrapy
from useragent.items import UseragentItem
class UaSpider(scrapy.Spider):
name = "ua"
allowed_domains = ["useragentstring.com"]
start_urls = (
'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
)
def parse(self, response):
ua_name = response.url.splite('=')[-1]
for ua_string in response.xpath('//li/a/text()').extract():
item = UseragentItem()
item['ua_name'] = ua_name
item['ua_string'] = ua_string.strip()
yield item
四、运行爬虫
通过参数-o,控制爬虫输出为 json 文件
$ scrapy crawl ua -o item.json
结果如图:
看起来没有得到想要的结果,注意到那个robot.txt。我猜测可能是网站禁止爬虫
猜的对不对先不管,先模拟浏览器再说,给所有的 request 添加 headers:
# useragent\spiders\ua.py
import scrapy
from useragent.items import UseragentItem
class UaSpider(scrapy.Spider):
name = "ua"
allowed_domains = ["useragentstring.com"]
start_urls = (
'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
)
# 在所有的请求发生之前执行
def start_requests(self):
for url in self.start_urls:
headers = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}
yield scrapy.Request(url, callback=self.parse, headers=headers)
def parse(self, response):
ua_name = response.url.split('=')[-1]
for ua_string in response.xpath('//li/a/text()').extract():
item = UseragentItem()
item['ua_name'] = ua_name
item['ua_string'] = ua_string.strip()
yield item
在运行,OK了!
效果图如下:
好了,以后不愁没有 User Agent用了。