Scrapy爬取1908电影网电影数据
最初是打算直接从豆瓣上爬电影数据的,但编写完一直出现403错误,查了查是豆瓣反爬虫导致了,加了headers也还是一直出现错误,无奈只能转战1908电影网了。
爬取数据是为了构建电影知识图谱的。而1908电影网的电影数据确实比豆瓣少太多,尤其是电影评论这块,所以需要数据全的童鞋们还是继续啃豆瓣吧。。
直接上代码,
items.py文件下
import scrapy
class Movie1905Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#电影名称
movie_name=scrapy.Field()
#评分
rating=scrapy.Field()
#海报
#post=scrapy.Field()
#上映日期
date=scrapy.Field()
#类型
genre=scrapy.Field()
#时长
time=scrapy.Field()
#导演
director=scrapy.Field()
#剧情
story=scrapy.Field()
pass
之后在spider文件下创建1908movie.py
from scrapy import Request
from scrapy.spiders import Spider
from pymovie.items import Movie1905Item
class movie1908(Spider):
name='1908movies_china'
headers={
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5',
}
url='http://www.1905.com/mdb/film/list/country-China/o0d0p1.html'
def start_requests(self):
basic_url='http://www.1905.com/mdb/film/list/country-China/o0d0p%s.html'
start,end=0,220
for i in range(start,end):
url=basic_url.replace("%s",str(i))
yield Request(url,headers=self.headers)
def parse(self,response):
urls=response.xpath('.//ul[@class="inqList pt18"]/li/a/@href').extract()
for url in urls:
url="http://www.1905.com"+url
yield Request(url,self.parse_movie)
def parse_movie(self,response):
item=Movie1905Item()
imovie=response.xpath('//div[@class="body"]')
item['movie_name']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/h1/text()').extract()
item['rating']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/h1/span[@class="score"]/b/text()').extract()
item['date']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/div[1]/span[1]/text()').extract()
item['genre']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/div[1]/span[2]/a[1]/text()').extract()
item['time']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/div[1]/span[4]/text()').extract()
item['director']=imovie.xpath('.//div[@class="container containerTop"]/div[2]/div[2]/a[1]/@title').extract()
item['story']=imovie.xpath('.//div[@class="container containerMain"]/div[1]/section/div/p/text()').extract()
yield item
最后在cmd内进入同scrapy.cfg同一级目录中,输入
scrapy crawl 1908movies_china movie.csv
得到的movie.csv如图
参考博客
http://www.2cto.com/kf/201604/501764.html
http://www.cnblogs.com/mrchige/p/6481194.html