scrapy爬虫框架结合BeautifulSoup

时间:2022-12-26 20:40:36

①安装scrapy
pip install scrapy
依赖的包 python-lxml python-dev libffi-dev
在指定目录下创建项目:
$ scrapy startproject weather
②定义Item
Item就是要保存的属性对象,定义在Item.py中
Item 是保存爬取到的数据的容器;其使用方法和python字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。

import scrapy
class BkgscrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name =scrapy.Field()
pass

③编写spider

import scrapy
from bs4 import BeautifulSoup
from weather.items import WeatherItem


class localspider(scrapy.Spider):
name="myspider"
allowed_domains=["meizitu.com/"]
start_urls=['http://www.meizitu.com/']

def parse(self, response):
html_doc = response.body
#html_doc = html_doc.decode('utf-8')
soup = BeautifulSoup(html_doc,'lxml')
itemTemp = {}
itemTemp['name'] = soup.find(id='slider_name')
return item