scrapy基础之爬虫入门：先用urllib2来跑几个爬虫

1，爬取糗事百科

概况：糗事百科是html网页，支持直接抓取html字符然后用正则过滤

爬取糗事百科需要同时发送代理信息，即user-agent

import urllib2,re

def pachong(page):

    url="http://www.qiushibaike.com/hot/page/"+str(page)    #起始页

    user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'    #代理信息，可通过f12查看

    headers={'User-Agent':user_agent}    #把代理信息按照合理方式编辑到headers中

    try:

        request=urllib2.Request(url,headers=headers)    #url后边加headers参数，发送带headers的访问请求

        response=urllib2.urlopen(request)    #以网页方式打开服务器给的response

        content=response.read().decode('utf-8')    #编码方式是utf-8，没有编码方式的设置不能得出正确答案

        pattern=re.compile('<span>\s*(.*)\s*</span>')    #正则表达式过滤信息

        items=re.findall(pattern,content)    #findall形成的是一个列表，列表的元素是所有匹配的字符串

        for i in items:

            haveimg=re.search('img',i)    #过滤掉图片格式内容

            if not haveimg:

                print i,'\n'

    except Exception as e:

        print e

if __name__=='__main__':

    for i in range(1,3):

        pachong(i)

秒客网

scrapy基础之爬虫入门：先用urllib2来跑几个爬虫

相关文章

scrapy基础 之 爬虫入门：先用urllib2来跑几个爬虫

相关文章

scrapy基础之爬虫入门：先用urllib2来跑几个爬虫