Requests+正则表达式爬取猫眼电影

流程框架

抓取单页内容
利用requests请求目标站点，得到单个网页HTML代码，返回结果。
正则表达式分析
根据HTML代码分析得到电影的名称、主演、上映时间、评分、图片链接等信息。
保存至文件
通过文件的形式将结果保存，每一部电影一个结果一行Json字符串。
开启循环及多线程
对多页内容遍历，开启多线程提高抓取速度。

1.获取榜单页代码

import requests
from requests.exceptions import RequestException

def get_one_page(url):
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    return None
  except RequestException:
    return None

def main():
  url = 'http://maoyan.com/board/4?'
  html = get_one_page(url)
  print(html)

if __name__ == '__main__':
  main()

2.正则表达式分析

def parse_one_page(html):
  pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
            +'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
            +'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
  items = re.findall(pattern,html)
  for item in items:
    yield {
        'index': item[0],
        'image': item[1],
        'title': item[2],
        'actor': item[3].strip()[3:],
        'time': item[4].strip()[5:],
        'score': item[5]+item[6]
      }

def main():
  url = 'http://maoyan.com/board/4?'
  html = get_one_page(url)
  for item in parse_one_page(html):
    print(item)

显示结果如下：

{'index': '1', 'image': 'https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c', 'title': '霸王别姬', 'actor': '张国荣,张丰毅,巩俐', 'time': '1993-01-01', 'score': '9.6'}
{'index': '2', 'image': 'https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c', 'title': '肖申克的救赎', 'actor': '蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿', 'time': '1994-10-14(美国)', 'score': '9.5'}
{'index': '3', 'image': 'https://p0.meituan.net/movie/54617769d96807e4d81804284ffe2a27239007.jpg@160w_220h_1e_1c', 'title': '罗马假日', 'actor': '格利高里·派克,奥黛丽·赫本,埃迪·艾伯特', 'time': '1953-09-02(美国)', 'score': '9.1'}
{'index': '4', 'image': 'https://p0.meituan.net/movie/e55ec5d18ccc83ba7db68caae54f165f95924.jpg@160w_220h_1e_1c', 'title': '这个杀手不太冷', 'actor': '让·雷诺,加里·奥德曼,娜塔莉·波特曼', 'time': '1994-09-14(法国)', 'score': '9.5'}
{'index': '5', 'image': 'https://p1.meituan.net/movie/f5a924f362f050881f2b8f82e852747c118515.jpg@160w_220h_1e_1c', 'title': '教父', 'actor': '马龙·白兰度,阿尔·帕西诺,詹姆斯·肯恩', 'time': '1972-03-24(美国)', 'score': '9.3'}
{'index': '6', 'image': 'https://p1.meituan.net/movie/0699ac97c82cf01638aa5023562d6134351277.jpg@160w_220h_1e_1c', 'title': '泰坦尼克号', 'actor': '莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩', 'time': '1998-04-03', 'score': '9.5'}
{'index': '7', 'image': 'https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@160w_220h_1e_1c', 'title': '唐伯虎点秋香', 'actor': '周星驰,巩俐,郑佩佩', 'time': '1993-07-01(中国香港)', 'score': '9.2'}
{'index': '8', 'image': 'https://p0.meituan.net/movie/b076ce63e9860ecf1ee9839badee5228329384.jpg@160w_220h_1e_1c', 'title': '千与千寻', 'actor': '柊瑠美,入野*,夏木真理', 'time': '2001-07-20(日本)', 'score': '9.3'}
{'index': '9', 'image': 'https://p0.meituan.net/movie/46c29a8b8d8424bdda7715e6fd779c66235684.jpg@160w_220h_1e_1c', 'title': '魂断蓝桥', 'actor': '费雯·丽,罗伯特·泰勒,露塞尔·沃特森', 'time': '1940-05-17(美国)', 'score': '9.2'}
{'index': '10', 'image': 'https://p0.meituan.net/movie/230e71d398e0c54730d58dc4bb6e4cca51662.jpg@160w_220h_1e_1c', 'title': '乱世佳人', 'actor': '费雯·丽,克拉克·盖博,奥利维娅·德哈维兰', 'time': '1939-12-15(美国)', 'score': '9.1'}

3.保存至文件

def write_to_file(content):
  with open('result.txt', 'a',encoding='utf-8') as f:
    f.write(json.dumps(content, ensure_ascii=False) + '\n')

多进程

if __name__ == '__main__':
    pool = Pool()
    pool.map(main, [i*10 for i in range(10)])

完整代码：

import requests
from multiprocessing import Pool
from requests.exceptions import RequestException
import re
import json

def get_one_page(url):
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    return None
  except RequestException:
    return None

def parse_one_page(html):
  pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
            +'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
            +'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
  items = re.findall(pattern,html)
  for item in items:
    yield {
        'index': item[0],
        'image': item[1],
        'title': item[2],
        'actor': item[3].strip()[3:],
        'time': item[4].strip()[5:],
        'score': item[5]+item[6]
      }   

def write_to_file(content):
  with open('result.txt', 'a',encoding='utf-8') as f:
    f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
  url = 'http://maoyan.com/board/4?offset='+str(offset)
  html = get_one_page(url)
  for item in parse_one_page(html):
    print(item)
    write_to_file(item)

if __name__ == '__main__':
  pool = Pool()
  pool.map(main, [i*10 for i in range(10)])

秒客网

Requests+正则表达式爬取猫眼电影

流程框架

相关文章