https://study.163.com/course/courseLearn.htm?courseId=1005913008#/learn/text?lessonId=1053258283&courseId=1005913008
把豆瓣“即将上映”的电影爬出来。
很简陋的代码,现在的知识只能写这样了,做个记录吧。以后有能力再完善。
1 # coding=gbk 2 # 豆瓣即将上映的电影 3 # https://movie.douban.com/cinema/later/changsha/爬取这个页面的信息吧 4 5 import requests 6 from bs4 import BeautifulSoup 7 import json 8 9 10 def get_page(): 11 url = 'https://movie.douban.com/cinema/later/changsha/' 12 headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"} 13 response = requests.get(url, headers, verify=False) 14 text = response.text 15 return text 16 17 18 def get_data(text): 19 movies = [] 20 soup = BeautifulSoup(text, 'lxml') 21 # class="item mod " class="item mod odd"有两个都是的。 22 # 第一种思路,直接分开成两个来处理 23 divList1 = soup.find_all('div', attrs={'class': "item mod "}) 24 divList2 = soup.find_all('div', attrs={'class': "item mod odd"}) 25 # 合并两个list 26 divList = divList1+divList2 27 ''' 28 <div class="item mod "> 29 <a class="thumb" href="https://movie.douban.com/subject/26394152/"> 30 <img class="" src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2541662397.jpg"/> 31 </a> 32 <div class="intro"> 33 <h3> 34 <a class="" href="https://movie.douban.com/subject/26394152/">大黄蜂</a> 35 <span class="icon"></span> 36 </h3> 37 <li class="dt">01月04日</li> 38 <li class="dt">动作 / 科幻 / 冒险</li> 39 <li class="dt">美国</li> 40 </div>''' 41 for div in divList: 42 movie = {} 43 thumb = div.find('img')['src'] 44 # print(thumb) 45 # 获取所有a标签,其中第二个a标签里是电影名 46 title = div.find_all('a')[1].text 47 # print(title) 48 # 有些电影里没有类型这列,所有要考虑下,不能直接赋值类型 49 li_tag = div.find_all('li') 50 if len(li_tag)==4: 51 date = li_tag[0].text 52 type = li_tag[1].text 53 area = li_tag[2].text 54 want = li_tag[3].text 55 else: 56 date = li_tag[0].text 57 type = '' 58 area = li_tag[1].text 59 want = li_tag[2].text 60 61 # print(date) 62 movie['title'] = title 63 movie['thumb'] = thumb 64 movie['date'] = date 65 movie['type'] = type 66 movie['area'] = area 67 movie['want'] = want 68 movies.append(movie) 69 return movies 70 71 72 def save_text(data): 73 with open('douban2.json', 'w', encoding='utf-8') as fp: 74 json.dump(data, fp, ensure_ascii=False) 75 76 77 if __name__ == '__main__': 78 text = get_page() 79 data = get_data(text) 80 save_text(data)
两个li标签的区别
或者以后可以参考https://www.jianshu.com/p/c64fe2a20bc9,这个代码写的可以借鉴。