Python 爬虫练习(三) 利用百度进行子域名收集

不多介绍了，千篇一律的正则匹配.....

import requests

import re

head = {'User-Agent': \

            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'}

key = 'jcrcw.com'  # 这里填主域名

lst = []

match = 'style="text-decoration:none;">(.*?)</b>'

for i in range(1, 20):  # 1-19页

    url = "https://www.baidu.com/s?wd=inurl:{}&pn={}&oq={}&ie=utf-8".format(key, i, key)

    print(url)

    # response = requests.get(url,headers=head,cookies = cook).content

    response = requests.get(url, headers=head).content

    subdomains = re.findall(match, response.decode())

    for j in subdomains:

        j = j.replace('<b>', '')

        if key in j:

            if j not in lst:

                lst.append(j)

                # print(lst)

print(lst)

　　运行结果：

秒客网

Python 爬虫练习(三) 利用百度进行子域名收集

相关文章