Web页面解析 / Web page parsing
1 HTMLParser解析
下面介绍一种基本的Web页面HTML解析的方式,主要是利用Python自带的html.parser模块进行解析。其主要步骤为:
- 创建一个新的Parser类,继承HTMLParser类;
- 重载handler_starttag等方法,实现指定功能;
- 实例化新的Parser并将HTML文本feed给类实例。
完整代码
1 from html.parser import HTMLParser 2 3 # An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered 4 # Subclass HTMLParser and override its methods to implement the desired behavior 5 6 class MyHTMLParser(HTMLParser): 7 # attrs is the attributes set in HTML start tag 8 def handle_starttag(self, tag, attrs): 9 print('Encountered a start tag:', tag) 10 for attr in attrs: 11 print(' attr:', attr) 12 13 def handle_endtag(self, tag): 14 print('Encountered an end tag :', tag) 15 16 def handle_data(self, data): 17 print('Encountered some data :', data) 18 19 parser = MyHTMLParser() 20 parser.feed('<html><head><title>Test</title></head>' 21 '<body><h1>Parse me!</h1></body></html>' 22 '<img src="python-logo.png" alt="The Python logo">')
代码中首先对模块进行导入,派生一个新的 Parser 类,随后重载方法,当遇到起始tag时,输出并判断是否有定义属性,有则输出,遇到终止tag与数据时同样输出。
Note: handle_starttag()函数的attrs为由该起始tag属性组成的元组元素列表,即列表中包含元组,元组中第一个参数为属性名,第二个参数为属性值。
输出结果
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html Encountered a start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo')
从输出中可以看到,解析器将HTML文本进行了解析,并且输出了tag中包含的属性。
2 BeautifulSoup解析
接下来介绍一种第三方的HTML页面解析包BeautifulSoup,同时与HTMLParser进行对比。
首先需要进行BeautifulSoup的安装,安装方式如下,
pip install beautifulsoup4
完整代码
1 from html.parser import HTMLParser 2 from io import StringIO 3 from urllib import request 4 5 from bs4 import BeautifulSoup, SoupStrainer 6 from html5lib import parse, treebuilders 7 8 9 URLs = ('http://python.org', 10 'http://www.baidu.com') 11 12 def output(x): 13 print('\n'.join(sorted(set(x)))) 14 15 def simple_beau_soup(url, f): 16 'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors' 17 # BeautifulSoup returns a BeautifulSoup instance 18 # find_all function returns a bs4.element.ResultSet instance, 19 # which contains bs4.element.Tag instances, 20 # use tag['attr'] to get attribute of tag 21 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a')) 22 23 def faster_beau_soup(url, f): 24 'faster_beau_soup() - use BeautifulSoup to parse only anchor tags' 25 # Add find_all('a') function 26 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a')) 27 28 def htmlparser(url, f): 29 'htmlparser() - use HTMLParser to parse anchor tags' 30 class AnchorParser(HTMLParser): 31 def handle_starttag(self, tag, attrs): 32 if tag != 'a': 33 return 34 if not hasattr(self, 'data'): 35 self.data = [] 36 for attr in attrs: 37 if attr[0] == 'href': 38 self.data.append(attr[1]) 39 parser = AnchorParser() 40 parser.feed(f.read()) 41 output(request.urljoin(url, x) for x in parser.data) 42 print('DONE') 43 44 def html5libparse(url, f): 45 'html5libparse() - use html5lib to parser anchor tags' 46 #output(request.urljoin(url, x.attributes['href']) for x in parse(f) if isinstance(x, treebuilders.etree.Element) and x.name == 'a') 47 48 def process(url, data): 49 print('\n*** simple BeauSoupParser') 50 simple_beau_soup(url, data) 51 data.seek(0) 52 print('\n*** faster BeauSoupParser') 53 faster_beau_soup(url, data) 54 data.seek(0) 55 print('\n*** HTMLParser') 56 htmlparser(url, data) 57 data.seek(0) 58 print('\n*** HTML5lib') 59 html5libparse(url, data) 60 data.seek(0) 61 62 if __name__=='__main__': 63 for url in URLs: 64 f = request.urlopen(url) 65 data = StringIO(f.read().decode()) 66 f.close() 67 process(url, data)
分段解释
首先将所需模块进行导入,其中StringIO模块用来实现字符串缓存容器,
1 from html.parser import HTMLParser 2 from io import StringIO 3 from urllib import request 4 5 from bs4 import BeautifulSoup, SoupStrainer 6 from html5lib import parse, treebuilders 7 8 9 URLs = ('http://python.org', 10 'http://www.baidu.com')
接着定义一个输出函数,利用集合消除重复参数同时进行换行分离。
1 def output(x): 2 print('\n'.join(sorted(set(x))))
此处定义一个简单的bs解析函数,首先利用BeautifulSoup类传入HTML文本以及features(新版提示使用‘html5lib’),生成一个BeautifulSoup实例,再利用find_all()函数返回所有tag为‘a’的链接锚集合类(bs4.element.Tag),通过Tag获取href属性,最后利用urljoin函数生成链接并输出。
1 def simple_beau_soup(url, f): 2 'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors' 3 # BeautifulSoup returns a BeautifulSoup instance 4 # find_all function returns a bs4.element.ResultSet instance, 5 # which contains bs4.element.Tag instances, 6 # use tag['attr'] to get attribute of tag 7 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a'))
接着定义一个新的解析函数,这个函数可以通过参数传入parse_only来设置需要解析的锚标签,从而加快解析的速度。
Note: 这部分存在一个问题,当使用‘html5lib’特性时,是不支持parse_only参数的,因此会对整个标签进行搜索。有待解决。
1 def faster_beau_soup(url, f): 2 'faster_beau_soup() - use BeautifulSoup to parse only anchor tags' 3 # Add find_all('a') function 4 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a'))
再定义一个用html方式进行解析的函数,可参见前节使用方式,首先建立一个锚解析的类,在遇到起始标签时,判断是否为‘a’锚,在进入时判断是否有data属性,没有的话初始化属性为空,随后对attrs参数遍历,获取href参数。最后生成实例并feed数据。
1 def htmlparser(url, f): 2 'htmlparser() - use HTMLParser to parse anchor tags' 3 class AnchorParser(HTMLParser): 4 def handle_starttag(self, tag, attrs): 5 if tag != 'a': 6 return 7 if not hasattr(self, 'data'): 8 self.data = [] 9 for attr in attrs: 10 if attr[0] == 'href': 11 self.data.append(attr[1]) 12 parser = AnchorParser() 13 parser.feed(f.read()) 14 output(request.urljoin(url, x) for x in parser.data) 15 print('DONE')
最后定义一个process函数,对于传入的data,每次使用完后都需要seek(0)将光标移回初始。
1 def process(url, data): 2 print('\n*** simple BeauSoupParser') 3 simple_beau_soup(url, data) 4 data.seek(0) 5 print('\n*** faster BeauSoupParser') 6 faster_beau_soup(url, data) 7 data.seek(0) 8 print('\n*** HTMLParser') 9 htmlparser(url, data) 10 data.seek(0) 11 print('\n*** HTML5lib') 12 html5libparse(url, data) 13 data.seek(0)
最终解析的结果为网页内所有的链接。
1 if __name__=='__main__': 2 for url in URLs: 3 f = request.urlopen(url) 4 data = StringIO(f.read().decode()) 5 f.close() 6 process(url, data)
运行输出结果
*** simple BeauSoupParser http://blog.python.org http://bottlepy.org http://brochure.getpython.info/ http://buildbot.net/ http://docs.python.org/3/tutorial/ http://docs.python.org/3/tutorial/controlflow.html http://docs.python.org/3/tutorial/controlflow.html#defining-functions http://docs.python.org/3/tutorial/introduction.html#lists http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html http://flask.pocoo.org/ http://ipython.org http://jobs.python.org http://pandas.pydata.org/ http://planetpython.org/ http://plus.google.com/+Python http://pycon.blogspot.com/ http://pyfound.blogspot.com/ http://python.org http://python.org#content http://python.org#python-network http://python.org#site-map http://python.org#top http://python.org/ http://python.org/about/ http://python.org/about/apps http://python.org/about/apps/ http://python.org/about/gettingstarted/ http://python.org/about/help/ http://python.org/about/legal/ http://python.org/about/quotes/ http://python.org/about/success/ http://python.org/about/success/#arts http://python.org/about/success/#business http://python.org/about/success/#education http://python.org/about/success/#engineering http://python.org/about/success/#government http://python.org/about/success/#scientific http://python.org/about/success/#software-development http://python.org/accounts/login/ http://python.org/accounts/signup/ http://python.org/blogs/ http://python.org/community/ http://python.org/community/awards http://python.org/community/diversity/ http://python.org/community/forums/ http://python.org/community/irc/ http://python.org/community/lists/ http://python.org/community/logos/ http://python.org/community/merchandise/ http://python.org/community/sigs/ http://python.org/community/workshops/ http://python.org/dev/ http://python.org/dev/core-mentorship/ http://python.org/dev/peps/ http://python.org/dev/peps/peps.rss http://python.org/doc/ http://python.org/doc/av http://python.org/doc/essays/ http://python.org/download/alternatives http://python.org/download/other/ http://python.org/downloads/ http://python.org/downloads/mac-osx/ http://python.org/downloads/release/python-2714/ http://python.org/downloads/release/python-364/ http://python.org/downloads/source/ http://python.org/downloads/windows/ http://python.org/events/ http://python.org/events/calendars/ http://python.org/events/python-events http://python.org/events/python-events/543/ http://python.org/events/python-events/611/ http://python.org/events/python-events/past/ http://python.org/events/python-user-group/ http://python.org/events/python-user-group/605/ http://python.org/events/python-user-group/619/ http://python.org/events/python-user-group/620/ http://python.org/events/python-user-group/past/ http://python.org/jobs/ http://python.org/privacy/ http://python.org/psf-landing/ http://python.org/psf/ http://python.org/psf/donations/ http://python.org/psf/sponsorship/sponsors/ http://python.org/shell/ http://python.org/success-stories/ http://python.org/success-stories/industrial-light-magic-runs-python/ http://python.org/users/membership/ http://roundup.sourceforge.net/ http://tornadoweb.org http://trac.edgewall.org/ http://twitter.com/ThePSF http://wiki.python.org/moin/Languages http://wiki.python.org/moin/TkInter http://www.ansible.com http://www.djangoproject.com/ http://www.facebook.com/pythonlang?fref=ts http://www.pylonsproject.org/ http://www.riverbankcomputing.co.uk/software/pyqt/intro http://www.saltstack.com http://www.scipy.org http://www.web2py.com/ http://www.wxpython.org/ https://bugs.python.org/ https://devguide.python.org/ https://docs.python.org https://docs.python.org/3/license.html https://docs.python.org/faq/ https://github.com/python/pythondotorg/issues https://kivy.org/ https://mail.python.org/mailman/listinfo/python-dev https://pypi.python.org/ https://status.python.org/ https://wiki.gnome.org/Projects/PyGObject https://wiki.python.org/moin/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/Python2orPython3 https://wiki.python.org/moin/PythonBooks https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event https://wiki.qt.io/PySide https://www.openstack.org https://www.python.org/psf/codeofconduct/ javascript:; *** faster BeauSoupParser Warning (from warnings module): File "C:\Python35\lib\site-packages\bs4\builder\_html5lib.py", line 63 warnings.warn("You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.") UserWarning: You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed. http://blog.python.org http://bottlepy.org http://brochure.getpython.info/ http://buildbot.net/ http://docs.python.org/3/tutorial/ http://docs.python.org/3/tutorial/controlflow.html http://docs.python.org/3/tutorial/controlflow.html#defining-functions http://docs.python.org/3/tutorial/introduction.html#lists http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html http://flask.pocoo.org/ http://ipython.org http://jobs.python.org http://pandas.pydata.org/ http://planetpython.org/ http://plus.google.com/+Python http://pycon.blogspot.com/ http://pyfound.blogspot.com/ http://python.org http://python.org#content http://python.org#python-network http://python.org#site-map http://python.org#top http://python.org/ http://python.org/about/ http://python.org/about/apps http://python.org/about/apps/ http://python.org/about/gettingstarted/ http://python.org/about/help/ http://python.org/about/legal/ http://python.org/about/quotes/ http://python.org/about/success/ http://python.org/about/success/#arts http://python.org/about/success/#business http://python.org/about/success/#education http://python.org/about/success/#engineering http://python.org/about/success/#government http://python.org/about/success/#scientific http://python.org/about/success/#software-development http://python.org/accounts/login/ http://python.org/accounts/signup/ http://python.org/blogs/ http://python.org/community/ http://python.org/community/awards http://python.org/community/diversity/ http://python.org/community/forums/ http://python.org/community/irc/ http://python.org/community/lists/ http://python.org/community/logos/ http://python.org/community/merchandise/ http://python.org/community/sigs/ http://python.org/community/workshops/ http://python.org/dev/ http://python.org/dev/core-mentorship/ http://python.org/dev/peps/ http://python.org/dev/peps/peps.rss http://python.org/doc/ http://python.org/doc/av http://python.org/doc/essays/ http://python.org/download/alternatives http://python.org/download/other/ http://python.org/downloads/ http://python.org/downloads/mac-osx/ http://python.org/downloads/release/python-2714/ http://python.org/downloads/release/python-364/ http://python.org/downloads/source/ http://python.org/downloads/windows/ http://python.org/events/ http://python.org/events/calendars/ http://python.org/events/python-events http://python.org/events/python-events/543/ http://python.org/events/python-events/611/ http://python.org/events/python-events/past/ http://python.org/events/python-user-group/ http://python.org/events/python-user-group/605/ http://python.org/events/python-user-group/619/ http://python.org/events/python-user-group/620/ http://python.org/events/python-user-group/past/ http://python.org/jobs/ http://python.org/privacy/ http://python.org/psf-landing/ http://python.org/psf/ http://python.org/psf/donations/ http://python.org/psf/sponsorship/sponsors/ http://python.org/shell/ http://python.org/success-stories/ http://python.org/success-stories/industrial-light-magic-runs-python/ http://python.org/users/membership/ http://roundup.sourceforge.net/ http://tornadoweb.org http://trac.edgewall.org/ http://twitter.com/ThePSF http://wiki.python.org/moin/Languages http://wiki.python.org/moin/TkInter http://www.ansible.com http://www.djangoproject.com/ http://www.facebook.com/pythonlang?fref=ts http://www.pylonsproject.org/ http://www.riverbankcomputing.co.uk/software/pyqt/intro http://www.saltstack.com http://www.scipy.org http://www.web2py.com/ http://www.wxpython.org/ https://bugs.python.org/ https://devguide.python.org/ https://docs.python.org https://docs.python.org/3/license.html https://docs.python.org/faq/ https://github.com/python/pythondotorg/issues https://kivy.org/ https://mail.python.org/mailman/listinfo/python-dev https://pypi.python.org/ https://status.python.org/ https://wiki.gnome.org/Projects/PyGObject https://wiki.python.org/moin/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/Python2orPython3 https://wiki.python.org/moin/PythonBooks https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event https://wiki.qt.io/PySide https://www.openstack.org https://www.python.org/psf/codeofconduct/ javascript:; *** HTMLParser http://blog.python.org http://bottlepy.org http://brochure.getpython.info/ http://buildbot.net/ http://docs.python.org/3/tutorial/ http://docs.python.org/3/tutorial/controlflow.html http://docs.python.org/3/tutorial/controlflow.html#defining-functions http://docs.python.org/3/tutorial/introduction.html#lists http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html http://flask.pocoo.org/ http://ipython.org http://jobs.python.org http://pandas.pydata.org/ http://planetpython.org/ http://plus.google.com/+Python http://pycon.blogspot.com/ http://pyfound.blogspot.com/ http://python.org http://python.org#content http://python.org#python-network http://python.org#site-map http://python.org#top http://python.org/ http://python.org/about/ http://python.org/about/apps http://python.org/about/apps/ http://python.org/about/gettingstarted/ http://python.org/about/help/ http://python.org/about/legal/ http://python.org/about/quotes/ http://python.org/about/success/ http://python.org/about/success/#arts http://python.org/about/success/#business http://python.org/about/success/#education http://python.org/about/success/#engineering http://python.org/about/success/#government http://python.org/about/success/#scientific http://python.org/about/success/#software-development http://python.org/accounts/login/ http://python.org/accounts/signup/ http://python.org/blogs/ http://python.org/community/ http://python.org/community/awards http://python.org/community/diversity/ http://python.org/community/forums/ http://python.org/community/irc/ http://python.org/community/lists/ http://python.org/community/logos/ http://python.org/community/merchandise/ http://python.org/community/sigs/ http://python.org/community/workshops/ http://python.org/dev/ http://python.org/dev/core-mentorship/ http://python.org/dev/peps/ http://python.org/dev/peps/peps.rss http://python.org/doc/ http://python.org/doc/av http://python.org/doc/essays/ http://python.org/download/alternatives http://python.org/download/other/ http://python.org/downloads/ http://python.org/downloads/mac-osx/ http://python.org/downloads/release/python-2714/ http://python.org/downloads/release/python-364/ http://python.org/downloads/source/ http://python.org/downloads/windows/ http://python.org/events/ http://python.org/events/calendars/ http://python.org/events/python-events http://python.org/events/python-events/543/ http://python.org/events/python-events/611/ http://python.org/events/python-events/past/ http://python.org/events/python-user-group/ http://python.org/events/python-user-group/605/ http://python.org/events/python-user-group/619/ http://python.org/events/python-user-group/620/ http://python.org/events/python-user-group/past/ http://python.org/jobs/ http://python.org/privacy/ http://python.org/psf-landing/ http://python.org/psf/ http://python.org/psf/donations/ http://python.org/psf/sponsorship/sponsors/ http://python.org/shell/ http://python.org/success-stories/ http://python.org/success-stories/industrial-light-magic-runs-python/ http://python.org/users/membership/ http://roundup.sourceforge.net/ http://tornadoweb.org http://trac.edgewall.org/ http://twitter.com/ThePSF http://wiki.python.org/moin/Languages http://wiki.python.org/moin/TkInter http://www.ansible.com http://www.djangoproject.com/ http://www.facebook.com/pythonlang?fref=ts http://www.pylonsproject.org/ http://www.riverbankcomputing.co.uk/software/pyqt/intro http://www.saltstack.com http://www.scipy.org http://www.web2py.com/ http://www.wxpython.org/ https://bugs.python.org/ https://devguide.python.org/ https://docs.python.org https://docs.python.org/3/license.html https://docs.python.org/faq/ https://github.com/python/pythondotorg/issues https://kivy.org/ https://mail.python.org/mailman/listinfo/python-dev https://pypi.python.org/ https://status.python.org/ https://wiki.gnome.org/Projects/PyGObject https://wiki.python.org/moin/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/Python2orPython3 https://wiki.python.org/moin/PythonBooks https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event https://wiki.qt.io/PySide https://www.openstack.org https://www.python.org/psf/codeofconduct/ javascript:; DONE *** HTML5lib *** simple BeauSoupParser http://e.baidu.com/?refer=888 http://home.baidu.com http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word= http://ir.baidu.com http://jianyi.baidu.com/ http://map.baidu.com http://map.baidu.com/m?word=&fr=ps01000 http://music.baidu.com/search?fr=ps&ie=utf-8&key= http://news.baidu.com http://news.baidu.com/ns?cl=2&rn=20&tn=news&word= http://tieba.baidu.com http://tieba.baidu.com/f?kw=&fr=wwwt http://v.baidu.com http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word= http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8 http://www.baidu.com/ http://www.baidu.com/cache/sethelp/help.html http://www.baidu.com/duty/ http://www.baidu.com/gaoji/preferences.html http://www.baidu.com/more/ http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001 http://www.hao123.com http://xueshu.baidu.com http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F javascript:; *** faster BeauSoupParser http://e.baidu.com/?refer=888 http://home.baidu.com http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word= http://ir.baidu.com http://jianyi.baidu.com/ http://map.baidu.com http://map.baidu.com/m?word=&fr=ps01000 http://music.baidu.com/search?fr=ps&ie=utf-8&key= http://news.baidu.com http://news.baidu.com/ns?cl=2&rn=20&tn=news&word= http://tieba.baidu.com http://tieba.baidu.com/f?kw=&fr=wwwt http://v.baidu.com http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word= http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8 http://www.baidu.com/ http://www.baidu.com/cache/sethelp/help.html http://www.baidu.com/duty/ http://www.baidu.com/gaoji/preferences.html http://www.baidu.com/more/ http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001 http://www.hao123.com http://xueshu.baidu.com http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F javascript:; *** HTMLParser http://e.baidu.com/?refer=888 http://home.baidu.com http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word= http://ir.baidu.com http://jianyi.baidu.com/ http://map.baidu.com http://map.baidu.com/m?word=&fr=ps01000 http://music.baidu.com/search?fr=ps&ie=utf-8&key= http://news.baidu.com http://news.baidu.com/ns?cl=2&rn=20&tn=news&word= http://tieba.baidu.com http://tieba.baidu.com/f?kw=&fr=wwwt http://v.baidu.com http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word= http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8 http://www.baidu.com/ http://www.baidu.com/cache/sethelp/help.html http://www.baidu.com/duty/ http://www.baidu.com/gaoji/preferences.html http://www.baidu.com/more/ http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001 http://www.hao123.com http://xueshu.baidu.com http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F javascript:; DONE *** HTML5lib
参考链接
《Python 核心编程 第3版》