文件名称:Goose python页面抓取
文件大小:58KB
文件格式:GZ
更新时间:2017-12-21 07:27:57
页面抓取 HTML python
Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object >>> from goose import Goose >>> from goose.text import StopWordsChinese >>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml' >>> g = Goose({'stopwords_class': StopWordsChinese}) >>> article = g.extract(url=url) >>> print article.cleaned_text[:150] 香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。 梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。 一些亲北京阵营*欢迎梁振英道歉,且认为应能获得香港民众接受,但这些*也质问梁振英有
【文件预览】:
goose-extractor-1.0.22
----MANIFEST.in(83B)
----PKG-INFO(11KB)
----README.rst(8KB)
----goose()
--------cleaners.py(10KB)
--------images()
--------text.py(6KB)
--------parsers.py(7KB)
--------extractors.py(18KB)
--------configuration.py(4KB)
--------resources()
--------__init__.py(3KB)
--------article.py(3KB)
--------videos()
--------outputformatters.py(5KB)
--------crawler.py(6KB)
--------utils()
--------network.py(2KB)
--------version.py(938B)
----goose_extractor.egg-info()
--------PKG-INFO(11KB)
--------requires.txt(46B)
--------not-zip-safe(1B)
--------SOURCES.txt(2KB)
--------top_level.txt(12B)
--------dependency_links.txt(1B)
----tests()
--------parsers.py(10KB)
--------extractors.py(14KB)
--------configuration.py(1KB)
--------__init__.py(928B)
--------article.py(1KB)
--------images.py(7KB)
--------network.py(2KB)
--------videos.py(3KB)
--------base.py(3KB)
----setup.cfg(59B)
----setup.py(2KB)