MiniCrawler
Github Path :
https://github.com/LixinZhang/miniCrowler
Introduction:
- MiniCrawler is a simple web crawler implemented by Python.
Threadpool tech is used to speed up fetching pages.
One can config the crawler through modify the file
config.py
. And start the crawling job usingpython run.py
.- The webs pages fetched will be stored in
pages
folder. -
check_status.py
helps you check the job's status as following:
Rank Hostname Times
----------------------------------------
1 buaa.edu.cn 40
2 baixing.com 32
3 cnblogs.com 29
4 hao123.com 5
5 xinhuanet.com 2
6 visionplaza.cn 2
7 people.com.cn 2
8 org.cn 2
9 news.cn 2
10 most.gov.cn 2
More Detail
You can find more detail in my Chinese Blog. Python 多线程抓取网页