预计阅读时间: 15分钟
环境: win7 + Selenium2.53.6+python2.7 +Firefox 45.2 (具体配置参考 http://www.cnblogs.com/yoyoketang/p/selenium.html)
FF45.2 官方下载地址: http://ftp.mozilla.org/pub/firefox/releases/45.2.0esr/win64/en-US/
痛点:爸爸的一个朋友最近简书上面更新了20多篇文章,让我添加目录。每次手动查找链接再添加标题太麻烦了,30多篇就需要半个多小时,而且链接可能会变换。
解决办法:由于简书支持markdown 格式,爬取作者目录然后生成Markdown格式文档即可
原始思路一: 采用urllib2方式爬取目录
步骤:
1.使用urllib2模拟header request打开页面
2. 采用正则匹配href的链接,然后用列表推导式生成链接
3. 采用正则获取标题
4. 生成目录
#coding=utf-8
import urllib2,re def getHtml(url):
header = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36'}
request = urllib2.Request(url,headers=header) #init user request with url and headers
response = urllib2.urlopen(request) #open url
text = response.read()
return text def getTitleLink(html):
pattern1 = re.compile('<a class="title" target="_blank" href="/p/(\w{0,12})"', re.S)
links = re.findall(pattern1,html)
urls = ["www.jianshu.com/p/"+str(link) for link in links] pattern2 = re.compile('<a class="title" target="_blank" href="/p/.*?">(.*?)</a>',re.S)
titles = re.findall(pattern2,html)
for title,url in zip(titles,urls):
if r'目录' not in title:
print "["+title+"](" + url + ")"
#return urls #sample test menu
url = 'http://www.jianshu.com/u/73632348f37a'
html = getHtml(url)
getTitleLink(html)
测试发现如果作者文章只有五六篇,能正确生成。
但是如果文章20篇以上,发现问题:
这种办法只爬取了当前页面加载的文章链接,手工拖拽滚动条动态加载的标题内容无法直接获取到,网上建议用selenium来解决
思路二: 采用selenium打开网页,调用js模拟鼠标点击滚动条,加载全部页面
步骤:
1. 使用selenium打开网页
2. 循环调用js模拟鼠标点击下拉滚动条,直至加载全部页面
3. 使用find_elements_by_xpath查找标题tag
4. 将标题tag解析后写入目录并打印
注: 步骤3获取的为WebElement 类型对象
#coding=utf-8 #refer to http://www.cnblogs.com/haigege/p/5492177.html
#Step1: scroll and generate Markdown format Menu from selenium import webdriver
import time def scroll_top(driver):
if driver.name == "chrome":
js = "var q=document.body.scrollTop=0"
else:
js = "var q=document.documentElement.scrollTop=0"
return driver.execute_script(js) # 拉到底部
def scroll_foot(driver):
if driver.name == "chrome":
js = "var q=document.body.scrollTop=100000"
else:
js = "var q=document.documentElement.scrollTop=100000"
return driver.execute_script(js) def write_text(filename, info):
"""
:param info: 要写入txt的文本内容
:return: none
"""
# 创建/打开info.txt文件,并写入内容
with open(filename, 'a+') as fp:
fp.write(info.encode('utf-8'))
fp.write('\n'.encode('utf-8'))
fp.write('\n'.encode('utf-8')) def sroll_multi(driver,times=5,loopsleep=2):
#40 titles about 3 times
for i in range(times):
time.sleep(loopsleep)
print "Scroll foot %s time..." % i
scroll_foot(driver)
time.sleep(loopsleep) #Note: titles is titles_WebElement type object
def write_menu(filename,titles):
with open(filename, 'w') as fp:
pass
for title in titles:
if r'目录' not in title.text:
print "[" + title.text + "](" + title.get_attribute("href") + ")"
t = title.text.encode('utf-8')
t = title.text.replace(":", ":")
t = title.text.replace("|", "丨")
t = title.text.decode('utf-8')
write_text(filename, "[" + t + "](" + title.get_attribute("href") + ")")
#assert type(title) == "WebElement"
#print type(title) def main(url):
# eg. <a class="title" href="/p/6f543f43aaec" target="_blank"> titleXXX</a>
driver = webdriver.Firefox()
driver.implicitly_wait(10)
# driver.maximize_window()
driver.get(url)
sroll_multi(driver)
titles = driver.find_elements_by_xpath('.//a[@class="title"]|.//a[target="_blank"]')
write_menu(filename, titles) if __name__ == '__main__':
# sample link
url = 'http://www.jianshu.com/u/73632348f37a'
filename = r'info.txt'
main(url)
注:
1. 参考链接: http://www.cnblogs.com/haigege/p/5492177.html
2. 环境下载:Firefox45: https://ftp.mozilla.org/pub/firefox/releases/45.0esr/win64/en-US/
3. 如果编码格式报错,添加
reload(sys)
sys.setdefaultencoding('utf8')