Python 爬虫使用正则去掉不想要的网页元素

在做爬虫的时候，我们总是不想去看到网页的注释，或者是网页的一些其他元素，有没有好的办法去掉他们呢？

例如：下面的问题

第一种情况
<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E6%9E%97%E7%BB%8D%E5%91%A8"target="_blank">林绍周（明）</a>辑</td>

想要得到的结果是：林绍周（明）辑

第二种情况

<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E9%92%9F%E6%83%BA"target="_blank">钟惺（明）</a><ahref="http://artso.artro

n.net/auction/search_auction.php?keyword=%E8%B0%AD%E5%85%83%E6%98%A5"target="_blank">谭元春</a>辑</td>

想要得到的结果是：钟惺（明）谭元春辑

第三种情况

<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E8%90%A7%E5%A8%B4"target="_blank">萧娴（1902～1997）</a></td>

想要得到的结果是： 萧娴（1902～1997）

针对这三种情况，可以试用正则 sub去提取信息

ewline = """<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E6%96%87%E7%83%BA"target="_blank">文烺</a><ahref="htt
p://artso.artron.net/auction/search_auction.php?keyword=%E6%9D%8E%E9%93%A0"target="_blank">李铠</a>等</td>"""

re_comment = re.compile('<ahref=[^>]*target="_blank">')

print re_comment

newlines = re_comment.sub('', newline)

print newlines.replace('</a>',' ').replace('</td>','').replace('</a>','')

运行结果是：

C:\Python27\python.exe C:/Users/xuchunlin/PycharmProjects/A9_25/haiwai__guanwang/0/qq.py

文烺 李铠 等

Process finished with exit code 0

秒客网

Python 爬虫使用正则去掉不想要的网页元素

相关文章

Python 爬虫 使用正则去掉不想要的网页元素

相关文章

Python 爬虫使用正则去掉不想要的网页元素