Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying:
对不起,如果这是一个模糊的标题。我试图在一致的基础上削减XKCD网络漫画的数量。我看到http://xkcd.com/总是在首页上有他们最新的漫画,并在网站的下方有一条线说:
Permanent link to this comic: http://xkcd.com/1520/
Where 1520 is the number of the newest comic on display. I want to scrape this number, however, I can't find any good way to do so. Currently all my attempts look really hackish like:
其中1520是展出的最新漫画的数量。我想要刮掉这个数字,但是,我找不到任何好方法。目前我的所有尝试看起来都很像hackish:
soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]
I mean.. That technically works, but if anything on the website gets moved in the slightest it could break horribly. I know there has to be better way to just search for http:xkcd.com/####/
within the a section of the front page and just return ####
but I can't seem to find it. The Permanent link to this comic: http://xkcd.com/1520/
line just seems to be kind of floating around without any kinds of tags, class, or ID. Can anyone offer any assistance?
我的意思是......技术上有用,但如果网站上的任何内容被移动到最轻微,它可能会破坏。我知道必须有更好的方法来在首页的一部分中搜索http:xkcd.com / #### /并返回####但我似乎无法找到它。这个漫画的永久链接:http://xkcd.com/1520/ line似乎有点漂浮,没有任何类型的标签,类或ID。有人可以提供任何帮助吗?
1 个解决方案
#1
Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:
通常我坚持使用HTML解析器。在这里,由于我们正在寻找HTML中的特定文本(不检查任何标签),因此在以下方面应用正则表达式搜索是非常好的:
Permanent link to this comic: http://xkcd.com/(\d+)/
saving digits in a group.
保存组中的数字。
Demo:
>>> import re
>>> import requests
>>>
>>>
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520
#1
Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:
通常我坚持使用HTML解析器。在这里,由于我们正在寻找HTML中的特定文本(不检查任何标签),因此在以下方面应用正则表达式搜索是非常好的:
Permanent link to this comic: http://xkcd.com/(\d+)/
saving digits in a group.
保存组中的数字。
Demo:
>>> import re
>>> import requests
>>>
>>>
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520