使用Python / Django从html获取第一张图片

时间:2021-10-28 08:59:33

I am grabbing a bunch of html from a service and parsing it slightly. I am looking for a way to grab the link from the first image tag.

我从一个服务中抓取一堆html并稍微解析它。我正在寻找一种从第一个图像标签中获取链接的方法。

Something similar like this JQuery code:

像这个JQuery代码类似的东西:

var imagelink = $('img:first', feed.content).attr('src');

But of course using only Python/Django (server runs on Google app engine). I rather not use any other libraries, just to grab a simple link.

但当然只使用Python / Django(服务器在Google应用引擎上运行)。我宁愿不使用任何其他库,只是为了获取一个简单的链接。

3 个解决方案

#1


7  

You can use BeautifulSoup to do this:

您可以使用BeautifulSoup执行此操作:

http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/BeautifulSoup/

It's a XML/HTML parser. So you pass in the raw html, and then you can search it for particular tags/attrs etc.

它是一个XML / HTML解析器。所以你传入原始html,然后你可以搜索特定的标签/ attrs等。

something like this should work:

像这样的东西应该工作:

tree = BeautifulSoup(raw_html)
img_link = (tree.find('img')[0]).attr['src']

#2


3  

This is exactly what I'm looking for. Actually, the real code is like this:

这正是我正在寻找的。实际上,真正的代码是这样的:

tree = BeautifulSoup(raw_html)
img_link = tree.find_all('img')[0].get('src')

Works great! thanks timmy-omahony

效果很好!谢谢timmy-omahony

#3


0  

If I do any more parsing of html I probably will look into one of the libraries suggested. But for now I have solved this by:

如果我再对html进行解析,我可能会查看其中一个建议的库。但是现在我已经解决了这个问题:

   startImgPos = post.find('<img', 0, len(post)) + 4
    if(startImgPos > -1):
        endImgPos = post.find('>', startImgPos, len(post))
        imageTag = post[startImgPos:endImgPos]
        startSrcPos = imageTag.find('src="', 0, len(post)) +5
        endSrcPos = imageTag.find('"', startSrcPos , len(post)) 
        linkTag = imageTag[startSrcPos:endSrcPos]
        r['linktag'] = linkTag

I'll improve this later, but for now it does the trick. Feel free to suggest any more ideas/improvements to the above code.

我稍后会改进这个,但是现在它可以解决这个问题。请随意为上述代码提出更多建议/改进建议。

#1


7  

You can use BeautifulSoup to do this:

您可以使用BeautifulSoup执行此操作:

http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/BeautifulSoup/

It's a XML/HTML parser. So you pass in the raw html, and then you can search it for particular tags/attrs etc.

它是一个XML / HTML解析器。所以你传入原始html,然后你可以搜索特定的标签/ attrs等。

something like this should work:

像这样的东西应该工作:

tree = BeautifulSoup(raw_html)
img_link = (tree.find('img')[0]).attr['src']

#2


3  

This is exactly what I'm looking for. Actually, the real code is like this:

这正是我正在寻找的。实际上,真正的代码是这样的:

tree = BeautifulSoup(raw_html)
img_link = tree.find_all('img')[0].get('src')

Works great! thanks timmy-omahony

效果很好!谢谢timmy-omahony

#3


0  

If I do any more parsing of html I probably will look into one of the libraries suggested. But for now I have solved this by:

如果我再对html进行解析,我可能会查看其中一个建议的库。但是现在我已经解决了这个问题:

   startImgPos = post.find('<img', 0, len(post)) + 4
    if(startImgPos > -1):
        endImgPos = post.find('>', startImgPos, len(post))
        imageTag = post[startImgPos:endImgPos]
        startSrcPos = imageTag.find('src="', 0, len(post)) +5
        endSrcPos = imageTag.find('"', startSrcPos , len(post)) 
        linkTag = imageTag[startSrcPos:endSrcPos]
        r['linktag'] = linkTag

I'll improve this later, but for now it does the trick. Feel free to suggest any more ideas/improvements to the above code.

我稍后会改进这个,但是现在它可以解决这个问题。请随意为上述代码提出更多建议/改进建议。