have an issue with this. I am not sure how to go about showing a single img. For example:
对此有异议。我不知道怎么去展示一个img。例如:
<img srcset="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s180/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 180w, http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s390/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 390w, http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s458/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 458w" src="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s615/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg">
As you can see above, there are different alternative images, however i am trying to scrape a single one to be shown.
正如你可以看到的,有不同的替代图像,但是我尝试去刮一个单一的显示。
import bs4 as bs
import urllib.request
import datetime
import random
import re
random.seed(datetime.datetime.now())
sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
#
title = soup.title
link = soup.link
image = re.search(img 'srcset=img(.*?),)
#this doesnt work, not sure how to
strong = soup.strong
description = soup.description
location = soup.location
title = soup.find('h1', class_ ='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('em').find('a')
description = soup.find('div', class_='description',to.text)
#Previous Code
print("H1:", title.text)
print("Article Link:", link)
print("Image Url:\n", image)
print("1st Paragraph:\n", strong.text)
print("2nd Paragraph:\n", description.string)
print("Location:\n", location.text)
My code is above, however the previous result when on my previous try would show:
我的代码在上面,但是之前的结果显示:
Greater Manchester News
<link href="rss.xml" rel="alternate" title="Default home feed"
type="application/rss+xml"/>
<img data-`src="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNA`TES/s615/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg" data-`srcset="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTE`RNATES/s180/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 180w,` http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALT`ERNATES/s
390/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-`Trafford-home-last-Thursday.jpg 390w, `http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s458/Mike-Grimshaw-34-was-fatally-attacked-following-t`he-attack-outs`ide-his-
Trafford-home-last-Thursday.jpg 458w"/>
Family of dad stabbed in the neck while defendin
g his fiancée from thugs speak of their heartbreak
Mike Grimshaw, 34, died after being stabbed in the neck outside his
home in Trafford last Thursday
Trafford
In the results, shows multiple image names, however i am trying to only show a single image link. How do i go about doing this.
在结果中,显示多个图像名称,但是我只是试图显示一个单一的图像链接。我该怎么做呢?
Any ideas would be much appreciated.
非常感谢您的建议。
1 个解决方案
#1
0
You can access the attribute data-src
or data-srcset
to get the image you want :
您可以访问属性data-src或data-srcset来获得您想要的图像:
image = soup.find('img')
single_img = image.get('data-src') # return the main image link
or
或
import re
image = soup.find('img')
img_string = image.get('data-srcset') # this return a string you have to parse
img_set = re.findall(r'(https?://[^\s]+)', img_set) # regex to match only links
Then you can access whatever index you want in img_set (just test the length of the list before)
然后,您可以访问img_set中您想要的任何索引(只需要测试列表的长度)
#1
0
You can access the attribute data-src
or data-srcset
to get the image you want :
您可以访问属性data-src或data-srcset来获得您想要的图像:
image = soup.find('img')
single_img = image.get('data-src') # return the main image link
or
或
import re
image = soup.find('img')
img_string = image.get('data-srcset') # this return a string you have to parse
img_set = re.findall(r'(https?://[^\s]+)', img_set) # regex to match only links
Then you can access whatever index you want in img_set (just test the length of the list before)
然后,您可以访问img_set中您想要的任何索引(只需要测试列表的长度)