I'm fetching webpages with the use of curl and storing it in a variable in python.
我使用curl来获取web页面,并将其存储在python中的一个变量中。
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
I just want the links from the string for example:
我只需要字符串中的链接,例如:
"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg"
I Tried matching with regular expressions by defining the start of regular expression as "(https|http) and end as ":
通过将正则表达式的开头定义为“(https|http)”,结尾定义为“:
x = re.findall(r'"(https|http)*"$', var)
print(x)
But I'm not getting the output. Please help me with this, thanks in advance.
但是我没有得到输出。请帮我一下,谢谢。
>>>[]
3 个解决方案
#1
1
@Manoj, you can also retrieve the value of src
attribute using the split()
method multiple times as follows.
@Manoj,还可以使用split()方法多次检索src属性的值,如下所示。
» Using lambda function (1 line statement)
使用lambda函数(1行语句)
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
get_url = lambda html: html.split('=')[1].split('\"')[1]
print(get_url(var))
» Output
»输出
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
Let's expand the above approach in multiple statements to understand the actual direct process.
让我们在多个语句中扩展上述方法,以了解实际的直接过程。
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
» Output
»输出
E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
#2
2
Using re.search
使用re.search
import re
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
m = re.search("src=\"(?P<url>.*?)\"", var)
if m:
print m.group('url')
Output:
输出:
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
#3
0
Using beautifulsoup you could search for a
or img
and check for the attributes:
使用漂亮的汤,你可以搜索a或img,并检查属性:
For example:
例如:
from bs4 import BeautifulSoup as soup
var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")
links = []
for elm in page_soup.findAll(['a', 'img']):
if elm.has_attr('href'):
links.append(elm.get('href'))
if elm.has_attr('src'):
links.append(elm.get('src'))
print(links)
演示
#1
1
@Manoj, you can also retrieve the value of src
attribute using the split()
method multiple times as follows.
@Manoj,还可以使用split()方法多次检索src属性的值,如下所示。
» Using lambda function (1 line statement)
使用lambda函数(1行语句)
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
get_url = lambda html: html.split('=')[1].split('\"')[1]
print(get_url(var))
» Output
»输出
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
Let's expand the above approach in multiple statements to understand the actual direct process.
让我们在多个语句中扩展上述方法,以了解实际的直接过程。
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
» Output
»输出
E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
#2
2
Using re.search
使用re.search
import re
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
m = re.search("src=\"(?P<url>.*?)\"", var)
if m:
print m.group('url')
Output:
输出:
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
#3
0
Using beautifulsoup you could search for a
or img
and check for the attributes:
使用漂亮的汤,你可以搜索a或img,并检查属性:
For example:
例如:
from bs4 import BeautifulSoup as soup
var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")
links = []
for elm in page_soup.findAll(['a', 'img']):
if elm.has_attr('href'):
links.append(elm.get('href'))
if elm.has_attr('src'):
links.append(elm.get('src'))
print(links)
演示