使用Python请求- URL类型错误从URL保存图像。

时间:2021-07-11 23:25:41

Using the following code:

使用下面的代码:

    with open('newim','wb') as f:
        f.write(requests.get(repr(url)))

where the url is:

的网址是:

    url = ''

I get the following error:

我得到了以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python33\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python33\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)

I have seen other posts with what, at first glance, appears to be a similar problem but I haven't had any luck just adding 'https://' or anything like that...I seriously want to avoid having to do this in webdriver+Autoit or something because I have to do a similar exercise for thousands of images.

我看到过其他的帖子,乍一看似乎也有类似的问题,但我没有任何运气,只是添加了“https://”之类的东西……我非常想避免在webdriver+Autoit中这样做,因为我需要做类似的练习来处理成千上万的图像。

2 个解决方案

#1


0  

This is an image encoded in base64. Quoting the URL below: "base64 equals to text (string) representation of the image itself".

这是在base64中编码的图像。引用下面的URL:“base64等于图像本身的文本(字符串)表示”。

Read this for a detailed explanation: http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/

请阅读下面的详细说明:http://www.stoimen.com/blog/2009/04/23/何时-你应该使用base64 / images/。

In order to use them you'll have to implement a base64 decoder. Luckily SO already provides you with the answer on how to do it:

为了使用它们,你必须实现一个base64解码器。幸运的是,已经为你提供了如何去做的答案:

Python base64 data decode

Python base64数据解码

#2


0  

There seems to be a problem with your understanding of the concept of embedded images. The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.

你对嵌入式图像概念的理解似乎存在问题。实际上,您所发布的url实际上是当您从上下文菜单中选择“查看图像”或“复制图像位置”(或类似的东西,取决于浏览器)时,您的浏览器返回的内容,并正式地称为数据URI。

It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message.

它不是指向图像的http url,也不能使用它从任何服务器检索实际图像:这正是在错误消息中指出的请求。


So, how do we get these images? The following script will handle this task:

那么,我们如何得到这些图像呢?下面的脚本将处理这个任务:

import requests
from lxml import html
import binascii as ba

i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/@src')

for img in images:
    i += 1
    ext = img.partition('data:image/')[2].split(';')[0]
    with open('newim'+str(i)+'.'+ext,'wb') as f:
        f.write(ba.a2b_base64(img.partition('base64,')[2]))

print("Done")

To run it you will need to install, along with requests, the lxml library which can be found here.

要运行它,您需要安装,以及请求,可以在这里找到lxml库。


Here follows a short description of how the script functions:

以下是对脚本功能的简短描述:

First it requests the url from the server and, after it gets the server's response, it stores it in a Response object (page).

首先,它从服务器请求url,并在获取服务器响应之后,将其存储在响应对象(页面)中。

Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/@src').

然后,它利用来自lxml的html.fromstring()将页面的“文本化”内容转换为树状结构,可以通过使用XPath语法的命令来处理,比如这个:images = struct.xpath('//img/@src')。

The result is a list containing the contents of the src attribute of every image in the page. In this case (embedded images) these are the data URIs.

结果是一个包含页面中每个图像的src属性内容的列表。在这种情况下(嵌入图像)这些是数据uri。

Then, for every image in the list, it first gets the image type (which will be used as the newim's extension), using partition() and split() and stores it in ext. Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file.

列表中的每一个图像,它首先获取图像类型(这将被用作newim的扩展),使用分区()和()并将其存储在ext。然后base64编码数据转换为二进制(使用a2b_base64()从binascii模块),并将输出写入到文件。


As a small demo, save this html code (as, eg, images.html) somewhere in your server

作为一个小的演示,把这个html代码保存在服务器的某个地方。

<h1>Images</h1>
<img src="" />  
<br />
<img src=""></img>
<br />
<img src=""/>

and point to it in the script: requests.get("http://yourserver/somedir/images.html").

并在脚本中指向它:请求。get(“http://yourserver/somedir/images.html”)。

When you run the script you will get the following 3 images: 使用Python请求- URL类型错误从URL保存图像。, 使用Python请求- URL类型错误从URL保存图像。, 使用Python请求- URL类型错误从URL保存图像。, respectively named newim1.png, newim2.png and newim3.jpg.

当您运行该脚本时,您将得到以下3个图像:,分别命名为newim1。png、newim2。png和newim3.jpg。


As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).

提醒一下,请注意,这个脚本(以当前的形式)只处理嵌入的图像。如果您想要处理普通的链接图像,那么您必须相应地修改它(但这并不困难)。

#1


0  

This is an image encoded in base64. Quoting the URL below: "base64 equals to text (string) representation of the image itself".

这是在base64中编码的图像。引用下面的URL:“base64等于图像本身的文本(字符串)表示”。

Read this for a detailed explanation: http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/

请阅读下面的详细说明:http://www.stoimen.com/blog/2009/04/23/何时-你应该使用base64 / images/。

In order to use them you'll have to implement a base64 decoder. Luckily SO already provides you with the answer on how to do it:

为了使用它们,你必须实现一个base64解码器。幸运的是,已经为你提供了如何去做的答案:

Python base64 data decode

Python base64数据解码

#2


0  

There seems to be a problem with your understanding of the concept of embedded images. The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.

你对嵌入式图像概念的理解似乎存在问题。实际上,您所发布的url实际上是当您从上下文菜单中选择“查看图像”或“复制图像位置”(或类似的东西,取决于浏览器)时,您的浏览器返回的内容,并正式地称为数据URI。

It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message.

它不是指向图像的http url,也不能使用它从任何服务器检索实际图像:这正是在错误消息中指出的请求。


So, how do we get these images? The following script will handle this task:

那么,我们如何得到这些图像呢?下面的脚本将处理这个任务:

import requests
from lxml import html
import binascii as ba

i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/@src')

for img in images:
    i += 1
    ext = img.partition('data:image/')[2].split(';')[0]
    with open('newim'+str(i)+'.'+ext,'wb') as f:
        f.write(ba.a2b_base64(img.partition('base64,')[2]))

print("Done")

To run it you will need to install, along with requests, the lxml library which can be found here.

要运行它,您需要安装,以及请求,可以在这里找到lxml库。


Here follows a short description of how the script functions:

以下是对脚本功能的简短描述:

First it requests the url from the server and, after it gets the server's response, it stores it in a Response object (page).

首先,它从服务器请求url,并在获取服务器响应之后,将其存储在响应对象(页面)中。

Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/@src').

然后,它利用来自lxml的html.fromstring()将页面的“文本化”内容转换为树状结构,可以通过使用XPath语法的命令来处理,比如这个:images = struct.xpath('//img/@src')。

The result is a list containing the contents of the src attribute of every image in the page. In this case (embedded images) these are the data URIs.

结果是一个包含页面中每个图像的src属性内容的列表。在这种情况下(嵌入图像)这些是数据uri。

Then, for every image in the list, it first gets the image type (which will be used as the newim's extension), using partition() and split() and stores it in ext. Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file.

列表中的每一个图像,它首先获取图像类型(这将被用作newim的扩展),使用分区()和()并将其存储在ext。然后base64编码数据转换为二进制(使用a2b_base64()从binascii模块),并将输出写入到文件。


As a small demo, save this html code (as, eg, images.html) somewhere in your server

作为一个小的演示,把这个html代码保存在服务器的某个地方。

<h1>Images</h1>
<img src="" />  
<br />
<img src=""></img>
<br />
<img src=""/>

and point to it in the script: requests.get("http://yourserver/somedir/images.html").

并在脚本中指向它:请求。get(“http://yourserver/somedir/images.html”)。

When you run the script you will get the following 3 images: 使用Python请求- URL类型错误从URL保存图像。, 使用Python请求- URL类型错误从URL保存图像。, 使用Python请求- URL类型错误从URL保存图像。, respectively named newim1.png, newim2.png and newim3.jpg.

当您运行该脚本时,您将得到以下3个图像:,分别命名为newim1。png、newim2。png和newim3.jpg。


As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).

提醒一下,请注意,这个脚本(以当前的形式)只处理嵌入的图像。如果您想要处理普通的链接图像,那么您必须相应地修改它(但这并不困难)。