代理必须是映射

时间:2022-02-13 20:24:21

I get this error :

我得到这个错误:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    proxy = urllib2.ProxyHandler(line)
  File "/usr/lib/python2.7/urllib2.py", line 713, in __init__
    assert hasattr(proxies, 'has_key'), "proxies must be a mapping"
AssertionError: proxies must be a mapping

when I run the following script:

当我运行以下脚本:

import urllib2  
u=open('urls.txt')
p=open('proxies.txt')
for line in p:
    proxy = urllib2.ProxyHandler(line)
    opener = urllib2.build_opener(proxy)
    urllib2.install_opener(opener)
    for url in u:
        urllib.urlopen(url).read()

u.close()
p.close()

my urls.txt file has this:

我的url。txt文件:

'www.google.com'
'www.facebook.com'
'www.reddit.com'

and my proxies.txt has this:

和我的代理。三种有:

{'https': 'https://94.142.27.4:3128'}
{'http': 'http://118.97.95.174:8080'}
{'http':'http://66.62.236.15:8080'}

I found them at hidemyass.com

我在hidemyass.com上找到的

from the googling I have done, most people that have had this problem have their proxies formatted wrong. Is this the case here?

从我所做的谷歌搜索来看,大多数有这个问题的人都把他们的代理格式化了。这里是这样吗?

1 个解决方案

#1


2  

As the documentation says:

的文档表示:

If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies.

如果给出了代理,那么它必须是一个字典映射协议名称到代理的url。

But in your code, it's just a string. In particular, it's one line out of your proxies.txt file:

但在代码中,它只是一个字符串。特别是,它是你的代理中的一行。txt文件:

p=open('proxies.txt')
for line in p:
    proxy = urllib2.ProxyHandler(line)

Looking at the file, it looks like the lines are intended to be something like the repr of a Python dictionary. And, given that all of the keys and values are string literals, that means you could use ast.literal_eval on it to recover the original dicts:

看着这个文件,看起来这些行看起来像是Python字典的repr。并且,由于所有的键和值都是字符串常量,这意味着您可以在它上使用ast.literal_eval来恢复原始的dicts:

p=open('proxies.txt')
for line in p:
    d = ast.literal_eval(line)
    proxy = urllib2.ProxyHandler(d)

Of course that won't work for your sample data, because one of the lines is missing a ' character. But if you fix that, it will…

当然,这对示例数据不起作用,因为其中一行缺少一个“字符”。但如果你解决了这个问题,它就会……

However, it would probably be better to use a format that's actually intended for data interchange. For example, JSON is just as human-readable as what you've got, and not all that different:

但是,最好使用一种实际用于数据交换的格式。例如,JSON和已有的一样具有人类可读性,并且没有什么不同:

{"https": "https://94.142.27.4:3128"}
{"http": "http://118.97.95.174:8080"}
{"http": "http://66.62.236.15:8080"}

The advantage of using JSON is that there are plenty of tools to validate, edit, etc. JSON, and none for your custom format; the rules for what is and isn't valid are obvious, rather than something you have to guess at; and the error messages for invalid data will likely be more helpful (like "Expecting property name at line 1 column 10 (char 10)" as opposed to "unexpected EOF while parsing").

使用JSON的好处是,有很多工具可以验证、编辑等JSON,而定制格式没有工具;什么是有效的,什么是无效的规则是显而易见的,而不是你不得不去猜测的东西;无效数据的错误消息可能会更有帮助(比如“在第1列(char 10)处期望属性名”,而不是“解析时意外的EOF”)。


Note that once you solve this problem, you're going to run into another one with the URLs. After all, 'www.google.com'\n is not what you want, it's www.google.com. So you're going to have to strip off the newline and the quotes. Again, you could use ast.literal_eval here. Or you could use JSON as an interchange format.

注意,一旦您解决了这个问题,您将遇到另一个具有url的。毕竟,“www.google.com”\n不是你想要的,它是www.google.com。你需要去掉换行符和引号。同样,您可以在这里使用ast.literal_eval。或者可以使用JSON作为交换格式。

But really, if you're just trying to store one string per line, why not just store the strings as-is, instead of trying to store a string representation of those strings (with the extra quotes on)?

但实际上,如果你只是想在每行中存储一个字符串,为什么不按原样存储字符串,而不是尝试存储这些字符串的字符串表示(附加引号)呢?


There are still more problems beyond that.

除此之外还有更多的问题。

Even after you get rid of the excess quotes, www.google.com isn't a URL, it's just a hostname. http://www.google.com is what you want here. Unless you want https://www.google.com, or some other scheme.

即使去掉多余的引号,www.google.com也不是URL,它只是一个主机名。http://www.google.com就是你想要的。除非你想要https://www.google.com或者其他的方案。

You're trying to loop through 'urls.txt' once for each proxy. That's going to process all of the URLs with just the first proxy installed, and then the remainder (which is nothing, since you already did all of them) with the first two installed, and then the remainder (which is still nothing) with all three installed. Move the url loop outside of the proxy loop.

你试图循环遍历“url”。每个代理对应一个txt。这将在安装第一个代理后处理所有url,然后在安装了前两个代理后处理其余的(什么都没有,因为您已经完成了所有的工作),然后在安装了这三个代理之后处理其余的(什么都没有)。将url循环移出代理循环。

Finally, these aren't really a problem, but while we're at it… Using a with statement makes it much easier to write more robust code than using manual close calls, and it makes your code shorter and more readable to boot. Also, it's usually better to wait until you need a file before you try to open it. And variable names like u and p are just going to cause more confusion in the long run than they'll save typing in the short run.

最后,这些并不是真正的问题,但是当我们讨论它的时候…使用with语句使编写更健壮的代码比使用手动关闭调用要容易得多,并且它使您的代码更短,更易于引导。此外,最好等到需要文件时再尝试打开它。像u和p这样的变量名在长期内会造成更多的混乱,而在短期内不会保存输入。

Oh, and just calling urllib.urlopen(url).read() and not doing anything with the result won't have any effect except to waste a few seconds and a bit of network bandwidth, but I assume you already knew that, and just left out the details for the sake of simplicity.

噢,只调用urllib.urlopen(url).read(),而不做任何与结果无关的事情,只会浪费几秒钟和一点网络带宽,但我假设您已经知道了,为了简单起见,省略了细节。

Putting it all together, and assuming you fix the two files as described above:

把它们放在一起,假设你修复了上面描述的两个文件:

import json
import urllib2  

with open('proxies.txt') as proxies:
    for line in proxies:
        proxy = json.loads(line)
        proxy_handler = urllib2.ProxyHandler(proxy)
        opener = urllib2.build_opener(proxy_handler)
        urllib2.install_opener(opener)
with open('urls.txt') as urls:
    for line in urls:
        url = line.rstrip()
        data = urllib.urlopen(url).read()
        # do something with data

As it turns out, you want to try all of the URLs through each proxy, not try all of them through all the proxies, or through the first and then the first two and so on.

事实证明,您希望通过每个代理来尝试所有的url,而不是通过所有的代理来尝试所有的url,或者通过第一个和第一个,等等。

You could do this by indenting the second with and for under the first for. But it's probably simpler to just read them all at once (and probably more efficient, although I doubt that matters):

您可以通过在第一个for下缩进第二个for来实现这一点。但是,一次读一遍可能更简单(而且可能更有效,尽管我怀疑这是否重要):

with open('urls.txt') as f:
    urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
    for line in proxies:
        proxy = json.loads(line)
        proxy_handler = urllib2.ProxyHandler(proxy)
        opener = urllib2.build_opener(proxy_handler)
        urllib2.install_opener(opener)
        for url in urls:
            data = urllib.urlopen(url).read()
            # do something with data

Of course this means reading the whole list of URLs before doing any work. I doubt that will matter, but if it does, you can use the tee trick to avoid it.

当然,这意味着在做任何工作之前都要阅读整个url列表。我怀疑这是否重要,但如果重要的话,你可以用发球的技巧来避免它。

#1


2  

As the documentation says:

的文档表示:

If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies.

如果给出了代理,那么它必须是一个字典映射协议名称到代理的url。

But in your code, it's just a string. In particular, it's one line out of your proxies.txt file:

但在代码中,它只是一个字符串。特别是,它是你的代理中的一行。txt文件:

p=open('proxies.txt')
for line in p:
    proxy = urllib2.ProxyHandler(line)

Looking at the file, it looks like the lines are intended to be something like the repr of a Python dictionary. And, given that all of the keys and values are string literals, that means you could use ast.literal_eval on it to recover the original dicts:

看着这个文件,看起来这些行看起来像是Python字典的repr。并且,由于所有的键和值都是字符串常量,这意味着您可以在它上使用ast.literal_eval来恢复原始的dicts:

p=open('proxies.txt')
for line in p:
    d = ast.literal_eval(line)
    proxy = urllib2.ProxyHandler(d)

Of course that won't work for your sample data, because one of the lines is missing a ' character. But if you fix that, it will…

当然,这对示例数据不起作用,因为其中一行缺少一个“字符”。但如果你解决了这个问题,它就会……

However, it would probably be better to use a format that's actually intended for data interchange. For example, JSON is just as human-readable as what you've got, and not all that different:

但是,最好使用一种实际用于数据交换的格式。例如,JSON和已有的一样具有人类可读性,并且没有什么不同:

{"https": "https://94.142.27.4:3128"}
{"http": "http://118.97.95.174:8080"}
{"http": "http://66.62.236.15:8080"}

The advantage of using JSON is that there are plenty of tools to validate, edit, etc. JSON, and none for your custom format; the rules for what is and isn't valid are obvious, rather than something you have to guess at; and the error messages for invalid data will likely be more helpful (like "Expecting property name at line 1 column 10 (char 10)" as opposed to "unexpected EOF while parsing").

使用JSON的好处是,有很多工具可以验证、编辑等JSON,而定制格式没有工具;什么是有效的,什么是无效的规则是显而易见的,而不是你不得不去猜测的东西;无效数据的错误消息可能会更有帮助(比如“在第1列(char 10)处期望属性名”,而不是“解析时意外的EOF”)。


Note that once you solve this problem, you're going to run into another one with the URLs. After all, 'www.google.com'\n is not what you want, it's www.google.com. So you're going to have to strip off the newline and the quotes. Again, you could use ast.literal_eval here. Or you could use JSON as an interchange format.

注意,一旦您解决了这个问题,您将遇到另一个具有url的。毕竟,“www.google.com”\n不是你想要的,它是www.google.com。你需要去掉换行符和引号。同样,您可以在这里使用ast.literal_eval。或者可以使用JSON作为交换格式。

But really, if you're just trying to store one string per line, why not just store the strings as-is, instead of trying to store a string representation of those strings (with the extra quotes on)?

但实际上,如果你只是想在每行中存储一个字符串,为什么不按原样存储字符串,而不是尝试存储这些字符串的字符串表示(附加引号)呢?


There are still more problems beyond that.

除此之外还有更多的问题。

Even after you get rid of the excess quotes, www.google.com isn't a URL, it's just a hostname. http://www.google.com is what you want here. Unless you want https://www.google.com, or some other scheme.

即使去掉多余的引号,www.google.com也不是URL,它只是一个主机名。http://www.google.com就是你想要的。除非你想要https://www.google.com或者其他的方案。

You're trying to loop through 'urls.txt' once for each proxy. That's going to process all of the URLs with just the first proxy installed, and then the remainder (which is nothing, since you already did all of them) with the first two installed, and then the remainder (which is still nothing) with all three installed. Move the url loop outside of the proxy loop.

你试图循环遍历“url”。每个代理对应一个txt。这将在安装第一个代理后处理所有url,然后在安装了前两个代理后处理其余的(什么都没有,因为您已经完成了所有的工作),然后在安装了这三个代理之后处理其余的(什么都没有)。将url循环移出代理循环。

Finally, these aren't really a problem, but while we're at it… Using a with statement makes it much easier to write more robust code than using manual close calls, and it makes your code shorter and more readable to boot. Also, it's usually better to wait until you need a file before you try to open it. And variable names like u and p are just going to cause more confusion in the long run than they'll save typing in the short run.

最后,这些并不是真正的问题,但是当我们讨论它的时候…使用with语句使编写更健壮的代码比使用手动关闭调用要容易得多,并且它使您的代码更短,更易于引导。此外,最好等到需要文件时再尝试打开它。像u和p这样的变量名在长期内会造成更多的混乱,而在短期内不会保存输入。

Oh, and just calling urllib.urlopen(url).read() and not doing anything with the result won't have any effect except to waste a few seconds and a bit of network bandwidth, but I assume you already knew that, and just left out the details for the sake of simplicity.

噢,只调用urllib.urlopen(url).read(),而不做任何与结果无关的事情,只会浪费几秒钟和一点网络带宽,但我假设您已经知道了,为了简单起见,省略了细节。

Putting it all together, and assuming you fix the two files as described above:

把它们放在一起,假设你修复了上面描述的两个文件:

import json
import urllib2  

with open('proxies.txt') as proxies:
    for line in proxies:
        proxy = json.loads(line)
        proxy_handler = urllib2.ProxyHandler(proxy)
        opener = urllib2.build_opener(proxy_handler)
        urllib2.install_opener(opener)
with open('urls.txt') as urls:
    for line in urls:
        url = line.rstrip()
        data = urllib.urlopen(url).read()
        # do something with data

As it turns out, you want to try all of the URLs through each proxy, not try all of them through all the proxies, or through the first and then the first two and so on.

事实证明,您希望通过每个代理来尝试所有的url,而不是通过所有的代理来尝试所有的url,或者通过第一个和第一个,等等。

You could do this by indenting the second with and for under the first for. But it's probably simpler to just read them all at once (and probably more efficient, although I doubt that matters):

您可以通过在第一个for下缩进第二个for来实现这一点。但是,一次读一遍可能更简单(而且可能更有效,尽管我怀疑这是否重要):

with open('urls.txt') as f:
    urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
    for line in proxies:
        proxy = json.loads(line)
        proxy_handler = urllib2.ProxyHandler(proxy)
        opener = urllib2.build_opener(proxy_handler)
        urllib2.install_opener(opener)
        for url in urls:
            data = urllib.urlopen(url).read()
            # do something with data

Of course this means reading the whole list of URLs before doing any work. I doubt that will matter, but if it does, you can use the tee trick to avoid it.

当然,这意味着在做任何工作之前都要阅读整个url列表。我怀疑这是否重要,但如果重要的话,你可以用发球的技巧来避免它。