如何在python中使用urllib2加速获取页面?

时间:2021-10-29 18:09:45

I have a script that fetches several web pages and parses the info.

我有一个脚本,可以获取多个网页并解析信息。

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

(可以在http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01上看到一个例子)

I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing.

我在上面运行了cProfile,正如我所假设的,urlopen占用了大量的时间。有没有办法更快地获取页面?或者一种方法一次获取几个页面?我会做最简单的事情,因为我是python和web开发的新手。

Thanks in advance! :)

提前致谢! :)

UPDATE: I have a function called fetchURLs(), which I use to make an array of the URLs I need so something like urls = fetchURLS().The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)

更新:我有一个名为fetchURLs()的函数,我用它来创建一个我需要的URL数组,这样就像urls = fetchURLS()。这些URL都是来自亚马逊和eBay API的XML文件(这让我很困惑为什么加载需要这么长时间,也许我的虚拟主机很慢?)

What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.

我需要做的是加载每个URL,读取每个页面,并将该数据发送到脚本的另一部分,该部分将解析和显示数据。

Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is.

请注意,在获取所有页面之前,我无法执行后一部分,这就是我的问题所在。

Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :)

另外,我相信,我的主机一次限制我25个进程,所以服务器上最简单的任何东西都会很好:)


Here it is for time:

这是时间:

Sun Aug 15 20:51:22 2010    prof

         211352 function calls (209292 primitive calls) in 22.254 CPU seconds

   Ordered by: internal time
   List reduced from 404 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}
     4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}
       10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}
     2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}
       12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
     3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
     1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
     4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
     1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
     1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

10 个解决方案

#1


24  

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

编辑:我正在扩大答案,包括一个更优秀的例子。我在这篇文章中发现了很多关于线程的敌意和错误信息。异步I / O.因此,我还添加了更多的论据来驳斥某些无效的主张。我希望这能帮助人们为正确的工作选择合适的工具。

This is a dup to a question 3 days ago.

这是3天前的问题。

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow Python urllib2.urlopen() is slow, need a better way to read several urls

Python urllib2.open很慢,需要更好的方法来读取几个url - Stack Overflow Python urllib2.urlopen()很慢,需要更好的方法来读取几个url

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

我正在抛出代码以展示如何使用线程并行获取多个网页。

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

如您所见,特定于应用程序的代码只有3行,如果您具有攻击性,可以将其折叠为1行。我认为任何人都不能证明他们声称这是复杂而不可维护的。

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

不幸的是,这里发布的大多数其他线程代码都有一些缺陷。他们中的许多人都进行主动轮询以等待代码完成。 join()是一种更好的同步代码的方法。我认为到目前为止,这段代码已经改进了所有线程示例。

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

如果所有URL指向同一服务器,WoLpH关于使用keep-alive连接的建议可能非常有用。

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Aaron Gallagher是扭曲框架的粉丝,他对任何建议线程的人都持敌视态度。不幸的是,他的许多说法都是错误的信息。例如,他说“-1表示线程。这是IO绑定的;线程在这里没用。”这与证据相反,因为Nick T和我都证明了使用线程的速度增益。事实上,使用Python的线程(v.s.在CPU绑定应用程序中没有增益)可以获得最多的I / O绑定应用程序。 Aaron对线程的误导性批评表明他对并行编程感到相当困惑。

Right tool for the right job

适合工作的正确工具

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

我很清楚使用线程,python,异步I / O等进行并行编程的问题。每个工具都有其优点和缺点。对于每种情况,都有适当的工具。我不反对扭曲(虽然我自己没有部署过)。但是我不相信我们可以说在任何情况下线程都是坏的并且扭曲是好的。

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

例如,如果OP的要求是并行获取10,000个网站,则可以优先选择异步I / O.线程不适用(除非使用无堆栈的Python)。

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

亚伦对线程的反对主要是概括。他没有意识到这是一项简单的并行化任务。每项任务都是独立的,不共享资源。所以他的大部分攻击都不适用。

Given my code has no external dependency, I'll call it right tool for the right job.

鉴于我的代码没有外部依赖,我会称之为正确工作的正确工具。

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

我想大多数人会同意这项任务的执行在很大程度上取决于网络代码和外部服务器,其中平台代码的性能应该可以忽略不计。然而,Aaron的基准测试表明,与线程代码相比,速度提升了50%。我认为有必要回应这种明显的速度增益。

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

在Nick的代码中,有一个明显的缺陷导致效率低下。但是,您如何解释我的代码速度提升233ms?我认为即使扭曲的粉丝也不会跳到最后将这归因于扭曲的效率。毕竟,系统代码之外还有大量的变量,比如远程服务器的性能,网络,缓存以及urllib2和扭曲的Web客户端之间的差异实现等等。

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

为了确保Python的线程不会产生大量的低效率,我做了一个快速的基准来产生5个线程然后500个线程。我很自然地说产生5个线程的开销可以忽略不计,无法解释233ms的速度差异。

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

对我的并行获取进行的进一步测试显示,在17次运行中响应时间存在巨大差异。 (不幸的是我没有扭曲来验证Aaron的代码)。

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

我的测试不支持Aaron的结论,即线程一直比异步I / O慢一个可测量的边缘。鉴于涉及的变量数量,我不得不说这不是衡量异步I / O和线程之间系统性能差异的有效测试。

#2


18  

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

使用扭曲!与使用线程相比,这使得这种事情变得非常容易。

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time

def processPage(page, url):
    # do somewthing here.
    return url, len(page)

def printResults(result):
    for success, value in result:
        if success:
            print 'Success:', value
        else:
            print 'Failure:', value.getErrorMessage()

def printDelta(_, start):
    delta = time.time() - start
    print 'ran in %0.3fs' % (delta,)
    return delta

urls = [
    'http://www.google.com/',
    'http://www.lycos.com/',
    'http://www.bing.com/',
    'http://www.altavista.com/',
    'http://achewood.com/',
]

def fetchURLs():
    callbacks = []
    for url in urls:
        d = getPage(url)
        d.addCallback(processPage, url)
        callbacks.append(d)

    callbacks = defer.DeferredList(callbacks)
    callbacks.addCallback(printResults)
    return callbacks

@defer.inlineCallbacks
def main():
    times = []
    for x in xrange(5):
        d = fetchURLs()
        d.addCallback(printDelta, time.time())
        times.append((yield d))
    print 'avg time: %0.3fs' % (sum(times) / len(times),)

reactor.callWhenRunning(main)
reactor.run()

This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

此代码的性能也优于任何其他已发布的解决方案(在我关闭一些使用大量带宽的东西后编辑):

Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s

And using Nick T's code, rigged up to also give the average of five and show the output better:

并且使用Nick T的代码,操作也可以给出平均五个并更好地显示输出:

Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s

Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s

And using Wai Yip Tung's code:

并使用Wai Yip Tung的代码:

Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s

I've gotta say, I do like that the sequential fetches performed better for me.

我必须说,我确实喜欢顺序提取对我来说更好。

#3


5  

Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

这是一个使用python Threads的例子。这里的其他线程示例每个url启动一个线程,如果它导致服务器处理的命中次数过多(例如,蜘蛛在同一主机上有很多url),这不是非常友好的行为

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

注意:此处给出的时间是40个网址,并且很大程度上取决于您的互联网连接速度和服务器延迟。在澳大利亚,我的ping是> 300ms

With WORKERS=1 it took 86 seconds to run
With WORKERS=4 it took 23 seconds to run
with WORKERS=10 it took 10 seconds to run

使用WORKERS = 1运行需要86秒使用WORKERS = 4运行时需要23秒才能运行WORKERS = 10需要10秒才能运行

so having 10 threads downloading is 8.6 times as fast as a single thread.

所以下载10个线程的速度是单个线程的8.6倍。

Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join() to detect when the requests have all completed
3. The results are kept in the same order as the url list

这是使用队列的升级版本。至少有几个优点。 1. URL按照它们出现在列表中的顺序被请求2.可以使用q.join()来检测请求何时全部完成3.结果保持与url列表相同的顺序

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start

#4


2  

The actual wait is probably not in urllib2 but in the server and/or your network connection to the server.

实际的等待可能不是在urllib2中,而是在服务器和/或与服务器的网络连接中。

There are 2 ways of speeding this up.

有两种方法可以加快速度。

  1. Keep the connection alive (see this question on how to do that: Python urllib2 with keep alive)
  2. 保持连接活着(请参阅此问题,了解如何执行此操作:Python urllib2保持活动状态)

  3. Use multiplle connections, you can use threads or an async approach as Aaron Gallagher suggested. For that, simply use any threading example and you should do fine :) You can also use the multiprocessing lib to make things pretty easy.
  4. 使用多重连接,您可以使用线程或异步方法,如Aaron Gallagher建议的那样。为此,只需使用任何线程示例,你应该做得很好:)你也可以使用多处理库使事情变得非常简单。

#5


2  

Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

大多数答案都集中在同时从不同服务器获取多个页面(线程),而不是重用已经打开的HTTP连接。如果OP向同一服务器/站点发出多个请求。

In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

在urlib2中,每个请求都会创建一个单独的连接,这会影响性能,从而导致获取页面的速度变慢。 urllib3通过使用连接池解决了这个问题。可以在这里阅读更多urllib3 [也是线程安全的]

There is also Requests an HTTP library that uses urllib3

还有一个请求使用urllib3的HTTP库

This combined with threading should increase the speed of fetching pages

这与线程相结合应该会提高获取页面的速度

#6


1  

Nowadays there is excellent Python lib that do this for you called requests.

现在有很好的Python lib可以为你做这个叫做请求。

Use standard api of requests if you want solution based on threads or async api (using gevent under the hood) if you want solution based on non-blocking IO.

如果您想要基于非阻塞IO的解决方案,如果您想要基于线程或异步api的解决方案(使用gevent),请使用标准API请求。

#7


1  

Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor:

由于这个问题已经发布,看起来有更高级别的抽象可用,ThreadPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

The example from there pasted here for convenience:

为方便起见,粘贴在这里的例子:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

There's also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

还有我认为使代码更容易的地图:https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

#8


0  

Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

显然需要一段时间来获取网页,因为您没有访问本地任何内容。如果您有多个要访问的,可以使用线程模块一次运行一对。

Here's a very crude example

这是一个非常粗略的例子

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

This was the output when I ran it:

这是我运行时的输出:

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

通过附加到列表来抓取线程中的数据可能是不明智的(队列会更好),但它说明存在差异。

#9


0  

Here's a standard library solution. It's not quite as fast, but it uses less memory than the threaded solutions.

这是一个标准的库解决方案。它不是那么快,但它比线程解决方案使用更少的内存。

try:
    from http.client import HTTPConnection, HTTPSConnection
except ImportError:
    from httplib import HTTPConnection, HTTPSConnection
connections = []
results = []

for url in urls:
    scheme, _, host, path = url.split('/', 3)
    h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
    h.request('GET', '/' + path)
    connections.append(h)
for h in connections:
    results.append(h.getresponse().read())

Also, if most of your requests are to the same host, then reusing the same http connection would probably help more than doing things in parallel.

此外,如果您的大多数请求都发送到同一主机,那么重复使用相同的http连接可能比并行操作更有帮助。

#10


0  

Please find Python network benchmark script for single connection slowness identification:

请找到用于单连接慢度识别的Python网络基准测试脚本:

"""Python network test."""
from socket import create_connection
from time import time

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen

TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))

And example of results with Python 3.6:

Python 3.6的结果示例:

Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42

Python 2.7.13 has very similar results.

Python 2.7.13具有非常相似的结果。

In this case, DNS and urlopen slowness are easily identified.

在这种情况下,很容易识别DNS和urlopen的缓慢。

#1


24  

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

编辑:我正在扩大答案,包括一个更优秀的例子。我在这篇文章中发现了很多关于线程的敌意和错误信息。异步I / O.因此,我还添加了更多的论据来驳斥某些无效的主张。我希望这能帮助人们为正确的工作选择合适的工具。

This is a dup to a question 3 days ago.

这是3天前的问题。

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow Python urllib2.urlopen() is slow, need a better way to read several urls

Python urllib2.open很慢,需要更好的方法来读取几个url - Stack Overflow Python urllib2.urlopen()很慢,需要更好的方法来读取几个url

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

我正在抛出代码以展示如何使用线程并行获取多个网页。

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

如您所见,特定于应用程序的代码只有3行,如果您具有攻击性,可以将其折叠为1行。我认为任何人都不能证明他们声称这是复杂而不可维护的。

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

不幸的是,这里发布的大多数其他线程代码都有一些缺陷。他们中的许多人都进行主动轮询以等待代码完成。 join()是一种更好的同步代码的方法。我认为到目前为止,这段代码已经改进了所有线程示例。

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

如果所有URL指向同一服务器,WoLpH关于使用keep-alive连接的建议可能非常有用。

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Aaron Gallagher是扭曲框架的粉丝,他对任何建议线程的人都持敌视态度。不幸的是,他的许多说法都是错误的信息。例如,他说“-1表示线程。这是IO绑定的;线程在这里没用。”这与证据相反,因为Nick T和我都证明了使用线程的速度增益。事实上,使用Python的线程(v.s.在CPU绑定应用程序中没有增益)可以获得最多的I / O绑定应用程序。 Aaron对线程的误导性批评表明他对并行编程感到相当困惑。

Right tool for the right job

适合工作的正确工具

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

我很清楚使用线程,python,异步I / O等进行并行编程的问题。每个工具都有其优点和缺点。对于每种情况,都有适当的工具。我不反对扭曲(虽然我自己没有部署过)。但是我不相信我们可以说在任何情况下线程都是坏的并且扭曲是好的。

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

例如,如果OP的要求是并行获取10,000个网站,则可以优先选择异步I / O.线程不适用(除非使用无堆栈的Python)。

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

亚伦对线程的反对主要是概括。他没有意识到这是一项简单的并行化任务。每项任务都是独立的,不共享资源。所以他的大部分攻击都不适用。

Given my code has no external dependency, I'll call it right tool for the right job.

鉴于我的代码没有外部依赖,我会称之为正确工作的正确工具。

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

我想大多数人会同意这项任务的执行在很大程度上取决于网络代码和外部服务器,其中平台代码的性能应该可以忽略不计。然而,Aaron的基准测试表明,与线程代码相比,速度提升了50%。我认为有必要回应这种明显的速度增益。

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

在Nick的代码中,有一个明显的缺陷导致效率低下。但是,您如何解释我的代码速度提升233ms?我认为即使扭曲的粉丝也不会跳到最后将这归因于扭曲的效率。毕竟,系统代码之外还有大量的变量,比如远程服务器的性能,网络,缓存以及urllib2和扭曲的Web客户端之间的差异实现等等。

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

为了确保Python的线程不会产生大量的低效率,我做了一个快速的基准来产生5个线程然后500个线程。我很自然地说产生5个线程的开销可以忽略不计,无法解释233ms的速度差异。

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

对我的并行获取进行的进一步测试显示,在17次运行中响应时间存在巨大差异。 (不幸的是我没有扭曲来验证Aaron的代码)。

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

我的测试不支持Aaron的结论,即线程一直比异步I / O慢一个可测量的边缘。鉴于涉及的变量数量,我不得不说这不是衡量异步I / O和线程之间系统性能差异的有效测试。

#2


18  

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

使用扭曲!与使用线程相比,这使得这种事情变得非常容易。

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time

def processPage(page, url):
    # do somewthing here.
    return url, len(page)

def printResults(result):
    for success, value in result:
        if success:
            print 'Success:', value
        else:
            print 'Failure:', value.getErrorMessage()

def printDelta(_, start):
    delta = time.time() - start
    print 'ran in %0.3fs' % (delta,)
    return delta

urls = [
    'http://www.google.com/',
    'http://www.lycos.com/',
    'http://www.bing.com/',
    'http://www.altavista.com/',
    'http://achewood.com/',
]

def fetchURLs():
    callbacks = []
    for url in urls:
        d = getPage(url)
        d.addCallback(processPage, url)
        callbacks.append(d)

    callbacks = defer.DeferredList(callbacks)
    callbacks.addCallback(printResults)
    return callbacks

@defer.inlineCallbacks
def main():
    times = []
    for x in xrange(5):
        d = fetchURLs()
        d.addCallback(printDelta, time.time())
        times.append((yield d))
    print 'avg time: %0.3fs' % (sum(times) / len(times),)

reactor.callWhenRunning(main)
reactor.run()

This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

此代码的性能也优于任何其他已发布的解决方案(在我关闭一些使用大量带宽的东西后编辑):

Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s

And using Nick T's code, rigged up to also give the average of five and show the output better:

并且使用Nick T的代码,操作也可以给出平均五个并更好地显示输出:

Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s

Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s

And using Wai Yip Tung's code:

并使用Wai Yip Tung的代码:

Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s

I've gotta say, I do like that the sequential fetches performed better for me.

我必须说,我确实喜欢顺序提取对我来说更好。

#3


5  

Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

这是一个使用python Threads的例子。这里的其他线程示例每个url启动一个线程,如果它导致服务器处理的命中次数过多(例如,蜘蛛在同一主机上有很多url),这不是非常友好的行为

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

注意:此处给出的时间是40个网址,并且很大程度上取决于您的互联网连接速度和服务器延迟。在澳大利亚,我的ping是> 300ms

With WORKERS=1 it took 86 seconds to run
With WORKERS=4 it took 23 seconds to run
with WORKERS=10 it took 10 seconds to run

使用WORKERS = 1运行需要86秒使用WORKERS = 4运行时需要23秒才能运行WORKERS = 10需要10秒才能运行

so having 10 threads downloading is 8.6 times as fast as a single thread.

所以下载10个线程的速度是单个线程的8.6倍。

Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join() to detect when the requests have all completed
3. The results are kept in the same order as the url list

这是使用队列的升级版本。至少有几个优点。 1. URL按照它们出现在列表中的顺序被请求2.可以使用q.join()来检测请求何时全部完成3.结果保持与url列表相同的顺序

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start

#4


2  

The actual wait is probably not in urllib2 but in the server and/or your network connection to the server.

实际的等待可能不是在urllib2中,而是在服务器和/或与服务器的网络连接中。

There are 2 ways of speeding this up.

有两种方法可以加快速度。

  1. Keep the connection alive (see this question on how to do that: Python urllib2 with keep alive)
  2. 保持连接活着(请参阅此问题,了解如何执行此操作:Python urllib2保持活动状态)

  3. Use multiplle connections, you can use threads or an async approach as Aaron Gallagher suggested. For that, simply use any threading example and you should do fine :) You can also use the multiprocessing lib to make things pretty easy.
  4. 使用多重连接,您可以使用线程或异步方法,如Aaron Gallagher建议的那样。为此,只需使用任何线程示例,你应该做得很好:)你也可以使用多处理库使事情变得非常简单。

#5


2  

Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

大多数答案都集中在同时从不同服务器获取多个页面(线程),而不是重用已经打开的HTTP连接。如果OP向同一服务器/站点发出多个请求。

In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

在urlib2中,每个请求都会创建一个单独的连接,这会影响性能,从而导致获取页面的速度变慢。 urllib3通过使用连接池解决了这个问题。可以在这里阅读更多urllib3 [也是线程安全的]

There is also Requests an HTTP library that uses urllib3

还有一个请求使用urllib3的HTTP库

This combined with threading should increase the speed of fetching pages

这与线程相结合应该会提高获取页面的速度

#6


1  

Nowadays there is excellent Python lib that do this for you called requests.

现在有很好的Python lib可以为你做这个叫做请求。

Use standard api of requests if you want solution based on threads or async api (using gevent under the hood) if you want solution based on non-blocking IO.

如果您想要基于非阻塞IO的解决方案,如果您想要基于线程或异步api的解决方案(使用gevent),请使用标准API请求。

#7


1  

Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor:

由于这个问题已经发布,看起来有更高级别的抽象可用,ThreadPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

The example from there pasted here for convenience:

为方便起见,粘贴在这里的例子:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

There's also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

还有我认为使代码更容易的地图:https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

#8


0  

Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

显然需要一段时间来获取网页,因为您没有访问本地任何内容。如果您有多个要访问的,可以使用线程模块一次运行一对。

Here's a very crude example

这是一个非常粗略的例子

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

This was the output when I ran it:

这是我运行时的输出:

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

通过附加到列表来抓取线程中的数据可能是不明智的(队列会更好),但它说明存在差异。

#9


0  

Here's a standard library solution. It's not quite as fast, but it uses less memory than the threaded solutions.

这是一个标准的库解决方案。它不是那么快,但它比线程解决方案使用更少的内存。

try:
    from http.client import HTTPConnection, HTTPSConnection
except ImportError:
    from httplib import HTTPConnection, HTTPSConnection
connections = []
results = []

for url in urls:
    scheme, _, host, path = url.split('/', 3)
    h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
    h.request('GET', '/' + path)
    connections.append(h)
for h in connections:
    results.append(h.getresponse().read())

Also, if most of your requests are to the same host, then reusing the same http connection would probably help more than doing things in parallel.

此外,如果您的大多数请求都发送到同一主机,那么重复使用相同的http连接可能比并行操作更有帮助。

#10


0  

Please find Python network benchmark script for single connection slowness identification:

请找到用于单连接慢度识别的Python网络基准测试脚本:

"""Python network test."""
from socket import create_connection
from time import time

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen

TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))

And example of results with Python 3.6:

Python 3.6的结果示例:

Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42

Python 2.7.13 has very similar results.

Python 2.7.13具有非常相似的结果。

In this case, DNS and urlopen slowness are easily identified.

在这种情况下,很容易识别DNS和urlopen的缓慢。