Python - 体验urllib3 -- HTTP连接池的应用

可以通过 http://code.google.com/p/urllib3/ 下载相关库和资料。

先列出使用方法：

# coding=utf8
import urllib3
import datetime
import time
import urllib

#创建连接特定主机的连接池
http_pool = urllib3.HTTPConnectionPool('ent.qq.com')
#获取开始时间
strStart = time.strftime('%X %x %Z')
for i in range(0,100,1):
    print i
    #组合URL字符串
    url = 'http://ent.qq.com/a/20111216/%06d.htm' % i
    print url
    #开始同步获取内容
    r = http_pool.urlopen('GET',url,redirect=False)
    print r.status,r.headers,len(r.data)
#打印时间
print 'start time : ',strStart
print 'end time : ',time.strftime('%X %x %Z')

比较简单：先建立连接池http_pool，然后连续获取同一host('ent.qq.com')的URL资源。
通过wireshark抓取包：

所有http://ent.qq.com/a/20111216/******.htm对应的src port都是13136，可见端口重用了
根据urllib3的文档应该采用了keep-alive特性，并且所有repond的connection字段都是keep-alive.

那这个连接池怎么实现的呢？


def urlopen(self, method, url, body=None, headers=None, retries=3,
                redirect=True, assert_same_host=True):
        # 去掉很多条件判断语句
        try:
            # 获取连接
            conn = self._get_conn()

            # 组合Request
            self.num_requests += 1
            conn.request(method, url, body=body, headers=headers)
            # 设置超时
            conn.sock.settimeout(self.timeout)
            httplib_response = conn.getresponse()
            # ...
...
            # 解析HTTPRespond
            response = HTTPResponse.from_httplib(httplib_response)

            # 把当前的连接放入队列，以供重用
            self._put_conn(conn)

        except
        # 出错处理
        ... 


        # 重定向处理，这里是递归尽兴的
        if (redirect and
            response.status in [301, 302, 303, 307] and
            'location' in response.headers):  # Redirect, retry
            log.info("Redirecting %s -> %s" %
                     (url, response.headers.get('location')))
            return self.urlopen(method, response.headers.get('location'), body,
                                headers, retries - 1, redirect,
                                assert_same_host)
# 返回结果
        return response


通过上面简化的代码可见，首先获取连接，然后构建Request，尽兴请求，之后获取Respond。
这里需要注意的是，每次建立连接是通过调用_get_conn
建立完连接后都调用_put_conn方法放入连接池里，相关代码如下：

    def _new_conn(self):
        # 新建连接
        return HTTPConnection(host=self.host, port=self.port)

    def _get_conn(self, timeout=None):
        # 从pool尝试获取连接
        conn = None
        try:
            conn = self.pool.get(block=self.block, timeout=timeout)

            # 判断连接是否已经建立了呢？
            if conn and conn.sock and select([conn.sock], [], [], 0.0)[0]:
                # Either data is buffered (bad), or the connection is dropped.
                log.warning("Connection pool detected dropped "
                            "connection, resetting: %s" % self.host)
                conn.close()

        except Empty, e:
            pass  # Oh well, we'll create a new connection then
# 如果队列为空，或者队列中的连接被断开了，那么新建一个连接在同一个端口
        return conn or self._new_conn()

    def _put_conn(self, conn):
        # 把当前连接放入队列里，当然这个对列的默认最大元素大小为1，如果超过此大小，则被丢弃
        try:
            self.pool.put(conn, block=False)
        except Full, e:
            # This should never happen if self.block == True
            log.warning("HttpConnectionPool is full, discarding connection: %s"
                        % self.host)

通过上述POOL和普通的urllib库进行测试性能，连续获取同一个域名的不同网页，速度没有明显提升，原因可能是服务器离本地比较近，而POOL的主要优化是减少TCP握手次数和慢启动次数，没有很好的体现出来。
对于性能测试方面的建议，不知有什么好的方法？
还有人提到，是否在urllib3里要提供连接池的池，即能实现访问不同网站时，自动为每个host建立一个池，即HTTPOcean :)

秒客网

Python - 体验urllib3 -- HTTP连接池的应用

相关文章