0.目录
1.思路
2.windows安装
3.相关命令行
4.简单配置和初步使用
5.问题:squid是否支持HTTPS
6.问题:配置多个代理条目,相同ip不同port报错
7.问题:根据代理请求区分HTTP/HTTPS并选择相应代理条目
8.问题:代理IP类型 高匿/匿名/透明
9.问题:正向/反向/透明代理
10.python脚本更新配置
11.log相关
12.参考
1.思路
- 定时监控代理源网站(30分/1小时都可),解析出所有代理IP,入数据库
- 从数据库中取出所有代理,访问某个固定的网站,找出访问成功的代理,更新数据库可用标记和响应时间
- 从数据库中加载所有可用代理,通过某种算法,根据响应时间计算使用权重和最大使用次数
- 按照squid的cache_peer格式,写入配置文件
- 重新加载squid配置文件,刷新squid下的代理列表
- 爬虫指定squid的服务IP和端口,进行纯粹的爬取操作
一个完整的代理服务通过这样的方法就可以搭建完成,定时输出高质量代理。爬虫端不用关心代理的采集和测试,只管使用squid的统一服务入口爬取数据即可。
2.windows安装
http://www.squid-cache.org/Versions/
In some cases, you may want (or be forced) to download a binary package of Squid. They are available for a variety of platforms, including Windows.
https://wiki.squid-cache.org/SquidFaq/BinaryPackages
https://wiki.squid-cache.org/KnowledgeBase/Windows
MSI installer packages for Windows are at:
64-bit: http://squid.diladele.com/
直接下载msi,建议安装目录:C:\Squid\
CentOS 安装:
https://wiki.squid-cache.org/SquidFaq/BinaryPackages
CentOS
Squid bundles with CentOS. However there is apparently no publicly available information about where to find the packages or who is bundling them. EPEL, DAG and RPMforge repositories appear to no longer contain any files. Other sources imply that CentOS is an alias for RHEL (we know otherwise). Although, yes, the RHEL packages should work on CentOS. Maintainer: unknown Bug Reporting: http://bugs.centos.org/search.php?category=squid&sortby=last_updated&hide_status_id=-2 Eliezer: 25/Apr/2017 - I have tested CentOS 7 RPMs for squid 3.5.25 on a small scale and it seems to be stable enough for 200-300 users as a forward proxy and basic features. Stable Repository Package (like epel-release)
To install run the command: yum install http://ngtech.co.il/repo/centos/7/squid-repo-1-1.el7.centos.noarch.rpm -y
or rpm -i http://ngtech.co.il/repo/centos/7/squid-repo-1-1.el7.centos.noarch.rpm
and then install squid using the command: yum install squid
3.相关命令行
帮助信息:
C:\Squid\bin>squid -h
Usage: squid [-cdhvzCFNRVYX] [-n name] [-s | -l facility] [-f config-file] [-[au] port] [-k signal]
-a port Specify HTTP port number (default: 3128).
-d level Write debugging to stderr also.
-f file Use given config-file instead of
/etc/squid/squid.conf
-h Print help message.
-k reconfigure|rotate|shutdown|restart|interrupt|kill|debug|check|parse
Parse configuration file, then send signal to
running copy (except -k parse) and exit.
-n name Specify service name to use for service operations
default is: squid.
-s | -l facility
Enable logging to syslog.
-u port Specify ICP port number (default: 3130), disable with 0.
-v Print version.
-z Create missing swap directories and then exit.
-C Do not catch fatal signals.
-D OBSOLETE. Scheduled for removal.
-F Don't serve any requests until store is rebuilt.
-N No daemon mode.
-R Do not set REUSEADDR on port.
-S Double-check swap during rebuild.
-X Force full debugging.
-Y Only return UDP_HIT or UDP_MISS_NOFETCH during fast reload.
启动/停止服务:计算机管理找到Squid for Windows,右键属性显示服务名称为squidsrv
C:\Squid\bin>net start squidsrv
请求的服务已经启动。 请键入 NET HELPMSG 2182 以获得更多的帮助。 C:\Squid\bin>net stop squidsrv
Squid for Windows 服务正在停止.
Squid for Windows 服务已成功停止。 C:\Squid\bin>net start squidsrv
Squid for Windows 服务正在启动 ..
Squid for Windows 服务已经启动成功。
重新加载配置
C:\Squid\bin>squid -k reconfigure
4.简单配置和初步使用
C:\Squid\etc\squid\squid.conf 复制另存 C:\Squid\etc\squid\squid_backup.conf 备用
确认默认监听端口:
# Squid normally listens to port 3128
http_access allow all
http_port 3128
不修改原有配置,仅在结尾添加如下两行,见章节 12.参考 (1):
免费代理IP请自行搜索
cache_peer 58.22.61.211 parent 3128 0 no-query
never_direct allow all
使用requests确认代理生效:
In [7]: os.system('c:/Squid/bin/squid -k reconfigure')
Out[7]: 0 In [8]: import requests In [9]: s = requests.Session() In [10]: s.proxies = {'http': 'http://127.0.0.1:3128', 'https': 'https://127.0.0.1:3128'} In [11]: s.get('http://httpbin.org/ip', timeout=10).text
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://httpbin.org/ip HTTP/1.1" 200 58
Out[11]: u'{\n "origin": "127.0.0.1, 163.125.31.126, 58.22.61.211"\n}\n' In [12]: s.get('https://httpbin.org/ip', timeout=10).text
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): httpbin.org
DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /ip HTTP/1.1" 200 31
Out[12]: u'{\n "origin": "58.22.61.211"\n}\n'
官网帮助
http://www.squid-cache.org/Doc/config/never_direct/
Default Value: Allow DNS results to be used for this request.
Usage: never_direct allow|deny [!]aclname ... never_direct is the opposite of always_direct. Please read
the description for always_direct if you have not already. With 'never_direct' you can use ACL elements to specify
requests which should NEVER be forwarded directly to origin
servers. For example, to force the use of a proxy for all
requests, except those in your local domain use something like: acl local-servers dstdomain .foo.net
never_direct deny local-servers
never_direct allow all or if Squid is inside a firewall and there are local intranet
servers inside the firewall use something like: acl local-intranet dstdomain .foo.net
acl local-external dstdomain external.foo.net
always_direct deny local-external
always_direct allow local-intranet
never_direct allow all This clause supports both fast and slow acl types.
See http://wiki.squid-cache.org/SquidFaq/SquidAcl for details.
5.问题:squid是否支持HTTPS
注意,在作为正向代理的时候(squid默认配置),http_port 3128端口也可以处理https代理请求,因为作正向代理时squid并不需要参与ssl的加密解密,只需要帮忙从用户到网站的443端口建立tcp连接,然后无脑转发用户到网站之间的加密数据即可。只有当要将squid用作反向代理的时候,才需要用到squid的https_port配置,为squid设置证书。
6.问题:配置多个代理条目,相同ip不同port报错
由于有可能有相同ip,而端口不同的代理 会报错
FATAL: ERROR: cache_peer 42.227.87.205 specified twice
所以要在最后加上proxy的name
cache_peer 120.xx.xx.32 parent 80 0 no-query weighted-round-robin weight=2 connect-fail-limit=2 allow-miss max-conn=5 name=proxy-90
http://www.squid-cache.org/Doc/config/cache_peer/
name=xxx Unique name for the peer.
Required if you have multiple peers on the same host
but different ports.
This name can be used in cache_peer_access and similar
directives to identify the peer.
Can be used by outgoing access controls through the
peername ACL type.
7.问题:根据代理请求区分HTTP/HTTPS并选择相应代理条目
http://www.squid-cache.org/Doc/config/cache_peer/
http://www.squid-cache.org/Doc/config/cache_peer_access/
http://www.squid-cache.org/Doc/config/acl/
http://www.squid-cache.org/Doc/config/access_log/
通过ACL实现,还不能百分百确认生效性!!!
acl acl_deny_http port 80
acl acl_deny_https port 443 cache_peer 219.156.151.20 parent 53281 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name=0_HTTP
cache_peer_access 0_HTTP deny acl_deny_https cache_peer 175.155.248.190 parent 808 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name=83_HTTPS
cache_peer_access 83_HTTPS deny acl_deny_http
8.问题:代理IP类型 高匿/匿名/透明
官网介绍:
http://www.squid-cache.org/Doc/config/forwarded_for/
Default Value: forwarded_for on
If set to "on", Squid will append your client's IP address
in the HTTP requests it forwards. By default it looks like:
X-Forwarded-For: 192.1.2.3
If set to "off", it will appear as
X-Forwarded-For: unknown
If set to "transparent", Squid will not alter the
X-Forwarded-For header in any way.
If set to "delete", Squid will delete the entire
X-Forwarded-For header.
If set to "truncate", Squid will remove all existing
X-Forwarded-For entries, and place the client IP as the sole entry.
http://www.squid-cache.org/Doc/config/via/
Default Value: via on
If set (default), Squid will include a Via header in requests and replies as required by RFC2616.
http://www.squid-cache.org/Doc/config/request_header_access/
Default Value: No limits.
For example, to achieve the same behavior as the old
'http_anonymizer standard' option, you should use:
request_header_access From deny all
request_header_access Referer deny all
request_header_access User-Agent deny all
综合参考资料,在squid.conf结尾添加如下内容:
forwarded_for off
via off
forwarded_for transparent
request_header_access Via deny all
request_header_access X-Forwarded-For deny all
request_header_access From deny all
前后对比:
In [104]: from bs4 import BeautifulSoup as BS
In [106]: os.system('c:/Squid/bin/squid -k reconfigure')
...: r=s.get('http://www.iprivacytools.com/proxy-checker-anonymity-test/', timeout=10)
...: soup=BS(r.text, 'lxml')
...: print soup.select('div.content')[1].text
...:
DEBUG:urllib3.connectionpool:Starting new HTTP connection (16): 127.0.0.1
DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://www.iprivacytools.com/proxy-checker-a
nonymity-test/ HTTP/1.1" 200 2777 Your IP address and hostname: 58.22.61.211 (58.22.61.211)
Here are your headers that could reveal a proxy: HTTP_VIA: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7)
HTTP_X_FORWARDED_FOR: 127.0.0.1, 163.125.31.83
HTTP_FORWARDED_FOR: anonymous / none
HTTP_X_FORWARDED: anonymous / none
HTTP_FORWARDED: anonymous / none
HTTP_CLIENT_IP: anonymous / none
HTTP_FORWARDED_FOR_IP: anonymous / none
VIA: anonymous / none
X_FORWARDED_FOR: anonymous / none
FORWARDED_FOR: anonymous / none
X_FORWARDED: anonymous / none
FORWARDED: anonymous / none
CLIENT_IP: anonymous / none
FORWARDED_FOR_IP: anonymous / none
HTTP_PROXY_CONNECTION: anonymous / none Proxy detected? YES
Here's how we know: Your HTTP_VIA header shows: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7)
Your HTTP_X_FORWARDED_FOR header shows: 127.0.0.1, 163.125.31.83 Again, please remember that this should not be considered a fullproof
test of your anonymous surfing level, as it is only analyzing your browser
headers. To surf via proxies with greater confidence, we highly suggest
using a firewall and disabling all browser plugins and script support. # 在squid.conf结尾添加如下内容:
# forwarded_for off
# via off
# forwarded_for transparent
# request_header_access Via deny all
# request_header_access X-Forwarded-For deny all
# request_header_access From deny all #结果对比:
In [107]: os.system('c:/Squid/bin/squid -k reconfigure')
...: r=s.get('http://www.iprivacytools.com/proxy-checker-anonymity-test/', timeout=10)
...: soup=BS(r.text, 'lxml')
...: print soup.select('div.content')[1].text
...:
DEBUG:urllib3.connectionpool:Resetting dropped connection: 127.0.0.1
DEBUG:urllib3.connectionpool:http://127.0.0.1:3128 "GET http://www.iprivacytools.com/proxy-checker-a
nonymity-test/ HTTP/1.1" 200 2749 Your IP address and hostname: 58.22.61.211 (58.22.61.211)
Here are your headers that could reveal a proxy: HTTP_VIA: 1.1 RD2:3128 (squid/2.7.STABLE7)
HTTP_X_FORWARDED_FOR: 163.125.31.93
HTTP_FORWARDED_FOR: anonymous / none
HTTP_X_FORWARDED: anonymous / none
HTTP_FORWARDED: anonymous / none
HTTP_CLIENT_IP: anonymous / none
HTTP_FORWARDED_FOR_IP: anonymous / none
VIA: anonymous / none
X_FORWARDED_FOR: anonymous / none
FORWARDED_FOR: anonymous / none
X_FORWARDED: anonymous / none
FORWARDED: anonymous / none
CLIENT_IP: anonymous / none
FORWARDED_FOR_IP: anonymous / none
HTTP_PROXY_CONNECTION: anonymous / none Proxy detected? YES
Here's how we know: Your HTTP_VIA header shows: 1.1 RD2:3128 (squid/2.7.STABLE7)
Your HTTP_X_FORWARDED_FOR header shows: 163.125.31.93 Again, please remember that this should not be considered a fullproof
test of your anonymous surfing level, as it is only analyzing your browser
headers. To surf via proxies with greater confidence, we highly suggest
using a firewall and disabling all browser plugins and script support.
本机信息被隐藏
Your HTTP_VIA header shows: 1.1 win7-PC (squid/3.5.26), 1.1 RD2:3128 (squid/2.7.STABLE7)
Your HTTP_X_FORWARDED_FOR header shows: 127.0.0.1, 163.125.31.83
9.问题:正向/反向/透明代理
xxx
10.python脚本更新配置
获取可用代理IP列表,格式: ip_port_type_tuple_list = [('1.1.1.1', '80', 'HTTP'), ('1.1.1.2', '1080', 'HTTPS'), ('1.1.1.3', '3128', 'both')]
def update_squid_conf():
bk_file = 'C:/Squid/etc/squid/squid_backup.conf'
conf_file = 'C:/Squid/etc/squid/squid.conf'
fmt = 'cache_peer {ip} parent {port} 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5 name={name}' pre_lines = ['\n#\n#\n#\nhttp_access allow all',
'read_timeout 30 seconds',
'request_timeout 30 seconds',
'acl acl_deny_http port 80',
'acl acl_deny_https port 443',] post_lines = ['never_direct allow all',
'forwarded_for off',
'via off',
'forwarded_for transparent',
'request_header_access Via deny all',
'request_header_access X-Forwarded-For deny all',
'request_header_access From deny all'] merge = sorted(list(set(ip_port_type_tuple_list)), key=lambda x: x[-1])
# for i in merge:
# print i count = 0
with open(bk_file, 'r') as bk_file, open(conf_file, 'w') as conf_file:
conf_file.write(bk_file.read()+'\n')
conf_file.write('\n'.join(pre_lines)+'\n')
for index, (ip, port, _type) in enumerate(merge):
name = '{}_{}'.format(index, _type)
item = fmt.format(ip=ip, port=port, name=name)
if _type in ['HTTP']:
item += '\ncache_peer_access %s deny acl_deny_https' %name
elif _type in ['HTTPS']:
item += '\ncache_peer_access %s deny acl_deny_http' %name
conf_file.write(item+'\n')
count += 1
conf_file.write('\n'.join(post_lines)+'\n')
assert os.system('c:/Squid/bin/squid -k reconfigure') == 0, 'update fail'
print time.ctime(), '{}/{}'.format(count, len(merge))
11.log相关
# access_log 设置access日志,daemon表示在后台将日志写入/var/log/squid/access.log文件,
# combined是一个预定义的logformat,也可以使用自定义的logformat
access_log daemon:/var/log/squid/access.log combined
# debug_options, 设置cache.log的log level
# ALL表示全部模块,loglevel为1;28表示acl模块,loglevel为5,29表示用户认证模块,loglevel为9
debug_options ALL,1 28,5 29,9
也可直接添加:access_log daemon:c:/Squid/var/log/squid/temp.log squid
查看log确认使用的父代理:其中访问https会显示 TCP_TUNNEL
1503895567.104 5567 127.0.0.1 TCP_MISS/200 510 GET http://httpbin.org/ip - FIRSTUP_PARENT/58.22.61.211 application/json
1503895643.345 67037 127.0.0.1 TCP_TUNNEL/200 3377 CONNECT httpbin.org:443 - FIRSTUP_PARENT/58.22.61.211 -
12.参考
官网: http://www.squid-cache.org/Doc/config/cache_peer/
中文文档: http://zyan.cc/book/squid/index.html
搜索cache_peer:
cache_peer hostname
type
proxy-port
icp-port
在此输入父代理(如果您想使用 ISP 的代理)。在主机名
中输入要使用代理的名称或 IP 地址,在类型
中输入 parent
。对于 proxy-port
,输入同样是由父代理运营商设置的在浏览器中使用的端口号(通常为 8080
)。如果父代理的 ICP 端口未知并且该端口的使用与提供商无关,请将 icp-port
设为 7
或 0
。此外,端口号后应指定 default
和 no-query
以禁止使用 ICP 协议。借助提供商的代理,Squid 就可以像普通浏览器那样操作了。
never_direct allow acl_name
要防止 Squid 直接从因特网接受请求,应使用上述命令强制连接到另一个代理。事先必须已在 cache_peer中输入该代理。如果将 acl_name
指定为 all
,会强制所有请求直接转发给父代理。有时这可能是必要的,例如在您的提供商严格规定使用它的代理或拒绝通过其防火墙直接访问因特网时。
forwarded_for on
如果将此项设置为 off,Squid 会将客户端的 IP 地址和系统名称从 HTTP 请求中删除。否则,它会向标题中添加以下行
X-Forwarded-For: 192.168.0.1
(2) 使用squid搭建代理服务器
注意,在作为正向代理的时候(squid默认配置),http_port 3128端口也可以处理https代理请求,因为作正向代理时squid并不需要参与ssl的加密解密,只需要帮忙从用户到网站的443端口建立tcp连接,然后无脑转发用户到网站之间的加密数据即可。只有当要将squid用作反向代理的时候,才需要用到squid的https_port配置,为squid设置证书。
# 拒绝所有请求,最后兜底的规则
http_access deny all
注意:squid的http_access是按照配置文件中定义的顺序依次进行判断的!遇到第一个满足条件的http_access(allow或者deny)就立即返回!不再进行后续http_access判断。
通过代理访问http://www.hawu.me,打开开发者工具中的网络窗口,检查该请求的状态,可以看到Remote Address为我们设置的代理,在Response Headers里还有我定义的代理服务器名”funway.aliyun.proxy”,表示这个请求是通过我们的代理服务器返回的。
squid可以很方便的搭建http代理服务器,但从上面被墙的案例我们看到,单单使用墙外的squid代理是无法进行*的。这时候就需要在墙内用户与墙外squid之间加一个stunnel,将我们发送给squid的请求进行加密。更详细的介绍请看下一篇文章http://www.hawu.me/operation/886
匿名代理:
http头中有三个信息是用来给服务器鉴别用户的:remote_addr,http_via,http_x_forwarded_for。
用户不使用代理直接访问网站时,http头包含如下信息: remote_addr = 用户真实ip
http_via = 不包含
http_x_forwarded_for = 不包含 用户使用普通代理访问时,对方服务器知道用户使用了代理,并且知道用户的真实ip。此时http头包含如下: remote_addr = 代理服务器ip
http_via = 代理服务器主机名(squid的visible_hostname)
http_x_forwarded_for = 用户真实ip(如果用户使用了多层代理,这里应该是不包括最后一跳的整个ip链) 用户使用匿名代理访问时,对方服务器知道用户使用了代理,但不知道用户的真实ip。此时http头包含如下: remote_addr = 代理服务器ip
http_via = 代理服务器主机名
http_x_forwarded_for = 代理服务器ip 用户使用高匿名代理访问时,对方服务器不知道用户使用了代理,也不知道用户真实ip。此时的http头包含如下: remote_addr = 代理服务器ip
http_via = 不包含
http_x_forwarded_for = 不包含 squid默认是作为普通代理的,即开启via,并会写入http_forwarded_for。要想作为匿名代理,只需修改如下两个配置: # 关闭via
via off
# 设置不修改http_forwarded_for
forwarded_for transparent
http_access allow all
http_port 64441
read_timeout 10 seconds
request_timeout 10 seconds cache_peer ec2-52-197-85-24.ap-northeast-1.compute.amazonaws.com parent 64441 0 no-query round-robin never_direct allow all
(4) 自己搭建亿级爬虫IP代理池
cache_peer IP parent PORT 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5
# 3. 重新加载配置文件
os.system('squid -k reconfigure')
使用方法
- 按Squid 搭建正向代理服务器、Squid 配置高匿代理介绍的方法搭建运行Squid高匿服务器
文档参考资料:
要将如下配置加入到配置文件/etc/squid/squid.conf
末尾即可。
request_header_access Via deny all
request_header_access X-Forwarded-For deny all
request_header_access From deny all
可以访问 http://httpbin.org/ip ,如果仅返回squid服务器ip,则表明高匿生效。
或者访问Proxy Checker,网页显示详细的代理检测信息。如果网页最上方显示NO PROXY DETECTED
则表明高匿代理搭建成功。
(5) Squid中文权威指南
10.11 该怎么做? Squid新手经常问同样的或相似的问题,关于如何让squid正确的转发请求。这里我将告诉你,在普通这种情况下,如何配置Squid。 10.11.1 通过另外的代理发送所有请求? 简单的只需定义父cache,并告诉squid它不允许直接连接到原始服务器。例如:
cache_peer parent.host.name parent 3128 0 acl All src 0/0 never_direct allow All
该配置的弊端是,假如父cache down掉,squid不能转发cache丢失。假如这种情况发生,用户会接受到“不能转发”的错误消息。 10.11.2 通过另外的代理发送所有请求,除非它down了? 试试这个配置:
nonhierarchical_direct off prefer_direct off cache_peer parent.host.name parent 3128 0 default no-query
或者,假如你喜欢对其他代理使用ICP:
nonhierarchical_direct off prefer_direct off cache_peer parent.host.name parent 3128 3130 default
在该配置下,只要父cache存活,squid就会将所有cache丢失转发给它。使用ICP可以让squid快速检测到死亡的父cache,但同时在某些情形下,可能不正确的宣称父cache死亡。 10.11.3 确认squid对某些请求,不使用邻居cache吗? 定义1条ACL来匹配特殊的请求:
cache_peer parent.host.name parent 3128 0 acl Special dstdomain special.server.name always_direct allow Special
在该情形下,对special.server.name域的请求的cache丢失,总是发送到原始服务器。其他请求也许,或也许不,通过父cache。 10.11.4 通过父cache发送某些请求来绕过本地过滤器? 某些ISP(或其他组织)有上级服务提供者,他们强迫HTTP传输通过包过滤代理(也许使用HTTP拦截)。假如你能在他们的网络之外使用不同的代理,那就能绕过其过滤器。这里显示你怎样仅发送特殊的请求到远端的代理:
cache_peer far-away-parent.host.name parent 3128 0 acl BlockedSites dstdomain www.censored.com cache_peer_access far-away-parent.host.name allow BlockedSites never_direct allow BlockedSites
(6) squid配置-cache_peer和cache_peer_domain详解
重启机器或者命令行执行“ net start squid”启动服务