Scope:
I am currently trying to write a Web Scrapper for this specific page. I have a pretty strong "Web Crawling" background using C#, but this httplib
is beating me off.
我目前正在尝试为此特定页面编写Web Scrapper。我有一个非常强大的“Web爬行”背景使用C#,但这个httplib打败了我。
Problem:
When trying to make a Http Get
request for the page specified above I get a "Moved Permanently", that points to the very same URL. I can make a request using the requests
lib, but I want to make it work using httplib
so I can understand what I am doing wrong.
当我尝试为上面指定的页面发出Http Get请求时,我得到一个“Moved Permanently”,它指向同一个URL。我可以使用请求lib发出请求,但我希望使用httplib使其工作,这样我就能理解我做错了什么。
Code Sample:
I am completely new to Python, so any wrong language guideline
or syntax is C#'s fault.
我是Python的新手,所以任何错误的语言指南或语法都是C#的错。
import httplib
# Wrapper for a "HTTP GET" Request
class HttpClient(object):
def HttpGet(self, url, host):
connection = httplib.HTTPConnection(host)
connection.request('GET', url)
return connection.getresponse().read()
# Using "HttpClient" class
httpclient = httpClient()
# This is the full URL I need to make a get request for : https://420101.com/strain-database
httpResponseText = httpclient.HttpGet('www.420101.com','/strain-database')
print httpResponseText
I really want to make it work using the httplib
library, instead of requests
or any other fancy one because I feel like I am missing something really small here.
我真的想让它使用httplib库,而不是请求或任何其他花哨的库,因为我觉得我错过了一些非常小的东西。
1 个解决方案
#1
The problem i've had too little or too much caffeine in my system.
问题是我的系统中咖啡因含量太少或太多。
To get a https, I needed the HTTPSConnection class.
要获得https,我需要HTTPSConnection类。
Also, there is no 'www' in the address I wanted to GET. So, it shouldn't be included in the host.
此外,我想要获取的地址中没有“www”。因此,它不应该包含在主机中。
Both of the wrong addresses redirect me to the correct one, with the 301 error code. If I were using requests or a more full featured module, it would have automatically followed the redirect.
两个错误的地址都将我重定向到正确的地址,并带有301错误代码。如果我使用请求或功能更全面的模块,它会自动跟随重定向。
My Validation:
c = httplib.HTTPSConnection('420101.com')
c.request("GET", "/strain-database")
r = c.getresponse()
print r.status, r.reason
200 OK
#1
The problem i've had too little or too much caffeine in my system.
问题是我的系统中咖啡因含量太少或太多。
To get a https, I needed the HTTPSConnection class.
要获得https,我需要HTTPSConnection类。
Also, there is no 'www' in the address I wanted to GET. So, it shouldn't be included in the host.
此外,我想要获取的地址中没有“www”。因此,它不应该包含在主机中。
Both of the wrong addresses redirect me to the correct one, with the 301 error code. If I were using requests or a more full featured module, it would have automatically followed the redirect.
两个错误的地址都将我重定向到正确的地址,并带有301错误代码。如果我使用请求或功能更全面的模块,它会自动跟随重定向。
My Validation:
c = httplib.HTTPSConnection('420101.com')
c.request("GET", "/strain-database")
r = c.getresponse()
print r.status, r.reason
200 OK