使用用户名和密码从外部站点刮取数据

I have an application with many users, some of these users have an account on an external website with data I want to scrape.

我有一个有很多用户的应用程序,其中一些用户在外部网站上有一个帐户,我想要抓取数据。

This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.

此外部站点的成员区域受电子邮件/密码表单保护。这会在提交时设置一些cookie(几个ASP)。然后,您可以拉出所需的页面并获取外部站点为刚刚登录的用户保留的数据。

The external site has no API.

外部站点没有API。

I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.

我设想我的应用程序要求用户提供外部站点的凭据,代表他们登录并获取我们想要的数据。

How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?

我将如何在Python中进行此操作,即我是否需要在Python处理cookie的服务器上运行GUI Web浏览器(我宁愿不这样做)?

2 个解决方案

#1

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.

通过检查浏览器检查器中登录调用的格式,查找页面对后端的调用。

Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.

在使用getpass从终端或通过GUI获取用户凭据后,请执行相同的请求。您可以使用urllib2发出请求。

Save all the cookies from the response in a cookiejar.

将响应中的所有cookie保存在cookiejar中。

Reuse the cookies in subsequent requests and fetch data.

在后续请求中重用cookie并获取数据。

Then, profit.

#2

Usually, this is performed with session.

通常,这是通过会话执行的。

I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.

我建议你使用请求库(http://docs.python-requests.org/en/latest/)来做到这一点。

You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.

您可以使用会话功能(http://docs.python-requests.org/en/latest/user/advanced/#session-objects)。只需执行身份验证HTTP请求(网址和参数取决于您要请求的网站),然后执行对您想要抓取的资源的请求。

Without further information, we cannot help you more.

没有进一步的信息,我们无法帮助您。

#1

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.