I'm trying to parse a bunch of webpages from an adult website using Ruby:
我正在尝试使用Ruby从成人网站解析一堆网页:
require 'hpricot' require 'open-uri' doc = Hpricot(open('random page on an adult website'))
However, what I end up getting instead is that initial 'Site Agreement' page making sure that you're 18+, etc.
但是,我最终得到的是最初的“网站协议”页面,确保您是18岁以上等。
How do I get past the Site Agreement and pull the webpages I want? (If there's a way to do it, any language is fine.)
我如何通过网站协议并拉出我想要的网页? (如果有办法,任何语言都可以。)
3 个解决方案
#1
3
You're going to have to figure out how the site detects that a visitor has accepted the agreement.
您将不得不弄清楚该网站如何检测到访问者已接受该协议。
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
最明显的选择是cookie。可能当访问者接受协议时,会向其浏览器发送cookie,然后在每次后续请求时将其传递回站点。
You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.
您必须通过接受cookie并将其与每个后续请求一起发送,让您的脚本像访问者一样行事。这将需要您编程,首先请求“接受协议”页面,找到cookie并存储以供使用。他们可能不会为协议使用特定的cookie,而是将其存储在会话中,在这种情况下,您只需要找到会话cookie。
#2
0
The 'Site Agreement' page probably has a link you have to click or form you have to submit to send back to the server to proceed. Read the source of that page to be sure. You could send that response back from your application. I don't know how to do that in Ruby, but I've seen similar tasks done using cURL and libcurl, which can probably be used from Ruby.
“站点协议”页面可能包含您必须单击或表单的链接,您必须提交该链接以发送回服务器以继续。请确保阅读该页面的来源。您可以从您的应用程序发回该响应。我不知道如何在Ruby中做到这一点,但我已经看到使用cURL和libcurl完成类似的任务,可以从Ruby使用。
#3
0
Install LiveHTTPHeaders plugin for Firefox and visit this site. Watch the headers and see what happens when you accept the agreement. You'll probably see that the browser sends some request (possibly a Post) and accepts some cookies. Then you'll have to repeat whatever browser does in your Ruby script.
为Firefox安装LiveHTTPHeaders插件并访问此站点。观看标题,看看接受协议后会发生什么。您可能会看到浏览器发送一些请求(可能是Post)并接受一些cookie。然后你将不得不重复Ruby脚本中的任何浏览器。
#1
3
You're going to have to figure out how the site detects that a visitor has accepted the agreement.
您将不得不弄清楚该网站如何检测到访问者已接受该协议。
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
最明显的选择是cookie。可能当访问者接受协议时,会向其浏览器发送cookie,然后在每次后续请求时将其传递回站点。
You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.
您必须通过接受cookie并将其与每个后续请求一起发送,让您的脚本像访问者一样行事。这将需要您编程,首先请求“接受协议”页面,找到cookie并存储以供使用。他们可能不会为协议使用特定的cookie,而是将其存储在会话中,在这种情况下,您只需要找到会话cookie。
#2
0
The 'Site Agreement' page probably has a link you have to click or form you have to submit to send back to the server to proceed. Read the source of that page to be sure. You could send that response back from your application. I don't know how to do that in Ruby, but I've seen similar tasks done using cURL and libcurl, which can probably be used from Ruby.
“站点协议”页面可能包含您必须单击或表单的链接,您必须提交该链接以发送回服务器以继续。请确保阅读该页面的来源。您可以从您的应用程序发回该响应。我不知道如何在Ruby中做到这一点,但我已经看到使用cURL和libcurl完成类似的任务,可以从Ruby使用。
#3
0
Install LiveHTTPHeaders plugin for Firefox and visit this site. Watch the headers and see what happens when you accept the agreement. You'll probably see that the browser sends some request (possibly a Post) and accepts some cookies. Then you'll have to repeat whatever browser does in your Ruby script.
为Firefox安装LiveHTTPHeaders插件并访问此站点。观看标题,看看接受协议后会发生什么。您可能会看到浏览器发送一些请求(可能是Post)并接受一些cookie。然后你将不得不重复Ruby脚本中的任何浏览器。