I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters?
我想创建一个程序,将一个字符串输入到像Google这样的网站上的文本框中(不使用他们的公共API),然后提交表单并获取结果。这可能吗?抓取结果将需要使用我认为的HTML抓取,但我如何将数据输入文本字段并提交表单?我会*使用公共API吗?这样的事情是不可行的吗?我是否必须弄清楚查询字符串/参数?
Thanks
5 个解决方案
#1
Theory
What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:
我要做的是创建一个小程序,可以自动将任何表单数据提交到任何地方,然后返回结果。使用HTTPUnit在Java中很容易做到这一点。任务是这样的:
- Connect to the web server.
- Parse the page.
- Get the first form on the page.
- Fill in the form data.
- Submit the form.
- Read (and parse) the results.
连接到Web服务器。
解析页面。
获取页面上的第一个表单。
填写表格数据。
提交表格。
读取(并解析)结果。
The solution you pick will depend on a variety of factors, including:
您选择的解决方案取决于多种因素,包括:
- Whether you need to emulate JavaScript
- What you need to do with the data afterwards
- What languages with which you are proficient
- Application speed (is this for one query or 100,000?)
- How soon the application needs to be working
- Is it a one off, or will it have to be maintained?
是否需要模拟JavaScript
之后您需要对数据做些什么
你精通什么语言
应用程序速度(这是一个查询还是100,000?)
应用程序需要多久才能运行
它是一次性的,还是必须保持?
For example, you could try the following applications to submit the data for you:
例如,您可以尝试以下应用程序为您提交数据:
Then grep (awk, or sed) the resulting web page(s).
然后grep(awk或sed)生成的网页。
Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.
屏幕抓取的另一个技巧是下载示例HTML文件并在vi(或VIM)中手动解析。将击键保存到文件中,然后每当运行查询时,将这些击键应用于生成的网页以提取数据。这个解决方案不可维护,也不是100%可靠(但很少有网站的屏幕抓取)。它的工作原理很快。
Example
A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.
提交网站表单(特别是处理登录网站)的半通用Java类如下所示,希望它可能有用。不要用它来做恶。
import java.io.FileInputStream;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.Properties;
import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;
import com.meterware.httpunit.WebClient;
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebForm;
import com.meterware.httpunit.WebLink;
import com.meterware.httpunit.WebRequest;
import com.meterware.httpunit.WebResponse;
public class FormElements extends Properties
{
private static final String FORM_URL = "form.url";
private static final String FORM_ACTION = "form.action";
/** These are properly provided property parameters. */
private static final String FORM_PARAM = "form.param.";
/** These are property parameters that are required; must have values. */
private static final String FORM_REQUIRED = "form.required.";
private Hashtable fields = new Hashtable( 10 );
private WebConversation webConversation;
public FormElements()
{
}
/**
* Retrieves the HTML page, populates the form data, then sends the
* information to the server.
*/
public void run()
throws Exception
{
WebResponse response = receive();
WebForm form = getWebForm( response );
populate( form );
form.submit();
}
protected WebResponse receive()
throws Exception
{
WebConversation webConversation = getWebConversation();
GetMethodWebRequest request = getGetMethodWebRequest();
// Fake the User-Agent so the site thinks that encryption is supported.
//
request.setHeaderField( "User-Agent",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );
return webConversation.getResponse( request );
}
protected void populate( WebForm form )
throws Exception
{
// First set all the .param variables.
//
setParamVariables( form );
// Next, set the required variables.
//
setRequiredVariables( form );
}
protected void setParamVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_PARAM ) )
{
String fieldName = getProperty( property );
String propertyName = property.substring( FORM_PARAM.length() );
String fieldValue = getField( propertyName );
// Skip blank fields (most likely, this is a blank last name, which
// means the form wants a full name).
//
if( "".equals( fieldName ) )
continue;
// If this is the first name, and the last name parameter is blank,
// then append the last name field to the first name field.
//
if( "first_name".equals( propertyName ) &&
"".equals( getProperty( FORM_PARAM + "last_name" ) ) )
fieldValue += " " + getField( "last_name" );
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
protected void setRequiredVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_REQUIRED ) )
{
String fieldValue = getProperty( property );
String fieldName = property.substring( FORM_REQUIRED.length() );
// If the field starts with a ~, then copy the field.
//
if( fieldValue.startsWith( "~" ) )
{
String copyProp = fieldValue.substring( 1, fieldValue.length() );
copyProp = getProperty( copyProp );
// Since the parameters have been copied into the form, we can
// eke out the duplicate values.
//
fieldValue = form.getParameterValue( copyProp );
}
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
private void showSet( String fieldName, String fieldValue )
{
System.out.print( "<p class='setting'>" );
System.out.print( fieldName );
System.out.print( " = " );
System.out.print( fieldValue );
System.out.println( "</p>" );
}
private WebForm getWebForm( WebResponse response )
throws Exception
{
WebForm[] forms = response.getForms();
String action = getProperty( FORM_ACTION );
// Not supposed to break out of a for-loop, but it makes the code easy ...
//
for( int i = forms.length - 1; i >= 0; i-- )
if( forms[ i ].getAction().equalsIgnoreCase( action ) )
return forms[ i ];
// Sadly, no form was found.
//
throw new Exception();
}
private GetMethodWebRequest getGetMethodWebRequest()
{
return new GetMethodWebRequest( getProperty( FORM_URL ) );
}
private WebConversation getWebConversation()
{
if( this.webConversation == null )
this.webConversation = new WebConversation();
return this.webConversation;
}
public void setField( String field, String value )
{
Hashtable fields = getFields();
fields.put( field, value );
}
private String getField( String field )
{
Hashtable<String, String> fields = getFields();
String result = fields.get( field );
return result == null ? "" : result;
}
private Hashtable getFields()
{
return this.fields;
}
public static void main( String args[] )
throws Exception
{
FormElements formElements = new FormElements();
formElements.setField( "first_name", args[1] );
formElements.setField( "last_name", args[2] );
formElements.setField( "email", args[3] );
formElements.setField( "comments", args[4] );
FileInputStream fis = new FileInputStream( args[0] );
formElements.load( fis );
fis.close();
formElements.run();
}
}
An example properties files would look like:
示例属性文件如下所示:
$ cat com.mellon.properties
form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments
# Submit Button
#form.submit=submit
# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm
Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):
运行它类似于以下(替换$ CLASSPATH的HTTPUnit和FormElements类的路径):
java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "John.Doe@gmail.com" "To whom it may concern ..."
Legality
Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.
另一个答案提到它可能违反使用条款。在您花时间研究技术解决方案之前,先检查一下。非常好的建议。
#2
Most of the time, you can just send a simple HTTP POST request.
大多数情况下,您只需发送一个简单的HTTP POST请求即可。
I'd suggest you try playing around with Fiddler to understand how the web works.
我建议你尝试玩Fiddler来了解网络的运作方式。
Nearly all the programming languages and frameworks out there have methods for sending raw requests.
几乎所有的编程语言和框架都有发送原始请求的方法。
And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.
并且您始终可以针对Internet Explorer ActiveX控件进行编程。我相信很多编程语言都支持它。
#3
I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the User-Agent
HTTP header and maybe some others.
我相信这会违反使用条款(请咨询律师:程序员不擅长提供法律建议!),但从技术上讲,您可以通过访问URL http:// www来搜索foobar。 google.com/search?q=foobar,正如您所说,抓取生成的HTML。您可能还需要伪造User-Agent HTTP标头以及其他一些标头。
Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.
也许有搜索引擎的使用条款不禁止这个;你和你的律师可能会建议四处看看是否确实如此。
#4
Well, here's the html from the Google page:
好吧,这是Google页面中的html:
<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%> </td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2> <a href=/advanced_search?hl=en>
Advanced Search</a><br>
<a href=/preferences?hl=en>Preferences</a><br>
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>
If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:
如果您知道如何使用您喜欢的编程语言发出HTTP请求,请试一试,看看您得到了什么。试试这个例子:
http://www.google.com/search?hl=en&q=Stack+Overflow
#5
If you download Cygwin, and add Cygwin\bin to your path you can use curl to retrieve a page and grep/sed/whatever to parse the results. Why fill out the form when with google you can use the querystring parameters, anyway? With curl, you can post info, too, set header info, etc. I use it to call web services from a command line.
如果您下载Cygwin,并将Cygwin \ bin添加到您的路径,您可以使用curl检索页面并使用grep / sed / whatever来解析结果。为什么用谷歌填写表单你还可以使用查询字符串参数?使用curl,您也可以发布信息,设置标题信息等。我用它从命令行调用Web服务。
#1
Theory
What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:
我要做的是创建一个小程序,可以自动将任何表单数据提交到任何地方,然后返回结果。使用HTTPUnit在Java中很容易做到这一点。任务是这样的:
- Connect to the web server.
- Parse the page.
- Get the first form on the page.
- Fill in the form data.
- Submit the form.
- Read (and parse) the results.
连接到Web服务器。
解析页面。
获取页面上的第一个表单。
填写表格数据。
提交表格。
读取(并解析)结果。
The solution you pick will depend on a variety of factors, including:
您选择的解决方案取决于多种因素,包括:
- Whether you need to emulate JavaScript
- What you need to do with the data afterwards
- What languages with which you are proficient
- Application speed (is this for one query or 100,000?)
- How soon the application needs to be working
- Is it a one off, or will it have to be maintained?
是否需要模拟JavaScript
之后您需要对数据做些什么
你精通什么语言
应用程序速度(这是一个查询还是100,000?)
应用程序需要多久才能运行
它是一次性的,还是必须保持?
For example, you could try the following applications to submit the data for you:
例如,您可以尝试以下应用程序为您提交数据:
Then grep (awk, or sed) the resulting web page(s).
然后grep(awk或sed)生成的网页。
Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.
屏幕抓取的另一个技巧是下载示例HTML文件并在vi(或VIM)中手动解析。将击键保存到文件中,然后每当运行查询时,将这些击键应用于生成的网页以提取数据。这个解决方案不可维护,也不是100%可靠(但很少有网站的屏幕抓取)。它的工作原理很快。
Example
A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.
提交网站表单(特别是处理登录网站)的半通用Java类如下所示,希望它可能有用。不要用它来做恶。
import java.io.FileInputStream;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.Properties;
import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;
import com.meterware.httpunit.WebClient;
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebForm;
import com.meterware.httpunit.WebLink;
import com.meterware.httpunit.WebRequest;
import com.meterware.httpunit.WebResponse;
public class FormElements extends Properties
{
private static final String FORM_URL = "form.url";
private static final String FORM_ACTION = "form.action";
/** These are properly provided property parameters. */
private static final String FORM_PARAM = "form.param.";
/** These are property parameters that are required; must have values. */
private static final String FORM_REQUIRED = "form.required.";
private Hashtable fields = new Hashtable( 10 );
private WebConversation webConversation;
public FormElements()
{
}
/**
* Retrieves the HTML page, populates the form data, then sends the
* information to the server.
*/
public void run()
throws Exception
{
WebResponse response = receive();
WebForm form = getWebForm( response );
populate( form );
form.submit();
}
protected WebResponse receive()
throws Exception
{
WebConversation webConversation = getWebConversation();
GetMethodWebRequest request = getGetMethodWebRequest();
// Fake the User-Agent so the site thinks that encryption is supported.
//
request.setHeaderField( "User-Agent",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );
return webConversation.getResponse( request );
}
protected void populate( WebForm form )
throws Exception
{
// First set all the .param variables.
//
setParamVariables( form );
// Next, set the required variables.
//
setRequiredVariables( form );
}
protected void setParamVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_PARAM ) )
{
String fieldName = getProperty( property );
String propertyName = property.substring( FORM_PARAM.length() );
String fieldValue = getField( propertyName );
// Skip blank fields (most likely, this is a blank last name, which
// means the form wants a full name).
//
if( "".equals( fieldName ) )
continue;
// If this is the first name, and the last name parameter is blank,
// then append the last name field to the first name field.
//
if( "first_name".equals( propertyName ) &&
"".equals( getProperty( FORM_PARAM + "last_name" ) ) )
fieldValue += " " + getField( "last_name" );
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
protected void setRequiredVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_REQUIRED ) )
{
String fieldValue = getProperty( property );
String fieldName = property.substring( FORM_REQUIRED.length() );
// If the field starts with a ~, then copy the field.
//
if( fieldValue.startsWith( "~" ) )
{
String copyProp = fieldValue.substring( 1, fieldValue.length() );
copyProp = getProperty( copyProp );
// Since the parameters have been copied into the form, we can
// eke out the duplicate values.
//
fieldValue = form.getParameterValue( copyProp );
}
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
private void showSet( String fieldName, String fieldValue )
{
System.out.print( "<p class='setting'>" );
System.out.print( fieldName );
System.out.print( " = " );
System.out.print( fieldValue );
System.out.println( "</p>" );
}
private WebForm getWebForm( WebResponse response )
throws Exception
{
WebForm[] forms = response.getForms();
String action = getProperty( FORM_ACTION );
// Not supposed to break out of a for-loop, but it makes the code easy ...
//
for( int i = forms.length - 1; i >= 0; i-- )
if( forms[ i ].getAction().equalsIgnoreCase( action ) )
return forms[ i ];
// Sadly, no form was found.
//
throw new Exception();
}
private GetMethodWebRequest getGetMethodWebRequest()
{
return new GetMethodWebRequest( getProperty( FORM_URL ) );
}
private WebConversation getWebConversation()
{
if( this.webConversation == null )
this.webConversation = new WebConversation();
return this.webConversation;
}
public void setField( String field, String value )
{
Hashtable fields = getFields();
fields.put( field, value );
}
private String getField( String field )
{
Hashtable<String, String> fields = getFields();
String result = fields.get( field );
return result == null ? "" : result;
}
private Hashtable getFields()
{
return this.fields;
}
public static void main( String args[] )
throws Exception
{
FormElements formElements = new FormElements();
formElements.setField( "first_name", args[1] );
formElements.setField( "last_name", args[2] );
formElements.setField( "email", args[3] );
formElements.setField( "comments", args[4] );
FileInputStream fis = new FileInputStream( args[0] );
formElements.load( fis );
fis.close();
formElements.run();
}
}
An example properties files would look like:
示例属性文件如下所示:
$ cat com.mellon.properties
form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments
# Submit Button
#form.submit=submit
# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm
Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):
运行它类似于以下(替换$ CLASSPATH的HTTPUnit和FormElements类的路径):
java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "John.Doe@gmail.com" "To whom it may concern ..."
Legality
Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.
另一个答案提到它可能违反使用条款。在您花时间研究技术解决方案之前,先检查一下。非常好的建议。
#2
Most of the time, you can just send a simple HTTP POST request.
大多数情况下,您只需发送一个简单的HTTP POST请求即可。
I'd suggest you try playing around with Fiddler to understand how the web works.
我建议你尝试玩Fiddler来了解网络的运作方式。
Nearly all the programming languages and frameworks out there have methods for sending raw requests.
几乎所有的编程语言和框架都有发送原始请求的方法。
And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.
并且您始终可以针对Internet Explorer ActiveX控件进行编程。我相信很多编程语言都支持它。
#3
I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the User-Agent
HTTP header and maybe some others.
我相信这会违反使用条款(请咨询律师:程序员不擅长提供法律建议!),但从技术上讲,您可以通过访问URL http:// www来搜索foobar。 google.com/search?q=foobar,正如您所说,抓取生成的HTML。您可能还需要伪造User-Agent HTTP标头以及其他一些标头。
Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.
也许有搜索引擎的使用条款不禁止这个;你和你的律师可能会建议四处看看是否确实如此。
#4
Well, here's the html from the Google page:
好吧,这是Google页面中的html:
<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%> </td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2> <a href=/advanced_search?hl=en>
Advanced Search</a><br>
<a href=/preferences?hl=en>Preferences</a><br>
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>
If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:
如果您知道如何使用您喜欢的编程语言发出HTTP请求,请试一试,看看您得到了什么。试试这个例子:
http://www.google.com/search?hl=en&q=Stack+Overflow
#5
If you download Cygwin, and add Cygwin\bin to your path you can use curl to retrieve a page and grep/sed/whatever to parse the results. Why fill out the form when with google you can use the querystring parameters, anyway? With curl, you can post info, too, set header info, etc. I use it to call web services from a command line.
如果您下载Cygwin,并将Cygwin \ bin添加到您的路径,您可以使用curl检索页面并使用grep / sed / whatever来解析结果。为什么用谷歌填写表单你还可以使用查询字符串参数?使用curl,您也可以发布信息,设置标题信息等。我用它从命令行调用Web服务。