抓取网页数据并解析
分类:版权声明:本文为博主原创文章,未经博主允许不得转载。
这天遇到这样一个需求:这种页面数据可以抓取吗?
随后提供了账号、密码和网站地址:
帐号:kytj1
密码:******************
登陆地址:http://student.tiaoji.kaoyan.com/tjadm
主要思路:
1、使用Fiddler4分析http请求交互方式,包括数据发送方式(POST或GET),携带参数等,获得返回的数据信息
2、用Android程序模拟HTTP请求
3、用Java解析HTML代码,提取出对应的姓名、报考学校、报考专业、分数、联系电话、发布时间等字段
4、把txt文件导入到Excel里,待进一步处理。
用Fiddle查看数据包
1、打开Fiddler
2、打开网站,填入用户名和密码,点击登录
登陆地址:http://student.tiaoji.kaoyan.com/tjadm
3、观察Filldder抓到的包
可以看到HOST、URL、POST方式以及明文密码
4、观察网页数据
登录成功后,网页数据显示为
对应的Filldder抓包数据为
可以看到请求的HOST以及URL,方式为GET,返回的数据也可以在body体中获取到。
5、HTML代码
返回的HTML页面代码为(选取了部分)
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=3.0,user-scalable=no ">
- <title>考研调剂中心_考研调剂意向发布系统_考研调剂_考研网(kaoyan.com)</title>
- <meta name="description" content="" />
- <link rel="stylesheet" type="text/css" href="http://img.kaoyan.com/tiaoji/css/tiaoji-h5.css" />
- <link href="http://img.kaoyan.com/global/style/header.css" rel="stylesheet">
- <link href="http://img.kaoyan.com/yz/style/yz.index.css" rel="stylesheet">
- <script type='text/javascript' src='http://cbjs.baidu.com/js/m.js'></script>
- </head>
- <body>
- <div class="kyHd">
- <div class="kyTop">
- <script src="http://img.kaoyan.com/www/header-tiaoji.js" type="text/javascript"></script>
- <script src="http://img.kaoyan.com/www/headera.js" type="text/javascript"></script>
- </div>
- </div>
- <div style="height:10px;"></div>
- <div class="w1000ad tc">
- <script type="text/javascript">/*考研网-大通栏-通用*/var cpro_id = "u1773335";</script>
- <script src="http://cpro.baidustatic.com/cpro/ui/c.js" type="text/javascript"></script>
- </div>
- <ul class="nav" id="tjNav">
- <li><a href="http://tiaoji.kaoyan.com/" title="考研调剂首页">调剂首页</a></li>
- <li><a href="http://www.kaoyan.com/kaoyan/27/474572/" title="考研调剂流程" target="_blank">调剂流程</a></li>
- <li><a href="http://www.kaoyan.com/tiaoji/xinxi/" title="考研调剂信息">调剂信息</a></li>
- <li><a href="http://tiaoji.kaoyan.com/xinwen/" title="考研调剂新闻">调剂新闻</a></li>
- <li><a href="http://tiaoji.kaoyan.com/jingyan/" title="考研调剂经验">调剂经验</a></li>
- <li><a href="http://tiaoji.bbs.kaoyan.com/" title="考研调剂论坛" target="_blank">调剂论坛</a></li>
- </ul>
- <div class="courseArea">
- <ul class="tjPicAd mt10 clear">
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("850729");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("850747");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("850763");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("850766");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869710");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869712");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869713");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869714");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869898");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869899");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869901");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869902");</script></li>
- </ul>
- <ul class="tjPicAd clear">
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("859514");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("859516");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("859517");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("859518");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869981");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869982");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869983");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("869984");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("1033455");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("1033457");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("1033458");</script></li>
- <li><script type="text/javascript">BAIDU_CLB_fillSlot("1033459");</script></li>
- </ul>
- </div>
- <div class="box pc-index">
- <div class="tiaoji-content-nav">
- <ul>
- <li><a href="http://www.kaoyan.com">考研网>></a></li>
- <li><a href="http://tiaoji.kaoyan.com">考研调剂中心>></a></li>
- <li><a href="http://student.tiaoji.kaoyan.com">考生调剂意向</a></li>
- </ul>
- </div>
- <form action="" method="GET">
- <select name="course">
- <option value="">专业门类</option>
- <option value="哲学">哲学</option>
- <option value="经济学">经济学</option>
- <option value="法学">法学</option>
- <option value="教育学">教育学</option>
- <option value="文学">文学</option>
- <option value="历史学">历史学</option>
- <option value="理学">理学</option>
- <option value="工学">工学</option>
- <option value="农学">农学</option>
- <option value="医学">医学</option>
- <option value="军事学">军事学</option>
- <option value="管理学">管理学</option>
- <option value="艺术学">艺术学</option>
- </select>
- 报考专业: <input name="major" value=""></input>
- <input type="submit" value="搜索" />
- </form>
- <div class="tiaoji-content">
- <div class="tiaoji-cont-top">
- <h5>考生调剂信息</h5><span><a href="/tjadm/logout">退出</a></span>
- </div>
- <table class="tiaoji-tab" cellpadding="0" cellspacing="0">
- <tr>
- <th width="3%">姓名</th>
- <th width="5%">报考学校</th>
- <th width="5%">报考专业</th>
- <th width="5%">专业门类</th>
- <th width="2%">总分</th>
- <th width="2%">政治</th>
- <th width="2%">外语</th>
- <th width="2%">专一</th>
- <th width="2%">专二</th>
- <th width="5%">电话</th>
- <th width="5%">邮箱</th>
- <th width="10%">调剂意向</th>
- <th width="5%">发布时间</th>
- </tr>
- <tr>
- <td>李***</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">天津大学</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">应用化学</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">工学</td>
- <td>244</td>
- <td>58</td>
- <td>53</td>
- <td>133</td>
- <td>0</td>
- <td>15********15</td>
- <td></td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">希望能调剂到211或者985院校,只要是与化学相关的都服从调剂</td>
- <td>2016-03-01</td>
- </tr>
- <tr>
- <td>何***</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">安徽大学</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">中国现当代文学</td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;"></td>
- <td>137</td>
- <td>71</td>
- <td>66</td>
- <td>0</td>
- <td>0</td>
- <td>18********74</td>
- <td></td>
- <td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">服从调剂</td>
- <td>2016-03-01</td>
- </tr>
- </table>
- <table width="100%" align="center" border="0" cellpadding="2" cellspacing="1" class="tiaoji-fy">
- <tr>
- <td colspan="2"> <span>[每页显示:20条/总共:161659条]</span> <a>上一页</a> <b>1</b> <a href="/tjadm/2.html" >2</a> <a href="/tjadm/3.html" >3</a> <a href="/tjadm/4.html" >4</a> <a href="/tjadm/5.html" >5</a> <a href="/tjadm/6.html" >6</a> <a href="/tjadm/7.html" >7</a> <a href="/tjadm/8.html" >8</a> <a href="/tjadm/9.html" >9</a> <a href="/tjadm/10.html" >10</a> <span>...</span> <a href="/tjadm/8082.html" >8082</a> <a href="/tjadm/8083.html" >8083</a> <a href="/tjadm/2.html">下一页</a></td>
- </tr>
- </table>
- </div>
- </div>
- <p class="clearFooter"></p>
- <div class="footer">
- <script src="http://img.kaoyan.com/www/footera.js" type="text/javascript"></script>
- - <a href="http://www.kaoyan.com/sitemap/">网站地图</a>
- - <a href="http://www.kaoyan.com/yzsitemap/">院校地图</a>
- - <a href="http://www.kaoyan.com/update/">最新更新</a>
- <script src="http://img.kaoyan.com/www/footerb.js" type="text/javascript"></script>
- </div>
- <script src='http://img.kaoyan.com/global/js/gcc.js' type='text/javascript'></script>
- <script src="http://img.kaoyan.com/global/js/backtopnew.js?ver=2014092901" type="text/javascript"></script>
- <script type="text/javascript">/*考研网-全站对联*/var cpro_id = "u1773154";</script>
- </body>
- </html>
要做的就是从以下格式的HTML代码中解析出需要的数据
Android程序模拟HTTP请求
经过上述分析,清楚了HTTP的请求地址、请求方式和携带参数格式,所以接下来要开发Android程序编程实现这个过程。(不一定非要Android实现,在PC上直接实现应该也是可以的。但本人比较熟悉Android上的一个HTTP开发库,所以计划Android平台实现)。1、打开Eclipse,新建一个工程TestGet,把实现HTTP库的代码拷入工程中,使用的库android-async-http官网源码:https://github.com/loopj/android-async-http官网教程:http://loopj.com/android-async-http/这个网络请求库是基于Apache HttpClient库之上的一个异步网络请求处理库,网络处理均基于Android的非UI线程,通过回调方法处理请求结果。
工程目录如下,其中com.loopj.android.http包就是android-async-http的源码
2、新建XcAsyncHttpClientUtil.java,添加请求URL地址,封装AsyncHttpClient的GET和POST请求
- package com.example.testget;
- import org.apache.http.HttpEntity;
- import android.content.Context;
- import com.loopj.android.http.AsyncHttpClient;
- import com.loopj.android.http.AsyncHttpResponseHandler;
- import com.loopj.android.http.RequestParams;
- public class XcAsyncHttpClientUtil {
- public static final String BASE_URL = "http://ntiaoji.kaoyan.com";
- public static final String LOGIN_URL = "/tjadm/login";
- public static final String INDEX1 = "/tjadm/1.html";
- private static AsyncHttpClient client = new AsyncHttpClient();
- public static void get(String url, RequestParams params,
- AsyncHttpResponseHandler responseHandler) {
- client.get(getAbsoluteUrl(url), params, responseHandler);
- }
- public static void post(String url, RequestParams params,
- AsyncHttpResponseHandler responseHandler) {
- client.post(getAbsoluteUrl(url), params, responseHandler);
- }
- public static void post(Context context, String url, HttpEntity entity,
- AsyncHttpResponseHandler responseHandler) {
- client.post(context, getAbsoluteUrl(url), entity, "", responseHandler);
- }
- public static String getAbsoluteUrl(String relativeUrl) {
- return BASE_URL + relativeUrl;
- }
- }
3、编辑activity_main.xml,添加两个按钮,一个登陆,一个获取表格数据
- <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
- xmlns:tools="http://schemas.android.com/tools"
- android:layout_width="match_parent"
- android:layout_height="match_parent"
- android:paddingBottom="@dimen/activity_vertical_margin"
- android:paddingLeft="@dimen/activity_horizontal_margin"
- android:paddingRight="@dimen/activity_horizontal_margin"
- android:paddingTop="@dimen/activity_vertical_margin"
- tools:context="com.example.testget.MainActivity"
- >
- <TextView
- android:layout_width="wrap_content"
- android:layout_height="wrap_content"
- android:text="@string/hello_world" />
- <Button
- android:id="@+id/btn"
- android:layout_width="wrap_content"
- android:layout_height="wrap_content"
- android:text="登陆" >
- </Button>
- <Button
- android:id="@+id/btn1"
- android:layout_width="wrap_content"
- android:layout_height="wrap_content"
- android:text="获取表格数据" >
- </Button>
- </LinearLayout>
4、编辑MainActivity.java,添加按钮点击动作,dologin()用来实现登陆,doGetData()用来获取表格数据,参数page用来构建请求的URL,初始化值为1,可自增,获取其他页面的数据
- @Override
- protected void onCreate(Bundle savedInstanceState) {
- super.onCreate(savedInstanceState);
- setContentView(R.layout.activity_main);
- btn = (Button) findViewById(R.id.btn);
- btn.setOnClickListener(new OnClickListener() {
- @Override
- public void onClick(View v) {
- dologin();
- }
- });
- btn1 = (Button) findViewById(R.id.btn1);
- btn1.setOnClickListener(new OnClickListener() {
- @Override
- public void onClick(View v) {
- doGetData();
- }
- });
- }
- private void dologin() {
- RequestParams params = new RequestParams();
- params.put("username", "kytj1");
- params.put("password", "***********");
- XcAsyncHttpClientUtil.post(XcAsyncHttpClientUtil.LOGIN_URL, params,
- new AsyncHttpResponseHandler() {
- @Override
- public void onSuccess(int statusCode, Header[] headers,
- byte[] responseBody) {
- try {
- String jsonString = new String(responseBody,
- "UTF-8");
- Log.e("TAG", jsonString);
- } catch (UnsupportedEncodingException e) {
- e.printStackTrace();
- }
- }
- @Override
- public void onFailure(int statusCode, Header[] headers,
- byte[] responseBody, Throwable error) {
- Log.e("Login", "onFailure");
- }
- });
- }
- protected void doGetData() {
- RequestParams params = new RequestParams();
- XcAsyncHttpClientUtil.get("/tjadm/" + page + ".html", params,
- new AsyncHttpResponseHandler() {
- @Override
- public void onSuccess(int statusCode, Header[] headers,
- byte[] responseBody) {
- try {
- String jsonString = new String(responseBody,
- "UTF-8");
- parse(jsonString);
- } catch (UnsupportedEncodingException e) {
- e.printStackTrace();
- }
- }
- @Override
- public void onFailure(int statusCode, Header[] headers,
- byte[] responseBody, Throwable error) {
- }
- });
- }
Java解析HTML数据
之前没有做过如何解析HTML数据,开始还有点头疼,不知道如何下手,在网上搜索解决办法。然后发现了这个库jsoupjsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。官方网站:http://jsoup.org/
点击下载jsoup库
把下载到的jsoup-1.8.3.jar库添加到Android工程libs文件夹下
解析如下HTML数据
原数据里table下有21条数据,第1条为表格title信息,如姓名、报考学校、报考专业等字段,第2-21条为实际的学生信息。Java代码如下:
- protected void parse(String html) {
- Document doc = Jsoup.parse(html);
- Element tiaojiTab = doc.select("table.tiaoji-tab").first();
- Elements lists = tiaojiTab.getElementsByTag("tr");
- int size = lists.size();
- for (int i = 1; i < size; i++) {
- Element item = lists.get(i);
- Elements els = item.getElementsByTag("td");
- String all = "";
- for (int j = 0; j < els.size(); j++) {
- Element value = els.get(j);
- String text = value.text();
- all = all + text + "#";
- }
- initData(all);
- Log.e("tag", all);
- }
- page++;
- if (page < totalsize + 1) {
- doGetData();
- } else {
- page = 1;
- }
- }
- doc.select("table.tiaoji-tab").first();
- page++;
- if (page < totalsize + 1) {
- doGetData();
- } else {
- page = 1;
- }
把数据写入到本地txt文件里代码:
- private void initData(String msg) {
- String filePath = "/sdcard/Test/";
- String fileName = "tiaoji.txt";
- makeFilePath(filePath, fileName);
- writeTxtToFile(msg, filePath, fileName);
- }
- // 将字符串写入到文本文件中
- public void writeTxtToFile(String strcontent, String filePath,
- String fileName) {
- // 生成文件夹之后,再生成文件,不然会出错
- String strFilePath = filePath + fileName;
- // 每次写入时,都换行写
- String strContent = strcontent + "\r\n";
- try {
- File file = new File(strFilePath);
- if (!file.exists()) {
- Log.d("TestFile", "Create the file:" + strFilePath);
- file.getParentFile().mkdirs();
- file.createNewFile();
- }
- RandomAccessFile raf = new RandomAccessFile(file, "rwd");
- raf.seek(file.length());
- raf.write(strContent.getBytes());
- raf.close();
- } catch (Exception e) {
- Log.e("TestFile", "Error on write File:" + e);
- }
- }
- // 生成文件
- public File makeFilePath(String filePath, String fileName) {
- File file = null;
- makeRootDirectory(filePath);
- try {
- file = new File(filePath + fileName);
- if (!file.exists()) {
- file.createNewFile();
- }
- } catch (Exception e) {
- e.printStackTrace();
- }
- return file;
- }
- // 生成文件夹
- public static void makeRootDirectory(String filePath) {
- File file = null;
- try {
- file = new File(filePath);
- if (!file.exists()) {
- file.mkdir();
- }
- } catch (Exception e) {
- Log.i("error:", e + "");
- }
- }
导入txt到Excel
1、连接手机到电脑,打开应用宝,工具箱里选择文件管理,把txt文件导入到电脑上2、打开Excel,选择数据-自文本按照提示,选择导出的txt文件,第2步中分隔符号选择其他,填入“#”,再完成
这样,就把数据成功的在Excel中显示了。
完整MainActivity.java如下:
- package com.example.testget;
- import java.io.File;
- import java.io.RandomAccessFile;
- import java.io.UnsupportedEncodingException;
- import org.apache.http.Header;
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- import org.jsoup.nodes.Element;
- import org.jsoup.select.Elements;
- import android.app.Activity;
- import android.os.Bundle;
- import android.util.Log;
- import android.view.View;
- import android.view.View.OnClickListener;
- import android.widget.Button;
- import com.loopj.android.http.AsyncHttpResponseHandler;
- import com.loopj.android.http.RequestParams;
- public class MainActivity extends Activity {
- private Button btn, btn1;
- private int page = 1;
- private static final int totalsize = 200;
- @Override
- protected void onCreate(Bundle savedInstanceState) {
- super.onCreate(savedInstanceState);
- setContentView(R.layout.activity_main);
- btn = (Button) findViewById(R.id.btn);
- btn.setOnClickListener(new OnClickListener() {
- @Override
- public void onClick(View v) {
- dologin();
- }
- });
- btn1 = (Button) findViewById(R.id.btn1);
- btn1.setOnClickListener(new OnClickListener() {
- @Override
- public void onClick(View v) {
- doGetData();
- }
- });
- }
- private void dologin() {
- RequestParams params = new RequestParams();
- params.put("username", "kytj1");
- params.put("password", "************");
- XcAsyncHttpClientUtil.post(XcAsyncHttpClientUtil.LOGIN_URL, params,
- new AsyncHttpResponseHandler() {
- @Override
- public void onSuccess(int statusCode, Header[] headers,
- byte[] responseBody) {
- try {
- String jsonString = new String(responseBody,
- "UTF-8");
- Log.e("TAG", jsonString);
- } catch (UnsupportedEncodingException e) {
- e.printStackTrace();
- }
- }
- @Override
- public void onFailure(int statusCode, Header[] headers,
- byte[] responseBody, Throwable error) {
- Log.e("Login", "onFailure");
- }
- });
- }
- protected void doGetData() {
- RequestParams params = new RequestParams();
- XcAsyncHttpClientUtil.get("/tjadm/" + page + ".html", params,
- new AsyncHttpResponseHandler() {
- @Override
- public void onSuccess(int statusCode, Header[] headers,
- byte[] responseBody) {
- try {
- String jsonString = new String(responseBody,
- "UTF-8");
- parse(jsonString);
- } catch (UnsupportedEncodingException e) {
- e.printStackTrace();
- }
- }
- @Override
- public void onFailure(int statusCode, Header[] headers,
- byte[] responseBody, Throwable error) {
- }
- });
- }
- protected void parse(String html) {
- Document doc = Jsoup.parse(html);
- Element tiaojiTab = doc.select("table.tiaoji-tab").first();
- Elements lists = tiaojiTab.getElementsByTag("tr");
- int size = lists.size();
- for (int i = 1; i < size; i++) {
- Element item = lists.get(i);
- Elements els = item.getElementsByTag("td");
- String all = "";
- for (int j = 0; j < els.size(); j++) {
- Element value = els.get(j);
- String text = value.text();
- all = all + text + "#";
- }
- initData(all);
- Log.e("tag", all);
- }
- page++;
- if (page < totalsize + 1) {
- doGetData();
- } else {
- page = 1;
- }
- }
- private void initData(String msg) {
- String filePath = "/sdcard/Test/";
- String fileName = "tiaoji.txt";
- makeFilePath(filePath, fileName);
- writeTxtToFile(msg, filePath, fileName);
- }
- // 将字符串写入到文本文件中
- public void writeTxtToFile(String strcontent, String filePath,
- String fileName) {
- // 生成文件夹之后,再生成文件,不然会出错
- String strFilePath = filePath + fileName;
- // 每次写入时,都换行写
- String strContent = strcontent + "\r\n";
- try {
- File file = new File(strFilePath);
- if (!file.exists()) {
- Log.d("TestFile", "Create the file:" + strFilePath);
- file.getParentFile().mkdirs();
- file.createNewFile();
- }
- RandomAccessFile raf = new RandomAccessFile(file, "rwd");
- raf.seek(file.length());
- raf.write(strContent.getBytes());
- raf.close();
- } catch (Exception e) {
- Log.e("TestFile", "Error on write File:" + e);
- }
- }
- // 生成文件
- public File makeFilePath(String filePath, String fileName) {
- File file = null;
- makeRootDirectory(filePath);
- try {
- file = new File(filePath + fileName);
- if (!file.exists()) {
- file.createNewFile();
- }
- } catch (Exception e) {
- e.printStackTrace();
- }
- return file;
- }
- // 生成文件夹
- public static void makeRootDirectory(String filePath) {
- File file = null;
- try {
- file = new File(filePath);
- if (!file.exists()) {
- file.mkdir();
- }
- } catch (Exception e) {
- Log.i("error:", e + "");
- }
- }
- }