Question
www.vdisk.cn is a web site, you can keep your file inside without free. My space ishttp://www.vdisk.cn/garyyding. But the limitation is that un-registered user can not keep your files more than 1 month.
User's url has a standard format: http://www.vdisk.cn/[username] I want to statistic which user has which kind of file. For example mock a username, then search.
Solution
Analysis the web page, and find the used urls
There are content we care <a href="/down/index/7376710" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第05章_数组2of2.7z</a> <a href='?tag=ALLFILES&p=2' title='查看该页'>2</a> <a href='?tag=ALLFILES&p=3' title='查看该页'>3</a> <div class='tag2'>所有文件(339)</div> <div class='tag'><a href='?tag=%E9%98%85%E8%AF%BB&p=1' title='阅读(2)'>阅读(2)</a></div> <div class='tag'><a href='?tag=%E8%BE%93%E5%85%A5%E5%B7%A5%E5%85%B7&p=1' title='输入工具(5)'>输入工具(5)</a></div> <div class='tag'><a href='?tag=%E8%BD%AF%E4%BB%B6%E5%B7%A5%E7%A8%8B&p=1' title='软件工程(4)'>软件工程(4)</a></div> <div class='tag'><a href='?tag=%E8%AF%BE%E7%A8%8B%E4%BD%9C%E4%B8%9A&p=1' title='课程作业(17)'>课程作业(17)</a></div> <div class='tag'><a href='?tag=%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%B1%BB%E7%94%B5%E5%AD%90%E6%9D%82%E5%BF%97&p=1' title='计算机类电子杂志(2)'>计算机类电子杂志(2)</a></div> <div class='tag'><a href='?tag=%E8%8B%B1%E8%AF%AD%E5%AD%A6%E4%B9%A0&p=1' title='英语学习(12)'>英语学习(12)</a></div> <div class='tag'><a href='?tag=%E8%81%8A%E5%A4%A9%E9%80%9A%E8%AE%AF&p=1' title='聊天通讯(8)'>聊天通讯(8)</a></div> <div class='tag'><a href='?tag=%E7%BC%96%E8%AF%91%E7%BC%96%E8%BE%91%E5%B7%A5%E5%85%B7&p=1' title='编译编辑工具(9)'>编译编辑工具(9)</a></div> <div class='tag'><a href='?tag=%E7%BC%96%E7%A8%8B%E7%9B%B8%E5%85%B3%E6%96%87%E7%AB%A0&p=1' title='编程相关文章(4)'>编程相关文章(4)</a></div> <div class='tag'><a href='?tag=%E7%BC%96%E7%A8%8B%E4%B9%A6%E7%B1%8D&p=1' title='编程书籍(7)'>编程书籍(7)</a></div> <div class='tag'><a href='?tag=%E7%B3%BB%E7%BB%9F%E5%A2%9E%E5%BC%BA&p=1' title='系统增强(10)'>系统增强(10)</a></div> <div class='tag'><a href='?tag=%E7%B3%BB%E7%BB%9F&p=1' title='系统(1)'>系统(1)</a></div> <div class='tag'><a href='?tag=%E7%97%85%E6%AF%92%E6%A0%B7%E6%9C%AC&p=1' title='病毒样本(2)'>病毒样本(2)</a></div> <div class='tag'><a href='?tag=%E7%94%B5%E5%AD%90%E6%9D%82%E5%BF%97%E4%B8%8E%E7%9B%B8%E5%86%8C&p=1' title='电子杂志与相册(7)'>电子杂志与相册(7)</a></div> <div class='tag'><a href='?tag=%E6%B3%A8%E5%86%8C%E6%9C%BA&p=1' title='注册机(16)'>注册机(16)</a></div> <div class='tag'><a href='?tag=%E6%A3%8B%E7%B1%BB%E6%B8%B8%E6%88%8F%E5%8F%8A%E8%B5%84%E6%96%99&p=1' title='棋类游戏及资料(6)'>棋类游戏及资料(6)</a></div> <div class='tag'><a href='?tag=%E6%9D%80%E6%AF%92%E5%AE%89%E5%85%A8&p=1' title='杀毒安全(29)'>杀毒安全(29)</a></div> <div class='tag'><a href='?tag=%E6%99%BA%E8%83%BD%E8%BD%A6&p=1' title='智能车(4)'>智能车(4)</a></div> <div class='tag'><a href='?tag=%E6%99%BA%E5%8A%9B&p=1' title='智力(5)'>智力(5)</a></div> <div class='tag'><a href='?tag=%E6%96%87%E5%AD%A6%E5%B0%8F%E8%AF%B4&p=1' title='文学小说(7)'>文学小说(7)</a></div> <div class='tag'><a href='?tag=%E6%88%91%E7%9A%84%E6%94%B6%E8%97%8F&p=1' title='我的收藏(15)'>我的收藏(15)</a></div> <div class='tag'><a href='?tag=%E5%B5%8C%E5%85%A5%E5%BC%8F&p=1' title='嵌入式(4)'>嵌入式(4)</a></div> <div class='tag'><a href='?tag=%E5%A5%BD%E7%94%A8%E5%B7%A5%E5%85%B7&p=1' title='好用工具(19)'>好用工具(19)</a></div> <div class='tag'><a href='?tag=%E4%B8%AA%E4%BA%BA%E7%BB%83%E4%B9%A0%E4%BB%A3%E7%A0%81&p=1' title='个人练习代码(13)'>个人练习代码(13)</a></div> <div class='tag'><a href='?tag=%E4%B8%8B%E8%BD%BD%E4%B8%8A%E4%BC%A0%E5%B7%A5%E5%85%B7&p=1' title='下载上传工具(9)'>下载上传工具(9)</a></div> <div class='tag'><a href='?tag=Windows%E5%B0%8F%E5%B7%A5%E5%85%B7&p=1' title='Windows小工具(5)'>Windows小工具(5)</a></div> <div class='tag'><a href='?tag=VOA%E8%8B%B1%E8%AF%AD&p=1' title='VOA英语(26)'>VOA英语(26)</a></div> <div class='tag'><a href='?tag=JAVA2&p=1' title='JAVA2(30)'>JAVA2(30)</a></div> <div class='tag'><a href='?tag=Java&p=1' title='Java(1)'>Java(1)</a></div> <div class='tag'><a href='?tag=C%E8%AF%AD%E8%A8%80%E4%BB%A3%E7%A0%81&p=1' title='C语言代码(13)'>C语言代码(13)</a></div> <div class='tag'><a href='?tag=C%E8%AF%AD%E8%A8%80%E4%B9%A6%E7%B1%8D&p=1' title='C语言书籍(18)'>C语言书籍(18)</a></div> <div class='tag'><a href='?tag=BT%E7%A7%8D%E5%AD%90%E6%96%87%E4%BB%B6&p=1' title='BT种子文件(8)'>BT种子文件(8)</a></div> <div class='tag'><a href='?tag=bccn%E5%8D%A7%E9%BE%99%E5%AD%94%E6%98%8E%E7%9A%84%E8%B5%84%E6%96%99&p=1' title='bccn卧龙孔明的资料(2)'>bccn卧龙孔明的资料(2)</a></div> <div class='tag'><a href='?tag=Android%E5%BA%94%E7%94%A8&p=1' title='Android应用(3)'>Android应用(3)</a></div> <div class='tag'><a href='?tag=&p=1' title='未分类(16)'>未分类(16)</a></div> </div>
Tools
Need a tool to analysis web page. So find www.open-open.com, got Web-Harvest
http://web-harvest.sourceforge.net/index.php
It uses xpath to analys web page. It not provides API. It provides a xml ( the element is command).
Coding
<?xml version="1.0" encoding="UTF-8"?> <!-- Expects following initial variable: search - search expression --> <config charset="UTF-8"> <include path="functions.xml" /> <!-- defines search keyword and start URL --> <var-def name="search" overwrite="false">platon</var-def> <var-def name="currentUser" overwrite="false">msdiaoxian</var-def> <var-def name="targetWebsite" overwrite="false">http://www.vdisk.cn/msdiaoxian</var-def> <var-def name="url"> <xpath expression="//a[@href[contains(., '?tag=ALLFILES')]]/@href[1]"> <html-to-xml> <http url="${targetWebsite}" /> </html-to-xml> </xpath> </var-def> <file action="write" path="output.xml" charset="UTF-8"> <template> <![CDATA[ <catalog name="${currentUser}"> ]]> </template> <loop item="link" index="i" filter="unique"> <list><var name="url" /></list> <body> <var-def name="links"> <xpath expression="//a[@href[contains(., '/down/index/')]]"> <html-to-xml> <http url="${targetWebsite}/${link}" /> </html-to-xml> </xpath> </var-def> <var name="links" /> </body> </loop> <![CDATA[ </catalog> ]]> </file> </config>
import org.webharvest.definition.ScraperConfiguration; import org.webharvest.runtime.Scraper; import org.webharvest.runtime.variables.Variable; public class WebHarvestTest { public static void main(String[] args) throws Exception { // register external plugins if there are any // DefinitionResolver.registerPlugin("com.my.MyPlugin1"); // DefinitionResolver.registerPlugin("com.my.MyPlugin2"); // DefinitionResolver.registerPlugin("com.my.MyPlugin3"); String path = WebHarvestTest.class.getResource("/").getFile(); System.setProperty("http.proxyHost", "192.168.0.59"); System.setProperty("http.proxyPort", "8080"); //System.setProperty("java.net.useSystemProxies", "true"); ScraperConfiguration config = new ScraperConfiguration(path + "/test1.xml"); Scraper scraper = new Scraper(config, path); scraper.addVariableToContext("username", "web-harvest"); scraper.addVariableToContext("password", "web-harvest"); // scraper.addVariableToContext("myXmlLib", new MyXmlLibrary()); scraper.setDebug(true); scraper.getHttpClientManager().setHttpProxy("192.168.0.59", 8080); scraper.execute(); // takes variable created during execution Variable articles = (Variable) scraper.getContext().get("url"); //System.out.println("articles=" + articles); // do something with articles... } }
<catalog name="msdiaoxian"> <a href="/down/index/7367058" target="_blank">voa2011年2月1-7含同步原文字幕.7z</a> <a href="/down/index/7356344" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第05章_数组1of2.7z</a> <a href="/down/index/7343831" target="_blank">j2se2010-4-4.7z</a> <a href="/down/index/7328717" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第04章_异常处理.7z</a> <a href="/down/index/7302993" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象5of5.7z</a> <a href="/down/index/7287297" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象4of5.7z</a> <a href="/down/index/7276709" target="_blank">voa2011年1月29-31含同步原文字幕.7z</a> <a href="/down/index/7276132" target="_blank">硬件高手——电脑爱好者_高清.pdf</a> <a href="/down/index/7258654" target="_blank">迷你迅雷MiniThunderInstaller3.1.1.58.exe</a> <a href="/down/index/7230575" target="_blank">计算机硬件信息检测工具EVEREST.7z</a> <a href="/down/index/7223397" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象3of5.7z</a> <a href="/down/index/7219071" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象2of5.7z</a> <a href="/down/index/7172651" target="_blank">尚学堂科技_马士兵_JAVA视频教程_J2SE_5.0_第02章_基础语法.7z</a> <a href="/down/index/7171283" target="_blank">200行的简单Snake1.0.1.13C语言源码(含头文件).7z</a> <a href="/down/index/7162335" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象1of5.7z</a> <a href="/down/index/7157219" target="_blank">SQLServer2005_SSMSEE(SQLServer Management Studio Express).msi</a> <a href="/down/index/7156915" target="_blank">SQLEXPR_CHS(SQL Server Express Edition2005).EXE</a> <a href="/down/index/7156780" target="_blank">msxml6.msi</a> <a href="/down/index/7154801" target="_blank">net framework cleanup_tool.exe</a> <a href="/down/index/7154018" target="_blank">Microsoft_DotNetFXCHS2.0(.net Framework 2.0).exe</a> <a href="/down/index/7151990" target="_blank">iis 5.1 简体中文完整安装包 适用XP.7z</a> <a href="/down/index/7135401" target="_blank">尚学堂科技_马士兵_JAVA视频教程_JDK5.0_下载-安装-配置.7z</a> <a href="/down/index/7123298" target="_blank">尚学堂科技_马士兵_JAVA视频教程_J2SE_5.0_第02章_递归补充.7z</a> <a href="/down/index/7114719" target="_blank">Lrc歌词文件制作程序CreateLrc0.1.3.6.7z</a> <a href="/down/index/7099147" target="_blank">voa2011年1月22-28含同步原文歌词.7z</a> <a href="/down/index/7003630" target="_blank">使用C语言开发图形界面.zip</a> <a href="/down/index/6993787" target="_blank">voa2011年1月15-21含同步原文歌词.7z</a> <a href="/down/index/6968986" target="_blank">voa2011年1月8-14含同步原文歌词.7z</a> <a href="/down/index/6952433" target="_blank">操作系统设计与实现(第2版+中文).pdf</a> <a href="/down/index/6903829" target="_blank">voa2011年1月1-7含同步原文歌词.7z</a> <a href="/down/index/6843412" target="_blank">voa2010年12月29-31含同步原文歌词.7z</a> <a href="/down/index/6836912" target="_blank">Snake贪吃蛇啰嗦版500行代码0.1.0.5.7z</a> <a href="/down/index/6836512" target="_blank">24小时反恐部队-第二季.torrent</a>