Use Web-Harvest to data-extract from www.vdisk.cn

时间:2022-06-15 13:37:14


Question

www.vdisk.cn is a web site, you can keep your file inside without free. My space ishttp://www.vdisk.cn/garyyding. But the limitation is that un-registered user can not keep your files more than 1 month.

User's url has a standard format: http://www.vdisk.cn/[username] I want to statistic which user has which kind of file. For example mock a username, then search.


Solution

Analysis the web page, and find the used urls

There are content we care

<a href="/down/index/7376710" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第05章_数组2of2.7z</a>


<a href='?tag=ALLFILES&p=2' title='查看该页'>2</a>  <a href='?tag=ALLFILES&p=3' title='查看该页'>3</a>

<div class='tag2'>所有文件(339)</div>  <div class='tag'><a href='?tag=%E9%98%85%E8%AF%BB&p=1' title='阅读(2)'>阅读(2)</a></div>  <div class='tag'><a href='?tag=%E8%BE%93%E5%85%A5%E5%B7%A5%E5%85%B7&p=1' title='输入工具(5)'>输入工具(5)</a></div>  <div class='tag'><a href='?tag=%E8%BD%AF%E4%BB%B6%E5%B7%A5%E7%A8%8B&p=1' title='软件工程(4)'>软件工程(4)</a></div>  <div class='tag'><a href='?tag=%E8%AF%BE%E7%A8%8B%E4%BD%9C%E4%B8%9A&p=1' title='课程作业(17)'>课程作业(17)</a></div>  <div class='tag'><a href='?tag=%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%B1%BB%E7%94%B5%E5%AD%90%E6%9D%82%E5%BF%97&p=1' title='计算机类电子杂志(2)'>计算机类电子杂志(2)</a></div>  <div class='tag'><a href='?tag=%E8%8B%B1%E8%AF%AD%E5%AD%A6%E4%B9%A0&p=1' title='英语学习(12)'>英语学习(12)</a></div>  <div class='tag'><a href='?tag=%E8%81%8A%E5%A4%A9%E9%80%9A%E8%AE%AF&p=1' title='聊天通讯(8)'>聊天通讯(8)</a></div>  <div class='tag'><a href='?tag=%E7%BC%96%E8%AF%91%E7%BC%96%E8%BE%91%E5%B7%A5%E5%85%B7&p=1' title='编译编辑工具(9)'>编译编辑工具(9)</a></div>  <div class='tag'><a href='?tag=%E7%BC%96%E7%A8%8B%E7%9B%B8%E5%85%B3%E6%96%87%E7%AB%A0&p=1' title='编程相关文章(4)'>编程相关文章(4)</a></div>  <div class='tag'><a href='?tag=%E7%BC%96%E7%A8%8B%E4%B9%A6%E7%B1%8D&p=1' title='编程书籍(7)'>编程书籍(7)</a></div>  <div class='tag'><a href='?tag=%E7%B3%BB%E7%BB%9F%E5%A2%9E%E5%BC%BA&p=1' title='系统增强(10)'>系统增强(10)</a></div>  <div class='tag'><a href='?tag=%E7%B3%BB%E7%BB%9F&p=1' title='系统(1)'>系统(1)</a></div>  <div class='tag'><a href='?tag=%E7%97%85%E6%AF%92%E6%A0%B7%E6%9C%AC&p=1' title='病毒样本(2)'>病毒样本(2)</a></div>  <div class='tag'><a href='?tag=%E7%94%B5%E5%AD%90%E6%9D%82%E5%BF%97%E4%B8%8E%E7%9B%B8%E5%86%8C&p=1' title='电子杂志与相册(7)'>电子杂志与相册(7)</a></div>  <div class='tag'><a href='?tag=%E6%B3%A8%E5%86%8C%E6%9C%BA&p=1' title='注册机(16)'>注册机(16)</a></div>  <div class='tag'><a href='?tag=%E6%A3%8B%E7%B1%BB%E6%B8%B8%E6%88%8F%E5%8F%8A%E8%B5%84%E6%96%99&p=1' title='棋类游戏及资料(6)'>棋类游戏及资料(6)</a></div>  <div class='tag'><a href='?tag=%E6%9D%80%E6%AF%92%E5%AE%89%E5%85%A8&p=1' title='杀毒安全(29)'>杀毒安全(29)</a></div>  <div class='tag'><a href='?tag=%E6%99%BA%E8%83%BD%E8%BD%A6&p=1' title='智能车(4)'>智能车(4)</a></div>  <div class='tag'><a href='?tag=%E6%99%BA%E5%8A%9B&p=1' title='智力(5)'>智力(5)</a></div>  <div class='tag'><a href='?tag=%E6%96%87%E5%AD%A6%E5%B0%8F%E8%AF%B4&p=1' title='文学小说(7)'>文学小说(7)</a></div>  <div class='tag'><a href='?tag=%E6%88%91%E7%9A%84%E6%94%B6%E8%97%8F&p=1' title='我的收藏(15)'>我的收藏(15)</a></div>  <div class='tag'><a href='?tag=%E5%B5%8C%E5%85%A5%E5%BC%8F&p=1' title='嵌入式(4)'>嵌入式(4)</a></div>  <div class='tag'><a href='?tag=%E5%A5%BD%E7%94%A8%E5%B7%A5%E5%85%B7&p=1' title='好用工具(19)'>好用工具(19)</a></div>  <div class='tag'><a href='?tag=%E4%B8%AA%E4%BA%BA%E7%BB%83%E4%B9%A0%E4%BB%A3%E7%A0%81&p=1' title='个人练习代码(13)'>个人练习代码(13)</a></div>  <div class='tag'><a href='?tag=%E4%B8%8B%E8%BD%BD%E4%B8%8A%E4%BC%A0%E5%B7%A5%E5%85%B7&p=1' title='下载上传工具(9)'>下载上传工具(9)</a></div>  <div class='tag'><a href='?tag=Windows%E5%B0%8F%E5%B7%A5%E5%85%B7&p=1' title='Windows小工具(5)'>Windows小工具(5)</a></div>  <div class='tag'><a href='?tag=VOA%E8%8B%B1%E8%AF%AD&p=1' title='VOA英语(26)'>VOA英语(26)</a></div>  <div class='tag'><a href='?tag=JAVA2&p=1' title='JAVA2(30)'>JAVA2(30)</a></div>  <div class='tag'><a href='?tag=Java&p=1' title='Java(1)'>Java(1)</a></div>  <div class='tag'><a href='?tag=C%E8%AF%AD%E8%A8%80%E4%BB%A3%E7%A0%81&p=1' title='C语言代码(13)'>C语言代码(13)</a></div>  <div class='tag'><a href='?tag=C%E8%AF%AD%E8%A8%80%E4%B9%A6%E7%B1%8D&p=1' title='C语言书籍(18)'>C语言书籍(18)</a></div>  <div class='tag'><a href='?tag=BT%E7%A7%8D%E5%AD%90%E6%96%87%E4%BB%B6&p=1' title='BT种子文件(8)'>BT种子文件(8)</a></div>  <div class='tag'><a href='?tag=bccn%E5%8D%A7%E9%BE%99%E5%AD%94%E6%98%8E%E7%9A%84%E8%B5%84%E6%96%99&p=1' title='bccn卧龙孔明的资料(2)'>bccn卧龙孔明的资料(2)</a></div>  <div class='tag'><a href='?tag=Android%E5%BA%94%E7%94%A8&p=1' title='Android应用(3)'>Android应用(3)</a></div>  <div class='tag'><a href='?tag=&p=1' title='未分类(16)'>未分类(16)</a></div> 			</div>




Tools

Need a tool to analysis web page. So find www.open-open.com, got Web-Harvest

http://web-harvest.sourceforge.net/index.php


It uses xpath to analys web page. It not provides API. It provides a xml ( the element is command).


Coding

<?xml version="1.0" encoding="UTF-8"?>

<!-- Expects following initial variable: search - search expression -->

<config charset="UTF-8">

	<include path="functions.xml" />

	<!-- defines search keyword and start URL -->
	<var-def name="search" overwrite="false">platon</var-def>
    <var-def name="currentUser" overwrite="false">msdiaoxian</var-def>
    <var-def name="targetWebsite" overwrite="false">http://www.vdisk.cn/msdiaoxian</var-def>
    
	<var-def name="url">
		<xpath expression="//a[@href[contains(., '?tag=ALLFILES')]]/@href[1]">
			<html-to-xml>
				<http url="${targetWebsite}" />
			</html-to-xml>
		</xpath>
	</var-def>	    

    
	<file action="write" path="output.xml" charset="UTF-8">
        <template>
            <![CDATA[ <catalog name="${currentUser}"> ]]>
        </template>
		<loop item="link" index="i" filter="unique">
         <list><var name="url" /></list>
			<body>
			    <var-def name="links">
			        <xpath expression="//a[@href[contains(., '/down/index/')]]">
			            <html-to-xml>
			                <http url="${targetWebsite}/${link}" />
			            </html-to-xml>
			        </xpath>
			    </var-def> 			
				<var name="links" />
			</body>
		</loop>	
		<![CDATA[ </catalog> ]]>
	</file>


  

</config>


import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;

public class WebHarvestTest {
    public static void main(String[] args) throws Exception {
        // register external plugins if there are any
        // DefinitionResolver.registerPlugin("com.my.MyPlugin1");
        // DefinitionResolver.registerPlugin("com.my.MyPlugin2");
        // DefinitionResolver.registerPlugin("com.my.MyPlugin3");
        String path = WebHarvestTest.class.getResource("/").getFile();
        System.setProperty("http.proxyHost", "192.168.0.59");
        System.setProperty("http.proxyPort", "8080");        
        //System.setProperty("java.net.useSystemProxies", "true");
        
        ScraperConfiguration config = new ScraperConfiguration(path
                + "/test1.xml");

        Scraper scraper = new Scraper(config, path);

        scraper.addVariableToContext("username", "web-harvest");
        scraper.addVariableToContext("password", "web-harvest");
        // scraper.addVariableToContext("myXmlLib", new MyXmlLibrary());

        scraper.setDebug(true);
        scraper.getHttpClientManager().setHttpProxy("192.168.0.59", 8080);
        scraper.execute();
         
        

        // takes variable created during execution
        Variable articles = (Variable) scraper.getContext().get("url");
        
        //System.out.println("articles=" + articles);
        // do something with articles...
    }

}

 
<catalog name="msdiaoxian">
<a href="/down/index/7367058" target="_blank">voa2011年2月1-7含同步原文字幕.7z</a>
<a href="/down/index/7356344" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第05章_数组1of2.7z</a>
<a href="/down/index/7343831" target="_blank">j2se2010-4-4.7z</a>
<a href="/down/index/7328717" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第04章_异常处理.7z</a>
<a href="/down/index/7302993" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象5of5.7z</a>
<a href="/down/index/7287297" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象4of5.7z</a>
<a href="/down/index/7276709" target="_blank">voa2011年1月29-31含同步原文字幕.7z</a>
<a href="/down/index/7276132" target="_blank">硬件高手——电脑爱好者_高清.pdf</a>
<a href="/down/index/7258654" target="_blank">迷你迅雷MiniThunderInstaller3.1.1.58.exe</a>
<a href="/down/index/7230575" target="_blank">计算机硬件信息检测工具EVEREST.7z</a>
<a href="/down/index/7223397" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象3of5.7z</a>
<a href="/down/index/7219071" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象2of5.7z</a>
<a href="/down/index/7172651" target="_blank">尚学堂科技_马士兵_JAVA视频教程_J2SE_5.0_第02章_基础语法.7z</a>
<a href="/down/index/7171283" target="_blank">200行的简单Snake1.0.1.13C语言源码(含头文件).7z</a>
<a href="/down/index/7162335" target="_blank">尚学堂科技_马士兵_J2SE_5.0_第03章_面向对象1of5.7z</a>
<a href="/down/index/7157219" target="_blank">SQLServer2005_SSMSEE(SQLServer Management Studio Express).msi</a>
<a href="/down/index/7156915" target="_blank">SQLEXPR_CHS(SQL Server Express Edition2005).EXE</a>
<a href="/down/index/7156780" target="_blank">msxml6.msi</a>
<a href="/down/index/7154801" target="_blank">net framework cleanup_tool.exe</a>
<a href="/down/index/7154018" target="_blank">Microsoft_DotNetFXCHS2.0(.net Framework 2.0).exe</a>
<a href="/down/index/7151990" target="_blank">iis 5.1 简体中文完整安装包 适用XP.7z</a>
<a href="/down/index/7135401" target="_blank">尚学堂科技_马士兵_JAVA视频教程_JDK5.0_下载-安装-配置.7z</a>
<a href="/down/index/7123298" target="_blank">尚学堂科技_马士兵_JAVA视频教程_J2SE_5.0_第02章_递归补充.7z</a>
<a href="/down/index/7114719" target="_blank">Lrc歌词文件制作程序CreateLrc0.1.3.6.7z</a>
<a href="/down/index/7099147" target="_blank">voa2011年1月22-28含同步原文歌词.7z</a>
<a href="/down/index/7003630" target="_blank">使用C语言开发图形界面.zip</a>
<a href="/down/index/6993787" target="_blank">voa2011年1月15-21含同步原文歌词.7z</a>
<a href="/down/index/6968986" target="_blank">voa2011年1月8-14含同步原文歌词.7z</a>
<a href="/down/index/6952433" target="_blank">操作系统设计与实现(第2版+中文).pdf</a>
<a href="/down/index/6903829" target="_blank">voa2011年1月1-7含同步原文歌词.7z</a>
<a href="/down/index/6843412" target="_blank">voa2010年12月29-31含同步原文歌词.7z</a>
<a href="/down/index/6836912" target="_blank">Snake贪吃蛇啰嗦版500行代码0.1.0.5.7z</a>
<a href="/down/index/6836512" target="_blank">24小时反恐部队-第二季.torrent</a>