Extracting Structured Data from Web Pages-网页数据提取的优秀文章

时间:2013-04-07 10:53:17
【文件属性】:

文件名称:Extracting Structured Data from Web Pages-网页数据提取的优秀文章

文件大小:546KB

文件格式:PDF

更新时间:2013-04-07 10:53:17

网页 数据提取

以下为原文摘要: Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such templategenerated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.


网友评论