如何以编程方式定期从网站获取内容？

Let me preface this by saying I don't care what language this solution gets written in as long as it runs on windows. My problem is this. There is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting. The site requires JavaScript to work properly so just using wget doesn't work. What is a good way to either imbed a browser in a program or use a stand-alone browser to routinely scrape the screen for this data? Ideally I'd like to grab certain tables on the page but can resort to regular expressions if necessary.

让我先说一下,只要它在Windows上运行,我不关心这个解决方案的语言。我的问题是这个。有一个网站的数据经常更新,我希望定期获取以便以后报告。该网站需要JavaScript才能正常工作,所以只使用wget不起作用。在程序中嵌入浏览器或使用独立浏览器定期搜索此数据的屏幕有什么好方法?理想情况下,我想抓住页面上的某些表,但如果需要可以使用正则表达式。

10 个解决方案

#1

You could probably use web app testing tools like Watir, Watin, or Selenium to automate the browser to get the values from the page. I've done this for scraping data before, and it works quite well.

您可以使用Watir,Watin或Selenium等Web应用程序测试工具来自动化浏览器以从页面获取值。我之前已经这样做了用于抓取数据,并且它运行良好。

#2

If JavaScript is a must, you can try instantiating an Internet Explorer via ActiveX (CreateObject("InternetExplorer.Application")) and use it's Navigate2() Method to open your web page.

如果必须使用JavaScript,您可以尝试通过ActiveX(CreateObject(“InternetExplorer.Application”))实例化Internet Explorer,并使用它的Navigate2()方法打开您的网页。

Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.Navigate2 "http://*.com"

After the page has finished loading (check document.ReadyState), you have full access to the DOM and can use whatever methods to extract any content you like.

页面加载完成后(检查document.ReadyState),您可以完全访问DOM,并可以使用任何方法提取您喜欢的任何内容。

#3

You can look at Beautiful Soup - being open source python, it is easily programmable. Quoting the site:

你可以看看Beautiful Soup - 开源python,很容易编程。引用网站:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup是一个Python HTML / XML解析器,专为快速周转项目而设计,如屏幕抓取。三个功能使其功能强大:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

如果你给它不好的标记,美丽的汤不会窒息。它产生一个解析树,它与原始文档几乎一样有意义。这通常足以收集您需要的数据并逃跑。

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup提供了一些简单的方法和Pythonic习语,用于导航,搜索和修改解析树:用于剖析文档和提取所需内容的工具包。您不必为每个应用程序创建自定义解析器。

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Beautiful Soup会自动将传入的文档转换为Unicode,将传出的文档转换为UTF-8。您不必考虑编码,除非文档未指定编码,而Beautiful Soup无法自动检测编码。然后你只需要指定原始编码。

#4

I would recommend Yahoo Pipes, that's exactly what they were built to do. Then you can get the yahoo pipes data as an RSS feed and do as you want with it.

我会推荐雅虎管道,这正是它们的目的。然后,您可以将yahoo管道数据作为RSS提要,并按照您的需要进行操作。

#5

If you are familiar with Java (or perhaps, other language that runs on a JVM such as JRuby, Jython, etc.), you can use HTMLUnit; HTMLUnit simulates a complete browser; http requests, creating a DOM for each page and running Javascript (using Mozilla's Rhino).

如果您熟悉Java(或者可能是在JVM上运行的其他语言,如JRuby,Jython等),您可以使用HTMLUnit; HTMLUnit模拟完整的浏览器; http请求,为每个页面创建一个DOM并运行Javascript(使用Mozilla的Rhino)。

Additionally, you can run XPath queries on documents loaded in the simulated browser, simulate events, etc.

此外,您可以对模拟浏览器中加载的文档运行XPath查询,模拟事件等。

http://htmlunit.sourceforge.net

#6

Give Badboy a try. It's meant to automate the system testing of your websites but you may find it's regular expression rules handy enough to do what you want.

试试Badboy吧。它旨在自动化您的网站的系统测试,但您可能会发现它的正则表达式规则足以满足您的需求。

#7

If you have Excel then you should be able to import the data from the webpage into Excel.

如果您有Excel,那么您应该能够将网页中的数据导入Excel。

From the Data menu select Import External Data and then New Web Query.

从Data菜单中选择Import External Data,然后选择New Web Query。

Once the data is in Excel then you can either manipulate it within Excel or output it in a format (e.g. CSV) you can use elsewhere.

一旦数据在Excel中,您就可以在Excel中操作它,或者以您可以在别处使用的格式(例如CSV)输出它。

#8

In compliment to Whaledawg's suggestion, I was going to suggest using an RSS scraper application (do a Google search) and then you can get nice raw XML to programmatically consume instead of a response stream. There may even be a few open-source implementation which would give you more of an idea if you wanted to implement yourself.

根据Whaledawg的建议,我打算建议使用RSS scraper应用程序(进行谷歌搜索)然后你可以获得不错的原始XML以编程方式使用而不是响应流。甚至可能有一些开源实现,如果你想自己实现,它会给你更多的想法。

#9

You could use the Perl module LWP, with module JavaScript. While this may not be the quickest to set up, it should work reliably. I would definitely not have this be your first foray into Perl though.

你可以使用Perl模块LWP和模块JavaScript。虽然这可能不是最快的设置,但它应该可靠地工作。我绝对不会这是你第一次涉足Perl。

#10

I recently did some research on this topic. The best resource I found is this Wikipedia article, which gives links to many screen scraping engines.

我最近做了一些关于这个主题的研究。我找到的最好的资源是这篇*文章,它提供了许多屏幕抓取引擎的链接。

I needed to have something that I can use as a server and run it in batch, and from my initial investigation, I think Web Harvest is quite good as an open source solution, and I have also been impressed by Screen Scraper, which seems to be very feature rich and you can use it with different languages.

我需要有一些我可以用作服务器的东西并批量运行它,从我最初的调查,我认为Web Harvest作为一个开源解决方案是相当不错的,我也对Screen Scraper印象深刻,似乎功能丰富,您可以使用不同的语言。

There is also a new project called Scrapy, haven't checked it out yet, but it's a python framework.

还有一个名为Scrapy的新项目尚未检查过,但它是一个python框架。

#1