如何刮网站，客户端或服务器端？

I am creating a bookmarklet button that, when the user clicks on this button in his browser, will scrape the current page and get some values from this page, such as price, item name and item image.

我正在创建一个bookmarklet按钮,当用户在浏览器中单击此按钮时,将抓取当前页面并从此页面获取一些值,例如价格,项目名称和项目图像。

These fields will be variable, means that the logic of getting these values will be different for each domain "amazon, ebay" for example.

这些字段将是可变的,意味着获取这些值的逻辑对于每个域“amazon,ebay”将是不同的。

My questions are:

我的问题是:

Should i use javascript to scrape these data then send to the server?

我应该使用javascript来抓取这些数据然后发送到服务器吗?

Or just send to my server side the URL then use .net code to scrape values?

或者只是发送到我的服务器端的URL然后使用.net代码来刮取值?

What is the best way? and why its better? advantages, disadvantages?

什么是最好的方法?为什么它更好?优点缺点?

Look at this video and you will understand what i want to do exactly http://www.vimeo.com/1626505

看看这个视频,你就会明白我想要做的事情http://www.vimeo.com/1626505

5 个解决方案

#1

If you want to pull information from another site for use in your site (written in ASP.NET, for example) then you'll typically do this on the server side so that you have rich language for processing the results (e.g. C#). You'll do this via a WebRequest object in .NET.

如果您想从其他站点提取信息以便在您的站点中使用(例如,用ASP.NET编写),那么您通常会在服务器端执行此操作,以便您拥有丰富的语言来处理结果(例如C#)。您将通过.NET中的WebRequest对象执行此操作。

The primary use of client side processing is to use Javascript to pull information to display on your site. An example would be the scripts provided by the Weather Channel to show a little weather box on your site or for very simple actions such as adding a page to favorites.

客户端处理的主要用途是使用Javascript来提取要在您的站点上显示的信息。一个例子是天气频道提供的脚本,用于在您的网站上显示一个小天气框,或者用于非常简单的操作,例如向收藏夹添加页面。

UPDATE: Amr writes that he is attempting to recreate the functionality of some popular screen scraping software which would require some quite sophisticated processing. Amr, I'd consider creating an application that uses the IE browser object to display web pages - it is quite simple. You could then just pull the InnerHTML (I think, it has been a few years since I implemented an IE-object-based program) to retrieve the contents of the page and do your magic. You could, of course, use a WebRequest object (just handing it the URL used in the browser object) but that wouldn't be very efficient as it would download the page a second time.

更新:Amr写道,他正试图重新创建一些流行的屏幕抓取软件的功能,这需要一些非常复杂的处理。 Amr,我考虑创建一个使用IE浏览器对象来显示网页的应用程序 - 这很简单。然后你可以拉出InnerHTML(我认为,自从我实现了一个基于IE对象的程序已经有几年了)来检索页面的内容并做你的魔术。当然,您可以使用WebRequest对象(仅将其交给浏览器对象中使用的URL),但这样做效率不高,因为它会第二次下载页面。

Is this what you are after?

这就是你追求的吗?

#2

If you want to use only JavaScript to do this, you are liable to have a fairly large bookmarklet unless you know the exact layout of every site it will be used on (and even then it will be big).

如果你只想用JavaScript做这件事,除非你知道它将被使用的每个网站的确切布局(否则它会很大),你可能会有一个相当大的书签。

A common way I have seen this done is to use a web service on your own server that your bookmarklet (which uses JavaScript) redirects to along with some parameters, like the URL of the page you are viewing. Your server would then scrape the page and do the work of parsing the HTML for the things you are interested in.

我看到这种方法的常见方法是在您自己的服务器上使用Web服务,您的bookmarklet(使用JavaScript)会重定向到一些参数,例如您正在查看的页面的URL。然后,您的服务器将抓取页面并为您感兴趣的内容解析HTML。

A good example is the "Import to Mendeley" bookmarklet, which passes the URL of the page you are visiting to its server where it then extracts information about scientific papers listed on the page and imports them into your collection.

一个很好的例子是“导入到Mendeley”书签,它将您正在访问的页面的URL传递到其服务器,然后在该服务器中提取有关页面上列出的科学论文的信息并将其导入您的集合。

#3

I would scrape it on the server side, because (i'am Java guy) i like static languages more then dynamic script languages, so maintaining the logic at the backend would be more comfortable to me. On the other side depends on how many items you want to scrape and how complex the logic for this would be. Perhaps the values are parseable with a single id selector in JavaScript, then server side processing could be overkill.

我会在服务器端刮掉它,因为(我是Java人)我更喜欢静态语言,而不是动态脚本语言,所以保持后端的逻辑对我来说会更舒服。另一方面取决于你想要刮掉多少项以及这个逻辑的复杂程度。也许这些值可以在JavaScript中使用单个id选择器进行解析,然后服务器端处理可能过度。

#4

Bookmarklets are client-side per definition, but you could have the client depend on a server, but your example doesn't provide enough information. What do you want to do with the scraped info?

Bookmarklets是每个定义的客户端,但您可以让客户端依赖于服务器,但您的示例不提供足够的信息。你想用刮下的信息做什么?

#5

If you include the scraping code in the bookmarlet your users will have to update their bookmark if you include new functionality or bug-fixes. Do it server-side and all your users get the new stuff instantly :)

如果您在书架中包含抓取代码,则如果您包含新功能或错误修复,则用户必须更新其书签。做服务器端,所有用户立即获得新东西:)

#1