防止网站数据被抓取和翻录

时间:2022-11-11 22:16:27

I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search.

我正在寻找建立一个内容网站,可能有数千个不同的条目,可通过索引和搜索访问。

What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together.

我可以采取哪些措施来防止恶意抓取工具从我的网站上删除所有数据?我不太担心SEO,虽然我不想一起阻止合法的抓取工具。

For example, I thought about randomly changing small bits of the HTML structure used to display my data, but I guess it wouldn't really be effective.

例如,我想过随机改变用于显示我的数据的HTML结构的小部分,但我想它不会真正有效。

12 个解决方案

#1


13  

Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

从理论上讲,任何人眼可见的网站都可能是可以传播的。如果您甚至试图访问,那么根据定义,这必须是这样的情况(如果不是机器可读的,那么说话浏览器如何能够提供您的内容)。

Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.

你最好的办法是研究你的内容水印,这样至少如果它被撕掉你可以指向水印并声明所有权。

#2


10  

Between this:

What are the measures I can take to prevent malicious crawlers from ripping

我可以采取哪些措施来防止恶意抓取工具被盗用

and this:

I wouldn't want to block legitimate crawlers all together.

我不想一起阻止合法的抓取工具。

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

你要求很多。事实是,如果您要尝试阻止恶意抓取工具,您最终也会阻止所有“好”抓取工具。

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:

你必须记住,如果人们想要抓住你的内容,他们会比搜索引擎机器人更多的手动工作......所以你的优先事项是正确的。你有两个选择:

  1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
  2. 让互联网的农民偷走你的内容。密切关注它(在谷歌搜索一些更独特的短语)并向ISP发送拆卸请求。除了时间之外,这种选择对你的影响几乎没有。

  3. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.
  4. 使用AJAX和滚动加密从服务器请求所有内容。您需要保持方法更改,甚至随机,以便每个页面加载带有不同的加密方案。但如果有人想破解它,即使这样也会被破解。你也会放弃搜索引擎的面貌,因此会受到真实用户流量的影响。

#3


4  

Good crawlers will follow the rules you specify in your robots.txt, malicious ones will not. You can set up a "trap" for bad robots, like it is explained here: http://www.fleiner.com/bots/.
But then again, if you put your content on the internet, I think it's better for everyone if it's as painless as possible to find (in fact, you're posting here and not at some lame forum where experts exchange their opinions)

好的抓取工具将遵循您在robots.txt中指定的规则,恶意抓取者不会。您可以为坏机器人设置“陷阱”,如下所述:http://www.fleiner.com/bots/。但话又说回来,如果你把你的内容放在互联网上,我觉得对每个人来说都更好,如果它尽可能轻松找到(事实上,你是在这里发布而不是在一些专家交换意见的蹩脚论坛)

#4


4  

Don't even try to erect limits on the web!

甚至不要试图在网上建立限制!

It really is as simple as this.

它真的很简单。

Every potential measure to discourage ripping (aside from a very strict robots.txt) will harm your users. Captchas are more pain than gain. Checking the user agent shuts out unexpected browsers. The same is true for "clever" tricks with javascript.

每一个阻止翻录的潜在措施(除了非常严格的robots.txt)都会对用户造成伤害。 Captchas比痛苦更痛苦。检查用户代理会关闭意外的浏览器。使用javascript的“聪明”技巧也是如此。

Please keep the web open. If you don't want anything to be taken from your website, then do not publish it there. Watermarks can help you claim ownership, but that only helps when you want to sue after the harm is done.

请保持网络开放。如果您不想从您的网站上获取任何内容,请不要在那里发布。水印可以帮助您声明所有权,但这只有在您想要在伤害发生后起诉时才有用。

#5


4  

Realistically you can't stop malicious crawlers - and any measures that you put in place to prevent them are likely to harm your legitimate users (aside from perhaps adding entries to robots.txt to allow detection)

实际上,您无法阻止恶意抓取工具 - 您为防止恶意抓取工具而采取的任何措施都可能会损害您的合法用户(除了可能会向robots.txt添加条目以允许检测)

So what you have to do is to plan on the content being stolen - it's more than likely to happen in one form or another - and understand how you will deal with unauthorized copying.

所以你要做的就是计划被窃取的内​​容 - 它很可能以某种形式发生 - 并了解你将如何处理未经授权的复制。

Prevention isn't possible - and will be a waste of your time trying to make it so.

预防是不可能的 - 并且浪费你的时间来试图做到这一点。

The only sure way of making sure that the content on a website isn't vulnerable to copying is to unplug the network cable...

确保网站上的内容不易被复制的唯一可靠方法是拔掉网线...

To detect it use something like http://www.copyscape.com/ may help.

要检测它,请使用像http://www.copyscape.com/这样的东西。

#6


3  

The only way to stop a site being machine ripped is to make the user prove that they are human.

阻止网站被机器翻录的唯一方法是让用户证明他们是人。

You could make users perform a task that is easy for humans and hard for machines, eg: CAPTCHA. When a user first gets to your site present a CAPTCHA and only allow them to proceed once it has completed. If the user starts moving from page to page too quickly re-verify.

您可以让用户执行一项对人类而言很容易且对机器来说很难的任务,例如:CAPTCHA。当用户第一次到达您的站点时,会出现一个CAPTCHA,并且只有在完成后才允许他们继续。如果用户开始在页面之间移动太快,请重新验证。

This is not 100% effective and hackers are always trying to break them.

这不是100%有效,黑客总是试图打破它们。

Alternatively you could make slow responses. You don't need to make them crawl, but pick a speed that is reasonable for humans (this would be very slow for a machine). This just makes them take longer to scrape your site, but not impossible.

或者你可以做出缓慢的回应。你不需要让它们爬行,但要选择一个对人类来说合理的速度(对于一台机器来说这会非常慢)。这只会让他们花费更长时间来抓取您的网站,但并非不可能。

OK. Out of ideas.

好。出于想法。

#7


2  

If you're making a public site, then it's very difficult. There are methods that involve server-side scripting to generate content or the use of non-text (Flash, etc) to minimize the likelihood of ripping.

如果你正在制作一个公共网站,那就非常困难了。有些方法涉及服务器端脚本以生成内容或使用非文本(Flash等)来最小化翻录的可能性。

But to be honest, if you consider your content to be so good, just password-protect it and remove it from the public arena.

但说实话,如果你认为你的内容是如此优秀,只需密码保护它并将其从公共场所删除。

My opinion is that the whole point of the web is to propagate useful content to as many people as possible.

我的观点是,网络的重点是将有用的内容传播给尽可能多的人。

#8


1  

In short: you cannot prevent ripping. Malicious bots commonly use IE user agents and are fairly intelligent nowadays. If you want to have your site accessible to the maximum number (ie screenreaders, etc) you cannot use javascript or one of the popular plugins (flash) simply because they can inhibit a legitimate user's access.

简而言之:你无法阻止翻录。恶意机器人通常使用IE用户代理,现在相当聪明。如果你想让你的网站可以访问最大数量(例如屏幕阅读器等),你就不能使用javascript或其中一个流行的插件(flash),因为它们会阻止合法用户的访问。

Perhaps you could have a cron job that picks a random snippet out of your database and googles it to check for matches. You could then try and get hold of the offending site and demand they take the content down.

也许你可以有一个cron作业从你的数据库中挑选一个随机片段并用谷歌搜索匹配。然后,您可以尝试抓住有问题的网站并要求他们将内容删除。

You could also monitor the number of requests from a given IP and block it if it passes a threshold, although you may have to whitelist legitimate bots and would be no use against a botnet (but if you are up against a botnet, perhaps ripping is not your biggest problem).

您还可以监视来自给定IP的请求数量,并在它超过阈值时对其进行阻止,尽管您可能必须将合法机器人列入白名单并且不会对僵尸网络使用(但如果您遇到僵尸网络,可能是翻录是不是你最大的问题)。

#9


1  

If the content is public and freely available, even with page view throttling or whatever, there is nothing you can do. If you require registration and/or payment to access the data, you might restrict it a bit, and at least you can see who reads what and identify the users that seem to be scraping your entire database.

如果内容是公开的并且是免费提供的,即使页面视图限制或其他任何内容,您也无能为力。如果您需要注册和/或付款来访问数据,您可能会对其进行一些限制,至少您可以看到谁读取了什么并识别出似乎正在刮擦整个数据库的用户。

However I think you should rather face the fact that this is how the net works, there are not many ways to prevent a machine to read what a human can. Outputting all your content as images would of course discourage most, but then the site is not accessible anymore, let alone the fact that even the non-disabled users will not be able to copy-paste anything - which can be really annoying.

但是,我认为你应该更加面对这样一个事实,即网络是如何工作的,没有太多方法可以阻止机器读取人类的能力。将所有内容输出为图像当然会使大多数人失望,但随后该网站不再可访问,更不用说甚至非禁用用户也无法复制粘贴任何内容 - 这可能非常烦人。

All in all this sounds like DRM/game protection systems - pissing the hell out of your legit users only to prevent some bad behavior that you can't really prevent anyway.

总而言之,这听起来像是DRM /游戏保护系统 - 只会让你的合法用户感到痛苦,以防止一些你无法阻止的不良行为。

#10


0  

Use where ever is possible human validators and try using some framework (MVC). The site ripping software is sometimes unable to rip this kind of page. Also detect the user agent, at least it will reduce the number of possible rippers

尽可能使用人类验证器并尝试使用某种框架(MVC)。网站翻录软件有时无法破解这种页面。同时检测用户代理,至少会减少可能的破解者数量

#11


0  

You could try using Flash / Silverlight / Java to display all your page contents. That would probably stop most crawlers in their tracks.

您可以尝试使用Flash / Silverlight / Java来显示所有页面内容。这可能会阻止大多数爬虫在他们的轨道上。

#12


0  

I used to have a system that would block or allow based on the User-Agent header. It relies on the crawler setting their User-Agent but it seems most of them do.

我曾经有一个基于User-Agent标头阻止或允许的系统。它依赖于爬虫设置他们的用户代理,但似乎大多数人都这样做。

It won't work if they use a fake header to emulate a popular browser of course.

如果他们使用假标头来模拟流行的浏览器,它将无法工作。

#1


13  

Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

从理论上讲,任何人眼可见的网站都可能是可以传播的。如果您甚至试图访问,那么根据定义,这必须是这样的情况(如果不是机器可读的,那么说话浏览器如何能够提供您的内容)。

Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.

你最好的办法是研究你的内容水印,这样至少如果它被撕掉你可以指向水印并声明所有权。

#2


10  

Between this:

What are the measures I can take to prevent malicious crawlers from ripping

我可以采取哪些措施来防止恶意抓取工具被盗用

and this:

I wouldn't want to block legitimate crawlers all together.

我不想一起阻止合法的抓取工具。

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

你要求很多。事实是,如果您要尝试阻止恶意抓取工具,您最终也会阻止所有“好”抓取工具。

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:

你必须记住,如果人们想要抓住你的内容,他们会比搜索引擎机器人更多的手动工作......所以你的优先事项是正确的。你有两个选择:

  1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
  2. 让互联网的农民偷走你的内容。密切关注它(在谷歌搜索一些更独特的短语)并向ISP发送拆卸请求。除了时间之外,这种选择对你的影响几乎没有。

  3. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.
  4. 使用AJAX和滚动加密从服务器请求所有内容。您需要保持方法更改,甚至随机,以便每个页面加载带有不同的加密方案。但如果有人想破解它,即使这样也会被破解。你也会放弃搜索引擎的面貌,因此会受到真实用户流量的影响。

#3


4  

Good crawlers will follow the rules you specify in your robots.txt, malicious ones will not. You can set up a "trap" for bad robots, like it is explained here: http://www.fleiner.com/bots/.
But then again, if you put your content on the internet, I think it's better for everyone if it's as painless as possible to find (in fact, you're posting here and not at some lame forum where experts exchange their opinions)

好的抓取工具将遵循您在robots.txt中指定的规则,恶意抓取者不会。您可以为坏机器人设置“陷阱”,如下所述:http://www.fleiner.com/bots/。但话又说回来,如果你把你的内容放在互联网上,我觉得对每个人来说都更好,如果它尽可能轻松找到(事实上,你是在这里发布而不是在一些专家交换意见的蹩脚论坛)

#4


4  

Don't even try to erect limits on the web!

甚至不要试图在网上建立限制!

It really is as simple as this.

它真的很简单。

Every potential measure to discourage ripping (aside from a very strict robots.txt) will harm your users. Captchas are more pain than gain. Checking the user agent shuts out unexpected browsers. The same is true for "clever" tricks with javascript.

每一个阻止翻录的潜在措施(除了非常严格的robots.txt)都会对用户造成伤害。 Captchas比痛苦更痛苦。检查用户代理会关闭意外的浏览器。使用javascript的“聪明”技巧也是如此。

Please keep the web open. If you don't want anything to be taken from your website, then do not publish it there. Watermarks can help you claim ownership, but that only helps when you want to sue after the harm is done.

请保持网络开放。如果您不想从您的网站上获取任何内容,请不要在那里发布。水印可以帮助您声明所有权,但这只有在您想要在伤害发生后起诉时才有用。

#5


4  

Realistically you can't stop malicious crawlers - and any measures that you put in place to prevent them are likely to harm your legitimate users (aside from perhaps adding entries to robots.txt to allow detection)

实际上,您无法阻止恶意抓取工具 - 您为防止恶意抓取工具而采取的任何措施都可能会损害您的合法用户(除了可能会向robots.txt添加条目以允许检测)

So what you have to do is to plan on the content being stolen - it's more than likely to happen in one form or another - and understand how you will deal with unauthorized copying.

所以你要做的就是计划被窃取的内​​容 - 它很可能以某种形式发生 - 并了解你将如何处理未经授权的复制。

Prevention isn't possible - and will be a waste of your time trying to make it so.

预防是不可能的 - 并且浪费你的时间来试图做到这一点。

The only sure way of making sure that the content on a website isn't vulnerable to copying is to unplug the network cable...

确保网站上的内容不易被复制的唯一可靠方法是拔掉网线...

To detect it use something like http://www.copyscape.com/ may help.

要检测它,请使用像http://www.copyscape.com/这样的东西。

#6


3  

The only way to stop a site being machine ripped is to make the user prove that they are human.

阻止网站被机器翻录的唯一方法是让用户证明他们是人。

You could make users perform a task that is easy for humans and hard for machines, eg: CAPTCHA. When a user first gets to your site present a CAPTCHA and only allow them to proceed once it has completed. If the user starts moving from page to page too quickly re-verify.

您可以让用户执行一项对人类而言很容易且对机器来说很难的任务,例如:CAPTCHA。当用户第一次到达您的站点时,会出现一个CAPTCHA,并且只有在完成后才允许他们继续。如果用户开始在页面之间移动太快,请重新验证。

This is not 100% effective and hackers are always trying to break them.

这不是100%有效,黑客总是试图打破它们。

Alternatively you could make slow responses. You don't need to make them crawl, but pick a speed that is reasonable for humans (this would be very slow for a machine). This just makes them take longer to scrape your site, but not impossible.

或者你可以做出缓慢的回应。你不需要让它们爬行,但要选择一个对人类来说合理的速度(对于一台机器来说这会非常慢)。这只会让他们花费更长时间来抓取您的网站,但并非不可能。

OK. Out of ideas.

好。出于想法。

#7


2  

If you're making a public site, then it's very difficult. There are methods that involve server-side scripting to generate content or the use of non-text (Flash, etc) to minimize the likelihood of ripping.

如果你正在制作一个公共网站,那就非常困难了。有些方法涉及服务器端脚本以生成内容或使用非文本(Flash等)来最小化翻录的可能性。

But to be honest, if you consider your content to be so good, just password-protect it and remove it from the public arena.

但说实话,如果你认为你的内容是如此优秀,只需密码保护它并将其从公共场所删除。

My opinion is that the whole point of the web is to propagate useful content to as many people as possible.

我的观点是,网络的重点是将有用的内容传播给尽可能多的人。

#8


1  

In short: you cannot prevent ripping. Malicious bots commonly use IE user agents and are fairly intelligent nowadays. If you want to have your site accessible to the maximum number (ie screenreaders, etc) you cannot use javascript or one of the popular plugins (flash) simply because they can inhibit a legitimate user's access.

简而言之:你无法阻止翻录。恶意机器人通常使用IE用户代理,现在相当聪明。如果你想让你的网站可以访问最大数量(例如屏幕阅读器等),你就不能使用javascript或其中一个流行的插件(flash),因为它们会阻止合法用户的访问。

Perhaps you could have a cron job that picks a random snippet out of your database and googles it to check for matches. You could then try and get hold of the offending site and demand they take the content down.

也许你可以有一个cron作业从你的数据库中挑选一个随机片段并用谷歌搜索匹配。然后,您可以尝试抓住有问题的网站并要求他们将内容删除。

You could also monitor the number of requests from a given IP and block it if it passes a threshold, although you may have to whitelist legitimate bots and would be no use against a botnet (but if you are up against a botnet, perhaps ripping is not your biggest problem).

您还可以监视来自给定IP的请求数量,并在它超过阈值时对其进行阻止,尽管您可能必须将合法机器人列入白名单并且不会对僵尸网络使用(但如果您遇到僵尸网络,可能是翻录是不是你最大的问题)。

#9


1  

If the content is public and freely available, even with page view throttling or whatever, there is nothing you can do. If you require registration and/or payment to access the data, you might restrict it a bit, and at least you can see who reads what and identify the users that seem to be scraping your entire database.

如果内容是公开的并且是免费提供的,即使页面视图限制或其他任何内容,您也无能为力。如果您需要注册和/或付款来访问数据,您可能会对其进行一些限制,至少您可以看到谁读取了什么并识别出似乎正在刮擦整个数据库的用户。

However I think you should rather face the fact that this is how the net works, there are not many ways to prevent a machine to read what a human can. Outputting all your content as images would of course discourage most, but then the site is not accessible anymore, let alone the fact that even the non-disabled users will not be able to copy-paste anything - which can be really annoying.

但是,我认为你应该更加面对这样一个事实,即网络是如何工作的,没有太多方法可以阻止机器读取人类的能力。将所有内容输出为图像当然会使大多数人失望,但随后该网站不再可访问,更不用说甚至非禁用用户也无法复制粘贴任何内容 - 这可能非常烦人。

All in all this sounds like DRM/game protection systems - pissing the hell out of your legit users only to prevent some bad behavior that you can't really prevent anyway.

总而言之,这听起来像是DRM /游戏保护系统 - 只会让你的合法用户感到痛苦,以防止一些你无法阻止的不良行为。

#10


0  

Use where ever is possible human validators and try using some framework (MVC). The site ripping software is sometimes unable to rip this kind of page. Also detect the user agent, at least it will reduce the number of possible rippers

尽可能使用人类验证器并尝试使用某种框架(MVC)。网站翻录软件有时无法破解这种页面。同时检测用户代理,至少会减少可能的破解者数量

#11


0  

You could try using Flash / Silverlight / Java to display all your page contents. That would probably stop most crawlers in their tracks.

您可以尝试使用Flash / Silverlight / Java来显示所有页面内容。这可能会阻止大多数爬虫在他们的轨道上。

#12


0  

I used to have a system that would block or allow based on the User-Agent header. It relies on the crawler setting their User-Agent but it seems most of them do.

我曾经有一个基于User-Agent标头阻止或允许的系统。它依赖于爬虫设置他们的用户代理,但似乎大多数人都这样做。

It won't work if they use a fake header to emulate a popular browser of course.

如果他们使用假标头来模拟流行的浏览器,它将无法工作。