如何保护链接数据库不被剪贴?

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).

我有一个很大的链接数据库，这些链接都以特定的方式进行排序，并附加到其他信息上，这对(某些人)来说是很有价值的。

Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.

目前我的设置(似乎是工作)简单地调用一个php文件，比如link.php?id=123，它将请求用时间戳记录到数据库中。在公布链接之前，它会检查在最后5分钟内从该IP发出了多少请求。如果它大于x，它会重定向到一个captcha页面。

That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.

这一切都很好，而且非常棒，但是这个网站已经变得非常流行(并且已经被DDOsed了大约6周)，所以php已经被动摇了，所以我试着尽量减少使用php来做一些事情的时间。我想用纯文本显示链接，而不是通过。php?id=，并具有onclick函数，只需向视图计数添加1。我仍然在使用php，但至少如果它落后了，它会在后台运行，用户可以马上看到他们请求的链接。

Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

问题是，这使得这个网站确实是可剪贴的。我能做些什么来防止这种情况发生，但是在吐出链接之前仍然不依赖php进行检查吗?

5 个解决方案

#1

It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.

看起来瓶颈在数据库上。每个请求执行一次插入(记录请求)，然后执行一个select(在最后5分钟内确定来自IP的请求数量)，然后执行应用程序的核心功能所需的任何数据库操作。

Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).

考虑在服务器内存中维护请求节流数据(IP、请求时间)，而不是给数据库增加负担。两种解决方案是memcache (http://www.php.net/manual/en/book.cachmeme.php)和memcached (http://php.net/manual/en/book.cachedphp)。

As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).

正如其他人所注意到的，确保对查询的任何键(如链接id等字段)都存在索引。如果索引已经就位，并且数据库仍然受到负载的影响，请尝试使用HTTP加速器，如Varnish (http://varnish-cache.org/)。

#2

You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.

您可以在web服务器级别执行ip节流。也许有一个模块存在于您的web服务器上，或者作为一个例子，使用apache您可以编写自己的rewritemap，并让它咨询一个守护进程程序，以便您可以做更复杂的事情。让守护进程程序查询一个内存数据库。它将很快。

#3

Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.

检查你的数据库。你是否把所有的东西都编入索引?一个包含这么多条目的表将变得非常大，速度非常快，速度也会减慢。您可能还想运行一个每晚删除超过1小时的条目的进程。

If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.

如果这些都不起作用，则需要升级/负载平衡服务器。直接链接到页面只会让你在不得不升级之前浪费很多时间。

#4

Every thing you do on the client side can't be protected, Why not just use AJAX ?

您在客户端所做的每一件事都无法得到保护，为什么不使用AJAX呢?

Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.

有一个onClick事件，它调用ajax函数，只返回链接并将其填充到页面上的DIV中，因为请求的大小一个答案很小，它将快速工作以满足您的需要。只要确保在调用的函数中检查时间戳，就可以很容易地创建一个多次调用该函数的脚本，从而将链接转换为钢。

You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.

您可以查看jQuery或其他AJAX库(我使用jQuery和sAjax)。而且我有很多页面都很快速地改变内容，客户甚至不知道是不是纯JS。

#5

Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.

大多数抓取器只分析静态HTML，因此对链接进行编码，然后在客户端web浏览器中使用JavaScript动态地解码它们。

Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

有决心的数据收集者仍然可以绕过这个问题，但如果数据足够有价值，他们可以绕过任何技术。

#1

#2

#3