区分机器人和人类访客的属性?

I am looking to roll my own simple web stats script.

我希望滚动我自己的简单web stats脚本。

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

就我所知，这条路上唯一的主要障碍就是把人类和机器人区分开来。我想要一个解决方案来解决我不需要经常维护的问题(例如，我不想用与bot相关的用户代理更新文本文件)。

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

有没有像Akismet那样的开放服务来处理垃圾邮件?或者是否有一个PHP项目专门用于识别蜘蛛和机器人，并提供频繁更新?

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.

澄清一下:我不是要阻止机器人。我不需要100%水密的结果。我只是想从我的统计数据中排除尽可能多的数据。要知道，解析用户代理是一个选项，但是维护要解析的模式是一项很大的工作。我的问题是，是否有任何项目或服务已经这样做了。

Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.

我想我应该把这个作为这个话题的参考问题。最好的/最原始的/技术上最可行的贡献将会得到赏金。

14 个解决方案

#1

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

人类和机器人会做类似的事情，但是机器人会做人类做不到的事情。让我们试着识别这些东西。在我们讨论行为之前，让我们接受RayQuang的评论是有用的。如果一个访问者有一个机器人的用户代理字符串，它可能是一个机器人。我无法想象任何一个使用“谷歌爬虫”(或类似的东西)作为UA的人，除非他们正致力于破坏某些东西。我知道你不想手动更新一个列表，但是自动拉动一个列表应该是好的，即使它在接下来的10年里保持陈旧，它还是有帮助的。

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

有些人已经提到了Javascript和图像加载，但是谷歌将同时实现这两个功能。我们必须假设现在有几个机器人会同时做这两件事，所以这些不再是人类的指示。然而，机器人唯一能做的就是遵循“隐形”的链接。链接到一个页面的方式非常狡猾，我无法看到作为一个用户。如果它被跟踪，我们就有了一个机器人。

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

机器人通常会尊重机器人，虽然并非总是如此。用户不关心机器人。txt，我们可以假设任何人都在检索机器人。三是一个机器人。不过，我们可以更进一步，将一个虚拟的CSS页面链接到robots.txt排除的页面。如果我们的普通CSS被加载，但是我们的虚拟CSS没有被加载，那么它肯定是一个机器人。您将不得不通过IP构建(可能是内存中的)负载表，并执行一个不包含在match中的表，但这应该是一个非常可靠的消息。

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

因此，要使用所有这些:根据ip地址维护机器人的数据库表，可能有时间戳限制。添加任何跟随隐形链接的内容，添加任何加载“真实”CSS但忽略机器人的内容。txt CSS。也许加上所有的机器人。三种下载者。过滤用户代理字符串作为最后一步，并考虑使用它进行快速的统计分析，看看这些方法在识别我们知道的机器人方面有多有效。

#2

The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.

最简单的方法是检查用户代理是否包含“bot”或“spider”。大多数人都喜欢这么做。

#3

To start with, if your software is gonna work Javascript based, the majority of bots will be automatically stripped out as bots, generally, don't have Javascript.

首先，如果你的软件是基于Javascript的，大多数的聊天机器人会被自动地删除为聊天机器人，一般来说，没有Javascript。

Nevertheless, the straight answer to your question is to follow a bot list and add their user-agent to the filtering list.

不过，对您的问题的直接回答是跟踪一个机器人列表并将其用户代理添加到筛选列表中。

Take a look at this bot list.

看一下这个机器人列表。

This user-agent list is also pretty good. Just strip out all the B's and you're set.

这个用户代理列表也很好。去掉所有的B，就可以了。

EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.

编辑:eSniff所做的令人惊叹的工作在这里有上面的列表“以一种更容易查询和分析的形式”。每个新机器人都由一个机器人id:XXX定义。你应该可以每周下载一次，然后把它解析成脚本可以使用的内容，就像你可以在他的评论中读到的那样。

Hope it helps!

希望它可以帮助!

#4

Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).

考虑一个伪装成CSS背景图像的PHP stats脚本(给出正确的响应头——至少是内容类型和缓存控件——但是写出一个空的图像)。

Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).

有些机器人解析JS，但肯定没有人加载CSS图像。与JS一样，一个缺陷是你会排除基于文本的浏览器，但这还不到万维网人口的1%。此外，css禁用的客户端肯定比js禁用的客户端要少(手机!)

To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).

为了使(不例外的)更加可靠，更高级的机器人(谷歌、Yahoo等)将来可能会抓取它们，在机器人中不允许CSS图像的路径。txt(这是更好的机器人会尊重的)。

#5

I use the following for my stats/counter app:

我使用以下的数据/计数器应用:

<?php
    function is_bot($user_agent) {
        return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
    }

    //example usage
    if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>

I removed a link to the original code source, because it now redirects to a food app.

我删除了原始代码源代码的链接，因为它现在重定向到一个食品应用程序。

#6

I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.

我目前使用AWstats和Webalizer监视Apasce2的日志文件，到目前为止，他们在这方面做得相当不错。如果你愿意，你可以看看他们的源代码，因为它是一个开源项目。

You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html

您可以在http://awstats.sourceforge.net获得源代码，也可以查看FAQ http://awstats.sourceforge.net/awstats_faq.html

Hope that helps, RayQuang

希望有所帮助,RayQuang

#7

Checking the user-agent will alert you to the honest bots, but not the spammers.

检查用户代理将提醒您注意诚实的机器人，而不是垃圾邮件发送者。

To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .

要判断哪些请求是由不诚实的机器人发出的，最好的方法(基于这个人有趣的研究)是捕获一个Javascript焦点事件。

If the focus event fires, the page was almost certainly loaded by a human being.

如果焦点事件触发，页面几乎肯定是由人加载的。

Edit: it's true, people with Javascript turned off will not show up as humans, but that's not a large percentage of web users.

编辑:的确，禁用Javascript的人不会以人类的身份出现，但这在web用户中并不占很大比例。

Edit2: Current bots can also execute Javascript, at least Google can.

当前的机器人也可以执行Javascript，至少谷歌可以。

#8

Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)

我们寻找的是暗示人类行为的东西，而不是一长串蜘蛛用户代理。这些原则是我们将会话计数分为两部分:单页会话的数量和多页会话的数量。我们删除一个会话cookie，并使用它来确定多页面会话。我们还删除了一个持久的“机器ID”cookie;返回的用户(找到的机器ID cookie)被视为一个多页会话，即使他们在会话中只查看一个页面。你可能有其他特征意味着一个“人类”的游客—推荐人是谷歌,例如(虽然我认为搜索机器人mascarades女士作为标准UserAgent提到与现实关键字检查网站没有显示不同的内容(给他们的机器人),和行为看起来很像一个人类!)

Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).

当然这不是infalible,特别是如果你有很多的人到“点击”它不会是一个很好的统计给你,或如果你有优势的饼干关闭(在我们的例子中,他们将无法使用我们的【购物车】网站没有启用会话cookie)。

Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!

从我们的一个客户的数据中，我们发现每天的单会话计数到处都是——一个数量级每天都不同;然而，如果我们从每天的多页会话中减去1000，那么我们就会得到一个糟糕的- - - - -接近线性的速率，每个订单下4个多页会话/每个篮子上2个会话。我不知道每天还有1000个多页的会议是什么!

#9

Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.

记录鼠标移动和使用javascript滚动。你可以从记录的数据中判断它是人还是机器人。除非这个机器人真的很复杂并且模仿人类的鼠标动作。

#10

Prerequisite - referrer is set

先决条件——引用设置。

apache level:

apache水平:

LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*)   /b.gif [L]
SetEnv human_session 0

# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1

SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog

In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.

在web页面中，嵌入/human/$hashkey_of_current_url.gif。如果是机器人，则不太可能有引用集(这是一个灰色区域)。如果直接点击浏览器地址栏，它将不包括。

At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.

在每一天结束时，/human-access_log应该包含所有实际是human page-view的引用者。

To play safe, hash of the referrer from apache log should tally with the image name

为了安全起见，来自apache log的引用者的散列应该与映像名一致

#11

Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.

在你的页面上保存一个1x1的gif。如果被加载，很可能是浏览器。如果没有加载，它很可能是一个脚本。

#12

=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.

= ?对不起,误解。你可以尝试我在我的网站上设置的另一个选项:创建一个没有链接的网页，并创建一个硬的/奇怪的名字，并记录不同的访问。如果不是所有的访问者都是机器人，那么你就可以动态创建你的机器人列表。

Original answer follows (getting negative ratings!)

最初的回答如下(得到负面评价!)

The only reliable way to tell bots from humans are [CAPTCHAS][1]. You can use [reCAPTCHA][2] if it suits you.

区分机器人和人类的唯一可靠方法是[1]。如果它适合你，你可以用[再验证码][2]。

[1]: http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/

[1]http://en.wikipedia.org/wiki/Captcha[2]:http://recaptcha.net/

#13

I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.

我很惊讶没有人推荐实施图灵测试。只要在另一端有一个与人类的聊天框。

A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR

一个程序化的解决方案是行不通的:看看帕里遇到医生时会发生什么

These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.

这两个“角色”都是“聊天”机器人，它们是在70年代人工智能研究的过程中编写的:看看它们能骗过一个人多久，让他以为自己也是一个人。帕里的角色被塑造成偏执型精神分裂症患者，而医生则是典型的心理治疗师。

Here's some more background

这里有一些更多的背景

#14

You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.

您可以排除来自请求robots.txt的用户代理的所有请求。所有行为良好的机器人都会发出这样的请求，但是坏的机器人不会被发现。

You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.

你也会遇到假阳性的问题——作为一个人，我读机器人的情况并不常见。txt在我的浏览器中，但我当然可以。为了避免这些不正确地显示为机器人，您可以列出一些常见的浏览器用户代理，并认为它们始终是人类。但这将变成为浏览器维护用户代理列表，而不是为机器人维护用户代理列表。

So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.

这did-they-request-robots。txt方法当然不会提供100%水密的结果，但它可能提供一些启发式来提供完整的解决方案。

#1

#2