Time goes by, but still no perfect solution... See if someone has a bright idea to differentiate bot from human-loaded web page? State of the art is still loading a long list of well-known SE bots and parse USER AGENT?
时间流逝,但仍然没有完美的解决方案......看看是否有人有一个明智的想法来区分机器人和人力加载的网页?最新技术仍在加载一长串众所周知的SE机器人并解析用户代理?
Testing has to be done before the page is loaded! No gifs or captchas!
必须在加载页面之前进行测试!没有GIF或验证码!
9 个解决方案
#1
4
If possible, I would try a honeypot approach to this one. It will be invisible to most users, and will discourage many bots, though none that are determined to work, as they could implement special code for your site that just skipped the honeypot field once they figure out your game. But it would take a lot more attention by the owners of the bot than is probably worth it for most. There will be tons of other sites accepting spam without any additional effort on their part.
如果可能的话,我会尝试使用蜜罐方法。它对大多数用户来说是不可见的,并且会阻止很多机器人,尽管没有一个机器人可以工作,因为他们可以为你的网站实现特殊代码,一旦他们弄清楚你的游戏就跳过了蜜罐字段。但是机器人的所有者需要更多关注,而不是大多数人都值得。将有大量其他网站接受垃圾邮件,而无需他们做任何额外的努力。
One thing that gets skipped over from time to time is it is important to let the bot think that everything went fine, no error messages, or denial pages, just reload the page as you would for any other user, except skip adding the bots content to the site. This way there are no red flags that can be picked up in the bots logs, and acted upon by the owner, it will take much more scrutiny to figure out you are disallowing the comments.
有一件事不时被忽略,让机器人认为一切正常,没有错误消息或拒绝页面是重要的,只需重新加载页面,就像对任何其他用户一样,除了跳过添加机器人内容到网站。通过这种方式,没有可以在机器人日志中拾取的红色标记,并且由所有者采取行动,需要更多的仔细检查以确定您不同意这些评论。
#2
3
Without a challenge (like CAPTCHA), you're just shooting in the dark. User agent can trivially be set to any arbitrary string.
没有挑战(如CAPTCHA),你只是在黑暗中拍摄。用户代理可以简单地设置为任意字符串。
#3
1
The user agent is set by the client and thus can be manipulated. A malicious bot thus certainly would not send you an I-Am-MalBot
user agent, but call himself some version of IE. Thus using the User Agent to prevent spam or something similar is pointless.
用户代理由客户端设置,因此可以进行操作。因此,恶意僵尸程序肯定不会向您发送I-Am-MalBot用户代理,而是称自己为IE的某个版本。因此,使用用户代理来防止垃圾邮件或类似的东西是毫无意义的。
So, what do you want to do? What's your final goal? If we knew that, we could be better help.
那么,你想做什么?你的最终目标是什么?如果我们知道这一点,我们可以提供更好的帮助。
#4
1
The creators of SO should know why they are using Captcha in order to prevent bots from editing content. The reason is there is actually no way to be sure that a client is not a bot. And i think there never will be.
SO的创建者应该知道为什么他们使用Captcha来防止机器人编辑内容。原因是实际上没有办法确定客户端不是机器人。而且我认为永远不会有。
#5
1
What the others have said is true to an extent... if a bot-maker wants you to think a bot is a genuine user, there's no way to avoid that. But many of the popular search engines do identify themselves. There's a list here (http://www.jafsoft.com/searchengines/webbots.html) among other places. You could load these into a database and search for them there. I seem to remember that it's against Google's user agreement to make custom pages for their bots though.
其他人所说的在某种程度上是正确的......如果机器人制造商希望你认为机器人是真正的用户,那么就没有办法避免这种情况。但是许多流行的搜索引擎确实能够识别自己。这里有一个列表(http://www.jafsoft.com/searchengines/webbots.html)和其他地方。您可以将这些加载到数据库中并在那里搜索它们。我似乎记得,这违反了谷歌的用户协议,为他们的机器人制作自定义页面。
#6
1
I myself is coding web crawlers for different purposes. And I use a web browser UserAgent.
我自己正在为不同的目的编写网络爬虫。我使用Web浏览器UserAgent。
As far as I know, you cannot distinguish bots from humans if a bot is using a legit UserAgent. Like:
据我所知,如果机器人使用合法的UserAgent,你无法区分机器人和人类。喜欢:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.11 (KHTML, like Gecko) Chrome/9.0.570.1 Safari/534.11
The only thing I can think of is JavaScript. Most custom web bots (like those that I code) can't execute javascript codes because it's a browser job. But if the bot is linked or using a web browser (Like firefox) then it will be undetected.
我唯一能想到的是JavaScript。大多数自定义网络机器人(如我编码的那些)无法执行javascript代码,因为它是一个浏览器工作。但是如果机器人链接或使用网络浏览器(如火狐),那么它将被检测不到。
#7
0
I'm sure I'm going to take a votedown on this, but I had to post it: Constructive
我相信我会对此进行投票,但我不得不发布它:具有建设性
In any case, captchas are the best way right now to protect against bots, short of approving all user-submitted content.
在任何情况下,验证码是目前防止机器人的最佳方式,而不是批准所有用户提交的内容。
-- Edit --
- 编辑 -
I just noticed your P.S., and I'm not sure of anyway to diagnose a bot without interacting with it. Your best bet in this case might be to catch the bots as early as possible and implement a 1 month IP restriction, after which time the BOT should give up if you constantly return HTTP 404 to it. Bot's are often run from a server and don't change their IP, so this should work as a mediocre approach.
我刚注意到你的P.S.,而且我不确定无论如何在没有与之交互的情况下诊断机器人。在这种情况下,你最好的选择可能是尽早抓住机器人并实施1个月的IP限制,之后如果你经常将HTTP 404返回给它,BOT应该放弃。 Bot通常是从服务器运行而不是改变它们的IP,所以这应该是一种平庸的方法。
#8
0
I would suggest using Akismet, a spam prevention plugin, rather than any sort of Captcha or CSS trick because it is very excellent at catching spam without ruining the user experience.
我建议使用Akismet,一个垃圾邮件防护插件,而不是任何类型的Captcha或CSS技巧,因为它非常适合捕获垃圾邮件而不会破坏用户体验。
#9
0
Honest bots, such as search engines, will typically access your robots.txt. From that you can learn their useragent string and add it to your bot list.
诚实的机器人,例如搜索引擎,通常会访问您的robots.txt。从那里你可以学习他们的useragent字符串并将其添加到你的机器人列表。
Clearly this doesn't help with malicious bots which are pretending to be human, but for some applications this could be good enough if all you want to do is filter search engine bots out of your logs (for example).
显然,这对假冒人类的恶意机器人没有帮助,但对于某些应用程序而言,如果您只想从日志中过滤搜索引擎机器人(例如),这可能就足够了。
#1
4
If possible, I would try a honeypot approach to this one. It will be invisible to most users, and will discourage many bots, though none that are determined to work, as they could implement special code for your site that just skipped the honeypot field once they figure out your game. But it would take a lot more attention by the owners of the bot than is probably worth it for most. There will be tons of other sites accepting spam without any additional effort on their part.
如果可能的话,我会尝试使用蜜罐方法。它对大多数用户来说是不可见的,并且会阻止很多机器人,尽管没有一个机器人可以工作,因为他们可以为你的网站实现特殊代码,一旦他们弄清楚你的游戏就跳过了蜜罐字段。但是机器人的所有者需要更多关注,而不是大多数人都值得。将有大量其他网站接受垃圾邮件,而无需他们做任何额外的努力。
One thing that gets skipped over from time to time is it is important to let the bot think that everything went fine, no error messages, or denial pages, just reload the page as you would for any other user, except skip adding the bots content to the site. This way there are no red flags that can be picked up in the bots logs, and acted upon by the owner, it will take much more scrutiny to figure out you are disallowing the comments.
有一件事不时被忽略,让机器人认为一切正常,没有错误消息或拒绝页面是重要的,只需重新加载页面,就像对任何其他用户一样,除了跳过添加机器人内容到网站。通过这种方式,没有可以在机器人日志中拾取的红色标记,并且由所有者采取行动,需要更多的仔细检查以确定您不同意这些评论。
#2
3
Without a challenge (like CAPTCHA), you're just shooting in the dark. User agent can trivially be set to any arbitrary string.
没有挑战(如CAPTCHA),你只是在黑暗中拍摄。用户代理可以简单地设置为任意字符串。
#3
1
The user agent is set by the client and thus can be manipulated. A malicious bot thus certainly would not send you an I-Am-MalBot
user agent, but call himself some version of IE. Thus using the User Agent to prevent spam or something similar is pointless.
用户代理由客户端设置,因此可以进行操作。因此,恶意僵尸程序肯定不会向您发送I-Am-MalBot用户代理,而是称自己为IE的某个版本。因此,使用用户代理来防止垃圾邮件或类似的东西是毫无意义的。
So, what do you want to do? What's your final goal? If we knew that, we could be better help.
那么,你想做什么?你的最终目标是什么?如果我们知道这一点,我们可以提供更好的帮助。
#4
1
The creators of SO should know why they are using Captcha in order to prevent bots from editing content. The reason is there is actually no way to be sure that a client is not a bot. And i think there never will be.
SO的创建者应该知道为什么他们使用Captcha来防止机器人编辑内容。原因是实际上没有办法确定客户端不是机器人。而且我认为永远不会有。
#5
1
What the others have said is true to an extent... if a bot-maker wants you to think a bot is a genuine user, there's no way to avoid that. But many of the popular search engines do identify themselves. There's a list here (http://www.jafsoft.com/searchengines/webbots.html) among other places. You could load these into a database and search for them there. I seem to remember that it's against Google's user agreement to make custom pages for their bots though.
其他人所说的在某种程度上是正确的......如果机器人制造商希望你认为机器人是真正的用户,那么就没有办法避免这种情况。但是许多流行的搜索引擎确实能够识别自己。这里有一个列表(http://www.jafsoft.com/searchengines/webbots.html)和其他地方。您可以将这些加载到数据库中并在那里搜索它们。我似乎记得,这违反了谷歌的用户协议,为他们的机器人制作自定义页面。
#6
1
I myself is coding web crawlers for different purposes. And I use a web browser UserAgent.
我自己正在为不同的目的编写网络爬虫。我使用Web浏览器UserAgent。
As far as I know, you cannot distinguish bots from humans if a bot is using a legit UserAgent. Like:
据我所知,如果机器人使用合法的UserAgent,你无法区分机器人和人类。喜欢:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.11 (KHTML, like Gecko) Chrome/9.0.570.1 Safari/534.11
The only thing I can think of is JavaScript. Most custom web bots (like those that I code) can't execute javascript codes because it's a browser job. But if the bot is linked or using a web browser (Like firefox) then it will be undetected.
我唯一能想到的是JavaScript。大多数自定义网络机器人(如我编码的那些)无法执行javascript代码,因为它是一个浏览器工作。但是如果机器人链接或使用网络浏览器(如火狐),那么它将被检测不到。
#7
0
I'm sure I'm going to take a votedown on this, but I had to post it: Constructive
我相信我会对此进行投票,但我不得不发布它:具有建设性
In any case, captchas are the best way right now to protect against bots, short of approving all user-submitted content.
在任何情况下,验证码是目前防止机器人的最佳方式,而不是批准所有用户提交的内容。
-- Edit --
- 编辑 -
I just noticed your P.S., and I'm not sure of anyway to diagnose a bot without interacting with it. Your best bet in this case might be to catch the bots as early as possible and implement a 1 month IP restriction, after which time the BOT should give up if you constantly return HTTP 404 to it. Bot's are often run from a server and don't change their IP, so this should work as a mediocre approach.
我刚注意到你的P.S.,而且我不确定无论如何在没有与之交互的情况下诊断机器人。在这种情况下,你最好的选择可能是尽早抓住机器人并实施1个月的IP限制,之后如果你经常将HTTP 404返回给它,BOT应该放弃。 Bot通常是从服务器运行而不是改变它们的IP,所以这应该是一种平庸的方法。
#8
0
I would suggest using Akismet, a spam prevention plugin, rather than any sort of Captcha or CSS trick because it is very excellent at catching spam without ruining the user experience.
我建议使用Akismet,一个垃圾邮件防护插件,而不是任何类型的Captcha或CSS技巧,因为它非常适合捕获垃圾邮件而不会破坏用户体验。
#9
0
Honest bots, such as search engines, will typically access your robots.txt. From that you can learn their useragent string and add it to your bot list.
诚实的机器人,例如搜索引擎,通常会访问您的robots.txt。从那里你可以学习他们的useragent字符串并将其添加到你的机器人列表。
Clearly this doesn't help with malicious bots which are pretending to be human, but for some applications this could be good enough if all you want to do is filter search engine bots out of your logs (for example).
显然,这对假冒人类的恶意机器人没有帮助,但对于某些应用程序而言,如果您只想从日志中过滤搜索引擎机器人(例如),这可能就足够了。