I run a small webserver, and lately it's been getting creamed by a search engine spider. What's the proper way to cool it down? Should I send it 5xx responses periodically? Is there a robots.txt setting I should be using? Or something else?
我运行了一个小型的网络服务器,最近它被搜索引擎蜘蛛变成了奶油。冷却它的正确方法是什么?我应该定期发送5xx响应吗?我应该使用robots.txt设置吗?或者是其他东西?
6 个解决方案
#1
Assuming that the spider is kind enough to respect robots.txt, you could restrict it from accessing your site with the following:
假设蜘蛛足够尊重robots.txt,您可以通过以下方式限制它访问您的网站:
User-agent: *
Disallow: /
This will affect all spiders. Narrow it down by specifying the correct user-agent for the spider.
这会影响所有蜘蛛。通过为蜘蛛指定正确的用户代理来缩小范围。
If the crawler doesn't respect your robots.txt, you might want to restrict it from accessing your site by blocking its IP in your firewall.
如果抓取工具不尊重您的robots.txt,您可能希望通过阻止其在防火墙中的IP来限制它访问您的网站。
EDIT: You can read more about robots.txt here.
编辑:您可以在此处阅读有关robots.txt的更多信息。
#2
Robots.txt should be your first port of call. The search bot should take note of these settings and stop hitting the pages that you deny access to. This is easily done by creating a file in the root of your website with the following syntax:
Robots.txt应该是您的第一个停靠港。搜索机器人应该记下这些设置并停止点击您拒绝访问的页面。通过使用以下语法在网站的根目录中创建文件,可以轻松完成此操作:
User-agent: *
Disallow: /
That syntax essentially says: All search bots (the wildcard *), you are not allowed to index anything under /. More information at robotstxt.org
该语法基本上说:所有搜索机器人(通配符*),您不能索引/下的任何内容。有关robotstxt.org的更多信息
If this doesn't work, the next step is to ban the IP address if possible.
如果这不起作用,则下一步是尽可能禁用IP地址。
#3
you can also build a sitemap and register the sitemap with the offending bot. The search engines will use the sitemap to determine which pages to hit, and how often. If your site is fully dynamic, it might no help so much, but if you have a lot of static pages, it's a good way to tell the spiders that nothing changes day to day.
您还可以构建站点地图并使用违规的机器人注册站点地图。搜索引擎将使用站点地图来确定要搜索的页面以及搜索频率。如果你的网站是完全动态的,它可能没有多大帮助,但是如果你有很多静态页面,那么告诉蜘蛛每天没有任何变化的好方法。
#4
If it's ignoring robots.txt, the second best thing is to ban it by its useragent string. Just banning the IP won't do much use as 99% of spiders these days are distributed over a bunch of servers.
如果它忽略了robots.txt,那么第二个最好的事情就是用它的useragent字符串禁止它。只是禁止IP不会有太大用处,因为现在99%的蜘蛛分布在一堆服务器上。
#5
User-agent: *
Disallow: /
#6
The robots.txt should be your first choice. However, if the bot misbehaves and you don't have control of the firewall you could set up an .htaccess restriction to ban it by IP.
robots.txt应该是您的首选。但是,如果机器人行为不端并且您无法控制防火墙,则可以设置.htaccess限制以通过IP禁用它。
#1
Assuming that the spider is kind enough to respect robots.txt, you could restrict it from accessing your site with the following:
假设蜘蛛足够尊重robots.txt,您可以通过以下方式限制它访问您的网站:
User-agent: *
Disallow: /
This will affect all spiders. Narrow it down by specifying the correct user-agent for the spider.
这会影响所有蜘蛛。通过为蜘蛛指定正确的用户代理来缩小范围。
If the crawler doesn't respect your robots.txt, you might want to restrict it from accessing your site by blocking its IP in your firewall.
如果抓取工具不尊重您的robots.txt,您可能希望通过阻止其在防火墙中的IP来限制它访问您的网站。
EDIT: You can read more about robots.txt here.
编辑:您可以在此处阅读有关robots.txt的更多信息。
#2
Robots.txt should be your first port of call. The search bot should take note of these settings and stop hitting the pages that you deny access to. This is easily done by creating a file in the root of your website with the following syntax:
Robots.txt应该是您的第一个停靠港。搜索机器人应该记下这些设置并停止点击您拒绝访问的页面。通过使用以下语法在网站的根目录中创建文件,可以轻松完成此操作:
User-agent: *
Disallow: /
That syntax essentially says: All search bots (the wildcard *), you are not allowed to index anything under /. More information at robotstxt.org
该语法基本上说:所有搜索机器人(通配符*),您不能索引/下的任何内容。有关robotstxt.org的更多信息
If this doesn't work, the next step is to ban the IP address if possible.
如果这不起作用,则下一步是尽可能禁用IP地址。
#3
you can also build a sitemap and register the sitemap with the offending bot. The search engines will use the sitemap to determine which pages to hit, and how often. If your site is fully dynamic, it might no help so much, but if you have a lot of static pages, it's a good way to tell the spiders that nothing changes day to day.
您还可以构建站点地图并使用违规的机器人注册站点地图。搜索引擎将使用站点地图来确定要搜索的页面以及搜索频率。如果你的网站是完全动态的,它可能没有多大帮助,但是如果你有很多静态页面,那么告诉蜘蛛每天没有任何变化的好方法。
#4
If it's ignoring robots.txt, the second best thing is to ban it by its useragent string. Just banning the IP won't do much use as 99% of spiders these days are distributed over a bunch of servers.
如果它忽略了robots.txt,那么第二个最好的事情就是用它的useragent字符串禁止它。只是禁止IP不会有太大用处,因为现在99%的蜘蛛分布在一堆服务器上。
#5
User-agent: *
Disallow: /
#6
The robots.txt should be your first choice. However, if the bot misbehaves and you don't have control of the firewall you could set up an .htaccess restriction to ban it by IP.
robots.txt应该是您的首选。但是,如果机器人行为不端并且您无法控制防火墙,则可以设置.htaccess限制以通过IP禁用它。