我可以阻止Apache Web服务器上每个站点的搜索爬虫吗?

时间:2021-06-11 16:52:05

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.


Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?


Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.


6 个解决方案



Create a robots.txt file with the following contents:


User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).


Add the following to your httpd.conf file:


# Exclude all robots
<Location "/robots.txt">
    SetHandler None
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.


That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.


(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)




You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.


This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:


<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]

You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

您可以尝试将蜘蛛重定向到另一台服务器上的主robots.txt,但是一些蜘蛛在获得HTTP请求中的“200 OK”或“404 not found”返回代码以外的任何内容后可能会犹豫不决,并且他们可能无法读取重定向的URL。

Here's how you would do that:


<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]



Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?




To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.


Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.




Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)

根据您的部署方案,您应该寻找将不同的robots.txt文件部署到dev / stage / test / prod(或者您拥有的任何组合)的方法。假设您在不同的服务器上有不同的数据库配置文件或(或类似的),这应遵循类似的过程(您的数据库有不同的密码,对吧?)

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

如果你没有一步到位的部署过程,这可能是一个很好的动力来获得一个...有很多工具可用于不同的环境--Capistrano是一个相当不错的工具,并且在Rails /中受到青睐Django世界,但绝不是唯一的。

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt




Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.




Create a robots.txt file with the following contents:


User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).


Add the following to your httpd.conf file:


# Exclude all robots
<Location "/robots.txt">
    SetHandler None
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.


That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.


(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)




You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.


This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:


<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]

You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

您可以尝试将蜘蛛重定向到另一台服务器上的主robots.txt,但是一些蜘蛛在获得HTTP请求中的“200 OK”或“404 not found”返回代码以外的任何内容后可能会犹豫不决,并且他们可能无法读取重定向的URL。

Here's how you would do that:


<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]



Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?




To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.


Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.




Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)

根据您的部署方案,您应该寻找将不同的robots.txt文件部署到dev / stage / test / prod(或者您拥有的任何组合)的方法。假设您在不同的服务器上有不同的数据库配置文件或(或类似的),这应遵循类似的过程(您的数据库有不同的密码,对吧?)

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

如果你没有一步到位的部署过程,这可能是一个很好的动力来获得一个...有很多工具可用于不同的环境--Capistrano是一个相当不错的工具,并且在Rails /中受到青睐Django世界,但绝不是唯一的。

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt




Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.
