How to Defeat Bad Web Robots With Apache

By Lee Killough
Modified 1 March 2010 by gypsy



Download gypsy's package that uses a mysql database rather than Lee's flat files.

NOTE: This page has become somewhat outdated  :-( I first wrote it in 2001. If you use newer versions of Linux, you may need to use iptables instead of ipchains, and the syntax is slightly different. Also, virtual hosting / virtual domains are much more common than when this page was first written -- if you run a webserver on one or more virtual hosts, then these instructions must be applied to each <VirtualHost> separately, and not just to the global rules in the httpd.conf file, although include files may help eliminate duplication. Also, the rules in this document were written with Apache 1.3.x, and have not been tested with Apache 2.0.x.

Introduction

This article describes how to configure Apache to defeat bad robots, programs which automatically fetch and scan web pages, sometimes for bad purposes.

I'll spare you an introduction to web robots, spiders, etc., because you probably already know about them -- that's why you're reading this. But for background information, see the robots.txt page.

The term "robots" is used generically throughout this document to refer to any automata which perform web requests. It includes "spiders", a term which is sometimes used to refer to agents which check the validity of links or syntax on pages, but which don't index or analyze web pages' content.

Throughout this document, the Apache configuration file httpd.conf will be referred to often. On many Linux installations of Apache, it resides in /etc/httpd/conf/httpd.conf. But its exact location depends on how Apache was installed. For more information, refer to the documentation which came with your version of Apache, or refer to the Apache docs.

The /www directory will be referred to throughout this document. It should be replaced with the DocumentRoot path in your setup, or /www can simply be made a symlink to the DocumentRoot, e.g.:

ln -s /home/www /www

mod_rewrite

To use the Apache features described here, you must enable the mod_rewrite module. Before any mod_rewrite directives that follow, add this line to your httpd.conf file:

RewriteEngine On

robots.txt

robots.txt refers to a convention adopted to control which files on a web server are allowed to be accessed by robots, based on User-agents. A User-agent is a means of identification for an http client such as a robot. For example, Google uses Googlebot/2.1 as its User-agent identifier. For more information on robots.txt, see the robots.txt page.

However, robots.txt depends on robots obeying it, and many do not. In order for robots.txt to be effective in preventing a robot from accessing certain files on a web server, the robot must fetch /robots.txt first, and then observe it. Many robots do not even fetch /robots.txt, and a sizeable proportion of those that do, don't observe it.


Redirect / to /index.html

You might have noticed that on some web sites, an access to / causes an immediate redirection to /index.html.

Why would you do this, if / and /index.html refer to the same file, and / is shorter?

The reason is that robots.txt cannot exclude / without excluding everything underneath it.

So to prevent robots from visiting /, while still allowing them to visit pages underneath it, add a rule such as this to httpd.conf:

RewriteRule ^/$ /index.html [R,L]

This causes accesses to / to be redirected to /index.html, which can be specifically excluded in robots.txt:

Disallow /index.html

Use Dynamic robots.txt

One way to better control robots.txt, is to make its content dynamic, say depending on the User-agent and/or domain of the robot.

  1. Add a line similar to this to httpd.conf:

    RewriteRule /robots\.txt$ /cgi-bin/robots.pl [L,T=application/x-httpd-cgi]
    

    This tells the server to silently redirect any robots.txt requests to a CGI script, which generates the content dynamically.

    Note that no ^ appears before /robots\.txt in the rule, because some robots might check for a robots.txt file in the same directory as the file(s) they are interested in fetching (this was proposed at one time as a way of letting individual users control robots in their own subdirectories if they did not have privileges to modify the root /robots.txt, but it never really caught on). Another reason is that many sites use web domain redirection services, in which /robots.txt, as just another file under a web page domain, gets translated to a path other than /robots.txt on the destination server, because paths in the URL after the web domain are appended to a subdirectory on the destination server. Therefore any path ending in /robots.txt should be considered a request for the /robots.txt file.

  2. The CGI script would look something like this (robots.pl):

    #!/usr/bin/perl
    
    $| = 1;
    
    $host  = $ENV{'REMOTE_HOST'};
    $addr  = $ENV{'REMOTE_ADDR'};
    $agent = $ENV{'HTTP_USER_AGENT'};
    
    print "Content-type: text/plain\n\n";
    
    if ($host =~ /\.googlebot\.com$/i && $agent =~ /^Googlebot/) {
        print <<'EOF';
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    EOF
    } else {
        print <<'EOF';
    User-agent: *
    Disallow: /
    EOF
    }
    

    This script returns two different answers, depending on which robot is accessing it.

    If it sees that Googlebot is the robot, and that the request is really coming from googlebot.com, then it sends Googlebot a reasonable robots.txt file that Googlebot will respect.

    If, however, it is not Googlebot that is accessing robots.txt, then a totally restrictive robots.txt will be sent, which should prevent any further requests, if the robot respects it.

This allows you to select which robots are allowed to crawl your site, without disclosing your policy to all of them.

But how do you enforce it?


Track and Enforce robots.txt Accesses

By recording accesses to robots.txt, you can enforce them.

  1. At the end of the robots.pl script above, append these lines:

    $mapfile = "/www/.robots";
    
    if (open(MAP,$mapfile)) {
      while (<MAP>) {
       $robot{$1}=1 if !/^\s*#/ && /^(\S+) -$/;
      }
      close (MAP);
    }
    
    if (!defined($robot{$addr}) && open(MAP,">>$mapfile")) {
       flock(MAP, 2);
       seek(MAP, 0, 2);
       print MAP "\n# $host" if $host ne $addr;
       print MAP "\n# $agent\n$addr -\n";
       close(MAP);
    }
    

    This code records accesses to robots.txt in the file $mapfile. That should be a file which is world-writable but not readable by http accesses. A filename beginning with a . (period) is customary in these instances. (To prevent http accesses of files beginning with periods, see below.)

  2. Be sure to set the $mapfile variable correctly, and to initially create an empty file with the appropriate permissions so that the script can append to it while running as apache or another non-privileged user. The file should be somewhere underneath DocumentRoot, to avoid security problems. For example:

    touch /www/.robots
    chmod 666 /www/.robots
    
  3. Configure apache to recognize hosts which have accessed robots.txt by adding these lines to httpd.conf:

    RewriteMap  robots txt:/www/.robots
    RewriteCond ${robots:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND
    RewriteRule .* - [E=IS_ROBOT:true]
    

    This sets the IS_ROBOT environmental variable to true if a requestor's IP address matches one which was recorded earlier as having accessed robots.txt. This environmental variable is available to any later httpd.conf directives, and to any CGI scripts that are executed.

    The filename after the : on the RewriteMap line should match $mapfile in the script.

  4. Make Apache reject accesses if a host has accessed robots.txt before, unless it is from a host known to be one of the "good" robots:

    RewriteCond %{ENV:IS_ROBOT} true
    RewriteCond %{REMOTE_HOST} !\.googlebot\.com$    [NC]
    RewriteRule .* - [F,L]
    

    This rule should come after the rule above for /robots.txt, so that /robots.txt is always allowed to be read. It should be customized to match whatever policy you want robots.txt to enforce.

But this does nothing to stop robots which do not access robots.txt.


Lay Spider Traps

Create a file in your document root named bad.html containing:

<!doctype html public "-//w3c//dtd HTML 4.0 transitional//en"> 
<HTML><HEAD> 
   <META http-equiv="Content-Type" CONTENT="text/html; charset=iso-8859-1">
   <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
   <TITLE>Enforcer</TITLE></HEAD>
<BODY>
<!-- spider trap -->
<A HREF="/bad.html" onclick="return false;" onmouseover="window.status='Do
not follow this link, or you will be blocked from this site. This is a spider
trap.'; return true;">
<IMG height="1" width="1" align="right" border="0" src="/images/onepixel.gif"
 alt="Do not follow this link, or you will be blocked from this site. This is a
spider trap."></A>
<!-- end spider trap -->
</BODY></HTML>
Invisible links (I call them "spider traps") can be added to pages, so that if they are traversed, they catch the spider/robot in the act.

  1. Lay traps at the beginning and ending of frequently-visited pages:

    <!-- spider trap -->
    <a href="/bad.html" onclick="return false;" onmouseover="window.status='Do not follow this link, or your host will be blocked from this site. This is a spider trap.'; return true;">
    <img height="1" width="1" align="right" border="0" src="/images/spider_trap_do_not_visit.gif" alt="Do not follow this link, or your host will be blocked from this site. This is a spider trap.">
    </a>
    <!-- end spider trap -->
    

    This spider trap is made up of a floating (align=right) image inside a hyperlink. The image file is named spider_trap_do_not_visit.gif, and a warning message is put in the ALT attribute, so that Lynx users can tell what it means. Furthermore, on JavaScript-enabled browsers, the link does not respond when clicked on, and it prints a warning message on the status bar.

    There are two such spider traps on this page. They are very hard to find, however, if images are automatically loaded. They are located in the upper right and lower right corners of this page. The idea is to make the traps the first links on the page that an automatic program will scan, so that it is trapped as early as possible.

    The image can be any image, preferably nothing but a blank image of the same color as the text background.

    A simpler method is to use a . (period) for the hyperlink, and to make its color match the background color of the page, but then it will be less friendly to Lynx users:

    <!-- spider trap -->
    <p align=right>        <!-- optional -->
    <a href="/bad.html" onclick="return false" onmouseover="window.status='Do not follow this link, or your host will be blocked from this site. This is a spider trap.'">
    <font size="-2" color="white">.</font></a>
    <!-- end spider trap -->
    

    Note that using an empty <a> </a> container is not correct HTML, and that many robots, just like many browsers, completely ignore such hyperlinks.

  2. Create a script which records bad hosts (bad.pl):

    #!/usr/bin/perl
    
    $| = 1;
    
    $host  = $ENV{'REMOTE_HOST'};
    $addr  = $ENV{'REMOTE_ADDR'};
    $agent = $ENV{'HTTP_USER_AGENT'};
    $url   = $ENV{'REQUEST_URI'};
    
    $mapfile = "/www/.bad";
    
    if (open(MAP,$mapfile)) {
      while (<MAP>) {
       $bad{$1}=1 if !/^\s*#/ && /^(\S+) -$/;
      }
      close (MAP);
    }
    
    if (!defined($bad{$addr}) && open(MAP,">>$mapfile")) {
       flock(MAP, 2);
       seek(MAP, 0, 2);
       print MAP "\n# $host" if $host ne $addr;
       print MAP "\n# $agent\n$addr -\n";
       close(MAP);
    }
    
    $url =~ s/&/&amp;/g;
    
    print <<"EOF";
    Status: 403 Forbidden
    Content-type: text/html
    
    <html><head><title>403 Forbidden</title></head><body>
    <h1>403 Forbidden</h1>
    You don't have permission to access $url on this server.
    </body></html>
    EOF
    
  3. Just like the file used to record robots, this script's $mapfile must be set to the name of a file to append entries to, and that file must be initialized with correct permissions. For example:

    touch /www/.bad
    chmod 666 /www/.bad
    
  4. Detect and reject accesses from hosts marked bad:

    RewriteMap  bad txt:/www/.bad
    RewriteCond ${bad:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND
    RewriteRule .* - [F,L]
    
  5. Detect accesses to spider traps, and record them as bad hosts by running the script:

    RewriteRule ^/bad\.html$ /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]
    

    Note: This RewriteRule, and the one right before it, only need to appear once in the http.conf file. I say this because some might interpret the instructions here and below to mean that this RewriteRule should appear in multiple places in the http.conf file. But it only is needed once, because the state of the /www/.bad file does not change in the middle of an http access (only at the end), and because the rule is final (the 'L' in the RewriteRule's options), hence making any later rules of the same pattern redundant.

  6. Put the spider trap into the robots.txt exclusion list for good robots by modifying the robots.pl script shown above, e.g.:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    

    Becomes:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /bad.html
    

    This way, good robots won't be punished.

Now, whenever /bad.html is accessed by a client, regardless of whether it accessed robots.txt first, it will be flagged as a bad host, and will not be allowed to access any more pages. Normal users should not run into it, because the links are invisible, or carry a clear warning message.

This method is unique, because it immediately records information about bad robots, immediately restricting future accesses from the same hosts.


Permanently Exclude Hosts Which Violate robots.txt

With the rules above, hosts which access robots.txt are forbidden from accessing files which are excluded by robots.txt, but they are not forbidden from accessing other files. With the bad.pl script, rules can be added to permanently reject hosts which violate robots.txt. This way, if a host identifying itself as a robot by accessing robots.txt then accesses a file it shouldn't, it will be permanently forbidden from accessing any more files (except robots.txt).

For example:

# Mark Googlebot bad if it violates robots.txt
RewriteCond %{ENV:IS_ROBOT} true
RewriteCond %{REMOTE_HOST} \.googlebot\.com$    [NC]
RewriteRule ^(/cgi-bin/|/images/) /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

# Mark all other hosts bad if they violate robots.txt
RewriteCond %{ENV:IS_ROBOT} true
RewriteCond %{REMOTE_HOST} !\.googlebot\.com$    [NC]
RewriteRule .* /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

(Be sure to remove any earlier rules which might prevent these rules from working correctly.)


Catch Sloppy SGML/HTML Parsing

Many robots do not handle SGML or HTML correctly. These rules catch them and punish them:

RewriteRule &amp; /cgi-bin/bad.pl [NC,L,T=application/x-httpd-cgi]

If &amp; appears in a URL, it means that a robot traversed a hyperlink which was encoded with &amp; in the source. &amp; is a SGML entity for &, and must be used in hyperlinks which use ampersands, in order for it to be 100% correct HTML. But many robots don't parse SGML entities correctly, and this gives them away immediately. Also, if an &amp; appears in a hyperlink, it probably means that the URL points to a CGI script, which you usually don't want robots to visit anyway. Since no decent human browsers are this sloppy about SGML, an &amp; appearing in a URL is almost certainly a sign of a bad robot.

RewriteRule \# /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

If the requested URL contains a # in it, then it means that an even worse robot is not correctly stripping off #'s in hyperlinks, or is not handling escaping correctly. Such a robot is usually malicious.


Prevent and Record Accesses to Paths Containing /.

Paths containing /., such as /.htaccess, /.bad, or /.robots, should not be allowed to be accessible, and any attempts to access them should mark a host as suspicious, and prevent any further accesses from that host:

# Reject accesses to paths containing /. but not /.. before ?
RewriteRule ^[^?]*/\.[^\.] /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

Catch Bad Robots by HTTP_USER_AGENT or HTTP_REFERER

This one is controversial (see this Slashdot discussion), and it is not foolproof, since robots can easily change their HTTP_USER_AGENT and HTTP_REFERER strings. However, if they reveal themselves, then they can and should be stopped -- permanently. Here is a huge list of rules for stopping suspicious agents:

# Robots known or highly suspected of collecting email addresses for spam
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|Bullseye|CherryPicker|Crescent|ecollector|EmailCollector|Email.Extractor|EmailSiphon|EmailWolf|ExtractorPro|fastlwspider|.*LWP|Digger|.*hhjhj@yahoo|Microsoft.URL|Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|NICErsPRO|SurfWalker|Telesoft|WebBandit|WebEMailExtrac|Zeus.*Webster) [NC,OR]

# Robots (sometimes called spiders) which regularly violate robots.txt
RewriteCond %{HTTP_USER_AGENT} ^(ADSARobot|.*almaden\.ibm|ASSORT|big.brother|bumblebee|Digimarc|FavOrg|FAST|.*fluffy|.*Girafabot|HomePageSearch|IncyWincy|Ingelin|NPBot|Openfind|OpenTextSiteCrawler|OrangeBot|Robozilla|ScoutAbout|.*searchhippo|searchterms\.it|sitecheck|UIowaCrawler|.*webcraft@bea\.com|WEBMASTERS|WhosTalking|WISEbot|Yandex) [NC,OR]

# Agents used for both good and bad purposes, such as sucking up bandwidth
# by downloading entire sites, or probing servers for security exploits.
RewriteCond %{HTTP_USER_AGENT} ^(ASPSeek|Deweb|Fetch|FlashGet|Getleft|GetURL|GetWebPage|.*HTTrack|KWebGet|libwww-perl|Mirror|NetAnts|NetCarta|netprospector|Net.Vampire|pavuk|PSurf|PushSite|reget|Rsync|Shai|SpiderBot|SuperBot|tarspider|Templeton|w3mir|web.by.mail|WebCopier|WebCopy|WebMiner|WebReaper|WebSnake|WebStripper|webvac|webwalk|WebZIP|Wget|XGET) [NC,OR]

# Miscellaneous (suspicious -- more information would be appreciated)
RewriteCond %{HTTP_USER_AGENT} ^(ah-ha|aktuelles|amzn_assoc|ATHENS|attache|bew|disco|.*DTS.Agent|Favorites.Sweeper|FEZhead|Generic|GetRight|go-ahead-got-it|.*Harvest|IBM_Planetwide|leech|MCspider|NetResearchServer|nost\.info|OpaL|PackRat|RepoMonkey|.*Rover|Spegla|SqWorm|.*TrueRobot|UtilMind|vspider|.*WUMPUS) [NC,OR]

# Blank or 10-letter user agent
RewriteCond %{HTTP_USER_AGENT} ^(-?|[A-Z]{10})$                    [OR]

# A host which tries to hide itself in reverse DNS lookup
RewriteCond %{REMOTE_HOST} ^private$                               [NC,OR]

# Web surveying sites (may require using ipchains)
RewriteCond %{HTTP_REFERER} (traffixer|netfactual|netcraft)\.com   [NC,OR]
RewriteCond %{REMOTE_HOST} \.netcraft\.com$                        [NC,OR]

# A fake referrer that's often used -- use this unless your pages are related
# in some way to atomic energy and could really be linked to from www.iaea.org
RewriteCond %{HTTP_REFERER} ^[^?]*iaea\.org                        [NC,OR]

# "addresses.com" is a referer used by an email address extractor
RewriteCond %{HTTP_REFERER} ^[^?]*addresses\.com                   [NC,OR]

# A fake referrer that's used in conjuncting with formmail exploits
RewriteCond %{HTTP_REFERER} ^[^?]*\.ideography\.co\.uk             [NC]

# The rule which blocks out further access from the host
RewriteRule .* /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

(Note that the RewriteCond's are connected by [OR]'s and that the last condition does not have [OR]. If only a subset of these rules is used, then the [OR] disjunction must be used correctly.)

These rules match robots known not to observe robots.txt, robots used to harvest email addresses for spam, robots used to suck up entire sites, robots used by script kiddies, etc.

Some of the entries are controversial (e.g. libwww-perl, Wget), because they can be put to good uses as well as bad ones. However, in my experience, they are usually used to attack or rip entire sites, disregarding robots.txt entirely. But there are good uses as well, and so if you do not consider these to be a problem, you can leave out the rules.

Entries such as Fast-WebCrawler, IncyWincy, and ASSORT are listed because I've seen them violate robots.txt -- ignoring it completely, or reading it and then immediately violating it -- not because they are inherently evil.

Please do not send me email saying that one of these engines should not be listed. All have been witnessed violating /robots.txt repeatedly at one time or another. If you think they are safe for you, you can remove them from the rules.

The bad.pl script is used, so that even if the client changes its HTTP_USER_AGENT and/or HTTP_REFERER strings, it will still be blocked. So if someone uses Wget, finds out there's a problem accessing the site with it, and changes Wget's HTTP_USER_AGENT string to Mozilla, they will still be blocked, until you manually remove their host from the /www/.bad file.

To really avoid getting spam from email harvesters (e.g. EmailSiphon, CherryPicker, Crescent Internet ToolPak), don't put email addresses in web pages. Use forms for email (but don't use formmail.pl, because it has security problems).

The ^(-?|[A-Z]{10})$ entry deserves some explanation: It traps clients which hide their HTTP_USER_AGENT string entirely, and clients which use random 10-letter strings for their HTTP_USER_AGENT -- I've seen some malicious scripts generate 10-letter pseudo-random alphabetic strings for HTTP_USER_AGENT.

netcraft.com and netfactual.com are often used to probe ("survey") web sites for web server and web page information, without regard to robots.txt. Since they are usually only interested in server response codes, it might be necessary to block them completely with ipchains. Unless, of course, you don't mind your site being probed for web "surveys."

iaea.org is often used as a bogus referrer (see this page). It often comes from telus.net. www.ideography.co.uk is similar, and usually is associated with formmail.pl exploits.

http://www.irs.ustreas.gov/auditors/whistleblowers/index.html is often used as a fake referrer, but I've never seen it associated with other suspcious activity, so it's probably just a prank.


Strange 404 Requests

Altavista and DIIbot use suspicious request methods to test 404 errors. These robots, and perhaps others, probably use this method in order to figure out what a server's 404 response is (a kind of "profile"), and assumes that it's the same for all 404 pages.

But as those who have messed with ErrorDocument and mod_rewrite know, error documents can be made just as dynamic as normal pages, so the robot is making a risky assumption if it assumes that all 404 responses will be the same. Why doesn't it simply examine the Status field in the response header when it requests real pages? Why does it need to make up a fake URL it "knows" will be missing, and request it?

DIIbot attempts to access the file /test404response.

Altavista did something strange in 2002. They started requesting the page: /kjhgdkjhf1goifj2lktjelj34knfhjguih8bbj/index.htm.

However, these requests are completely innocuous to Apache, and probably do not need to be blocked.

I contacted AltaVista about these strange requests, and I got this response:

Thanks for contacting us about the machine scooter1.sv.av.com. This machine is
running a web crawler on behalf of AltaVista, and is probably following links
that it found on some other page. If your machine is not really a web server,
or if you have blocked access to it from the outside, our crawler will try only
a few times before giving up.

Currently, our crawler is scrubbing some URLs which do not exist. During the
work, the crawler leaves the strange path as stated in your claim. This
procedure is under our control. If the path left is different, or any our work
bothers you, please contact us with your URL, the IP of the suspicious machine,
and the suspicious path string. We will work on the issue then.

Here is some general information about our crawler and how you can prevent it
from accessing your site. If you don't want AltaVista crawlers to access your
pages, but you still want the pages available to the Internet at large, the
best course of action is to use a robots.txt file to prevent the crawler from
accessing your pages. Two excellent resources about robots.txt and the Robots
Exclusion Protocol are:

        http://help.altavista.com/adv_search/ast_haw_avoiding 
        http://info.webcrawler.com/mak/projects/robots/robots.html 

If your pages are on a server that does not permit you to access the ServerRoot
directory, you are not able to implement a robots.txt file without the
assistance of your ISP system administrator. If your ISP declines this
assistance, you can include METAtags on each of your pages to tell our crawler
not to index the page. The use of METAtags is described in the documents linked
above.

(One thing to keep in mind is that AltaVista knows nothing about the design of
your site, and therefore cannot evaluate, test, or verify your robots.txt file.
In the most general terms, if your site contains pages you do not want us to
access, you know where they are, but we don't.)

A note about passwords and privacy: creating a robots.txt file is a good way to
control crawlers (not just ours, but all search engines). However, users will
still be able to see the pages. If the information is really private or
confidential, you should protect it with a password.

If your web server contains private information, you should use either a
firewall or a password to protect the data from unauthorized access. If you
have already blocked access, and your server or firewall is notifying you of
unauthorized access attempts, there is no need to report the activity. In most
cases, both the crawler and the firewall are behaving correctly and the
activity is not really suspicious.

Again, if you feel this crawl activity is causing actual damage to your systems
or denial of service, please provide us with the requested info and we will
investigate further. Thanks.

Regards, Ron
AltaVista Crawl Support

When I replied that that wasn't a good enough explanation, I got this response:

Thanks for contacting us again for your concerns.

We apologize for any inconvenience potentially caused by this work.

You do not need to change your robots.txt.

As the information provided by our crawl engineers, this is for the soft 404
scrub that we are doing on hosts now. It is under our control. We know that
the works leave the strange path as "/kjhgdkjhf1goif ...."

If the path left is different, or any our work bothers you, please contact
us with your URL, the IP of the suspicious machine, and the suspicious path
string. We will work on the issue then.

Please let us know if you have further questions.

Thanks.

Regards,
Kate
AltaVista Crawl Support

Perhaps the best response is to throw confusion at them:

# Altavista
RewriteRule kjhgdkjhf1goifj2lktjelj34knfhjguih8bbj http://www.altavista.com/ [L,R]

# DIIbot
RewriteRule test404response http://www.findsame.com/ [L,R]

On the other hand, you could do something really far out, like this trick or treat:

http://www.odinsrealm.com/kjhgdkjhf1goifj2lktjelj34knfhjguih8bbj/

Warning: This page emits noise known to kill robots. So put your pet AIBO in a safe place :) K-9, did you hear that? Yes, Master


webcollage

webcollage, an X11 screensaver, uses AltaVista's image search engine to collect random images off the internet to display as a screensaver. AltaVista does not observe robots.txt for images, leading to many unwanted accesses from webcollage.

AltaVista does not crawl sites downloading images excluded by robots.txt -- it simply crawls HTML pages not excluded by robots.txt, looking for image references, and then indexes the images regardless of whether they are excluded by robots.txt, making them available for webcollage and others to perform keyword searches on the image URLs.

If you get a HTTP_USER_AGENT of webcollage and your images should have been excluded, you should return a 404 or 403 error, e.g.:

RewriteCond %{HTTP_USER_AGENT} ^webcollage
RewriteRule .* - [L,F]
But don't permanently block out the host, who was simply misdirected by AltaVista, unless you want to be extra harsh and punish webcollage users. After all, webcollage sucks up bandwidth by going around randomly fetching images which are only used for a screensaver. (Is that what landed Bert in bed with Osama bin Laden? :) And here is another reason to avoid webcollage :)


Catch Known Security Exploit Attempts

This one must be updated regularly, but here's a good place to start:

# Bad requests which look like attacks (these have all been seen in real attacks)
RewriteRule ^[^?]*/(owssvr|strmver|orders|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) /cgi-bin/bad.pl [NC,L,T=application/x-httpd-cgi]

This catches many attempted exploits, like the formmail.pl exploit, which allows someone to anonymously send spam from the web server to anywhere they want, and the DCShop exploit, which allows collection of online shoppers' credit card data.

# Filter out bad requests (may need to be adjusted to your needs)
RewriteCond %{THE_REQUEST} "^((GET|POST|HEAD) [^/]|CONNECT)" [NC]
RewriteRule .* /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

The last RewriteRule catches requests such as GET x, which are used to obtain information about a server (to exploit its vulnerabilities). All GET, POST and HEAD requests should begin with / in the URL. If they do not, they are most likely an attempted exploit. It also rejects CONNECT requests attempting to use the server as a proxy.


Use ipchains to Totally Block Certain Hosts

Some hosts, like those infected with the Code Red and Nimda worms, continue to access sites even when the server gives them back error codes. For these hosts, you can (at least on Linux) run a script in the background as root to block out the hosts completely, such as this:

#!/usr/bin/perl

$log = "/www/.blocklist";

open(LOG, $log) || die "Cannot open $log: $!";

seek(LOG, 0, 2);             # go to end of file

while (1) {
   while (<LOG>) {
      if (/^(\d+\.\d+\.\d+\.\d+)/ && !$hit{$1}++) {
         system("/sbin/ipchains -I input 1 -p tcp --dport 80 -s $1 -j DENY");
      }
   }
   seek(LOG, 0, 1);         # clear EOF
   sleep(1);
}

This script watches the file /www/.blocklist for new entries, and whenever a new entry is added, calls ipchains to add a DENY entry to the front of the input chain.

REJECT could be used instead of DENY, but DENY was chosen to slow down the attackers as much as possible.

Another CGI script titled block.pl, similar to bad.pl above, would then be created to log very bad hosts to /www/.blocklist. The only difference between bad.pl and block.pl needs to be the value of $mapfile:

$mapfile = "/www/.blocklist";

You might also want to put a sleep(5) call in the block.pl script after it appends to /www/.blocklist, to give the other script above and ipchains time to finish before the client makes another request.

block.pl would be invoked like this:

RewriteRule ^(/(scripts|msadc|MSADC|./winnt)|.*(default\.ida|[NX]{30}|c\+dir)) /cgi-bin/block.pl [L,T=application/x-httpd-cgi]

This catches the Code Red and Nimda worms. A shorter pattern could be used, but the pattern is flexible so that future variants will be caught as well.

Put this rule before any of the other rules, so that if it matches, it will cause the host to be completely blocked. If this rule isn't put before the other rules, then if the host had been marked "bad" before, it might get stuck being marked "bad", but without being completely blocked with ipchains by this rule.

After this ipchains blocking mechanism has been set up, you can use block.pl anywhere bad.pl was used above, to completely block connections. However, it should probably be used only for the most blatant security exploit attempts, like those in the last section.

Note: Sometimes this is not enough to catch the worms, because either they send bad request headers which get caught before mod_rewrite is even called, or the requests are intercepted and partially blocked by ISPs. In these cases, you will need to use the error log monitor script below.


Completely Block Hosts Which Send Bad Requests

Some ISPs have installed annoying filters in their routers which block the Code Red and Nimda worms, but only halfway: A http connection must be established to a server first, before any request data can be sent. The ISP's filter can only stop the transmission when it sees the worm's signature, but by then, it's too late, because the http connection to the server was already made. The result is a dropped connection which leaves a httpd process hanging around waiting until timeout for the client to send it data. For Apache, this is worse than simply getting the worm requests. In essence, it is an ISP-created Denial of Service (DoS) attack!!!

Also, some worm variants produce bad http requests, which prevents mod_rewrite from being effective in catching them. In these, cases, error log monitoring is probably necessary.

Solution:

  1. Substantially decrease Apache's Timeout parameter in the httpd.conf file:

    Timeout 10
    

    This causes Apache to time out much earlier than the default of 300 seconds. Depending on your server, this might need to be increased, but 300 seconds is probably too high.

    If you use a larger timeout than 10 seconds or so, be sure to look at the MaxClients parameter, and consider increasing it too. Every time an ISP filters out an already-started http connection, an Apache server process sits around doing nothing for the duration of Timeout, and it counts towards the MaxClients limit.

  2. Write a log monitoring script which calls ipchains every time a timeout is detected or a bad request is made:

    #!/usr/bin/perl
    
    $log = "/var/log/httpd/error_log";
    
    while (open(LOG, $log)) {
        seek(LOG, 0, 2) if !defined($inode);     # go to end first time only
        $inode = (stat(LOG))[1];
        do {
            while (<LOG>) {
                if (/client ([^\]]+)\] (read request line timed out|Client sent malformed Host header)/ && !$hit{$1}++) {
                   system("/sbin/ipchains -I input 1 -p tcp --dport 80 -s $1 -j DENY");
                }
            }
    	sleep(1);
            $pos = tell(LOG);                  # save position
            seek(LOG, 0, 2);                   # find end of file
            $pos = 0 if $pos > tell(LOG);      # reset $pos to 0 if $pos > EOF
            seek(LOG, $pos, 0);                # go to position $pos
        } while ((stat($log))[1] == $inode);   # in case of rotatelogs
        close(LOG);
        while (! -e $log) {
    	sleep(1);
        }
    }
    
    die "Cannot read $log: $!";
    

    Run this script in the background as root. It may be need minor changes, if your error log lists errors in a different format. If the error log changes because of log file rotation, this script detects it, and stays with the latest error log.


Further Enhancements

Enhancements beyond what's presented above, include:

These enhancements are left as exercises for the reader. (Note: I've already implemented some of these partially or fully on my server, such as the expiration date and email notification. I've simply left out how to do them either for brevity, because I'm lazy, or because they are good exercises.)


For Further Reading


© Lee Killough
Last Updated: