Update – I have a more complete tutorial on how to block bots with Squid over on my wiki which you can view here.
I’ve written before about my reverse proxy and how it allows me to accelerate content delivery and also to allow me to run multiple webservers unsing a single IP address. However it is capable of so much more.
Squid uses access control lists (acl’s) to govern who can do what with the proxy server. For example you can set acls to only allow certain computers to access the internet or indeed access the internet via the cache at certain times or hours. There are a myriad of different options that you could configure but one in particular struck me as being exceptionally useful. That is that you can use acls to block certain useragents.
In a conventional scenario you would use .htaccess on the server to block access to various bad bots. If you were the administrator of several or maybe even a few dozen sites then it becomes a chore to ensure that the bot and nefarious useragents in all the .htaccess files are kept up to date. However as in my case as all traffic is passing through the reverse proxy it becomes trivial to deny access to those bots and useragents as all you have to do is create a single acl and it will apply to all sites that the proxy is fronting for.
Setting it up couldn’t be easier.
In my case my squid.conf is almost identical to the one used on my reverse proxy tutorial. One of the key things to consider in adding an acl to block certain useragents is that the new acl that we will be creating needs to be read by squid on startup before all the others.
First up we need to define our acl. So as per my tutorial I need to add this acl which I will be calling ‘badbrowsers’ just above the first ‘cache_peer’ entry in squid.conf. I will be storing all the bad bot entries in a seperate text file to avoid a messy squid.conf. In order to get squid to reference a seperate file, the location for the file musr be enclosed in quotes. So now we define our acl exactly as follows:
acl badbrowsers browser “/etc/squid/badbrowsers.conf”
Now the acl has been defined we must decide on an action that will occur when our new acl is triggered and for this we need to scroll down through our squid.conf and in a new line just above the http_access for our proxied sites add a new line to deny http access for out acl as follows:
http_access deny badbrowsers
That’s all the configuration needed for our squid.conf so save your changes and now we will create and edit the file that we have defined that will contain our bad bots and useragents.
When defining our acl the configuration file that I have chosen will be located in /etc/squid. So change to this directory and using your favourite editor create a file called badbrowsers.conf. On each line in this file we can add our banned useragents using regular expressions. I’ve noticed lately that most of the comment spam that I have been receiving lately has been coming from a useragent calling itself “Jakarta Commons-HttpClient/3.0.1″. To banish this useragent add a line to your badbrowsers.conf file with the following:
^Jakarta
That’s it. That’s all you need. Once the first word is matched in the useragent string you don’t need anything else. You can elaborate on this if you like to encompass whatever you like using regular expressions.
Once you are happy with your configuration save your changes and restart squid and no more bad bots.