This post is in response to Graywolf’s post about Google Analytics and bot tracking.
Google Analytics and other JavaScript-based analytics programs cannot capture data about robot activity on a site. But there is an easy way to get data about crawler activity on a site by grepping (searching) raw log files and analysing the data in a spreadsheet.

This technique works best in Linux or Mac OS X. I’m only familiar with Linux so the example below assumes that you are running Ubuntu or another Debian-based GNU/Linux distro. If you don’t have Linux and can’t install it somewhere, you could also run the commands directly on your remote Linux server over SSH and then download the resulting CSV file.
Open up a terminal and navigate to the directory where your raw logfile is located. In this case the raw logfile is called access_log.
Then type the following line:
egrep '(Googlebot|Yahoo!|msnbot)' access_log | cut -d' ' -f1-30 --output-delimiter=, >search_engines.csv
You will then have a CSV file of search engine activity that you can open in a spreadsheet for analysis.
This is what the script does:
- egrep is used to extract all lines from the logfile that contain Googlebot, Yahoo!, or msnbot. You can add other bots if you want their data also. access_log is the name of the logfile and can be changed to whatever your logfile is named.
- cut is a command that splits the logfile into columns using a space as the column delimiter (-d’ ‘). It extracts fields/columns 1 to 30 (-f1-30), though there are probably about 20 to 25 columns. It then reassembles the columns with a comma as the output delimiter.
- The script then saves the output as a CSV file called search_engines.csv.
Once the CSV file is open in a spreadsheet, delete the columns that you don’t want. Then go to the menu item Data —> Filter —> Auto Filter. You will then be able to sort the output by search engine and by response code. For example, you could filter the data to see only 302 response codes that Googlebot encountered. Or 404 errors that msnbot encountered.
This is just a crude solution that should work for general log files. You can customize it if necessary. Many IIS logs are not constructed well and need some other manipulation, but that is a post for later.

One Trackback
[…] Grepping your raw log files can turn up a lot of useful information about how spiders are perceiving your site. Basically, you want to extract all of the hits that were made by Google’s, Yahoo’s, and MSN’s spiders, and then see what kind of response codes were sent. […]