Tuesday, March 3, 2009

De-googling my website logs

One of the reasons I have a website is fascination with the logs that are generated by IIS, seeing what browser and operating system visitors use, how many each day, what they type into Google and Live Search (and other search engines of course, but to be fair two thirds of my traffic is from Google and Live is almost everything else). Yes, you are being watched!

I use WebLog Expert to analyse the logs every few days which makes lovely HTML reports with graphs and stuff. For instance, here are the operating systems you're all using!

I've always thought though that the impressive list of visitors each day were being inflated by search engine bots, sites I use to check the site is up by "visiting" it and frankly attempts to hack the site. There was a spike in traffic in August 2008 for instance that turned out to be a 3 day long SQL injection attack! Of course I never knew how many were real people.

Anyway, I decided to make them "pure", just real people, no automated traffic. I discarded logs from before 2008 but had over a year of log files - about 450 text files, most of them hundreds of lines, so I clearly needed something to process them. IIS 7 logfiles are quite simple - each line is a visitor doing something - downloading an image for instance - with lots of information on the same line, like the previous site visited, IP address, OS, etc. You coming to this page will have generated quite a few lines of text. The good thing though is that if you can find the lines to remove all the information is very cleanly removed.

So, I started looking at what to remove from the logs. Most of the rubbish lines were from search engine bots which I started to identify one at a time - Googlebot, MSNbot, etc - until the penny dropped they all had bot in their name. I found the SQL injection attacks all start DECLARE. I excluded basicstate, hosttracker and blogger, since any line with those in are basically me. In the end, the list of strings to look for to identify unwanted lines are (so far):
- bot
- ysearch/slurp
- basicstate
- HostTracker
- blogger.com/blogpreview

I will find a better way to do this soon, but the first attempt is:
1. Make a text file of all the log file names (Open CMD, "dir *.log /b > list.txt")
2. Rename list.txt to a batch file, add "call processlog " before each log file name
3. Create a batch file called processlog.bat to do the procecssing.

The batch file is below - basically it reads the whole contents of the file minus any lines with a string in to a temporary file, then to another temporary file against another string, until its done, then replaces the original file. Simple! Okay, this is a very inefficient method but it worked just fine. It incidentally reduced my stats by about 75%!!! But I know they are real now....
copy %1 backup /Y

type %1 find /v "bot" > tmp1.txt

type tmp1.txt find /v "basicstate" > tmp2.txt

type tmp2.txt find /v "ysearch/slurp" > tmp1.txt

type tmp1.txt find /v "DECLARE" > tmp2.txt

type tmp2.txt find /v "www.blogger.com/blog-preview" > tmp1.txt

type tmp1.txt find /v "HostTracker" > tmp2.txt

del %1 /Q

del tmp1.txt

ren tmp2.txt %1

No comments:

Post a Comment