I've always thought though that the impressive list of visitors each day were being inflated by search engine bots, sites I use to check the site is up by "visiting" it and frankly attempts to hack the site. There was a spike in traffic in August 2008 for instance that turned out to be a 3 day long SQL injection attack! Of course I never knew how many were real people.
Anyway, I decided to make them "pure", just real people, no automated traffic. I discarded logs from before 2008 but had over a year of log files - about 450 text files, most of them hundreds of lines, so I clearly needed something to process them. IIS 7 logfiles are quite simple - each line is a visitor doing something - downloading an image for instance - with lots of information on the same line, like the previous site visited, IP address, OS, etc. You coming to this page will have generated quite a few lines of text. The good thing though is that if you can find the lines to remove all the information is very cleanly removed.
So, I started looking at what to remove from the logs. Most of the rubbish lines were from search engine bots which I started to identify one at a time - Googlebot, MSNbot, etc - until the penny dropped they all had bot in their name. I found the SQL injection attacks all start DECLARE. I excluded basicstate, hosttracker and blogger, since any line with those in are basically me. In the end, the list of strings to look for to identify unwanted lines are (so far):
- bot
- ysearch/slurp
- DECLARE
- basicstate
- HostTracker
- blogger.com/blogpreview
I will find a better way to do this soon, but the first attempt is:
1. Make a text file of all the log file names (Open CMD, "dir *.log /b > list.txt")
2. Rename list.txt to a batch file, add "call processlog " before each log file name
3. Create a batch file called processlog.bat to do the procecssing.
The batch file is below - basically it reads the whole contents of the file minus any lines with a string in to a temporary file, then to another temporary file against another string, until its done, then replaces the original file. Simple! Okay, this is a very inefficient method but it worked just fine. It incidentally reduced my stats by about 75%!!! But I know they are real now....
copy %1 backup /Y
type %1 find /v "bot" > tmp1.txt
type tmp1.txt find /v "basicstate" > tmp2.txt
type tmp2.txt find /v "ysearch/slurp" > tmp1.txt
type tmp1.txt find /v "DECLARE" > tmp2.txt
type tmp2.txt find /v "www.blogger.com/blog-preview" > tmp1.txt
type tmp1.txt find /v "HostTracker" > tmp2.txt
del %1 /Q
del tmp1.txt
ren tmp2.txt %1
No comments:
Post a Comment