At Kiwix we were impacted by a long term issue: we were unable to give practical and accurate download statistics. Our traffic partly goes to our web site and the other part goes to a storage place where all the big files (ZIM, ZIP, ...) are published. Of course, we wanted to have as much details as possible about both of them. After hacking at little bit, this problem seems now to be fixed, here is how I did it.
To measure Web audience, there are two approaches:
- Configure your web server to log everything and parse afterwards the log. There is a free software to do that: AWstats. AWstats is pretty old and not practical at all. In addition, I failed to configure it so that it merges multiple requests from an IP on one file in only one record (big files are downloaded in chunks).
We have been using Piwik for many years for our web site and we are really satisfied with it: it's practical, quick enough and maintained. Our problem is focused on http://download.kiwix.org. Our case is even a little bit worse, because, in fact, we do not really host our files. download.kiwix.org acts as master redirecting requests to mirrors. It uses a solution called Mirrorbrain. In addition to its many weaknesses, AWstats is not able to deal correctly with HTTP 301 and 302 redirections.
We have decided to put everything in Piwik, because it is better to have everything in one tool and also because this is the best one to visualise logs. We have now for both sites:
We have installed Piwik and inserted the mandatory piece of code on http://www.kiwix.org to track the visitors visiting our web site. This is easy and straight forward. Then we created in Piwik a second site for http://download.kiwix.org (one instance of Piwik can deal with statistics of many sites).
Later, we configured the Apache virtual host with Mirrorbrain to save logs on the hard disk and let it run for a few days.
- Parse all log files for a web site (also all .gz files of logrotate)
- Able to start from scratch to upload all logs
- Able to "follow" the log file an update in real time Piwik.
- Count HTTP redirections (HTTP 301 and 302) errors as valid downloads (mandatory for Mirrorbrain)
- Merge multiple similar requests to one (only one download for one content per IP in one month)
- Avoid counting request for directories, md5 checksum, favicon, ...
- Re-upload only new logs by two consecutive runs
- Works with both Apache and Nginx
- Configurable on the command line with arguments
- Warning: it cannot report about the number of downloads completed.
This script is for our own purpose, but I think it's pretty simple to reuse it; it's pretty simple and you should not need to adapt it too much. Hope this was helpful!