At Kiwix we were impacted by a long term issue: we were unable to give practical and accurate download statistics. Our traffic partly goes to our web site and the other part goes to a storage place where all the big files (ZIM, ZIP, ...) are published. Of course, we wanted to have as much details as possible about both of them. After hacking at little bit, this problem seems now to be fixed, here is how I did it.

Audience measurement

To measure Web audience, there are two approaches:

  • Insert a "plugin" (dynamic image or javascript code) on each web page. Every time a web page is loaded, it stores the page name, date and details about the visitors browser in a database. This is extremely efficient and there exist many free offers on the Web to do that; but this method has two disadvantages. First, if you use free commercial tools, like Google analytics or Xiti, you actually sell the privacy of your users. Second, this approach is inefficient to measure file downloads because no web page is involved. To avoid the first one, you can install a free solution like Piwik, which is pretty neat, on your own server; this will avoid informing a third part about your visitors. Unfortunately, you cannot overcome the second problem.
  • Configure your web server to log everything and parse afterwards the log. There is a free software to do that: AWstats. AWstats is pretty old and not practical at all. In addition, I failed to configure it so that it merges multiple requests from an IP on one file in only one record (big files are downloaded in chunks).

Our case

We have been using Piwik for many years for our web site and we are really satisfied with it: it's practical, quick enough and maintained. Our problem is focused on http://download.kiwix.org. Our case is even a little bit worse, because, in fact, we do not really host our files. download.kiwix.org acts as master redirecting requests to mirrors. It uses a solution called Mirrorbrain. In addition to its many weaknesses, AWstats is not able to deal correctly with HTTP 301 and 302 redirections.

We have decided to put everything in Piwik, because it is better to have everything in one tool and also because this is the best one to visualise logs. We have now for both sites:

Our solution

We have installed Piwik and inserted the mandatory piece of code on http://www.kiwix.org to track the visitors visiting our web site. This is easy and straight forward. Then we created in Piwik a second site for http://download.kiwix.org (one instance of Piwik can deal with statistics of many sites).

Later, we configured the Apache virtual host with Mirrorbrain to save logs on the hard disk and let it run for a few days.

Finally, we wrote a custom PHP script to upload the logs to Piwik, using a Piwik PHP class called PiwikTracker.php. Here are its features:

  • Parse all log files for a web site (also all .gz files of logrotate)
  • Able to start from scratch to upload all logs
  • Able to "follow" the log file an update in real time Piwik.
  • Count HTTP redirections (HTTP 301 and 302) errors as valid downloads (mandatory for Mirrorbrain)
  • Merge multiple similar requests to one (only one download for one content per IP in one month)
  • Avoid counting request for directories, md5 checksum, favicon, ...
  • Re-upload only new logs by two consecutive runs
  • Works with both Apache and Nginx
  • Configurable on the command line with arguments
  • Warning: it cannot report about the number of downloads completed.

This script is for our own purpose, but I think it's pretty simple to reuse it; it's pretty simple and you should not need to adapt it too much. Hope this was helpful!