Kiwix au jour le jour

Aller au contenu | Aller au menu | Aller à la recherche

dimanche 9 décembre 2012

Nginx or Apache, Mirrorbrain and Piwik

At Kiwix we were impacted by a long term issue: we were unable to give practical and accurate download statistics. Our traffic partly goes to our web site and the other part goes to a storage place where all the big files (ZIM, ZIP, ...) are published. Of course, we wanted to have as much details as possible about both of them. After hacking at little bit, this problem seems now to be fixed, here is how I did it.

Audience measurement

To measure Web audience, there are two approaches:

  • Insert a "plugin" (dynamic image or javascript code) on each web page. Every time a web page is loaded, it stores the page name, date and details about the visitors browser in a database. This is extremely efficient and there exist many free offers on the Web to do that; but this method has two disadvantages. First, if you use free commercial tools, like Google analytics or Xiti, you actually sell the privacy of your users. Second, this approach is inefficient to measure file downloads because no web page is involved. To avoid the first one, you can install a free solution like Piwik, which is pretty neat, on your own server; this will avoid informing a third part about your visitors. Unfortunately, you cannot overcome the second problem.
  • Configure your web server to log everything and parse afterwards the log. There is a free software to do that: AWstats. AWstats is pretty old and not practical at all. In addition, I failed to configure it so that it merges multiple requests from an IP on one file in only one record (big files are downloaded in chunks).

Our case

We have been using Piwik for many years for our web site and we are really satisfied with it: it's practical, quick enough and maintained. Our problem is focused on http://download.kiwix.org. Our case is even a little bit worse, because, in fact, we do not really host our files. download.kiwix.org acts as master redirecting requests to mirrors. It uses a solution called Mirrorbrain. In addition to its many weaknesses, AWstats is not able to deal correctly with HTTP 301 and 302 redirections.

We have decided to put everything in Piwik, because it is better to have everything in one tool and also because this is the best one to visualise logs. We have now for both sites:

Our solution

We have installed Piwik and inserted the mandatory piece of code on http://www.kiwix.org to track the visitors visiting our web site. This is easy and straight forward. Then we created in Piwik a second site for http://download.kiwix.org (one instance of Piwik can deal with statistics of many sites).

Later, we configured the Apache virtual host with Mirrorbrain to save logs on the hard disk and let it run for a few days.

Finally, we wrote a custom PHP script to upload the logs to Piwik, using a Piwik PHP class called PiwikTracker.php. Here are its features:

  • Parse all log files for a web site (also all .gz files of logrotate)
  • Able to start from scratch to upload all logs
  • Able to "follow" the log file an update in real time Piwik.
  • Count HTTP redirections (HTTP 301 and 302) errors as valid downloads (mandatory for Mirrorbrain)
  • Merge multiple similar requests to one (only one download for one content per IP in one month)
  • Avoid counting request for directories, md5 checksum, favicon, ...
  • Re-upload only new logs by two consecutive runs
  • Works with both Apache and Nginx
  • Configurable on the command line with arguments
  • Warning: it cannot report about the number of downloads completed.

This script is for our own purpose, but I think it's pretty simple to reuse it; it's pretty simple and you should not need to adapt it too much. Hope this was helpful!

jeudi 14 juin 2012

Kiwix Compile Farm

Kiwix is a special software. Special because it's difficult to define:

  • A Desktop software for browsing offline content with ZIM on Mac, Windows and Linux.
  • A Server allowing to serve ZIM content on those platforms + arm Linux.
  • A library of ZIM files for popular content: Wikipedias, Wikileaks, etc.
  • Very few developers (most of the time it's 2).
  • Very large (and growing) number of users.
  • “Small” code base (about 50,000 lines of code).

So far, all this was maintained by hand, the ZIM files are created from a complex procedure of scripts and mirror setups, the releases are created manually on all platforms, etc.

You got it, it's difficult to keep up with improving the software, fixing bugs, generating new content ZIM files, building and testing the software on all platforms.

Our first step into the right direction was getting the translations done on TranslateWiki and it got us to ship Kiwix now with 80+ languages.

Thanks to sponsorship by Wikimedia CH, we decided to first tackle the build problem as it's the most annoying.

The Problem

Kiwix releases on the following:

  • Mac OSX 10.6+ Intel Universal (Intel 32b, Intel64b)
  • Linux 32b “static“ (no dependencies)
  • Linux 64b static
  • Sugar .xo for OLPC
  • Windows 32b.
  • Armel5 (kiwix-serve only)
  • Source code.
  • Debian wheezy package 32b.
  • Debian wheezy package 64b.
  • Linux 32/64b with dependencies (used to be a PPA for Ubuntu until they removed xulrunner).

Knowing that only reg has a Mac, that the Windows setup for building Kiwix is complicated and that both Kelson and reg are using Linux, testing and distributing new versions of the code is very difficult.

This is not a unique problem ; most large multi platform softwares faces the same issue and we did nothing but imitate them: we deployed a build farm.

The Solution

The solution looks like the following schema: a bunch of Virtualbox VM, a Qemu one for arm, buildbot on all of them.

Builbot Diagram

As you can see, the builbot master controls all the slaves which creates their own builds and sends them to the Web server's repository.

Buildbot

A compile farm is a set of servers ; each building a platform or target of the software. To manage those, [a large number of software|http://en.wikipedia.org/wiki/Comparison_of_continuous_integration_software\ exist.

After some research, we chose Buildbot because:

  • It seemed easy to install
  • It looked very powerful
  • Its documentation was clear.
  • It's written in Python (including the configuration file).

The deal-maker was really the tutorial on the website wich allowed us to imagine the required steps without to actually get our hands dirty.

The Python configuration file is a great feature as it allows a very flexible configuration without a dedicated syntax.

Builbot is divided into two softwares:

  • the master which holds the whole configuration (the only file you care about).
  • slaves which only needs to run the slave software (python). Those are logic-less.

Kiwix already rent a very powerful server in a data center for serving downloads. We used it to hold everything.

VirtualBox

All the build slaves (except for the arm target which is not supported by VBox) are VirtualBox Virtual Machines (VM):

  • 512MB RAM
  • 20GB HD
  • 2 NIC: NAT for accessing Internet (dhcp) ; Host-only for buildbot (Fixed local IP).

OSX VM is 1GB RAM and 40GB HDD.

All the VMs were installed through VRDP (VNC-like protocol) until the network is configured and ssh access is enabled.

See also: VM Setup

QEmu

QEmu was required to get an armel VM. We used aurel32's debian images.

Note: In order to ease the SSH connexion, halt and start of the VMs, we wrote a wrapper script around VirtualBox.

Configuring and running

Configuring buildbot is pretty straightforward once you know what you want to do. Configuration is composed of the following components:

  • Slave definitions (name, login, password)
  • Builders: targets composed of steps (commands) executed on a slave.
  • Schedulers: Triggers for when to start builders.
  • Status: What to do with output of builders.

The hard part is defining the builders as this is where you indicate how to retrieve source code, launch your configure script, compile, and tranfer your build somewhere else.

Take a look at ours as an example: master.cfg

We don't use any advanced feature so it's easy to understand. We chose:

  • fixed time daily to run our builds (at night – server time)
  • builds (tarball, etc) are uploaded to the server's /var/www/ for direct web access.

Buildbot handles the transfer of files between master and slave.

  • We can trigger builds at any time from the web interface.
  • We list build results on the web page and by mail in a dedicated mailing-list.
  • Builds are announced and controllable by the IRC bot.

Although it's simple, it takes a lot of tweaks and tests (fortunately it's easy) to get the configuration as wished, you need to have a proper and documented build mechanism for all your targets otherwise you'll probably go crazy. We completed a complete rewrite of our autotools Makefiles for all platforms before we setup buildbot. It sounds dumb but it's worth noticing.

Outcomes:

  • Every day, a new release for all targets available for download ; properly named with the SVN revision and the date.
  • Ability to fire a build at any time from the Web UI.
  • Kiwix to be integrated into Debian sid in the coming days (and thus in the next stable release).

Things you should know:

If you intend to reproduce, here's a few things we've learned and want to share.

  • Installing OSX on non-Mac hardware is tricky: you need a recent Intel CPU (support for VTx) but not too recent otherwise your OSX Install DVD won't know about it (and refuse to install).
  • On OSX, Apple packages (MacOSX updates, XCode) have an expiration period. If you install an XCode version 2 years (that's an example) after it's been released, the installer will fail with no useful feedback. It's due to the package's signature being too old. You can still install it by unpacking/repacking the packages.
  • SSH to your QEmu VM is done using a QEmu proxy so you ssh to localhost on a different port.
  • VRDP requires a good connexion if you intend to do a lot of configuration inside Windows (384k clearly is a pain!).
  • Buildbot slaves freezes frequently. Not sure why but sometimes it fails to answer to build request and stays attached doing nothing. As a workaround, we delete and recreate the buildbot slave folder daily in a cron job.
  • Buildbot slaves have network issues some times. We're not sure if it's related to buildbot, VBox or something else but it's frequent that the slave can't checkout the source tree or can't download our dependencies from the web.
  • Windows slave frequently loose connexion to the master. Might just be a Windows configuration issue.

What's Next?

  • Improve our wrapper script to handle VRDP access to VMs by controlling iptables.
  • Add SSH to the Windows slave so we can do basic tests in console.
  • Investigate the network/slaves problems so that it works 24/7.
  • Automate & build a similar platform for the creation of ZIM files so we can focus only on code thereafter.

vendredi 1 avril 2011

Content manager: challenges and solutions

This is the translation of my previous post in French. Thx Rupert for it.

Kiwix 0.9 has now beta status and it's time to think seriously about version 1.0. The integrated management of content is Kiwix 1.0 core functionality, namely ZIM files and search indexes. Kiwix will provide a new usage experience when downloading, sharing, organizing, deleting of different content can be done without using or installing other tools. Distribution of content will benefit if we make the Kiwix users life even simpler than today. To provide such functionalities we are currently addressing the following challenges:

  • new content can be downloaded out of Kiwix
  • the software needs to be robust and fast
  • even if a content server fails or is disconnected, the user can continue to download and share content
  • the download cost need to stay low even if the volume skyrockets
  • downloading and sharing content must be easy, even if the LAN is cut off the internet

MetalinkThe architectural solution we envision to fullfill the above requirements is to combine the specific advantages of a centralized download via FTP/HTTP (for efficiency) and decentralized P2P (for ruggedness and low cost). The standard which seems to best match the requirements is Metalink. Metalink is an XML standard for defining content by using a checksum, its sources (HTTP, FTP, magnet, Bittorrent) and priority rules on these sources. Examples for such rules are the geographic location, or simply a rating system for the mirrors. The format is still fairly young, but it is being standardized by the IETF. Compared to more traditional solutions of uniquely using HTTP, FTP or P2P, the technology combines the strengths of each while eliminate the disadvantages.

To use Metalink, the following is necessary:

  • a server capable of generating metalink files, torrent, different HTTP and FTP mirrors, etc.
  • a client that can interpret the metalink files. It will need to manage all available sources, download the best and ensure sharing.

Metalink has also some existing implementations for the server called Mirrorbrain, and for the client, called Aria2. By using them we hope to keep the implementation effort low. To verify the architectural and tool choices we will make a prototype.

Lets elaborate a little bit on the server side. Mirrorbrain is a software originally developed for openSUSE, but now used by many other large projects. It is an Apache module that allows for a given file, released a bunch of checksums, one. metalink, one. torrent, a link magnet and of course the list of mirrors that have the file, taking into account the geographic location of the clients IP address. A list of tools to know what the mirror has what file is currently available, etc. When using Mirrorbrain, we can concentrate on:

  • synchronization of mirrors with rsync
  • the "Superseed" for Bittorrent
  • possibly a Bittorrent tracker and DHT node if you do not want (only) use the trackers / public nodes.

On the client side, Aria2 is a command line client which "understands" .metalink files and manages the rest. Aria2 is therefore able to download files, and can also share files coming originally from different sources via various protocols, at the same time. It is actively developed for several years and is very light. Via its XML-RPC interface it can be controlled out of Kiwix. Exactly this interface is most likely the largest part of the implementation work.

Finally, this leads us the the last challenge we need to address: where to get the .metalink files, and therefore the available content, from? Especially if no central server is available (e.g. in some mesh network scenarios) .metalink files need to be available on the client device. The plan is to integrate .metalink files into the XML files managing the content index of Kiwix. Such an XML file then would list at the same time contents available on local storage, as well as contents available for download. In the beginning, we want to just include some default index into Kiwix. Later on index files can be shared via the same technology developped for the contents itself.

vendredi 11 février 2011

Kiwix 1.0 : Quels enjeux ? Quelles solutions ?

Kiwix 0.9 a à peine entamé son cycle de betas que nous commençons en parallèle à sérieusement penser la version 1.0. Deux raisons à cela : la musique s'accélère pour Kiwix et nous ne voulons surtout pas avoir la tête dans le guidon.

Kiwix 1.0 apportera comme fonctionnalité principale, la gestion intégrée des contenus (pour faire simple : fichiers ZIM et indexes de recherche). Cela signifie que le téléchargement, le partage, l'organisation, la suppression des différents contenus se fera dans Kiwix et non à l'aide de logiciels externes. Nous pensons que c'est nécessaire pour simplifier la vie de l'utilisateur, mais aussi pour favoriser la distribution des contenus.

Nous pourrions évidement implémenter et intégrer un simple téléchargement HTTP ; mais cela ne serait à notre avis pas pleinement satisfaisant. Voici les quelques défis que nous pensons nécessaires de relever :

  • Il faut que le téléchargement de nouveaux contenus soit simple et puisse se faire depuis Kiwix directement.
  • Il faut que la solution logicielle soit robuste et rapide.
  • Il faut que même si la plateforme de téléchargement ainsi que ses miroirs tombent (ou sont censurés), que l'utilisateur puisse continuer à télécharger/partager des contenus.
  • Il faut que la plateforme de téléchargement reste bon marché, même dans le cas de très nombreux téléchargements.
  • Il faut que les contenus soient téléchargeables et partageables facilement, même dans un réseau local coupé d'internet.

Metalink Pour réussir, nous voulons combiner les avantages particuliers du téléchargement centralisé via FTP/HTTP (pour l'efficacité) et du P2P... en particulier du P2P décentralisé (pour la robustesse et le bas cout). Pour toutes ces raisons, notre choix technologique s'est porté sur Metalink.

Metalink est une norme XML qui permet de définir un contenu (en particulier à l'aide d'une somme de contrôle), des sources (HTTP, FTP, magnet, Bittorent) et des règles de priorité sur les sources (en fonction de la location géographique par exemple, ou tout simplement avec l'aide d'un système de note pour les miroirs). Le format est assez jeune encore, mais il est en cours de normalisation par l'IETF. Les avantages de Metalink sont clairs par rapport à des solutions plus traditionnelles uniques comme HTTP, FTP ou le P2P ; cette technologie permet de combiner les points forts de chacune d'entre elles tout en élimant les désavantages.

Pour utiliser Metalink, il faut :

  • une solution serveur capable de générer les fichiers .metalink, .torrent, les différents miroirs HTTP et FTP, etc. et
  • une solution client capable d'interpréter les fichiers .metalink, de gérer toutes les sources disponibles, de télécharger au mieux ainsi que d'assurer le partage.

La solution serveur s'appelle Mirrorbrain. Mirrorbrain est un logiciel développé à l'origine pour openSUSE, mais utilisé aujourd'hui par de nombreux autres gros projets. Mirrorbrain, c'est plusieurs choses :

  • un module pour Apache qui permet, pour un fichier donné, de sortir tout un tas de sommes de contrôle, un .metalink, un .torrent, un lien magnet et naturellement la liste des miroirs qui disposent du fichier en tenant compte de la localisation géographique du client (localisation par IP),
  • une liste d'outils pour savoir quel miroir possède quel fichier, est actuellement disponible, etc.

En utilisant Mirrorbrain, il ne nous restera à gérer:

  • la synchronisation des miroirs avec rsync,
  • le "superseeder" pour Bittorrent,
  • éventuellement un tracker Bittorrent et noeud DHT si on ne souhaite pas (seulement) utiliser les trackers/noeuds publics.

L'introduction de Mirrorbrain devrait donc être assez vite réalisée ; dans les faits elle est déjà bien entamée.

Coté client, la meilleure solution qui nous soit accessible se nomme Aria2. Aria2 est un logiciel de téléchargement en ligne de commande qui comprend les fichiers .metalink et gère tout. Aria2 est donc capable de télécharger, et partager lorsque c'est possible, en parallèle un même fichier depuis différentes sources et en utilisant différents protocoles. Il est activement développer depuis plusieurs années et est très léger. Il dispose d'une interface XML-RPC permettant de le contrôler depuis Kiwix. Là se trouve probablement le plus gros du boulot ; Il nous faudra rapidement faire un prototype pour valider ce choix.

Le dernier point en suspend semble être alors la question de la disponibilité des fichiers .metalink ; en-effet, que faire si les fichiers .metalink sont indispensables et que le serveur central est indisponible ? Notre idée est d'intégrer les .metalink dans les fichiers XML gérant la bibliothèque de Kiwix. Un tel fichier pourrait donc indistinctement lister des contenus présents sur la mémoire de masse, comme des contenus disponibles en ligne au téléchargement.

La question ultime concernerait alors la mise à disposition des ces fichiers bibliothèque. Un début serait déjà de les livrer par défaut avec Kiwix. Dans tous les cas de figures, distribuer un fichier de quelques milliers de kilooctets est bien moins problématique que de distribuer les contenus en eux mêmes qui font plusieurs gigaoctets.