Content manager: challenges and solutions
This is the translation of my previous post in French. Thx Rupert for it.
Kiwix 0.9 has now beta status and it's time to think seriously about version 1.0. The integrated management of content is Kiwix 1.0 core functionality, namely ZIM files and search indexes. Kiwix will provide a new usage experience when downloading, sharing, organizing, deleting of different content can be done without using or installing other tools. Distribution of content will benefit if we make the Kiwix users life even simpler than today. To provide such functionalities we are currently addressing the following challenges:
- new content can be downloaded out of Kiwix
- the software needs to be robust and fast
- even if a content server fails or is disconnected, the user can continue to download and share content
- the download cost need to stay low even if the volume skyrockets
- downloading and sharing content must be easy, even if the LAN is cut off the internet
The architectural solution we envision to fullfill the above requirements is to combine the specific advantages of a centralized download via FTP/HTTP (for efficiency) and decentralized P2P (for ruggedness and low cost). The standard which seems to best match the requirements is Metalink. Metalink is an XML standard for defining content by using a checksum, its sources (HTTP, FTP, magnet, Bittorrent) and priority rules on these sources. Examples for such rules are the geographic location, or simply a rating system for the mirrors. The format is still fairly young, but it is being standardized by the IETF. Compared to more traditional solutions of uniquely using HTTP, FTP or P2P, the technology combines the strengths of each while eliminate the disadvantages.
To use Metalink, the following is necessary:
- a server capable of generating metalink files, torrent, different HTTP and FTP mirrors, etc.
- a client that can interpret the metalink files. It will need to manage all available sources, download the best and ensure sharing.
Metalink has also some existing implementations for the server called Mirrorbrain, and for the client, called Aria2. By using them we hope to keep the implementation effort low. To verify the architectural and tool choices we will make a prototype.
Lets elaborate a little bit on the server side. Mirrorbrain is a software originally developed for openSUSE, but now used by many other large projects. It is an Apache module that allows for a given file, released a bunch of checksums, one. metalink, one. torrent, a link magnet and of course the list of mirrors that have the file, taking into account the geographic location of the clients IP address. A list of tools to know what the mirror has what file is currently available, etc. When using Mirrorbrain, we can concentrate on:
- synchronization of mirrors with rsync
- the "Superseed" for Bittorrent
- possibly a Bittorrent tracker and DHT node if you do not want (only) use the trackers / public nodes.
On the client side, Aria2 is a command line client which "understands" .metalink files and manages the rest. Aria2 is therefore able to download files, and can also share files coming originally from different sources via various protocols, at the same time. It is actively developed for several years and is very light. Via its XML-RPC interface it can be controlled out of Kiwix. Exactly this interface is most likely the largest part of the implementation work.
Finally, this leads us the the last challenge we need to address: where to get the .metalink files, and therefore the available content, from? Especially if no central server is available (e.g. in some mesh network scenarios) .metalink files need to be available on the client device. The plan is to integrate .metalink files into the XML files managing the content index of Kiwix. Such an XML file then would list at the same time contents available on local storage, as well as contents available for download. In the beginning, we want to just include some default index into Kiwix. Later on index files can be shared via the same technology developped for the contents itself.