How To Build ZimDump To Process ZIM Files

There is a wonderful set of dumps of Project Gutenberg & Wiki* content available here, across many languages.  These make wonderful corpuses for Natural Language Processing (NLP), but they come in the custom ZIM format.  ZIM is intended for graphical viewing of Wiki content, and contains XZ compressed content + indices and metadata.  To get this into a more natural format for NLP work, we’re going to need to do some digging:

ZIM has a tool, zimdump, intended to dump content.  Unfortunately it has to be built from source, and doesn’t have binaries.

Today I’m going to show you how to do this in a Docker container (to avoid polluting the operating system with one-use build tooling):

First we navigate to the folder containing the ZIM file of interest, and start a temporary docker container that mounts it as a volume (to write output):

docker run -it –rm -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash

Next we install dependencies & libraries in the container (this will take a few minutes):

apt-get update && apt-get install -y g++ git autoconf make automake libtool liblzma-dev

Next we grab the source code containing openzim:

git clone https://gerrit.wikimedia.org/r/p/openzim.git

And we navigate to the folder:

cd openzim/zimlib

Now we do some setup for the build:

./autogen.sh && ./configure

And we build the darned thing:

make -j4

Now we can use and run zimdump, but first let’s go back to the top-level folder:

cd ../../

Now, I want to do a dump from the German (code: de) Project Gutenberg collection, so we do this:

./openzim/zimlib/src/tools/zimdump -D gutenberg_de_dump  gutenberg_de_all_10_2014.zim

Unfortunately this will have issues (due to invalid filenames), but now you can extract individual articles if desired

3 thoughts on “How To Build ZimDump To Process ZIM Files

  1. Hi, i need process a zim file and i reached your blog but i have no enough knowledge , below codes i can’t understand where and how i write these codes/commands , please help me

    First we navigate to the folder containing the ZIM file of interest, and start a temporary docker container that mounts it as a volume (to write output):

    docker run -it –rm -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash
    Next we install dependencies & libraries in the container (this will take a few minutes):

    apt-get update && apt-get install -y g++ git autoconf make automake libtool liblzma-dev

    Now we do some setup for the build:

    ./autogen.sh && ./configure
    And we build the darned thing:

    make -j4

    ./openzim/zimlib/src/tools/zimdump -D gutenberg_de_dump gutenberg_de_all_10_2014.zim

    Like

  2. The first command didn’t work for me, this did. (Looks like an autoformatting error)

    docker run –rm -it -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s