There is a wonderful set of dumps of Project Gutenberg & Wiki* content available here, across many languages. These make wonderful corpuses for Natural Language Processing (NLP), but they come in the custom ZIM format. ZIM is intended for graphical viewing of Wiki content, and contains XZ compressed content + indices and metadata. To get this into a more natural format for NLP work, we’re going to need to do some digging:
ZIM has a tool, zimdump, intended to dump content. Unfortunately it has to be built from source, and doesn’t have binaries.
Today I’m going to show you how to do this in a Docker container (to avoid polluting the operating system with one-use build tooling):
First we navigate to the folder containing the ZIM file of interest, and start a temporary docker container that mounts it as a volume (to write output):
docker run -it –rm -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash
Next we install dependencies & libraries in the container (this will take a few minutes):
apt-get update && apt-get install -y g++ git autoconf make automake libtool liblzma-dev
Next we grab the source code containing openzim:
And we navigate to the folder:
cd openzim/zimlib
Now we do some setup for the build:
./autogen.sh && ./configure
And we build the darned thing:
make -j4
Now we can use and run zimdump, but first let’s go back to the top-level folder:
cd ../../
Now, I want to do a dump from the German (code: de) Project Gutenberg collection, so we do this:
./openzim/zimlib/src/tools/zimdump -D gutenberg_de_dump gutenberg_de_all_10_2014.zim
Unfortunately this will have issues (due to invalid filenames), but now you can extract individual articles if desired
Hi, i need process a zim file and i reached your blog but i have no enough knowledge , below codes i can’t understand where and how i write these codes/commands , please help me
First we navigate to the folder containing the ZIM file of interest, and start a temporary docker container that mounts it as a volume (to write output):
docker run -it –rm -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash
Next we install dependencies & libraries in the container (this will take a few minutes):
apt-get update && apt-get install -y g++ git autoconf make automake libtool liblzma-dev
Now we do some setup for the build:
./autogen.sh && ./configure
And we build the darned thing:
make -j4
./openzim/zimlib/src/tools/zimdump -D gutenberg_de_dump gutenberg_de_all_10_2014.zim
LikeLike
The first command didn’t work for me, this did. (Looks like an autoformatting error)
docker run –rm -it -v $(pwd):/workdir -w /workdir ubuntu:14.04 bash
LikeLike
Looks like yet again autoformatting makes the command wrong. – – (double dash in front of rm)
LikeLike