One of the most common and powerful Linux patterns is piping together a series of commands in bash to transform data. Today I’ll show you one way to make these pipelines run much faster:
Combine sed commands (but not too much!):
Often we’ll need to do multiple search-and-replace operations via sed to clean up data in different ways, but did you know it makes a difference how you do this? For example, let’s see how long it takes to use a quick-and-dirty way to remove entity-encoding from a large (3.8 GB) XML file:
[user@host: ~/Documents/wordcount]$ time cat data/huwiki-latest-pages-meta-current.xml | sed ‘s/>/>/g’ | sed ‘s/</</g’ | sed ‘s/"/”/g’ | sed ‘s/&/&/g’ | sed “s/'/’/g” | wc -m
Not bad for such a big file, especially given that just reading & counting characters takes 1m 3s alone (giving 3673719498 characters). Twice as long to transform the data isn’t bad.
One odd thing: we use more CPU time (‘user’ time) than the actual elapsed time. This is because each command in a pipe runs as a separate process, and they can run in parallel here. Since this is a dual-core computer with hyperthreading, we can schedule up to 4 processes in parallel, and are using roughly 2.5 virtual cores.
But wait, sed can take multiple transformation commands, separated by semicolons (OR) via -e arguments. Is it faster to give multiple commands at once, or to pipe them together?
[user@host:~/Documents/wordcount]$ time cat data/huwiki-latest-pages-meta-current.xml | sed “s/>/>/g; s/</</g; s/"/\”/g; s/&/&/g; s/'/’/g” | wc -m
Wow, it uses NEARLY HALF as much total CPU time for the same operation! Even though the sed command only uses a single CPU now, it still 25% faster overall (‘real’ time). This is because each transformation is very simple, so most of the time is spent reading & writing data to buffers, and it is far more efficient to do multiple search-and-replace operations at once. Also, pipes aren’t completely free.
Since processes run concurrently, that extra CPU power can be used by another processing step, as long as it can run on partial input. This becomes even more obvious if we remove the expensive character count, because CPU time goes from 3m 31.208s to 1m 23.008s (!).
The only catch: if you have a complex sed command, it is better to run it as a separate process, because the pipeline cannot run any faster than the slowest command, and the other steps can run on different CPUs.
There you have it: combine simple sed commands into one operation, and you can make your bash pipelines twice as efficient (or more!).