Making Linux Pipelines 2x as Efficiently By Combining Transformations

One of the most common and powerful Linux patterns is piping together a series of commands in bash to transform data.   Today I’ll show you one way to make these pipelines run much faster:

 

Combine sed commands (but not too much!):

Often we’ll need to do multiple search-and-replace operations via sed to clean up data in different ways, but did you know it makes a difference how you do this?  For example, let’s see how long it takes to use a quick-and-dirty way to remove entity-encoding from a large (3.8 GB) XML file:

[user@host: ~/Documents/wordcount]$ time cat data/huwiki-latest-pages-meta-current.xml | sed ‘s/&gt;/>/g’ | sed ‘s/&lt;/</g’ | sed ‘s/&quot;/”/g’ | sed ‘s/&amp;/&/g’ | sed “s/&apos;/’/g” | wc -m

3564069990

real 1m57.601s
user 5m1.484s
sys 0m24.032s

Not bad for such a big file, especially given that just reading & counting characters takes 1m 3s alone (giving 3673719498 characters).  Twice as long to transform the data isn’t bad.

One odd thing: we use more CPU time (‘user’ time) than the actual elapsed time.  This is because each command in a pipe runs as a separate process, and they can run in parallel here.  Since this is a dual-core computer with hyperthreading, we can schedule up to 4 processes in parallel, and are using roughly 2.5 virtual cores.

But wait, sed can take multiple transformation commands, separated by semicolons (OR) via -e arguments.  Is it faster to give multiple commands at once, or to pipe them together?

[user@host:~/Documents/wordcount]$ time cat data/huwiki-latest-pages-meta-current.xml | sed “s/&gt;/>/g; s/&lt;/</g; s/&quot;/\”/g; s/&amp;/&/g; s/&apos;/’/g” | wc -m

3564069990

real 1m37.599s
user 2m42.236s
sys 0m12.584s

Wow, it uses NEARLY HALF as much total CPU time for the same operation!   Even though the sed command only uses a single CPU now, it still 25% faster overall (‘real’ time).  This is because each transformation is very simple, so most of the time is spent reading & writing data to buffers, and it is far more efficient to do multiple search-and-replace operations at once.  Also, pipes aren’t completely free.

Since processes run concurrently, that extra CPU power can be used by another processing step, as long as it can run on partial input.  This becomes even more obvious if we remove the expensive character count, because CPU time goes from 3m 31.208s to 1m 23.008s (!).

The only catch: if you have a complex sed command, it is better to run it as a separate process, because the pipeline cannot run any faster than the slowest command, and the other steps can run on different CPUs.

There you have it: combine simple sed commands into one operation, and you can make your bash pipelines twice as efficient (or more!).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s