2013.0814 permalink

Bash is irrecoverably broken

Everybody knows that bash is a little weird. The quoting shenanigans with $@ and $*, the horked local scoping rules, the required whitespace around [[ and ]], and a handful of other oddities that make shell scripting feel like an adventure. And these were things I was happy to put up with in exchange for its convenient ubiquity and highly usable default settings.

That all changed this morning.

I was up early to work on bake, a build tool that integrates with the shell environment. Bake maintains a table of variable bindings and uses these repeatedly when solving the dependency graph, so I wrote up some quick benchmarks to test various strategies for storing and accessing it.

No problem, right? Bash has sparse arrays, global variables, and bash 4 introduces associative arrays. One way or another, we've got this covered. Ok, so let's do a quick check to make sure the sparse-array support doesn't do anything too crazy with access times:

# setup xs to have 65536 elements, each one character long
benchmarking local x=${xs[1024]}...
user    0m0.234s
user    0m0.237s
user    0m0.234s
user    0m0.241s
benchmarking local x=${xs[16384]}...
user    0m0.233s
user    0m0.240s
user    0m0.242s
user    0m0.240s

Looks good! It doesn't seem to matter what the index is. So however bash is doing arrays, it isn't doing anything crazy like a linear scan when we retrieve stuff. Right?

Arrays are linked lists

Wrong. Sparse arrays are implemented as linked lists in bash. To optimize the sequential access case, the array structure contains a stateful pointer to the last element accessed. We can see this behavior in a benchmark by requesting element 0 after element N:

benchmarking local x=${xs[1024]} y=${xs[0]}...
user  0m0.347s
user  0m0.353s
user  0m0.349s
user  0m0.350s
benchmarking local x=${xs[16384]} y=${xs[0]}...
user  0m1.175s
user  0m1.183s
user  0m1.189s
user  0m1.182s

A tragic lack of forethought on the part of the array implementer. But no matter; bash gives us at least two ways to access hash tables, so we can work around the problem that way.

Hash tables are linked lists

Bash does not use hashtables as of version 4.2, and you should not use a hashtable written by anyone who tells you otherwise.

Yes, you read that correctly. Every variable lookup, every associative array lookup, alias lookup, etc -- these are all linear-time scans through a linked list.

In the bash source, there's a set of functions called hash_create, hash_insert, etc, that are unsurprisingly called whenever bash maintains a mapping from strings to other stuff like functions, variables, aliases, and such. And also unsurprisingly, there's a function that hashes strings by multiplying each character by a big prime number. So one might reasonably conclude that all this talk of hashes is about setting up a hash table for averaged constant-time access.

That was probably the idea, anyway. But if you benchmark variable lookups, you'll observe a noticeable decline in performance as more of them are defined:

setting up 64 vars...
real  0m0.004s
user  0m0.004s
sys   0m0.000s
benchmarking local x=${xs_0} y=${xs_0}...
user  0m0.260s
user  0m0.259s
user  0m0.258s
user  0m0.259s
setting up 65536 vars...
real  0m7.484s
user  0m7.415s
sys   0m0.071s
benchmarking local x=${xs_0} y=${xs_0}...
user  0m2.905s
user  0m2.914s
user  0m2.927s
user  0m2.909s

Let's give bash the benefit of the doubt here and say that maybe some of the later variables created some hash collisions (despite all being named xs_i for densely-distributed integers i). It isn't quite fair to compare a nearly-empty hashtable with a full one.

setting up 524288 vars...
real  12m4.956s
user  12m2.650s
sys   0m2.090s
benchmarking local x=${xs_0} y=${xs_0}...
user  1m20.455s
user  1m19.054s
user  1m17.574s
user  1m12.624s

ZOMGWTF. Eight times as many variables, and a whopping 100x as long to set them up. 10x as long to access. And we're not swapping to disk or anything; this is all happening well within memory.

Clearly this is no ordinary hashtable implementation. And indeed, here is the hash_insert function from the source (reformatted for brevity):

hash_insert (string, table, flags)
     char *string;
     HASH_TABLE *table;
     int flags;
  int bucket;
  unsigned int hv;
  if (table == 0) table = hash_create (0);
  item = (flags & HASH_NOSRCH) ? (BUCKET_CONTENTS *)NULL
               : hash_search (string, table, 0);
  if (item == 0)
      bucket = HASH_BUCKET (string, table, hv);
      item = (BUCKET_CONTENTS *)xmalloc (sizeof (BUCKET_CONTENTS));
      item->next = table->bucket_array[bucket];
      table->bucket_array[bucket] = item;
      item->data = NULL;
      item->key = string;
      item->khash = hv;
      item->times_found = 0;
  return (item);

This all looks quite innocuous at first glance, but if you've ever written a hashtable you will quickly realize that something crucial is missing: the bucket array is never resized. Bash does us the service of maintaining an exact count of the number of items stored in the table, but at no point does it resize the array to maintain an average chain depth.

Which, of course, is something of a problem as we start to add more entries. What it really means is that bash isn't using a proper hashtable at all; instead, it's just a fixed number of linked lists over which it will ultimately do linear scans. This explains why creating n variables takes n2 time.

Why this matters

You might be thinking, "look, it's a shell, who cares." I was tempted to think this too, and for most people it probably really doesn't matter. But the shell isn't just this thing you type commands into. It's your window into the system. It's a programming language that you're using to tell the computer what to do. And being able to extend your navigational context is enormously helpful. Bash, through its sloppy implementation, reduces the shell to just a command prompt and not much more.

It's possible to overcome most kinds of limitations in one way or another. And I'm generally willing to go to some absurd lengths to get something to work if it's sufficiently interesting. There are two kinds of problems, however, that force a hard stop. One is when you don't have enough computational expressiveness to do something, like trying to write Lisp in the C preprocessor. (You can do it with C++ template metaprogramming, but the C preprocessor isn't Turing-complete.)

The other is when you hit a performance barrier with no workaround. If you can allocate memory and get to it in constant time, you can probably come up with some creative solution to a performance problem; but this all stops when your language has no constant-time data access.

I guess it's time to switch to zsh:

if (++ht->ct >= ht->hsize * 2 && !ht->scan)

Or maybe this will be enough motivation to write a shell. But either way, today is my last day using bash.