Raster

Using GOTO in your code is awesome and you should do it more

There. I said it. It needs to be said. Given the plethora of "GOTO is evil. Never use it" that goes around from your professors teaching you to code through to colleagues and coding guidelines, someone has to take the side of the poor maligned GOTO.

Poor little GOTO. Show some love people!

Now let's get back to the real world. I was recently hammering away at a hot path in our code trying to lower the overhead. Suffice the code looked something like this (pseudocode written specially here for illustration - not the exact original code):

if (shared) lock();
if (data == INVALID) {
  log_error("blah %s:%i %s() -> %p\n", __FILE__, __LINE__ __FUNC__, data);
  return NULL;
}
// smallish body of code hunting through some nested tables using data
if (shared) unlock();
return realdata;

Now some more background. lock() and unlock() are static inline to avoid function call overhead as they are very often used in hot paths and the cost of pushing stuff on the stack then doing a call, and returning in addition to the real function they call would not be a great idea to impose as they provide a level of portability across OS's so really we want the compiler to inline the real OS function without the developer needing to know what it is. As well it provides some basic error checking and logging so you don't have to di it all the time yourself, so they are not that trivial. The point is that the locking & the code inside the error handling if with logging were not that trivial pieces of code, so effectively if (shared) lock(); becomes something more like:

if (shared) {
  if (lock == VALID) {
    if (!do_lock(lock)) {
      log_error("lock fail %s:%i %s() -> %p\n", __FILE__, __LINE__ __FUNC__, lock);
    }
  }
}

And similarly for unlock().

Now this code above that may need to lock a shared resource, go hunt down some data, possibly unlock and return it as well as handle errors along the way. Now I was doing some profiling and this little segment of code was eating up 6-7% of our CPU time (a really hot path), and yes there are caches to avoid full lookups (in the "smallish body of code) etc. but still this was way too much overhead.

So I spent time re-organizing it like:

if (shared) {
  lock();
  if (data == INVALID) {
    log_error("blah %s:%i %s() -> %p\n", __FILE__, __LINE__ __FUNC__, data);
    return NULL;
  }
  // smallish body of code hunting through some nested tables using data
  unlock();
} else {
  if (data == INVALID) {
    log_error("blah %s:%i %s() -> %p\n", __FILE__, __LINE__ __FUNC__, data);
    return NULL;
  }
  // smallish body of code hunting through some nested tables using data
}
return realdata;

And yet still it used 6-7%. I stared at the data from perf and then something stood out. Reading the assembly matching the C and and how often code was actually following a code path. Firstly the shared code path would be rare right now, thus a reason to avoid locking and unlocking. Yes locking is EXPENSIVE. A lock + unlock can take between 6ns to 160ns depending on CPU generation and architecture (I've checked the numbers across various x86 CPUs and ARM). The same work without locking is almost 0ns. Locking is not free. Every lock+unlock cycle is a cost and it is not cheap. You have to balance the number of lock+unlocks with the time something stays locked, as well as trying to design lockless algorithms or implementations. I noticed the amount of ASM commands involved in the lock() and unlock() was maybe about 15-20 such instructions. I also noticed the code for if (data == INVALID) { ... } was maybe 20 or so instructions too as it pushed various things on the stack and did a call. Then it dawned on me...

There is a separate instruction cache, and the CPU will be fetching code 1 cacheline at a time (or more in advance, depending), and 1 cacheline generally is 64 bytes... and averaging maybe 4 bytes per instruction ... 16 insturctions is all that will fit in a cacheline... and if we are doing a compare + branch and SKIPPING almost a whole cacheline worth of instructions to jump to another set of instructions which is not cached yet (as we execute a lot of code outside of this hot function, thus at least polluting the L1 cache if not more)... so we're most likely getting a cache miss here every time because compilers STILL like to order code as written. At least thats what perf was displaying with the ASM and the matching C code next to it and how often it saw that line being hit. Of course this is obvious... we all know this, but we always think of our DATA, not our CODE when it comes to cache hits. Code can get a miss too and when it does, the CPU stalls. We just never keep this in mind when writing code. Continually focusing on our data and if it will be nicely cached we forget the poor instructions that also are data our CPU must consume and avoid cache misses for.

So time to bring out the GOTO to handle exceptions that are rare, moving all the "rare case" code to the end of the function, out of the way of the most common path of execution, helping with reducing caceh misses. So code became something like this:

if (!shared) {
  if (data != INITIALIZED) goto doinit_shared;
  doinit_shared_back:
  if (data == INVALID) goto err_invalid;
  // smallish body of code hunting through some nested tables using data
} else {
  lock();
  if (data != INITIALIZED) goto doinit_shared;
  doinit_shared_back:
  if (data == INVALID) goto err_invalid;
  // smallish body of code hunting through some nested tables using data
  unlock();
  return realdata;
}
return realdata;
doinit_shared:
  // a few lines of initting data here
  goto doinit_shared_back;
doinit:
  // a few lines of initting data here
  goto doinit_back;
err_invalid:
  log_error("blah %s:%i %s() -> %p\n", __FILE__, __LINE__ __FUNC__, data);
  return NULL;

And presto. We dropped down to about 2.5% CPU usage. Moving the "common case" of non-shared data to the early code helped. Moving the big fat amount of instructions involved in taking and releasing a lock helped. Moving the "rare case" of data not being valid and the error handling helped. All of this moved rarely executed code out of the hot path and out of the L1 cachelines that were fetched, so the CPU could stall much less.

Remember this. GOTO is not the pure evil everyone keeps telling you it is. It has its use cases. It's an excellent way of isolating code for handling exceptions out of the hot path. Compilers still are not that bright and even if you use things like __builtin_expect() to hint at what the likely or unlikely result will be, it's good to make your code flow more easily by having the reader run through the common cases nicely without having to skip over large sections of unlikely code. They can look into the exceptions later on when needed as there is a goto telling them where to look for the code handling that. This also helps the CPU simply perform better and in this case it led to a 2-3 times speedup. The lesson here is that nothing is totally evil. Everything should be taken with a grain of salt. Of course you should not use GOTO to replace for or while loops. You shouldn't use it for general flow control, But for moving code out of the hot path and handling exceptions, it makes your code more readable AND provides major performance improvements that are worth any percieved ugliness.

GOTO IS AWESOME

P.S. The actual function in question is _eo_obj_pointer_get() here and you'd need a lot of exra backgrouind to really know what this is and why it's here, so I wasn't going to explain all o that background too