Faster Fibers/Coroutines

Windows provides quite efficient API for fiber/coroutine management, fiber switch costs just a handful of instructions - it basically saves all registers to the current fiber context and restores the registers from the new fiber context. Unix provides almost the same API, except one thing - each ucontext switch causes 2 syscalls: one to save current signal mask and another to restore signal mask. It makes it orders of magnitude slower.

C historically provides another API which does basically the same thing - setjmp/longjmp. It turns out that it's possible to combine ucontext and setjmp/longjmp to get almost portable fast fibers without platform-specific assembly code. The idea is that we create new contexts with makecontext(), and then use setjmp/longjmp to switch between them.

Relacy Race Detector heavily uses fibers to emulate threads. I've implemented the trick in version 2.4, and it basically boils down to replacement of the following code:

typedef ucontext_t fiber_t;

void create_fiber(fiber_t& fib, void(*ufnc)(void*), void* uctx)

{

getcontext(&fib);

size_t const stack_size = 64*1024;

fib.uc_stack.ss_sp = (::malloc)(stack_size);

fib.uc_stack.ss_size = stack_size;

fib.uc_link = 0;

makecontext(&fib, ufnc, 1, uctx);

}

inline void switch_to_fiber(fiber_t& fib, fiber_t& prev)

{

swapcontext(&prev, &fib);

}

with the following code:

struct fiber_t

{

ucontext_t fib;

jmp_buf jmp;

};

struct fiber_ctx_t

{

void(* fnc)(void*);

void* ctx;

jmp_buf* cur;

ucontext_t* prv;

};

static void fiber_start_fnc(void* p)

{

fiber_ctx_t* ctx = (fiber_ctx_t*)p;

void (*ufnc)(void*) = ctx->fnc;

void* uctx = ctx->ctx;

if (_setjmp(*ctx->cur) == 0)

{

ucontext_t tmp;

swapcontext(&tmp, ctx->prv);

}

ufnc(uctx);

}

inline void create_fiber(fiber_t& fib, void(*ufnc)(void), void* uctx)

{

getcontext(&fib.fib);

size_t const stack_size = 64*1024;

fib.fib.uc_stack.ss_sp = (::malloc)(stack_size);

fib.fib.uc_stack.ss_size = stack_size;

fib.fib.uc_link = 0;

ucontext_t tmp;

fiber_ctx_t ctx = {ufnc, uctx, &fib.jmp, &tmp};

makecontext(&fib.fib, (void(*)())fiber_start_fnc, 1, &ctx);

swapcontext(&tmp, &fib.fib);

}

inline void switch_to_fiber(fiber_t& fib, fiber_t& prv)

{

if (_setjmp(prv.jmp) == 0)

_longjmp(fib.jmp, 1);

}

Note that I am using _setjmp/_longjmp instead of setjmp/longjmp (that usually also save and restore signal mask). On Linux it gave me 2.5x speedup instantly, while on Darwin that single change gave me astonishing 7x speedup.

A fly in the ointment is that some builds on some platforms (namely, release builds on Linux) start crashing with the "longjmp causes uninitialized stack frame" error message. It's quite reasonable because a trivial debug check is able to detect that longjmp tries to jump upwards which is definitely incorrect (of course in reality the code does not tries to jump upwards, it tries to jump to a completely unrelated stack). The good news is that the check can be suppressed with #undef _FORTIFY_SOURCE.