Lazy Concurrent Initialization

Components are sometimes designed to perform initialization tasks when they are first called, rather than when they are created. Lazy concurrent initialization ensures that this initialization occurs only once even when multiple threads may attempt the initialization. Perhaps, the most prevailing use case is Singletons, I am not going to discuss here as to whether singletons are good or bad, it's up to you. However, note that there are other legitimate use cases. For example, consider that we have a large graph of objects, and each object consists of a lightweight "header" and a heavyweight "body", and that body initialization requires heavy computations or a disk/database access. We may want to defer body initialization until a first access, it helps us to reduce initial latency (which is important in interactive applications). Or, perhaps, significant amount of objects won't be initialized at all, and we will determine which objects to initialize only in a concurrent phase of an application, so the only choice is the lazy concurrent initialization.

There are 2 types of initialization: blocking and non-blocking. Blocking is actually not that bad in this context and is usually preferable, because blocking potentially happens only once - during initialization. That is, the first thread announces that it is in process of initialization, and other threads wait (block) for the initialization to complete. And non-blocking initialization refers to the following scenario. If a thread detects that an object is not completely initialized, it starts it's own "shadow" initialization process. When a thread completes the process, it tries to commit it. A first thread succeeds with the commit, while all other threads fail and rollback their initialization.

Here is a straightforward mutex-based implementation of blocking initialization:

T* g_object; // = 0;

std::mutex g_guard;

T* get_or_create()

{

std::mutex::scoped_loсk loсk (g_guard);

if (g_object == 0)

g_object = new T;

return g_object;

}

And here is a non-blocking one:

T* get_or_create()

{

{

std::mutex::scoped_loсk loсk (g_guard);

if (g_object != 0)

return g_object;

}

T* local = new T;

T* instance = 0;

{

std::mutex::scoped_loсk loсk (g_guard);

if (g_object == 0)

{

g_object = local;

return local;

}

instance = g_object;

}

delete local;

return instance;

}

I need to note that lazy concurrent initialization is available in both Windows API, POSIX and C1x/C++0x.

Windows API is almost perfectly designed - it supports both blocking and non-blocking mode, and it supports inversion of control with callbacks and direct control flow. POSIX does not support non-blocking mode nor direct non-inverted control, and more importantly it does not allow to pass a context into an initialization callback (so you need to guess what you are initializing basing only on function address, that is, you can't initialize 2 objects with a single function). C++0x provides blocking initialization with inverted control, but you are able to pass a context to an initialization callback. C1x mimics POSIX.

So why would you potentially want to create your own lazy concurrent initialization facility?

Windows API is appeared only in Windows Vista. POSIX/C1x API has significant drawbacks. C++0x is not readily available now (however g++ quickly catchs up and there is just::thread). Perhaps, you are working in an environment without Windows API/POSIX. Plus separate initialization object inevitably eats up some additional memory, which can be avoided by merging it into the object pointer. And in the end it's just a good exercise in concurrent programming.

The mutex-based implementations won't scale (neither rw-mutex), so let's create a canonical scalable implementation. We are going to employ the so called double-checked locking pattern, where the first check (fast-path) is a read access. Here is a blocking version:

T* g_object; // = 0

std::atomic<bool> g_is_initialized; // = false

std::mutex g_guard;

T* get_or_create()

{

// read-only fast-path

if (g_is_initialized.load(std::memory_order_acquire))

return g_object;

// slow-path

std::mutex::scoped_loсk loсk (g_guard);

if (g_is_initialized.load(std::memory_order_acquire))

return g_object;

g_object = new T;

g_is_initialized.store(true, std::memory_order_release);

}

Note that the slow-path is executed only during initialization, while the object is initialized all threads will go fast-path. And the fast-path is physically read-only, so the algorithm will scale linearly.

And here is a non-blocking one:

std::atomic<T*> g_object; // = 0

T* get_or_create()

{

// read-only fast-path

T* local = g_object.load(std::memory_order_acquire);

if (local != 0)

return local;

// slow-path

T* shadow = new T;

if (g_object.compare_exchange_strong(local, shadow,

std::memory_order_acq_rel))

return shadow;

delete shadow;

// compare_exchange_strong() updates comperand on failure,

// so we get committed version of the object in 'local'

return local;

}

Move on to Elimination of Memory Fences