Elimination of Memory Fences

Now the only remained problem is that nasty acquire fence on the fast-path of both algorithms. Acquire memory fences do not hinder scalability, however they still have some associated costs (on a par with few tens of cycles). Here I need to make a proviso - on some hardware platforms (most notably x86 and SPARC TSO) acquire fences are implicit and implied with each load, that is, they are costless no-ops. So if you are targeting only on such platforms, you are OK with the version which issues an acquire fence on fast-path. However, do not fall into the fallacy that you may remove them completely - you still need to ensure proper code generation by a compiler.

There are two ways to eliminate the fence: weaken it to the consume fence, or completely eliminate it via thread-local cache trick.

If there is a data dependency between the synchronization load (load of a pointer or a flag) and an associated data, then we can weaken acquire fence to consume fence. Consume fence is a costless no-op on most modern architectures (to the best of my knowledge the only architecture that requires consume fence is now almost dead DEC Alpha). Here is an example with a data-dependency (it's a union of an initialization function and surrounding code as if compiler inlines initialization function):
T* local = g_object.load(std::memory_order_consume);
if (local != 0)
    printf("%d", local->data);
Here is an example w/o a data-dependency:
if (g_is_initialized.load(std::memory_order_acquire))
    printf("%d", g_object->data);
And in this case we need a fair acquire fence, otherwise a compiler or hardware can reorder it as:
tmp = g_object->data;
// (*)
if (g_is_initialized.load(std::memory_order_acquire))
    printf("%d", tmp);
Which will clearly lead to problems if a thread will be preempted in (*) while another thread will initialize the object and set g_is_initialized flag.
So, here a blocking initialization which uses emitted data-dependency to weaken acquire fence to consume fence. It can be considered virtually zero-overhead on fast-path on most modern hardware:
std::atomic<T*> g_object; // = 0
std::mutex g_guard;

T* get_or_create()
{
    // read-only fast-path
    T* local = g_object.load(std::memory_order_consume);
    if (local != 0)
        return local;

    // slow-path
    std::mutex::scoped_loсk loсk (g_guard);
    local = g_object.load(std::memory_order_relaxed);
    if (local == 0)
    {
        local = new T;
        g_object.store(local, std::memory_order_release);
    }
    return local;
}

But don't hurry, the problem is that sometimes you don't actually have a data-dependency where you may think you have a one (do not fall into the fallacy like some people do). Consider the following code:
ILog* g_log; // = 0

class Singleton
{
    static int s_var; // = 0

    Singleton()
    {
        g_log = new Log;
        s_var = rand();
    }

    void foo()
    {
        g_log->log(s_var);
    }
};


Singleton* instance = get_or_create();
instance->foo();

Is there a data-dependency or not? Consider what we have here in essence:
// it's inlined get_or_create()
Singleton* instance = g_instance.load(std::memory_order_consume);
if (instance)
    // and it's inlined Singleton::foo()
    g_log->log(s_var);

See? There are no data-dependencies between the load of g_instance and loads of g_log and s_var. They are completely independent loads of global variables. So be careful with memory_order_consume, and if you are creating a general purpose library you need to either reject memory_order_consume (luckily there is another way to eliminate the acquire fence) or distinctly warn your users about restrictions.


There is an interesting trick based on thread-local storage that allows to completely eliminate the acquire fence.  Basically each thread just caches a pointer to the object in thread-local storage:
T* g_object; // = 0
thread_local T* t_object; // = 0
std::mutex g_guard;

T* get_or_create()
{
    T* obj = t_object;
    if (obj != 0)
        return obj;

    // slow-path
    std::mutex::scoped_loсk loсk (g_guard);
    if (g_object == 0)
        g_object = new T;
    t_object = g_object;
    return g_object;
}

Access to thread local storage is on par with 2 indirect loads, that is, it's very fast. However on x86/SPARC TSO version with the acquire fence is a bit faster.
This version requires a thread-local storage slot and a mutex per each lazily initialized object, which may be expensive in some contexts. There is a way to eliminate this, or more precisely to amortize this, that is, there is a single slot and single mutex system-wide (for all lazily initialized objects):
#define UNINITIALIZED UINT64_MAX
#define INITIALIZING (UINT64_MAX - 1)

// per-object data
T* g_object; // = 0
std::atomic<uint64_t> g_once; // = UNINITIALIZED

// global data
std::atomic<uint64_t> g_counter; // = 0
thread_local uint64_t t_counter; // = 0
std::mutex g_guard;

T* get_or_create()
{
    uint64_t counter = g_once.load(std::memory_order_relaxed);
    if (counter > t_counter)
    {
        // slow-path
        g_guard.loсk();
        counter = g_once.load(std::memory_order_relaxed);
        if (counter == UNINITIALIZED)
        {
            g_once.store(INITIALIZING, std::memory_order_relaxed);
            g_guard.unloсk();
            g_object = new T;
            g_guard.loсk();
            counter = g_counter.fetch_add(1, std::memory_order_relaxed) + 1;
            g_once.store(counter, std::memory_order_release);
        }
        else
        {
            while (g_once.load(std::memory_order_relaxed) == INITIALIZING)
                std::this_thread::yield();
        }
        t_counter = g_counter.load(std::memory_order_relaxed);
        g_guard.unloсk();
    }
    return g_object;
}

A single mutex can hinder scalability if there are a lot of lazily initialized objects, so this solution can be further improved by using a hash table of mutexes (and associated counters).

The last thing regarding lazy initialization is that you need to consider necessity of static initialization of the primitive. If you are creating a reusable solution, then it's possible that it will be used "before main()", that is, threads are started from constructors of global objects and from initialization routines of dynamic libraries. Then you need to provide either initializing macro or a C++0x constexpr constructor.

For example, pthread_once() provides it in the form of:
pthread_once_t g_once = PTHREAD_ONCE_INIT;
Potential problem is that all primitives that you use must support static initialization as well. That is, you can't use, for example, CRITICAL_SECTION or std::mutex, because they does not support static initialization. However, you can use pthread_mutex_t, because it provides static initializer PTHREAD_MUTEX_INITIALIZER.

Well, that's mainly all I can tell about lazy initialization.

Comments