False-sharing

Entry in Parallel Programming with .NET blog "Most Common Performance Issues in Parallel Programs" and recent article in MSDN ".NET Matters: False Sharing" have attracted my attention. Basically they both suggest to eliminate false sharing. Wrong! Wrong! Wrong! It's not the whole truth, so to say. So if authors were under oath in the virtual IT court, they would have to be arrested. Fortunately they weren't under oath :)

The first thing one has to say in that context is:

1. Eliminate sharing. Period. Not false sharing, just sharing. It's sharing that has huge performance penalties. It's sharing that changes linear scalability of your application to super-linear degradation. And believe me, hardware has no means to distinguish false sharing from true sharing. It can't penalize only false sharing, and handle true sharing without any performance penalties.

Second thing one has to say in that context is:

2. Put things that must be close to each other... close to each other. Assume following situation. In order to complete some operation thread has to update variable X and variable Y. If variables are situated far from each other (on different cache lines), then thread has to load (from main memory, or from other processor's cache) 2 cache lines instead of 1 (if variables are situated close to each other). Effectively this situation can be considered the same as false-sharing, because thread places unnecessary work on interconnects, thus degrading performance and scalability.

Points 1 and 2 can be aggregated as:

1+2. Do pay attention to data layout. This was important in the 60's. This is even more important in the multicore era.

Only after that one can also add:

3. Sometimes sharing can show up when you are not expecting it, i.e false sharing. This is important to eliminate false sharing too... etceteras... [insert here contents of False Sharing article]

If one says only point 3, well, it's basically senseless. And sometimes it can even hurt.

Let's consider simple example:

long volatile g_operation_count = 0;

void collect_statistics()

{

InterlockedIncrement(&g_operation_count);

}

What does naive programmer think about it? Hmmm... Let's see... I use "fast" non-blocking interlocked operations. Good!... Hmmm... False sharing. Let's see... Hmmm... Here is no false sharing. Good! So my program fully conforms to recommendations of experts.

Rubbish! It's a dead-slow, completely non-scalable program.

Now let's apply consistent rules to the example. First of all we have to do something like this:

long volatile g_operation_count [MAX_THREAD_COUNT] = {};

void collect_statistics()

{

InterlockedIncrement(&g_operation_count[get_current_thread_id()]);

}

It's good distributed design. When we need aggregate number of operations we just sum up all thread local counters.

Only at this point we can remember about false-sharing and put the final touches to the code:

struct counter_t

{

long volatile count;

char pad [CACHE_LINE_SIZE - sizeof(long)];

}

counter_t g_operation_count [MAX_THREAD_COUNT] = {};

void collect_statistics()

{

InterlockedIncrement(&g_operation_count[get_current_thread_id()].count);

}

Ok, this distributed version is also fast and scalable. It has linear scalability and can be faster up to 100x on modern multi-core hardware as compared with original version.

So, point 1+2 is a kind of general rule, while point 3 is just a refinement to them.

Why people don't say the whole truth? I don't know. I don't beleive that authors don't aware of the problem. Maybe they think that it's obvious. The practice shows that it's not...

Now you may move to the following applied sections:

Reader-Writer Problem

Producer-consumer Queues