Distributed Reader-Writer Mutex

Now, when we know that traditional reader-writer mutexes do no scale and write sharing is our foe, and that the way to go is state distribution, let's try to create a scalable distributed reader-writer mutex. The mutex is going to be very simple, I'm not going to dive too deep into advanced lockfree algorithms, let's just create the simplest possible distributed design, and see what performance and scalability we will achieve.

The mutex is based on per-processor data, and it leads to a very simple implementation. If it would be based on per-thread data instead, we would need to cope with dynamic thread registration/deregistration and properly synchronize arriving/terminating readers with writers.

The idea is very simple. We merely create a traditional reader-writer mutex per CPU; a reader acquires in shared mode a mutex it thinks refers to presumably current CPU; while a writer acquires in exclusive mode all the mutexes:

Note that it's OK if a reader acquires a "wrong" mutex - they all are plain reader-writer mutexes in itself, so they support several concurrent readers. No additional synchronization between writers is required, writers acquire the mutexes in the same order (from 0 to P-1), so ownership over mutex 0 basically determines who is the "current" writer (all other potential writers are parked on mutex 0).

As an underlying reader-writer mutex type I use plain pthread_rwlock_t; sched_getcpu() is used to obtain current processor number. Let's move on to implementation. First, let's define data structures:

typedef struct distr_rw_mutex_cell_t

{

pthread_rwlock_t mtx;

char pad [CACHE_LINE_SIZE - sizeof(pthread_rwlock_t)];

} distr_rw_mutex_cell_t;

typedef struct distr_rw_mutex_t

{

int proc_count;

char pad [CACHE_LINE_SIZE - sizeof(int)];

distr_rw_mutex_cell_t cell [0];

} distr_rw_mutex_t;

Constructor merely determines total number of processors in a system, memorizes it, and initializes per-processor mutexes. While destructor destroys the mutexes and frees memory:

int distr_rw_mutex_create (distr_rw_mutex_t** mtx_p)

{

distr_rw_mutex_t* mtx;

int proc_count;

int i;

proc_count = (int)sysconf(_SC_NPROCESSORS_CONF);

if (posix_memalign((void**)&mtx, CACHE_LINE_SIZE,

sizeof(distr_rw_mutex_t) +

proc_count * sizeof(distr_rw_mutex_cell_t)))

return 1;

mtx->proc_count = proc_count;

for (i = 0; i != proc_count; i += 1)

{

if (pthread_rwlock_init(&mtx->cell[i].mtx, 0))

{

while (i --> 0)

pthread_rwlock_destroy(&mtx->cell[i].mtx);

free(mtx);

return 1;

}

}

*mtx_p = mtx;

return 0;

}

int distr_rw_mutex_destroy (distr_rw_mutex_t* mtx)

{

int i;

for (i = 0; i != mtx->proc_count; i += 1)

pthread_rwlock_destroy(&mtx->cell[i].mtx);

free(mtx);

return 0;

}

Write lock/unlock functions merely lock/unlock all the mutexes. Not much to comment here:

int distr_rw_mutex_wrlock (distr_rw_mutex_t* mtx)

{

int i;

for (i = 0; i != mtx->proc_count; i += 1)

pthread_rwlock_wrlock(&mtx->cell[i].mtx);

return 0;

}

int distr_rw_mutex_wrunlock (distr_rw_mutex_t* mtx)

{

int i;

for (i = 0; i != mtx->proc_count; i += 1)

pthread_rwlock_unlock(&mtx->cell[i].mtx);

return 0;

}

Read lock function obtains [approximation] of current processor, memorizes it, and locks in shared mode respective mutex. Read unlock function just unlocks the same mutex. Note that unlock function can't re-obtain current processor number and use, it must use processor number obtained in the lock function (because processor might be changed):

int distr_rw_mutex_rdlock (distr_rw_mutex_t* mtx, int* proc)

{

*proc = sched_getcpu();

pthread_rwlock_rdlock(&mtx->cell[*proc].mtx);

return 0;

}

int distr_rw_mutex_rdunlock (distr_rw_mutex_t* mtx, int proc)

{

pthread_rwlock_unlock(&mtx->cell[proc].mtx);

return 0;

}

Performance

In order to verify performance and scalability, I benchmarked the mutex against pthread_rwlock_t. The benchmark is very simple: 1 reader-writer mutex, an array of N int's (the data) and P worker threads. Each worker thread constantly acquires the mutex in shared mode and verifies data's consistency. Periodically each worker thread acquires the mutex in exclusive mode and mutates the data. The benchmark was executed on a 4 processor x 4 cores AMD machine (16 hardware threads in total) running Linux 2.6.29.

In the first run I set N=4, and vary period of writing as 10, 50, 100, 500, 1000 and 10000

And below is the same graph but without lines for distributed(500, 1000 and 10000):

In the second run I set N=256, the same two graphs below:

So, what we see on the graphs? Our distributed mutex is somewhat (10-60%) slower in uncontended case (note that 60% slowdown refers to the extreme case of 10% write rate + basically no useful work). pthread_rwlock_t is completely non-scalable under load even on read-mostly workloads (however, we see a slight attempt to scale with N=256 on 2 threads). Our distributed mutex scales much better, under 1/10000 write rate it exposes perfect linear scaling.

Note that that fact that we are using current processor number for read acquisition is crucial, because performance-wise per-processor data is basically equal to per-thread data (a processor runs one thread at a time). I've also benchmarked a randomized variant of the distributed mutex (it uses per-thread random number generators to choose a mutex for read acquisition), and I've tried to create kind of the best conditions for it - I set data size N to 256 and increase number of underlying reader-writer mutexes 4-fold. The benchmark showed that it scales better than a centralized mutex, however still far from per-processor mutex (write rate is presented in brackets):

The bottom line is that the implementation is very simple and comprehensible, performance is somewhat worse than pthread_rwlock_t, while scalability is significantly improved. The mutex can be used whenever you have high read load and low write-to-read ratio (~<1-5%).

You can download the implementation along with the benchmark below (gcc/Linux).