Creation of efficient scalable concurrent data structures is a kind of black magic. There is no recipes for that. What we should do is carefully analyze usage patterns for a data structure, and then try to satisfy all user requirements and scalability prerequisites with all possible means.
So what are the scalability prerequisites?
First, no mutexes on fast-paths ever (for slow-paths they are Ok, and even recommended because of the usage simplicity). There are several problems with mutexes. Mutexes limit and sacrifice concurrency to provide simplicity (anti-threads, sort of). Then, they provoke write-sharing (cache-coherence traffic) on every operation (even otherwise read-only). So, they just does not scale, forget about them.
Second, logically read-only operations must be implemented as a physically read-only operations. So, what does it mean? During logically read-only operation one should not do any single write to a shared memory location. Note that writes may be hidden inside of some component, in particular most of the reader-writer mutexes do writes to internal state in read_lock()/read_unlock(), such writes are no less harmful. Writes to a shared state provoke cache-coherence traffic, large amounts of which quickly brings any concurrent system to it's knees.
Due to specifics of implementation of cache-coherence in modern concurrent hardware (see MOESI protocol), reads to a shared state have 100% scalability (i.e. any number of threads can read from a memory location simultaneously); while writes to a shared state have zero scalability (i.e. at most 1 thread can write to a memory location at any given moment in time).
Third, no writes to a centralized shared state on fast-paths. Writes to a shared state are generally unavoidable for most concurrent data structures. However we can distinguish 4 kinds of a shared state for our needs:
Fourth, be aware of false sharing. Due to performance reasons cache-coherence protocols work with whole cache lines, rather than with separate bytes, words or C-language variables. I.e if two variables are contained within a single cache line, for the hardware they look like a single variable with all implications on scalability. So everything said above must be actually extended from distinct memory locations to cache lines. Size of a cache-line is architecture dependent, there are/was architectures with cache line sizes from 16 bytes to 4 kilobytes. However for modern Intel x86 processors (IA-32, Intel 64) cache-line size is fixed to 64 bytes, i.e. 16 consecutive words for IA-32 and 8 consecutive words for Intel 64.
Fifth, atomic RMW operations have some fixed associated costs. For modern Intel x86 processors cost of a single atomic RMW operation (LOCK prefixed instruction) is some 40 cycles (depends on a particular model, and steadily decreases). The cost comes mostly from frequently-unneeded embed full memory fence (so when Intel will separate atomicity and memory ordering, you may cross out this rule). However, the cost is fixed and does not affect scalability, so is far less important than above-outlined scalability-affecting points.
Of course, all well-known single-threaded optimization rules are still applicable too. So we may include into the fifth point also algorithmic optimality and general implementation quality.
If we summarize we get the following scalability mantra:
The most important aggregate metric for a concurrent data structure is a mean number of cache line transfers between threads per operation. All possible means must be employed to reduce the value as much as possible.
That's “why” and “what” need to be achieved. Compliance with the above guidelines ensures linear scalability if possible, and just maximum possible scalability otherwise. It's worth noting that some data structures inherently can't be implemented with the linear scalability, for example producer-consumer queue with the strict FIFO ordering requirement (it's 'strict FIFO ordering' part that is problematic, because it inherently requires communication between threads on every enqueue operation). Fortunately, a concurrent skip list is not one of them.
Move on to Skip List Design