So what primitives are in your arsenal for implementation of advanced synchronization algorithms?
Perhaps, it's the most famous primitive, it's other names are CAS, compare-and-exchange, compare-and-set, std::atomic_compare_exchange, InterlockedCompareExchange, __sync_val_compare_and_swap, LOСK CMPXCHG and other. It's an instance of so-called atomic RMW (read-modify-write) operation. It's pseudo-code is:
That is, it stores a new value (xchg) into a memory location only if it contains an expected value (cmp), in either case it returns a value that was stored in the location when the operation begins. And all that is done atomically on hardware level.
Also atomic RMW operation, and also conducted atomically in hardware. Aka atomic_fetch_add, InterlockedExchangeAdd, LOСK XADD. Below is the pseudo-code:
There are also variations like fetch-and-sub, fetch-and-and, fetch-and-or, fetch-and-xor.
Atomic RMW. Aka atomic_exchange, XCHG. Dead simple, but not less useful:
Atomic loads and stores
They are not RMW (read-modify-write) operations, they are just independent atomic loads and stores. They are frequently unfairly underestimated. However, they play fundamental role in synchronization algorithms, and they are what you should generally strive for - atomic loads and stores are better/cheaper/faster than atomic RMW operations.
Mutexes and the company
Why not? The most stupid thing one can do is try to implement everything in a non-blocking style (of course, if you are not writing infantile research paper, and not betting a money). Generally it's perfectly Ok to use mutexes/condition variables/semaphores/etc on cold-paths. For example, during process or thread startup/shutdown mutexes and condition variables is the way to go.
Now, let's move from the abstract world closer to the reality.
On Windows platform the primitives are available in the form of InterlockedXXX functions, and generally you should prefer Microsoft Visual C++ intrinsics starting from underscore _InterlockedXXX.
In GNU world they available in the form of gcc __sync intrinsics.
On Solaris they are available in the form of atomic_ops.
There is also quite portable libatomic_ops.
And hopefully all that zoo will be unified under umbrella of C1x/C++0x in the form of language-provided atomic primitives.
And in Java world there is java.util.concurrent.atomic.And in .NET world there is System.Threading.Interlocked.