For performance measurements and scalability estimation I used the Intel Parallel Universe service, which features two Intel Xeon X5570 packages (4 HT cores each, 16 hardware threads total). Here is summary report produced by the service for 6x5 game board:
As it can be seen, the program shows very good scalability of 13.38 on 16 hardware threads (8 physical cores). Speedup obtained due to HT technology (difference between 8 threads and 16 threads) is 1.86. That's quite high speedup for HT technology (reference numbers are 1.2-1.4), and it may be explained by a high amount of cache misses (linked list traversal inherently produces high amount of cache misses), which are efficiently neutralized by HT technology (pipeline stalls of two HT sibling threads are interleaved, so that execution units are busy most of the time).
The service also produces a more detailed report for The Intel Parallel Amplifier:
As it can be seen, concurrency is perfect – all bars are completely green (i.e. 16 threads are runnable all the time). ~50% of the execution time is consumed by the concurrent_skiplist::insert() function, which is completely expected. ~30% of the execution time is consumed by local state caching and filtration (remember_state()). And remaining 20% is consumed by game state exploration logic.
In general, I consider the concurrent skip list implementation and parallelization strategy as successful.
[You can download full source code on the first page]