It's not surprisingly that after all optimizations the program hit The Memory Bandwidth Wall, i.e. became memory bandwidth bound. A program is memory bandwidth bound if it consumes all available memory bandwidth and the bandwidth becomes scalability limiting factor. (for details see Detecting Memory Bandwidth Saturation in Threaded Applications chapter from Intel Guide for Developing Multithreaded Applications)
The memory bandwidth problem is not so of a problem in the single-threaded world, because the program still processes gigabytes of data per second and is blazingly fast. But it's a serious problem in the multi-threaded world, because the program does not run any faster when additional threads added.
There is not much one can do when a program is memory bandwidth bound (except upgrading hardware of course). The only thing software developer can do about it is to reduce amount of input and/or output data. In this problem we are concerned only with output, and fortunately it's format is not fixed. So I tried to reduce size of output data as much as possible.
The first natural thing to do is to encode each move as 2 bits (instead of a whole byte). This reduces output 4 times.
The next thing I done is elimination of delimiters between paths. Since I generate paths based on path class, and length of paths in each class is constant and known, delimiters are needless.
Those measures did not help with fighting the memory bandwidth problem. It's easy to observe that 2 bits per move is not optimal packing (2 bits can represent 4 states, while we use only 3), and that it's possible to pack 5 moves into a byte (3^5 <= 2^8). So I switched to “5 moves per byte” encoding:
byte = move0*3^0 + move1*3^1 + move2*3^2 + move3*3^3 + move4*3^4
Which is quite optimal because it uses 243 states out of 256 possible states (95% efficiency). For the problem of size (11,11) there are 45'046'719 possible paths, and my program uses 184'257'251 bytes for output, which is 4.09 bytes per path on average.
The program does speedup proportionally to decrease of output data size. However it's still memory bandwidth bound, i.e. execution time depends on total system's memory bandwidth rather than on processor/core count.
Next page: Performance Results