Wide Finder 2
It's a write-up for my Wide Finder 2 entry. Here is the official problem statement for the Wide Finder 2 benchmark. Here is the infrastructure description, and here is the results table.
As of now, my parallel entry is #1 performance-wise. I've also submitted a single-threaded entry, which is of less interest, however it puts a distinctive landmark, so to say. You can download both entries at the bottom of the page.
First of all, you need to read Parallel Disk IO article, it describes parallel IO subsystem that I used in the entry. I take it out because I think it's interesting in itself.
So, the parallel IO subsystem feeds in parallel the rest of the system with fixed size data blocks (in the final run I used 8MB blocks). The blocks are not yet split into lines, and more importantly the contain line fragments at the borders.
I've tried several solutions for line fragment combining, and finally come up with the following solution. Since the file size and block size are known, number of line fragments is also known in advance. So I create a global combining array of pointers with a cell for each pair of line fragments. The array is initialized with NULLs. When a thread wants to combine a line fragment, it checks the appropriate cell in the array. If it's equal to NULL, then the thread tries to CAS it from NULL to own line fragment. If the CAS succeeds, then the thread has successfully offloaded obligation to combine the line. And if the cell is not equal to NULL or the CAS fails, then the thread takes the line fragment from the cell and combines it with the own line fragment:
For 45GB file and 8MB blocks there are ~5.6K cells, on a 64-bit platforms the array occupies only ~45KB of memory. What I like in this solution is that is synchronous, that is, the second thread always instantly finishes with the line. So there is no excessive buffering, and no deferred asynchronous work. By the way, buffers for line fragments are cached and reused on a per thread basis.
After line fragment combining a thread scans through the data block, detects line boundaries, parses lines, checks the regular expressions, and finally collects the statistics. For statistics collection I use per-thread hash maps, which along with per-thread memory allocators allow worker threads to work completely independently (which is a very good thing).
That's basically all interesting regarding processing. When the file is completely processed, worker threads copy their private hash maps with statistics to global hash maps (which is mutex-protected). When all worker threads are done, main thread finds 10 top entries in each global hash map (yes, this part is not parallelized), it's done in a single pass by bubbling up each entry in a separate 10-entry array. Parallelization of the partial sorting can potentially win several additional seconds for us.
The results are as follows. The program handles 45GB 218-million-lines logfile in 3 minutes 11 seconds, total consumed CPU time is 16:54, that is, ~5.5 hardware threads are busy on average (however, in reality there is a crucial difference between processing of the cached and uncached file parts). Average processing speed is 235MB/s (while disks are able to provide only 150MB/s, once again it's due to file caching). The entry contains ~1345 LOC, from which 926 are reusable code (parallel IO subsystem and hash map) and 419 are application-specific. The program contains 17 times more LOC than the initial Ruby program, and is 479 times faster.