The Solution

The idea behind the solution is that we have a set of interchangeable worker threads, and each thread decides what is the most optimal thing to do next. Possible actions/states are: read, process and block. Also threads can unblock other threads if required and put/get data blocks to a buffer. A thread state machine looks as follows:

The data block queue has high and low watermarks. If it's size grows over high watermark, the system enters an "overflow" state. Then, when size goes down to low watermark, the system returns to normal state.

After BLOCK a thread always transitions to READ.

READ is the initial state, because we want a "fast start". After READ a thread transitions to PROCESS, if number of currently processing threads is less than number of processors (and of course, in this case the thread processes his just read block). Otherwise, if (number of reading threads < MIN_IO_THREADS) or (number of reading threads < number of processing threads), a thread returns to READ (and puts his just read block to the queue). Otherwise, a thread transitions to BLOCK (and puts his just read block to the queue).

After PROCESS a thread returns back to PROCESS (taking away a block from the queue), if (the queue is not empty) and (number of currently processing threads is less than number of processors). Otherwise, if (number of reading threads < MIN_IO_THREADS) or (number of reading threads < number of processing threads), a thread transitions to READ. Otherwise, a thread transitions to BLOCK.

During READ->PROCESS and PROCESS->PROCESS a thread decides to unblock another thread, if ((number of reading threads < MIN_IO_THREADS) or (number of reading threads < number of processing threads)) and (the system is not in overload state) and (there are blocked threads).

On implementation level there is a global context object that holds current position in the file, counters of reading and processing threads, LIFO stack of blocked threads, FIFO data block queue and other required state. The object is protected with a spin mutex. When a thread needs to decide on the next action, it acquires the mutex, then adjusts and analyzes the state, releases the mutex, and then blocks itself or unblocks other threads if required. Time under the mutex is negligible, there are just few conditional branches, few counter adjustments and few pointer manipulations.

Perhaps there are some minor details that I've missed in the description, but at large it's that way (check out the code as the best documentation). You may verify how the algorithm accounts for all the external factors, and how it encourages/prevents the good/bad things. The algorithm is not ideally optimal, because it's difficult to determine as to whether the file is cached or not, and contention for CPU/disk from other processes. However, automatically tunes to near optimal behavior.

To verify the behavior, I had collected a profile during processing of a partially cached 45GB file - first half is cached, and second half is on disk (click to enlarge):

where: 'active' is the sum of reading and processing threads, 'block' - number of blocked threads, 'read' - number of reading threads, 'proc' - number of processing threads, 'cached' - number of cached blocks in the queue.

As you may see, the behavior is quite steady and reasonable.

Here is enlarged profile for the cached part (click to enlarge). The system tries to maintain the following invariants: PROCESSOR_COUNT processing thread and at least MIN_IO_THREADS reading threads (the system need to maintain it, because it doesn't actually know that the file cached, it just orients on relative processing/read speeds, this causes slight CPU over-subscription). Almost no data blocks are cached, which is good (what for?). No threads are blocked, which is expected - when the file is cached we can go at full speed.

And here is enlarged profile for the uncached part (click to enlarge). What we see here? Most threads are parked, and since we park/unpark in LIFO order, they are kind of "deeply" parked - good. Reading thread are maintained on MIN_IO_THREADS level. There are so many processing threads as to handle data at disk reading speed. No cached blocks at all.

The bottom line. Do not create threads to perform specific functions, threads and functions are orthogonal - any thread can perform any function. Do not leave critical aspects of behavior to the chance, instead proactively force desired behavior. Let each thread decide what is currently the most important/optimal thing to do next.

You may also want to check out description of my Wide Finer 2 entry which uses the described IO subsystem.

External links:

Multithreaded File I/O - the article shows performance of multithreaded accesses to SATA, SCSI and RAID5 disks under various workloads.