Parallel Disk IO

Below is a description of the parallel disk IO subsystem that I used in my Wide-Finder 2 entry, however the solution is pretty general and highlights the main principles. The task is to read and process 45GB webserver logfile (218 million lines). The machine is Sun Fire T2000 with UltraSPARC T1 processor (32 hardware threads) and 32GB RAM running Solaris 10U4. IO subsystem is 4 "pretty fast disks" with single ZFS filesystem.

First of all, we need to determine what is the most efficient OS mechanism for disk IO, it will significantly affect programming model as well. I've considered plain blocking read(), mmap() and aio. In singlethreaded mode aio gives the best results (not surprisingly, because it allows to interleave IO and processing). However, in multithreaded mode plain old blocking read() (pread64()) turned to be the fastest. Memory mapped IO is slower in both cases (there are other people who case to the same conclusion: here and here).

So, how can we approach it? The first naive approach is to organize a pipeline with 2 stages: reading and processing:

In order to prevent resource under-subscription for sure, we need PROCESSOR_COUNT threads for each stage. This model has the following drawbacks:

- It will lead to CPU over-subscription if the file is cached

- It may lead to disk over-subscription

- All data blocks are passed via FIFO queue, which is a perfect way to cool data in cache

- If processing can't keep up with reading, the queue will be almost always full and will consume substantial amount of memory

- We need to create more threads than is ever required, and excessive reading threads won't be parked

In order to mitigate some drawbacks some contestants create small fixed number of IO threads (8-12, requires manual tuning), however it leads to significantly suboptimal processing if a file is cached.

Another possible approach is to let each thread read and process blocks independently, below is a thread state machine:

In order to prevent resource under-subscription this model requires PROCESSOR_COUNT + MIN_IO_THREADS threads in total. It mitigates over-subscription somehow, and now the system won't consume excessive amounts of memory for buffering. However it still has some drawbacks:

- It over-subscribes either CPU or disk

- Excessive threads are not parked

and the most serious:

- Without buffering it will lead to temporal disk underutilization (no pending read requests) followed by CPU underutilization (all thread are blocked on IO)

So, I am going to combine both approaches and take everything under manual control to prevent all kinds of undesirable things. But first things first. and first we need to determine external factors we want to account for, and what are good/bad things we want to encourage/prevent.

External factors are:

- The file can be cached or not (or partially cached), if the file is cached then disk IO effectively becomes lightweight CPU processing

- Processing can be faster or slower than reading

- There can be other processes that contend for CPU and/or disk

- The hardware can be modestly parallel (1-2 cores), or highly parallel (dozens of cores)

- The disks can be HDD, SSD, RAID

What we want to encourage/prevent:

- Read the file sequentially to the extent possible - it's crucial for HDD disks

- Keep up at least MIN_IO_THREADS pending reads always. If an application is IO bound, we can't afford disk idling

- If the file is cached, occupy all otherwise idle processors with reading

- Do not over-subscribe processors

- Prefer processing over IO - final stages of processing should be preferred over initial one, because final stages produce useful work and relieve the system, while initial stages senselessly feed the system without producing any useful work. That is, initial stages are only the means to feed final stages.

- Prefer processing of a just read block, do not cool the data

- Prefer processing stage after completion of processing stage, most likely it has some data hot in cache

- Do control overloads, do not let excessive amount of data blocks senselessly hang in memory

- Park/unpark threads in LIFO order, they may have various associated data hot in cache

- Cache excessive data blocks in FIFO order - otherwise we can prevent application forward progress

- Automatic dynamic tuning to changing conditions (CPU, disk, other processes come and go, cached/uncached parts of the file) with hysteresis/sluggishness and stable equilibrium

Move on to The Solution