[This is my write-up for Intel Threading Challenge 2009. You may download the code for Microsoft Visual C++ at the bottom of the page]
Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers representing the two (x,y,z) endpoints.
As a base algorithm I choose simple exhaustion of segment pairs: void solve_segment_intersection(std::vector<segment_t> const& segments, std::vector<intersection_t>& results) { for (size_t i = 0; i != segments.size(); i += 1) { for (size_t j = i + 1; j != segments.size(); j += 1) { if (is_intersect(segments[i], segments[j])) results.push_back(intersection_t(segments[i], segments[j])); } } } Computational complexity of such algorithm is O(N^2 / 2). It's quite high complexity, and there are known algorithms with complexity O((N+K)*logN) [1] and even O(N*logN + K) [2] (K - number of intersections). However I decide to not implement these theoretically more efficient algorithms because of the significantly higher implementation complexity and the fact that they are much harder to parallelize efficiently and less amenable to low-level optimizations such as vectorization, branch elimination, cache-conscious memory access patterns, etc. In order to reduce computational complexity I apply sorting of segments by X coordinate with following bounding of exhaustion. If segments are sorted by X1 coordinate (it's assumed that X1 <= X2), then once we encounter segment with X1 greater then X2 coordinate of original segment we may not consider all subsequent segments (they are not overlapping with original segment by X coordinate): void solve_segment_intersection(std::vector<segment_t>& segments, std::vector<intersection_t>& results) Now computational complexity is reduced to O(N*M), where M is some constant dependent on input data ("how many segments are overlapping by X coordinate"). So in best case complexity is O(N) now (if no segments are overlapping). In worst case complexity is still O(N^2 / 2).
Precise calculation of is_intersect predicate is computationally hard (contains considerable amount of multiplication operations, conditional branching, etc). Bounding box is a simple optimization which determines evidently non intersecting segments. Bounding box optimization is based on the following observation - if 2 segments are intersecting in 3D space then their projections to coordinate axises are intersecting by pairs too. Thus, if projections to at least one axis are not intersecting then segments itself are not intersecting in 3D space: If ((s1.x2 < s2.x1) or (s2.x2 < s1.x1) or (s1.y2 < s2.y1) or (s2.y2 < s1.y1) or (s1.z2 < s2.z1) or (s2.z2 < s1.z1)) Experimentation shows that for randomly generated data bounding box technique detects quite significant amount of evidently non intersecting segments. Here is the code: void solve_segment_intersection(std::vector<segment_t>& segments, std::vector<intersection_t>& results) { std::sort(segments.begin(), segments.end(), sort_by_x1); for (size_t i = 0; i != segments.size(); i += 1) { for (size_t j = i + 1; j != segments.size(); j += 4) // note that increment is 4 { if (segments[j].x1 > segments[i].x2) break;
// only not calculate precise predicate if (is_intersect(segments[i], segments[j])) results.push_back(intersection_t(segments[i], segments[j])); } } }
Bounding box verification can be further optimized with SSE vector operations [3]. First gain comes from the fact that SSE vector operations can process up to 4 pairs of 32-bit integers at a time. Second gain comes from then fact that SSE extensions contain powerful operations that can find minimum/maximum of 2 values in streamlined fashion (w/o conditional branching, just single machine instruction). Note that in order to apply vector operations data structures have to be converted from AoS (array of structures) representation to SoA (structure of arrays) representation, i.e. following straightforward representation of array of segments: struct segment_t { int x1, x2, y1, y2, z1, z2; }; typedef std::vector<segment_t> segments_t; have to be converted to following SoA representation: struct segments_t;
After such conversion we may use SSE vector operations. Here is a bit simplified code (it uses Intel C++ compiler intrinsics): void solve_segment_intersection(std::vector<segment_t>& segments, std::vector<intersection_t>& results) Note that there is only a handful of lightweight instructions (no multiplications, divisions, branching) before final _mm_test_all_ones() test which throws away 4 segments at a time (assumed to be common case for randomly generated data). And only if the test fails we consider each pair of segments individually. Is_intersecting predicate will be computed only for segment pairs for which bounding box says that segments are possibly intersecting.
As a tool for parallelization I choose Intel Threading Building Blocks (TBB) library which provides handy and flexible abstraction of lightweight tasks. Here is the high-level scheme of parallelization (each rectangle represents a task, rectangles situated on the same horizontal level may be executed in parallel):
On the start phase I analyze command line parameters, read number of segments, decide on number of worker threads and start input tasks. Number of worker threads is crucial aspect for performance. Too high number of threads will introduce unjustified latency for small inputs, because of the overheads related to worker thread starting/stopping, work distribution, aggregation of results and synchronization. I've made a number of tests and set up minimum number of segments per worker thread to 1000. I.e. if there is less than 1000 segments all work will be performed by single thread, if there is 1000-2000 segments all work will be performed by 2 worker threads, and so on. I choose number 1000 based on the following equation. Execution time for 1000 segments for 1 thread is roughly equal to that for 2 threads, i.e. at this point parallelization related speedup starts outweighing parallelization related overheads. Maximum number of worker threads is bounded by number of available execution units (processors, cores, HT threads). Higher number of threads is senseless for CPU-bound tasks (as opposed to IO-bound tasks), because will only cause additional overheads related to context switching.
Each input task is supplied with it's own piece of input file (input file is equally partitioned between tasks). First of all input task finds nearest to the piece begin boundary of segment description (using '\r' and '\n' symbols as markers). Then it parses input file until piece end, stores segment descriptions to array and collects some statistics.
On sorting phase I just use TBB's standard tbb::parallel_sort() algorithm. As will be seen in the Performance section it achieves linear scalability and provides single-threaded performance similar to that of std::sort() algorithm. So no need to re-invent the wheel.
In order to parallelize intersection testing segment pairs must be somehow partitioned to independent groups. Since we have 2 nested loops - outer "i" loop and inner "j" loop - and all iterations of both loops are independent (no data dependencies between them) we may choose either of them as a source of partitioning. However golden rule of parallelization says:
Parallelization on highest possible level tends to "distribute" threads from each other and give each thread bigger piece of independent work, thus reducing work distribution and synchronization overheads. Parallelization of outer loop may be infeasible/impossible for some reasons. For example, if there is just too small number of iterations (less than number of threads), or if there are data dependencies between iterations (calculations on i+1 iteration depend on results of i-th iteration). However it's not the case for our problem, so I choose partitioning of outer "i" loop. Initially I created one task per worker thread and equally divided iteration space among them, however it turns out that this way amount of work per task (thread) may be quite unbalanced. If segments are sorted by X coordinate ascending then first segment must be verified for intersection potentially with all other segments, while last segment must not be verified at all. In order to overcome this I create larger number of tasks which is determined by the following formula:
min_number_of_segments_per_task is set to 1000 (see section 3.2. above). surplus_factor is set to 64 and determines number of tasks per worker thread required to achieve sufficiently good load balancing.
I decide to not parallelize output phase because number of intersections is expected to be small. However if number of intersections is expected to be very large following parallelization strategy can be applied. Memory mapping of sufficient size is opened for output file. Private buffer able to hold, let's say, 1000 intersections is allocated for every worker thread. Worker threads collect results in these private buffers. When a thread's buffer becomes full, thread reserves a range of output file with single atomic increment operation on file_end_marker variable, and then flushes private buffer to file. When thread finishes work it has to flush it's partially full private buffer to file too. This way each thread will output it's own hot-in-cache results, and output will be evenly distributed and parallelized.
I used 2 different data generation algorithms for performance testing. First algorithm generates large segments with large coordinates domain. Here is the results I got on Intel Core2 Duo P9500 (2.5GHz, 6MB L2 cache, 1GHz memory bus) for 30'000 segments with coordinates randomly generated in the range 0 ... 10^6 (number of intersections is 0):
Second algorithm generates small segments with small coordinates domain. Here is the results I got for 400'000 segments of maximum length 40 (in each direction), all coordinates are in range 0...400 (~13K intersections):
As on can see, parallelized phases achieve nearly linear speedup. There is some fixed deviation from the linear scaling probably caused by thread management, work distribution and synchronization overheads. Execution time of non-parallelized output phase is negligible. So I consider parallelization as successful in whole.
[1] Bentley-Ottmann algorithm. http://en.wikipedia.org/wiki/Bentley%E2%80%93Ottmann_algorithm [2] Linear-Time Algorithms for Geometric Graphs with Sublinearly Many Crossings. http://www.siam.org/proceedings/soda/2009/SODA09_018_eppsteind.pdf |