Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu

Background

In Single Instruction Multiple Data (SIMD) architecture, all threads in the same wrap execute the same instruction. However, these threads may in different control flow when working their data. The SIMD hardware needs to execute the same instruction for multiple passes to cope with these divergent paths by keeping the threads independently executed. For example, an if-then-else construct can produce two different paths for the threads. Extra cost will also be incurred in other constructs, such as for-loop or decision conditions on threadIdx values.

An Example of a Thread Divergence Problem: Reduction Algorithm

A reduction algorithm computes a single value from an array of values. You will encounter this kind of algorithm when performing a summary, such as averaging and summing up the given values, or tournament selection. Although sequential algorithm works efficiently in linear time to the number of elements introduced, parallel execution is still favourable due to a large number of elements.

Consider the following kernel function (wasting at least half of the threads in a block) on sum reduction.

Algorithm 1

__shared__ float partialSum[]

unsigned int t = threadIdx.x;

for (unsigned int stride=1; stride<blockDim.x; stride *=2){

if (t % (2*stride) ==0)

partialSum[t] += partialSum[t+stride]; //Absorb the neighbour of the right

}

 Threads Iter stride 0 1 2 3 4 5 6 7 8 9 10 11 0 1 4 2 2 4 2 5 2 3 1 4 6 1 1 5 4 6 7 4 10 2 2 9 13 14 3 4 22 14 4 8 36

Given N elements, time complexity is N/2+N/4+N/8+…+1=N-1=O(N)

There is an issue of thread divergence in the above algorithm. The first iteration of the loop, some threads will execute a different path due to the condition on threadIdx.x. In the worst case, two passes will be needed for each iteration for the threads in the same wrap.

Solution

Algorithm 2

__shared__ float partialSum[]

unsigned int t = threadIdx.x;

for (unsigned int stride=blockDim.x/2; stride>1; stride /=2){

if (t < stride)

partialSum[t] += partialSum[t+stride]; //Absorb the neighbour of the right

}

 Threads Iter stride 0 1 2 3 4 5 6 7 8 9 10 11 0 1 4 2 2 4 2 5 2 3 1 4 6 1 8 4 5 6 8 4 2 5 2 2 4 8 7 11 10 3 2 19 17 4 1 36

A if-then statement still remains in the loop but we claimed there is a performance difference between Algorithm 1 and Algorithm 2. The hint is the positions of threads between those executing in different control paths.

Suppose the size of a warp 4 (32 in reality), all the threads in warps 0-1 (corresponding to thread 0-7) executes the same control path, there is no thread divergence in iteration 1 in these wraps. Similarly, this happens in iteration 2. However, when the number of threads is less than the size of warp, thread divergence occurs again in iteration 3 and iteration 4. But when the problem is large, the number iterations with thread divergence accounts for a small constant number of iteration, i.e. log2(size of a warp) (at least after log2(blockDim.x)-log2(size of a warp) iterations for thread divergence to occur ideally).

Conclusion

In this article, we studied how the thread divergence affects the design of algorithm running on GPGPU. Design of algorithm needs to match with the underlying hardware architecture in order to fully utilize the computational resources.