Posts from the “CUDA” Category

Efficient Shared Memory Use

Thread Hierarchy

Programming Model

SM Implementation

Bank Conflict

Race Condition

Atomic Operation

A clarification of bank conflict in the slide: when two or more addresses of memory request from threads within the same warp are in the same bank, bank conflict occurs.

Radix Sorting in CUDA

1. Sequential

For the sequential radix sorting, deem the integers as binary bits may be simpler than decimal digits. It can be done by partitioning the 0s and 1s for every bit, so that each integer in the i^th-bit sorting is relocated based on the counts of 0 and 1. For example, to compute the new position of an integer in the i^th-bit sorting, if the i^th bit is 0, it should be placed after the previous integers with 0 for the i^th bit; otherwise if the i^th bit is 1, it should be placed after all the integers that have 0 for the i^th bit and the previous integers valued 1 in the i^th bit. The peudocode is shown below:

FOR the i^th bit (32 bits in total){

FOR the j^th integer I[j] (N in total) {

count the 1s before I[j] (exclude I[j]): T_before[j];


compute the count of 0s: F_total = N – T_before[N-1];

FOR the j^th integer I[j] {

IF the i^th bit is 1: new position p[j] = F_total + T_before[j];

ELSE: new position p[j] = j – T_before[j];



2. Coalesced

To coalesce the radix sorting in CUDA, every block should take a sequence of integers and sorts them in each bit loop. Within every loop, a global reordering is executed after the blocks finishing the sortings. The peudocode in the host function is shown below:

FOR the i^th bit (32 bits in total){

call the radix sorting kernel function;

device synchronize;

call the reordering kernel function;

device synchronize;


Memory Hierarchy Optimization in CUDA: Sobel Edge Detection

It is well known that reading data from Shared Memory/Registers is far more faster than from Global/Device Memory. The following figure is an illumination of Nvidia GPU Execution Model: Nv-GPU-exe-model

Figure 1. Nvidia GPU Execution Model.

In Sobel Edge Detection (using CUDA), the image buffer is first copied into Global Memory from Host Memory, then the processors compute the magnitude of each pixel that is used to determine an edge depending on a magnitude threshold. Sobel Edge Detection usually goes as the follow:


Figure 2. Sobel Edge Detection.

Use Global Memory

1. Description

For each thread

Compute the corresponding pixel’s coordinate

Read the eight neighbors’ values from Global Memory

Compute the magnitude of current pixel

Determine whether the current pixel is in an edge


2. Limitation

For any two pixels next to each other, there are six shared neighbors. This method will lead to redundant data transfers, because shared/reused neighbors are read multiple times from Global Memory. A better way to do this is creating tiled matrix in Shared Memory to reduce data transfers between Device and Global Memory.

Use Tiled Matrix in Shared Memory

1. Description

Image (here we use a matrix to represent): img

Figure 3. Image: black numbers are x and y indexes; green numbers are pixel values. 

Define a M*N tiled matrix, there will be (M-2)*(N-2) pixels’ magnitudes be calculated in each block:


 Figure 4.  A 5*3 Tiled Matrix: black numbers are x and y indexes; green numbers are pixel values. In this case, magnitudes of pixel ‘9’ ’10’ ’11’ are calculated . 

Block 1:


Block 2:


Block 3:


Block 4:


Block 5:


Block 6: