It is well known that reading data from Shared Memory/Registers is far more faster than from Global/Device Memory. The following figure is an illumination of Nvidia GPU Execution Model: Nv-GPU-exe-model

Figure 1. Nvidia GPU Execution Model.

In Sobel Edge Detection (using CUDA), the image buffer is first copied into Global Memory from Host Memory, then the processors compute the magnitude of each pixel that is used to determine an edge depending on a magnitude threshold. Sobel Edge Detection usually goes as the follow:


Figure 2. Sobel Edge Detection.

Use Global Memory

1. Description

For each thread

Compute the corresponding pixel’s coordinate

Read the eight neighbors’ values from Global Memory

Compute the magnitude of current pixel

Determine whether the current pixel is in an edge


2. Limitation

For any two pixels next to each other, there are six shared neighbors. This method will lead to redundant data transfers, because shared/reused neighbors are read multiple times from Global Memory. A better way to do this is creating tiled matrix in Shared Memory to reduce data transfers between Device and Global Memory.

Use Tiled Matrix in Shared Memory

1. Description

Image (here we use a matrix to represent): img

Figure 3. Image: black numbers are x and y indexes; green numbers are pixel values. 

Define a M*N tiled matrix, there will be (M-2)*(N-2) pixels’ magnitudes be calculated in each block:


 Figure 4.  A 5*3 Tiled Matrix: black numbers are x and y indexes; green numbers are pixel values. In this case, magnitudes of pixel ‘9’ ’10’ ’11’ are calculated . 

Block 1:


Block 2:


Block 3:


Block 4:


Block 5:


Block 6: