It is well known that reading data from Shared Memory/Registers is far more faster than from Global/Device Memory. The following figure is an illumination of Nvidia GPU Execution Model: 
Figure 1. Nvidia GPU Execution Model.
In Sobel Edge Detection (using CUDA), the image buffer is first copied into Global Memory from Host Memory, then the processors compute the magnitude of each pixel that is used to determine an edge depending on a magnitude threshold. Sobel Edge Detection usually goes as the follow:

Figure 2. Sobel Edge Detection.
Use Global Memory
1. Description
For each thread
Compute the corresponding pixel’s coordinate
Read the eight neighbors’ values from Global Memory
Compute the magnitude of current pixel
Determine whether the current pixel is in an edge
End
2. Limitation
For any two pixels next to each other, there are six shared neighbors. This method will lead to redundant data transfers, because shared/reused neighbors are read multiple times from Global Memory. A better way to do this is creating tiled matrix in Shared Memory to reduce data transfers between Device and Global Memory.
Use Tiled Matrix in Shared Memory
1. Description
Image (here we use a matrix to represent): 
Figure 3. Image: black numbers are x and y indexes; green numbers are pixel values.
Define a M*N tiled matrix, there will be (M-2)*(N-2) pixels’ magnitudes be calculated in each block:
Figure 4. A 5*3 Tiled Matrix: black numbers are x and y indexes; green numbers are pixel values. In this case, magnitudes of pixel ‘9’ ’10’ ’11’ are calculated .
Block 1:

Block 2:

Block 3:

Block 4:

Block 5:

Block 6:
