Gaussian blur with OpenCL

We’ve come a long way since the days where scientists hackishly encoded data in 3D primitives and textures to trick their highly parallel dedicated graphics hardware into performing non graphics-related computations. General purpose GPU computing has experienced fast grow in the last few years and nowadays a number of frameworls, and developing tools for GPU parallel are easily (and freely) available. One of them being OpenCL.

However, despite this widespread availability of tools and learning materials, diving into the world of parallel computing can be an overwhelming task for most enthusiasts (unless you’re doing CUDA, of course).

As a self-taught GPGPU aficionado, I found it a bit hard to move from the canonical example programs (i.e. vector addition and multiplication) to something more useful. In this post I’ll share my experience coding a non-trivial but still very simple inherently parallel application: a Gaussian blur.

I’ll be using Visual Studio Community 2015 and the Intel OpenCL libraries (which by the time I wrote this came with the Integrated Native Developer Experience toolbox) under Windows 8.1. Other targets might require additional changes. Knowledge of c/c++ and some basic notions of OpenCL and Image Processing are assumed.

Setting up the data

In mathematical terms, a Gaussian blur is the convolution of an image with, you guessed it, a Gaussian function. This operation reduces the high frequency components of the image resulting in a smoother version of it.

Such transformation requires the use of a matrix G (often referred to as the convolution kernel) in which each element is:

G(x,y)={\frac {1}{2\pi \sigma ^{2}}}e^{-{\frac {x^{2}+y^{2}}{2\sigma ^{2}}}}

where x and y are the distances from the origin (the center of the matrix) in the horizontal and vertical axes respectively and \sigma is the standard deviation parameter of the gaussian function.

By using this discrete representation of a Gaussain function, some accuracy is lost, causing a slight darkening of the final image. This effect can be attenuated if we normalize the convolution kernel by dividing each element by the sum of all the elements.

This can be implemented using the following code:

float sigma = 1.0;
 int kernelDimension = (int)ceilf(6 * sigma);
 if (kernelDimension % 2 == 0)
int cKernelSize = pow(kernelDimension, 2);
float *ckernel = new float[kernelDimension*kernelDimension];

float acc = 0;
 for (int j = 0; j < kernelDimension; j++)
      int y = j - (kernelDimension / 2);
      for (int i = 0; i < kernelDimension; i++)
          int x = i - (kernelDimension / 2);

          ckernel[j*kernelDimension+i] = (1 / (2 * 3.14159*pow(sigma, 2)))*exp(-((pow(x, 2) + pow(y, 2)) / (2 * pow(sigma, 2))));

          acc += ckernel[j*i+i];
 for (int j = 0; < kernelDimension; j++)
 for (int i = 0; i < kernelDimension; i++)
      ckernel[j*kernelDimension + i] = ckernel[j*kernelDimension + i] / acc;

The next thing we need to do is to initialize the image we want to blur. For this example I’m using a B8G8R8A8 format which means that each pixel is represented by 4 bytes, each of them corresponding to an 8-bit intensity value (0 – 255) of blue, green, red and an alpha channel (for transparency) respectively.

In this code snippet, a 640 x 480 image is initialized to random numbers, resulting in a noise pattern. The RenderingWindow class is a custom made D2D-based window to render a bitmap (Download the code at the end of this post for more details).

int width = 640;
int height = 480;
int inputSize = width * height * 4;
BYTE* pixels = new BYTE[width * height * 4];
BYTE* out = new BYTE[width * height * 4];
for (int i = 0; i< width * height; i++ )
BYTE* pixels = new BYTE[width * height * 4];
BYTE* out = new BYTE[width * height * 4];
for (int i = 0; i < width * height; i++ )
pixels[i * 4] = rand() % 255;
pixels[i * 4 + 1] = rand() % 255;
pixels[i * 4 + 2] = rand() % 255;
pixels[i * 4 + 3] = 255;

RenderingWindow window(L"OpenCL gaussian blur", width, height, nCmdShow);

The OpenCL kernel

Before going deeper into the code let’s take a minute to analyze what we’re trying to achieve. As previously discussed, a Gaussian blur is a convolution operation, meaning that each pixel of the image must be multiplied by a corresponding element in the convolution kernel and then accumulated and stored in the output buffer. This operation has to be repeated for each channel of the image, four in this case. That’s a lot of computations.

Gladly, since every pixel is independent from the others, these computations can be performed in parallel and without the need for synchronization. Sounds like a job for the GPU.

The strategy is, thus, the following:

  • We’ll dispatch as many threads as pixel channels are in the image (that is width x height x 4)
  • Each thread will compute the convolution of a single channel of a single pixel and write the result in the output buffer.

Note: there are better ways to do this, it’s up to you to try other more efficient strategies

An OpenCL kernel (not to be confused with the convolution kernel) is a function that will be executed in parallel by each thread. The OpenCL kernel to perform the 2D convolution described above looks as follows:

void blur(__global unsigned char *pixels, __global unsigned char *out,__global float* ckernel, __constant int *rows, __constant int *cols, __constant int *cKernelDimension)
 int idx = get_global_id(0);
 int currentRow = (idx/4)/ (*cols);
 int currentCol = (idx/4) % (*cols);
 int colorOffset = idx%4;
float acc=0;

if(colorOffset != 3)
     int i, j;
     int y =currentRow + (j-(*cKernelDimension/2));
     if(y < 0 || y >= *rows) y = currentRow; 

     int x = currentCol +(i-(*cKernelDimension/2));
     if(x < 0 || x > *cols) x = currentCol;
          acc += (float) ((float)(pixels[((y*(*cols)+x))*4 + colorOffset])* ckernel[(j*(*cKernelDimension))+i]);


 if(acc &gt;= 255)
     acc = 255;
 out[idx] =(unsigned char)acc;
 out[idx] =255;

Let’s break  down the code a little bit. These are the parameters of the kernel:

__global unsigned char *pixels: a pointer to a byte array containing the input image.

__global unsigned char *out: also a byte array that we’ll use as the output buffer.

__global float* ckernel: the convolution kernel itself.

__constant int *rows: the number of rows in the image.

__constant int *cols: the number of columns in the image.

__constant int *cKernelDimension: the dimension of the convolution kernel, which is a square matrix of size cKernelDimension x cKernelDimension.

The get_global_id function returns the index of the thread executing the kernel. We will use this index to address the convolution kernel, the input image and the output buffer. To simplify our lives we can extract the 2D coordinates and channel offset of the current pixel in the image using integer division and modulo operations.

When the channel offset equals 3, it means that the current thread is working on the alpha channel of a pixel. Since this represents the transparency of the pixel we’ll just set it to 255 (completely opaque).

Note: In the provided code, the alpha channel is actually ignored by the rendering window so the value doesn’t really matter.

For all the other channels, we sequentially (THERE’S ROOM FOR IMPROVEMENT HERE) multiply the convolution kernel by the pixels around the current pixel and accumulate the partial results that we finally write in the output buffer. The conditional statements such as  if(y < 0 || y >= *rows) y = currentRow are there to prevent the thread from trying to multiply by a pixel out of the bounds of the image.

Now, let’s make this thing run.

The host code

In order to execute our OpenCL kernel we need to make a little bit of setup.

The code for the OpenCL kernel can be stored as a string in the source code for the host, but I find it easier to save it as a separate file. Assuming the file is named OpenCLKernel.txt  the OpenCL code can be read as shown in the following code snippet:

FILE* inputFile= fopen("OpenCLKernel.txt", "r");
 fseek(inputFile, 0L, SEEK_END);
 int fileSize = ftell(inputFile);
 char* programSource = (char*)malloc(fileSize + 1);
 fileSize = fread(programSource, 1, fileSize, inputFile);
 programSource[fileSize] = '\0';

Note: If you’re using Visual Studio you might experience problems with the fopen function. You can change it openf_s or add the _CRT_SECURE_NO_WARNINGS pre-processor definition as explained here.

First we need to discover the OpenCL platforms and devices in our system, select one and then create a context and command queue to work with. You’ll notice that we’re running the clGetPlatformIDs and clGetDeviceIDs functions twice, one to discover the number of IDs in the system and the second one to actually obtain the IDs. If you want to use a specific platform or device, this is where it should be done. For this example we’re only considering GPU devices. For other options you can replace CL_DEVICE_TYPE_GPU by any of the types described in the documentation.

cl_int status;cl_uint numPlatforms = 0;
cl_platform_id *platforms = NULL;
status = clGetPlatformIDs(0, NULL, &amp;numPlatforms);
platforms = (cl_platform_id*)malloc(numPlatforms * sizeof(cl_platform_id));
status = clGetPlatformIDs(numPlatforms, platforms, NULL);

cl_uint numDevices = 0;
cl_device_id* devices = NULL;

status = clGetDeviceIDs(platforms[0],CL_DEVICE_TYPE_GPU , 0, NULL, &amp;numDevices);
devices = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));
status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);

cl_context context = NULL;
context = clCreateContext(NULL, numDevices, devices, NULL, NULL, &amp;status);

cl_command_queue cmdQueue;
cmdQueue = clCreateCommandQueue(context, devices[0], 0, &amp;status);

The next thing to do is to create the GPU buffers that will be associated to the  data in the host code. To do so we use the clCreateBuffer function which will return a handle to the created buffer by specifying, among other things, its size in bytes and a usage flag. You can get more details in the documentation. After this we can copy the host data to the recently created GPU buffer by using the clEnqueueWriteBuffer function. Again you can get more details by reading the function’s documentation.

cl_mem bufferPixels;
cl_mem bufferOut;
cl_mem bufferCKernel;
cl_mem bufferRows;
cl_mem bufferColumns;
cl_mem bufferCKernelSize;
bufferPixels = clCreateBuffer(context, CL_MEM_READ_ONLY, inputSize, NULL, &amp;status);
bufferOut = clCreateBuffer(context, CL_MEM_WRITE_ONLY, inputSize, NULL, &amp;status);
bufferCKernel = clCreateBuffer(context, CL_MEM_READ_ONLY, cKernelSize *sizeof(float), NULL, &amp;status);
bufferRows = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL, &amp;status);
bufferColumns = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL, &amp;status);
bufferCKernelSize = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL, &amp;status);

status = clEnqueueWriteBuffer(cmdQueue, bufferPixels, CL_TRUE, 0, inputSize, pixels, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufferCKernel, CL_TRUE, 0, cKernelSize * sizeof(float), ckernel, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufferRows, CL_TRUE, 0, sizeof(int), &amp;height, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufferColumns, CL_TRUE, 0, sizeof(int), &amp;width, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufferCKernelSize, CL_TRUE, 0, sizeof(int), &amp;kernelDimension, 0, NULL, NULL);


As with many freameworks for GPGPU programming, OpenCL kernels are compiled in runtime. It is very useful to get the error information as well in case something goes wrong. The following c++ code compiles the provided kernel and, if necessary, displays  a message box with the compilation errors:

cl_program program = clCreateProgramWithSource(context, 1, (const char **)&amp;programSource, NULL, &amp;status);
status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);
if (status != CL_SUCCESS)
clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_STATUS,
sizeof(cl_build_status), &amp;status, NULL);

size_t logSize;
wchar_t errorbuffer[2048];
clGetProgramBuildInfo(program, devices[0],
CL_PROGRAM_BUILD_LOG, 0, NULL, &amp;logSize);
char* programLog = (char*)calloc(logSize + 1, sizeof(char));

clGetProgramBuildInfo(program, devices[0],
CL_PROGRAM_BUILD_LOG, logSize + 1, programLog, NULL);

wchar_t* wProgramLog = new wchar_t(strlen(programLog) + 1);
mbstowcs(wProgramLog, programLog, strlen(programLog) + 1);

swprintf_s(errorbuffer,2048, L"error compiling openCL kernel\n Build failed; error=%d, status=%d, programLog:nn%s",
status, status, wProgramLog);
MessageBox(NULL, errorbuffer , L"ERROR", 0);


Note: if you’re using a newer version of OpenCL (higher than 2.0) you’ll notice that the clBuildProgram function has been deprecated which might be treated as an error by your IDE/compiler. You can fix this by adding #define CL_USE_DEPRECATED_OPENCL_2_0_APIS to your code.

The last setup step is to create a handle to the compiled OpenCL kernel and associate the GPU buffers to the parameters of the kernel. The string parameter of the clCreateKernel function is the entry point of the function in the OpenCL kernel file, so make sure they match.

cl_kernel kernel = NULL;
kernel = clCreateKernel(program, "blur", &amp;status);

clSetKernelArg(kernel, 0, sizeof(cl_mem), &amp;bufferPixels);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &amp;bufferOut);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &amp;bufferCKernel);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &amp;bufferRows);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &amp;bufferColumns);
clSetKernelArg(kernel, 5, sizeof(cl_mem), &amp;bufferCKernelSize);

Now we can run our parallel program. We need to specify the number of threads to execute. For this example we’re dispatching only one group of inputSize (width x height x 4) threads. We use the clEnqueueNDRangeKernel to execute the OpenCL kernel. After this, the results are stored in the GPU memory, so of course it is possible to use the GPU to draw it to the screen using, for instance, OpenGL… but that will be the subject of a future post. For the moment, we can copy the results back to the host using the clEnqueueReadBuffer function.

size_t globalWorkSize[1];
globalWorkSize[0] = inputSize;

clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, NULL);

clEnqueueReadBuffer(cmdQueue, bufferOut, CL_TRUE, 0, inputSize, out, 0, NULL, NULL);

The last thing to do is to draw the resulting image to the screen. You can use your favorite method to do so. In this example, we use an instance of the aforementioned RenderingWindow class. The drawing part and main application loop look something like this:

MSG msg{ 0 };
while (msg.message != WM_QUIT)
window.Draw(out, width, height);

while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE))
if (window.windowHandle && IsDialogMessage(window.windowHandle, &msg))

Ahh, and don’t forget to clean up the resources you used.

free( pixels);


And that’s it. I hope that after this post, parallel computing  with OpenCL appears a little bit less scary.

Here’s a [completely unnecessary] video of the program in action.


Finally, you can download the Visual Studio Solution with the code from this post from this link. NOTE: The Visual Studio  solution is configured to use the Intel OpenCL libraries so you might to make some modifications if you’re using other platforms, including changing the include and link directories to those on your local machine.

Happy coding!










Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s