Back in my high school days, i was into distributed computing. It was the era when computers pretty much consumed the same amount of power even if they were idle. I’ve chosen to dedicate my CPU cycles to SETI@home (yeah i was a big fan of Contact). First through the dedicated installer, later through BOINC. My contribution was pretty average but around 2007 i’ve noticed that some users spiked and started to produce pretty awesome numbers.
It was the year when NVIDIA launched their first CUDA enabled GPU which meant that you could use your card for more than gaming. The CPU is a general purpose processing unit and i’ts really good at doing anything. As you know when you are good at a lot of things you cannot be great at anything. The GPU on the other hand does the same type of mathematical computations over and over again. Practically each pixel in a game is rendered in the same way. So in short the GPU is optimized architecturally to carry out a lot of parallel tasks.
In my Mandelbrot set generator last time you saw that we do the same mathematical calculations for each pixel in the image. Since you know i love performance optimizations, let’s try to use CUDA to offload these intensive computations onto our GPU.
There are two main ways in which you can do this from managed code. Use a nuget package that can translate your C# code into CUDA C, which works great and i used it in the past. Or you can write your own CUDA C code and call it from your managed project. I’ve chosen the later because lately i work in a project where we do software modernization on a very old and ugly C++ code base and we are gradually moving things into a WPF application. So i had the chance to refresh and deepen my C++ knowledge. Also it is 15-20 percent faster this way.
The first thing that you need to do is install the CUDA toolkit and create a CUDA project from there. You can run the template so you can see that everything is going and you have a GPU which can execute your code. So let’s look at the .cu file in which i have the kernel.
__global__ void kernel(int *result,
const double *xcoord, const double *ycoord,
const int *col, int limit, int n)
{
int t = blockIdx.x*blockDim.x + threadIdx.x;
if (t < n)
{
double x, y, x0, y0;
x0 = x = xcoord[t];
y0 = y = ycoord[t];
for (int i = 0; i < limit; i++)
{
if (x * x + y * y >= 4)
{
result[t] = col[i];
return;
}
double zx = x * x - y * y + x0;
y = 2 * x * y + y0;
x = zx;
}
result[t] = 0;
}
}
The kernel is a piece of code that will be executed in parallel in every CUDA core which we request. It is marked with a __global__ keyword and its pretty much like a static method. It can only access other global variables and methods. This is how we calculate the fractal from the previous post.
The way we compute t needs a bit of explanations. The kernel is run in multiple threads and threads run on grids in the GPU, so we calculate our internal thread id in order to know which slice of the array is our input and output parameter. The rest is the same.
Additional to the kernel function we have what it’s called the host function.
void MandelbrotKernel(int *result,
const double *xcoord, const double *ycoord,
const int *col, const int limit, const int n)
{
size_t dsize = n * sizeof(double);
size_t isize = n * sizeof(int);
size_t csize = limit * sizeof(int);
double* d_xcoord;
cudaMalloc(&d_xcoord, dsize);
double* d_ycoord;
cudaMalloc(&d_ycoord, dsize);
int* d_result;
cudaMalloc(&d_result, isize);
int* d_colors;
cudaMalloc(&d_colors, csize);
cudaMemcpy(d_xcoord, xcoord, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_ycoord, ycoord, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_result, result, isize, cudaMemcpyHostToDevice);
cudaMemcpy(d_colors, col, csize, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
kernel<<<blocksPerGrid, threadsPerBlock>>>
(d_result, d_xcoord, d_ycoord, d_colors, limit, n);
cudaMemcpy(result, d_result, isize, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
// Free device memory
cudaFree(d_xcoord);
cudaFree(d_ycoord);
cudaFree(d_result);
cudaFree(d_colors);
}
The host function is running on the CPU and is setting up the memory and the parameters on which our kernel will run. In our example it is allocating memory on the GPU and is responsible of transferring data to and back from the CPU. The <<< >>> in the kernel call is telling the GPU how many threads we want to run per grid and how many grids we need. This is also why the kernel checks if(t<n) because there will be some threads that our outside of our input parameters.
Ok so how do we expose this to C#? We need to set the CUDA project to compile as a DLL and in our .cuh file we need to mark the methods that we want to be accessible from outside with extern “C”. It looks like this
#pragma once
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void kernel(int *result,
const double *xcoord, const double *ycoord,
const int *col, int limit, int n);
extern "C" __declspec(dllexport) void MandelbrotKernel(int *result,
const double *xcoord, const double *ycoord,
const int *col, const int limit, const int n);
Next in the WPF project from last time add a class that will import this function from our dll through the DllImport attribute.
public class CudaWrapper
{
[DllImport("GpuMandelbrot.CUDA.dll",
EntryPoint = "MandelbrotKernel",
CallingConvention = CallingConvention.Cdecl)]
public static extern void MandelbrotKernel(
[In, Out] int[] result,
double[] xcoords,
double[] ycoords,
int[] colors,
int limit,
int n);
}
In our GenerateImage method add this code to before the unsafe part which prepares the input parameters for our kernel and calls the imported method.
var xcoords = new double[width * height];
var ycoords = new double[width * height];
var results = new int[width * height];
Parallel.For(0, height, j =>
{
Parallel.For(0, width, i =>
{
var ind = j * width + i;
xcoords[ind] = startX + i * step;
ycoords[ind] = startY + j * step;
});
});
CudaWrapper.MandelbrotKernel(results,
xcoords, ycoords,
pallette, pallette.Length, results.Length);
When MandelbrotKernel is executed our host function from the CUDA project transfers these arrays into device memory, executes the kernel, reads back the results array. We copy the results into our WritableBitmap in the following fashion. This replaces our old calculation call in the unsafe block.
Parallel.For(0, height, j =>
{
Parallel.For(0, width, i =>
{
var ind = j * width + i;
var pixel = buffer + j * stride + i * 3;
*(int*)pixel = results[ind];
});
});
Ok, this was a long journey. I feel exhausted, but does it worth it? Well lets check the following image.
You can see in the caption the input parameters. On my i5-4690K this renders in around 3.5 seconds while on my GTX 980 it renders in 0.5 seconds. Kewl ha? I will try to make more optimizations later but i really hope you enjoyed it. I encourage to check out other CUDA examples before you jump into it here.