GPU Programming at PPPL

Written by Eliot Feibush & Michael Knyszek

Introduction

Programming on Graphical Processing Units (GPUs) is incredibly powerful in that it takes advantage of parallelism on a scale much larger than most of today's supercomputers. Rather than relying on massive amounts of machines which communicate with high-latency connections, GPUs have a massive amount of tightly-knit, incredibly lean cores that are heavily optimized for floating point operations. What this means is that you can have millions of points on a grid processed potentially hundreds of times faster than if you processed them serially. A fuller, broader introduction to the power of GPUs can be found in the following NVIDIA presentation for their new Tesla GPUs:

Tesla GPU Computing: Accelerating High Performance Computing

Although GPUs are fairly specialized in their computation, NVIDIA has conveniently provided a list of just some of the many applications of hardware accelerated computing ranging from finance to science and research:
Popular GPU Accelerated Applications

GPU Devices at PPPL
- cppg-6 Linux computer
- gpusrv GPU Cluster
NVIDIA Software
Basic GPU Programming
- Programming Course
- Simple CUDA Examples
More Sophisticated CUDA Examples
- Generating Fractals
- 2D Fluid Simulation
Debugging CUDA Code
Profiling CUDA Code
Tips & Tricks

GPU Devices at PPPL

cppg-6 Linux computer

An NVIDIA Tesla K20c GPU accelerator is installed in cppg-6. The computer's original 525 watt power supply was replaced with a 1000 watt power supply to power the 233 watt board. The K20c receives 80 watts from the PCI bus and 2 six pin power cables deliver the rest of the power. There are 2496 CUDA cores in this GPU. It contains 5 GB of on-board memory.

The computer is a Dell Precision T3500n with a 64-bit Intel Quad Core Xeon processor. It contains 4 GB of host memory. The graphics board is an NVIDIA Quadro FX 580 that drives 2 monitors at 1600 x 1200 resolution each. The graphics board is also a CUDA device with 32 cores. This computer mounts the PPPL file system when you log in.

Note: it is currently running 64-bit Red Hat 5 kernel. This computer mounts the PPL cluster file system so you can load PPL modules as on portal.

ssh into cppg-6 to run the GPU software.

gpusrv GPU Cluster

gpusrv01

Two NVIDIA M2070 GPU accelerators are installed on gpusrv01. These GPUs have 443 CUDA cores each and 5 GB of on-board memory.

gpusrv02

Two NVIDIA Tesla K20m GPU accelerators are installed in gpusrv02. There are 2496 CUDA cores in each GPU and each GPU contains 4.8 GB of on-board memory.

Run the "useg" or "use" command to be queued for access onto gpusrv.

Back to Top ➚

NVIDIA software

The NVIDIA SDK CUDA 5.0.35 (64-bit) software is installed in /usr/pppl/cuda/cudatoolkit/5.0.35. The package includes nvcc for cross compiling cuda programs to run on the K20.

At the Linux command line: module load cuda/5.0.35

/usr/bin/nvidia-smi -q
    This program prints information about the NVIDIA devices installed in the computer. There is a man page for nvidia-smi or you can run nvidia-smi -h to see a list of command line options.

Another sample program is in /p/vis/cuda/samples/devicequerydrv. The deviceQueryDrv program there prints information about the installed NVIDIA devices. The following output is an example of running deviceQueryDrv on cppg-6:

CUDA Device Query (Driver API) statically linked version
Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20c"
CUDA Driver Version:                           5.0
CUDA Capability Major/Minor version number:    3.5
Total amount of global memory:                 4096 MBytes (4294967295 bytes)
(13) Multiprocessors x (192) CUDA Cores/MP:    2496 CUDA Cores
GPU Clock rate:                                706 MHz (0.71 GHz)
Memory Clock rate:                             2600 Mhz
Memory Bus Width:                              320-bit
L2 Cache Size:                                 1310720 bytes
Max Texture Dimension Sizes                    1D=(65536) 2D=(65536,65536) 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block:           1024
Maximum sizes of each dimension of a block:    1024 x 1024 x 64
Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
Texture alignment:                             512 bytes
Maximum memory pitch:                          2147483647 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Concurrent kernel execution:                   Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Enabled
Device supports Unified Addressing (UVA):      No
Device PCI Bus ID / PCI location ID:           3 / 0
Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro FX 580"
CUDA Driver Version:                           5.0
CUDA Capability Major/Minor version number:    1.1
Total amount of global memory:                 511 MBytes (536150016 bytes)
( 4) Multiprocessors x ( 8) CUDA Cores/MP:    32 CUDA Cores
GPU Clock rate:                                1125 MHz (1.12 GHz)
Memory Clock rate:                             800 Mhz
Memory Bus Width:                              128-bit
Max Texture Dimension Sizes                    1D=(8192) 2D=(65536,32768) 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       16384 bytes
Total number of registers available per block: 8192
Warp size:                                     32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block:           512
Maximum sizes of each dimension of a block:    512 x 512 x 64
Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
Texture alignment:                             256 bytes
Maximum memory pitch:                          2147483647 bytes
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Concurrent kernel execution:                   No
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      No
Device PCI Bus ID / PCI location ID:           2 / 0
Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

It identifies the K20 as the first device. This is convenient for the CUDA examples that appear later on in this document.

For more information about the GPU driver and NVIDIA software included with the CUDA Toolkit, click on the course link below:
GPU driver/CUDA

Back to Top ➚

Basic GPU Programming

Programming Course

More information on all three sections below can be found in a larger course presentation which is very long, but contains information on all 3 types of CUDA programming.

Click the link below to look at the full 107 slide course:
Full 3-part Introduction

Introduction to CUDA Libraries

In most cases, utilizing libraries built for performing various mathematical operations and transformations may just be the easiest way to take advantage of GPU parallelism.
CUDA Libraries such as:

cuFFT - Fast Fourier Transforms Library
cuBLAS - Complete BLAS Library
cuSPARSE - Sparse Matrix Library
cuRAND - Random Number Generation (RNG) Library
NPP - Performance Primitives for Image & Video Processing
Thrust - Templated C++ Parallel Algorithms & Data Structures
math.h - C99 floating-point Library (on the GPU)

can make paralleling codes for the GPU trivial by simply "dropping in" library calls where you would otherwise perform certain mathematical operations.

For a more detailed overview and some tips on how to get started with the CUDA libraries, click on the course link below:
NVIDIA CUDA Library Approach (First ~30 Pages of the Large Course Above)

For more information about the Thrust library specifically, check out:
An Introduction to the Thrust Parallel Algorithms Library

Introduction to OpenACC

For anyone familiar with OpenMP, working with OpenACC should be very easy.
OpenACC is built on similar principles to OpenMP in that parallelism is as simple as adding directives to your C/Fortran codes.

For a more detailed overview and some tips on how to get started with OpenACC, click on the course link below:
GPU Computing with OpenACC Directives

Introduction to CUDA C/C++

CUDA C/C++ is the most involved and most powerful form of GPU programming of the above methods.
Built as an extension to C/C++, the language is easy to get started with but the runtime can be tricky to master. However, it is extremely powerful and offers the greatest degree of customization and is the biggest opportunity for a speed boost in your code.

For a more detailed overview and some tips on how to get started with CUDA C/C++, click on the course link below:
CUDA C/C++ Basics

Below is also a cheatsheet with some of the basic functions, syntax, and ideas of CUDA C/C++:
CUDA C/C++ Cheatsheet

Back to Top ➚

Simple CUDA Examples

The NVIDIA training course provided several programming examples.
Loading the cuda module should set up your include path and library path for compiling and linking the example programs.
There are some sample CUDA programs in /p/vis/cuda/samples/test.
The source code is in these files:

addcuda.cu
first.cu
kernel.cu
simple_gpu.cu

The Makefile will compile and link the codes so you can run the examples:

addcuda.out
x.first
x.hello_world
x.simple_gpu

More Sophisticated CUDA Examples

Generating Fractals

GPUs excel at performing many simultaneous computations that are mutually independent. Perhaps one of the best examples of this is in generating images of fractals since each point is calculated independently, sharing only the same base equation. Two particularly interesting fractals are those in the Mandelbrot set and the Julia set, which are defined by the equation

$z_{n+1}=(z_{n})^k+c$

where z and c are complex numbers. Here is an example of what the code below should produce:

(The short tutorial below does not run through coloring methods, but to achieve the same coloring effect, take a rainbow palette and use the output of the program to interpolate between colors on a log2 scaling.)

To generate images of these fractals, we iteratively compute the equation at discrete points and count the iterations it takes for the absolute value of z to exceed a certain threshold. This is a measure of how quickly the equation diverges and is interesting because it produces fractal patterns. From this measure, we can then color and draw the fractal. This is the simplest way to generate fractals and is known as the Escape Time Algorithm, since it measure how quickly the recursive equation "escapes" a certain bailout radius (the threshold). If you are more interested in learning about the math behind it, Wikipedia actually has an excellent amount of information on the subject. This example will instead focus on the GPU implementation aspect.

In this particular example, we need to define a few things. First of all, we represent our image as simple an array of floating point numbers in memory. Next, we need a way to pass the dimensions of the image and what our "viewing window" is (the maximum and minimum coordinates in the Argand plane to view). The solution to passing all these arguments will be to use a special part of the GPU called constant memory. It consists of a small amount of read-only memory set aside for constants which are going to be accessed frequently over all threads running in a kernel. We do this by first defining a C struct with all our information, like so:

typedef struct {
    int width;
    int height;
    float x_max;
    float x_min;
    float y_max;
    float y_min;
    float c_re;
    float c_im;
} Dimensions;

Now we want to place an instance of our struct onto the GPU for access. First, we need to declare some space in constant memory:

__constant__ Dimensions dev_d;

Second, we have to create our own instance, and then simply copy it over! Supposing the variable "d" is a pointer to our host-side Dimensions instance, we simply run the following code:

cudaMemcpyToSymbol(dev_d, d, sizeof(Dimensions), 0, cudaMemcpyHostToDevice);

One thing to keep in mind about constant memory is that for a warp of threads (all the threads sharing a block) there is a 32-bit cache for constant memory which can make reads of constant memory extremely fast. The limitation here is that the cache size is absolutely tiny. For a few constants, however, this can really improve performance over reading the constant from global memory. One should use constant memory with extreme care since, if used improperly, could lead to a serious bottleneck in computation. A few constants is okay, but having an array of hundreds of values would destroy the purpose of constant memory.
An example of device code for a GPU implementation of a Mandelbrot set fractal generator would look something like the following. Note: we assume our equation here has k=2 which makes exponentiating the complex number simple with just floats.

__global__ void mandelbrot(float *img) {

    // Get the current index based on thread and block number
    int index = threadIdx.x + blockIdx.x * blockDim.x; 
    
    // Derive x and y points
    int Px = index % dev_d.width;
    int Py = index / dev_d.width;
    
    // Scale x-y coordinates into predefined scaling options
    // x0 and y0 are part of our constant c, in the mandelbrot case
    float x0 = (float)Px*(dev_d.x_max - dev_d.x_min)/((float)dev_d.width) + (float)dev_d.x_min;
    float y0 = (float)Py*(dev_d.y_max - dev_d.y_min)/((float)dev_d.height) + (float)dev_d.y_min;
    
    // Set some other variables, x and y are parts of a complex number, z0 = x + yi
    float x = 0;
    float y = 0;
    float tmp;
    int i = 0;
    
    // Perform the escape-time iteration scheme
    // BAILOUT_RADIUS is our threshold, MAX_ITER is how many iterations to
    // perform before giving up and assuming the equation converges.
    while(x*x + y*y < BAILOUT_RADIUS && i < MAX_ITER) {
        // Square z (x + iy) and add c (x0 + iy0)
        tmp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = tmp;
        ++i;
    }
    
    // Part of the "Normalized Iteration Count Algorithm", converts an integer iteration count into
    // a real number. Not particularly important, but useful for making coloring smoother.
    if(i == MAX_ITER) img[index] = (float) i;
    else img[index] = (float) i + 1.5 - log2(log2(x*x+y*y)/2);
}

All that is left to do is to follow the steps of setting up our environment and running the kernel. If you have gone through the course above, the following methods for setting up our image on the GPU and copying it back should seem very familiar.

float *d_pixels; // pointer to image on device
float *h_pixels; // pointer to image on host

// size is the number of pixels in the image
h_pixels = (float*) malloc(size * sizeof(float)); // Allocate memory on host
cudaMalloc((void **) &d_pixels, size * sizeof(float)); // Allocate memory on device

// Start kernel, size is assumed divisible by THREADS_PER_BLOCK
mandelbrot <<< size/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>> (d_pixels);

// Copy our processed image onto the host
cudaMemcpy(h_pixels, d_pixels, size * sizeof(float), cudaMemcpyDeviceToHost);

// Force the host to wait for the device to finish so we do not write garbage data into the image
cudaDeviceSynchronize();

// ... Insert Image Writing Code Here ...

With this simple setup, you can implement a whole host of variations. For example, you could generate Julia set fractals, make fractal movies by altering parameters slightly, such as the constant for the Julia set or the exponent "k" which the equation uses.
All these features and more are implemented and can be examined by simply going into /u/mknyszek/cuda/fractals and having a look at the files.

Here are some videos of these features in action:

2D Fluid Simulation

Another useful application of GPU programming is fluid simulations. In this particular example, we will be working with the Navier-Stokes Equations for incompressible, constant density fluids:

$\frac{\partial\mathbf{u}}{\partial t} = -(\mathbf{u} \cdot \nabla)\mathbf{u} - \frac{1}{\rho}\nabla p + \nu \nabla^2 \mathbf{u} + \mathbf{f}$

$\nabla \cdot \mathbf{u} = 0$

And this tutorial will introduce an "ink" whose density will not be constant and will be moved about by the fluid. The density of the "ink" can be described by the following equation:

$\frac{\partial d}{\partial t} = -(\mathbf{u} \cdot \nabla)d + \gamma \nabla^2 d + S$

Like the previous fractals example, this tutorial will not dive too deeply into the mathematics. However, it is important to understand what numerical methods we are using here. The methods described here will be based upon the paper Stable Fluids (Stam, 1999) (ACM Entry) for some calculations, and we will simply take advantage of finite difference forms for other calculations. To summarize, the idea here is that we perform advection, diffusion, and the addition of body forces to a velocity with non-zero divergence. Afterwards, we calculate the pressure (which can be represented as a Poisson equation) at each point on the grid using a Jacobi solver, and subtract the gradient at each point from the our velocity to project it onto a space where it has a divergence of zero, satisfying the second equation above. We will also use the stable method for calculating advection as described in Stable Fluids, Stam, 1999, which involves moving backwards in time rather than forwards, which will be explained further down.

Some more important details about this implementation are:

Numerical Details
- We will use a grid size that is one-to-one with pixels.
- Pixel edges are considered boundaries, whereas data (velocity, substance) will be at the center of each pixel.
- For simplicity's sake, we will assume a non-viscous fluid (so, no diffusion of velocity) and we will not allow the "ink" to diffuse either.
Treatment of Boundary Conditions
- We will solve the Navier-Stokes equations enforcing solid wall or "no-slip" boundary conditions for velocity, and pure Neumann boundary conditions for pressure.
- Because boundary values are largely separate from the primary operations, we will only perform operations on values inside the boundary cells, which are the cells on the outside (indices 0, width-1, and 0, height-1 on the x and y axes respectively.)
Technical Details
- Similarly to the fractals example, we will be using a Dimensions structure loaded into constant memory (see above), but with slightly different parameters.
- We shall split the components of velocity, u and v, into separate arrays.
- As with the fractals example, we will "flatten" our image into a 1D array and use 1D arrays for all quantities.
- Due to the nature of some of the operations performed, we will have to keep a buffer array of previous values for most quantities. These previous array values will be stored in arrays with the suffix "old".
- As is common with most CG applications, we work assuming that 0,0 is the top-left corner, and moving downwards and right is a positive x,y increase.

Our overall approach will look something like this:

Add Body Forces to Velocity and Add "ink" from Sources
Advect Velocity and Density ("ink")
Calculate Divergence of Velocity
Solve Pressure Poisson Equation using the Jacobi Method: Ax=b where x is pressure and b is velocity divergence.
Use Pressure to Correct/Project Velocity to be Divergence-Free
Write Density ("ink") Values to Image

First, we define our Dimensions struct and constant memory allocation:

typedef struct {
    int width;
    int height;
    int steps;
    float timestep;    // Timestep in Seconds
    float dissipation; // Mass Dissipation Constant (how our "ink" dissipates over time). Set to 1 for no dissipation, 0.995 tends to look nice though.
} Dimensions;

...

__constant__ Dimensions ddim;

Next, we will define all our kernels and methods. By design, and unlike the fractals example, we will have many kernels that will be performing operations over a large number of points in parallel. Think of these operations as "slab" operations, performing calculations on large areas of memory simultaneously.
Organizing the code this way with many kernels keeps things organized with only minor overhead. Potentially, you could gain a speed boost by performing many of these operations in a single kernel, but you have to be careful to make sure you call __syncthreads() the proper amount of times. We use this multiple kernel technique for simplicity and clarity.

Add Forces/Sources

This is a very simple procedure. We simply add whatever our force or source amount is multiplied by the size of our timestep to each point in our grid.

__global__ void add_source(float *d, float *s) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;
    
    // Skip Boundary values
    if(px > 0 && py > 0 && px < width-1 && py < height-1) 
    {
        // Add source each timestep
        d[i] += ddim.timestep * s[i];
    }
}

The first argument is left as just "d" because we will use this same kernel to add our body forces to the horizontal and vertical components of velocity, in addition to "ink" density.

Advection

The advection kernel is significantly more complicated. As described in Stable Fluids (Stam 1999), trying to advect density or velocity forward in time (by taking the previous velocity at each point and following it to a new point and putting that velocity there) is unstable and leads to oscillations in the calculation over time. Instead, here we take the velocity at each point, follow it in its opposite direction, and find the velocity at that point using bilinear interpolation. We then update the velocity at our current point with that interpolated velocity.
Conveniently, it turns out that the unstable method is also unsuitable for GPU programming because we can end up with multiple threads trying to write to the same memory location, which can cause serious problems! With this method, each thread safely updates only its own dedicated grid point while only reading other, old grid points.

__global__ void advect(float *dold, float *d, float *u, float *v, float md) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;
    
    int px0, py0, px1, py1;
    float x, y, dx0, dx1, dy0, dy1;
    
    // Skip Boundary values
    if(px != 0 && py != 0 && px != width-1 && py != height-1) 
    {
        // Move "backwards" in time
        x = px - ddim.timestep * (width-2)  * u[i]; // Multiply by the width of the grid not including the boundary
        y = py - ddim.timestep * (height-2) * v[i]; // Multiply by the height of the grid not including the boundary
        
        // Clamp to Edges, that is, if the velocity goes over the edge, clamp it to the boundary value
        if (x < 0.5) x = 0.5; if (x > width - 1.5)  x = width - 1.5;
        if (y < 0.5) y = 0.5; if (y > height - 1.5) y = height - 1.5;
        
        // Setup bilinear interpolation "corner points"
        px0 = (int) x; px1 = px0 + 1; dx1 = x - px0; dx0 = 1 - dx1;
        py0 = (int) y; py1 = py0 + 1; dy1 = y - py0; dy0 = 1 - dy1;
        
        // Perform a bilinear interpolation
        d[i] = dx0 * (dy0 * dold[px0 + width*py0] + dy1 * dold[px0 + width*py1]) +
               dx1 * (dy0 * dold[px1 + width*py0] + dy1 * dold[px1 + width*py1]);
               
        // Multiply by the mass dissipation constant
        d[i] *= md;
    }
    
    return;
}

Here again, "d" is a placeholder variable since we will use both velocities and density in its place. "dold" refers to the array of previous values and "d" refers to the array storing new values.

Divergence

For calculating the divergence, we use the finite difference method for the divergence of a vector field. In our case, delta x and delta y are both 1 (since we define the grid size to be one-to-one with pixels). Thus, our kernel is simply:

__global__ void divergence(float *u, float *v, float *div) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;
    
    // Skip Boundary values
    if(px > 0 && py > 0 && px < width-1 && py < height-1) 
    {
        float u_l = u[i-1];
        float u_r = u[i+1];
        float v_t = v[px + (py-1)*width];
        float v_b = v[px + (py+1)*width];
        
        // Calculate divergence using finite difference method
        // We multiply by -1 here to reduce the number of negative multiplications in the pressure calculation
        div[i] = -0.5 * ((u_r - u_l)/(width-2) + (v_b - v_t)/(height-2)); 
    }
}

Pressure

We implement here a kernel for one iteration of the Jacobi solver for pressure, which can be represented a Poisson equation. Since the step of each solver depends on the last, we have to perform some pointer manipulations and boundary updates in between iterations, which is why we implement it this way.

__global__ void pressure(float *u, float *v, float *p, float *pold, float *div) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;
    
    // Skip Boundary values
    if(px > 0 && py > 0 && px < width-1 && py < height-1) 
    {
        // left, right, top, bottom neighbors
        float x_l = p[i-1];
        float x_r = p[i+1];
        float x_t = p[px + (py-1)*width];
        float x_b = p[px + (py+1)*width];
        float b = div[i];
        
        // Jacobi method for solving the pressure Poisson equation
        pold[i] = (x_l + x_r + x_t + x_b + b) * 0.25; // Here b is positive because of the extra negative sign in the divergence calculation
    }
}

Normally the Jacobi method is one of the methods that converges the most slowly. However, on the GPU each Jacobi iteration is incredibly cheap since the entire pressure array is updated nearly simultaneously. It is possible to implement the red-black Gauss-Siedel relaxation algorithm to solve for pressure, however this is more unnecessarily complex for the sake of this tutorial. From a GPU programming perspective, each step is very similar and no real benefit is gained from that implementation. Regardless, for an interesting look into the method, see this publication from the Universidade da Beira Interior.

Projection/Correction

Nothing very interesting occurs in this kernel. Essentially, the finite difference method for taking the gradient of a scalar field is used (with dx and dy being 1) to get the gradient of pressure. Each component of the gradient is then subtracted from the respective velocity component.

__global__ void correction(float *u, float *v, float *p) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;
    
    // Skip Boundary values
    if(px > 0 && py > 0 && px < width-1 && py < height-1) {
    
        // left, right, top, bottom neighbors
        float p_l = p[i-1];
        float p_r = p[i+1];
        float p_t = p[px + (py-1)*width];
        float p_b = p[px + (py+1)*width];
        
        // Find the gradient and perform the correction/projection
        u[i] -= 0.5*(p_r - p_l)*(width-2);
        v[i] -= 0.5*(p_b - p_t)*(height-2);
    }
}

Boundary Conditions

In order to enforce our defined boundary conditions (no-slip and pure Neumann), we have to have kernels to occasionally update the boundary conditions of both velocity components and pressure.

For velocity, we want the edge between the boundary grid points and the closest inner grid point to have a horizontal velocity of zero for vertical edges, and a vertical velocity of zero for horizontal edges. Simply by looking at the finite difference method, we see we can do this by setting the boundary grid point velocities to be negative that of the velocity directly inside.

__global__ void velocity_bc(float *u, float *v) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;

    // Skip Inner Values
    if(px == 0) 
    {
        u[i] = -u[i+1];
        v[i] =  v[i+1];
    } 
    else if(py == 0) 
    {
        u[i] =  u[px + (py+1)*width];
        v[i] = -v[px + (py+1)*width];
    } 
    else if(px == width-1) 
    {
        u[i] = -u[i-1];
        v[i] =  v[i-1];
    } 
    else if(py == height-1) 
    {
        u[i] =  u[px + (py-1)*width];
        v[i] = -v[px + (py-1)*width];
    }
}

For pressure, we want the edge between the boundary grid points and the closest inner grid point to have a derivative of zero with respect to the normal. By using the finite difference method for the derivative of pressure in the direction of the normal, we see that our pressure on the boundary grid point needs to be set equal to the pressure value of the closest inner grid point.

__global__ void pressure_bc(float *p) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int width = ddim.width;
    int height = ddim.height;
    int px = i % width;
    int py = i / width;

    // Skip Inner Values
    if(px == 0) 
    {
        p[i] = p[i+1];
    } 
    else if(py == 0) 
    {
        p[i] = p[px + (py+1)*width];
    } 
    else if(px == width-1) 
    {
        p[i] = p[i-1];
    } 
    else if(py == height-1) 
    {
        p[i] = p[px + (py-1)*width];
    }
}

Putting It All Together

Finally, we just have to put this all together in the step order we described above (with a few extra pointer swaps, boundary value updates, and memory updates in between).
For reference, we set up all our variables first, like we did for the fractals example:

// Full size of arrays, including boundaries
int size = dim->width * dim->height; 

// size is necessarily divisible by a constant THREADS_PER_BLOCK defined elsewhere
int blocks = size/THREADS_PER_BLOCK;

//
// All Device Pointers
//

//velocity and pressure
float *u, *v, *p;
float *uold, *vold, *pold;
float *utemp, *vtemp, *ptemp; // temp pointers, DO NOT INITIALIZE

//divergence of velocity
float *div;

//density
float *d, *dold, *dtemp;
float *map;

//sources
float *sd, *su, *sv;

...

//
// Allocate memory for most arrays
//

...

// Initialize our "previous" values of density and velocity to be all zero
cudaMemset(uold, 0, size * sizeof(float));
cudaMemset(vold, 0, size * sizeof(float));
cudaMemset(dold, 0, size * sizeof(float));

// Host density copy, you can initialize this if you'd like, then copy it to dold
map = (float *) malloc(size * sizeof(float));

// Create the sources using some arbitrary source creation function
// that you wish to use
sd = create_source();
su = create_source();
sv = create_source();

Now, we can begin our main loop, putting all our kernels to good use:

for(t = 0; t < steps; ++t) {

    //add sources to velocity (body force)
    add_source <<< blocks, THREADS_PER_BLOCK >>> (uold, su);
    add_source <<< blocks, THREADS_PER_BLOCK >>> (vold, sv);
    
    //add sources to density
    add_source <<< blocks, THREADS_PER_BLOCK >>> (dold, sd);
    
    //advect horizontal and vertical velocity components
    advect <<< blocks, THREADS_PER_BLOCK >>> (uold, u, uold, vold, 1);
    advect <<< blocks, THREADS_PER_BLOCK >>> (vold, v, uold, vold, 1);
    
    //advect density
    advect <<< blocks, THREADS_PER_BLOCK >>> (dold, d, u, v, 0.995);
    
    //call kernel to compute boundary values and enforce boundary conditions
    velocity_bc <<< blocks, THREADS_PER_BLOCK >>> (u, v);
    
    //call kernel to compute the divergence (div)
    divergence <<< blocks, THREADS_PER_BLOCK >>> (u, v, div);
    
    //reset pressure to 0 <- This is our initial guess to the Jacobi solver
    cudaMemset(p, 0, size * sizeof(float));
    
    //for each Jacobi solver iteration, 80 tends to be enough iterations to get convergence or close enough
    for (i = 0; i < 80; ++i)
    {
        //call kernel to compute pressure
        pressure <<< blocks, THREADS_PER_BLOCK >>> (u, v, p, pold, div);
        
        //swap old and new arrays for next iteration
        ptemp = pold; pold = p; p = ptemp;
        
        //call kernel to compute boundary values and enforce boundary conditions
        pressure_bc <<< blocks, THREADS_PER_BLOCK >>> (p);
    }
    
    //call kernel to correct velocity
    correction <<< blocks, THREADS_PER_BLOCK >>> (u, v, p);
    
    //call kernel to compute boundary values and enforce boundary conditions
    velocity_bc <<< blocks, THREADS_PER_BLOCK >>> (u, v);
    
	//copy density image back to host and synchronize
    cudaMemcpy(map, d, size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    /*************************************************
     *                                               *
     *    Write to Image Here, Or Anywhere Really    *
     *                                               *
     ************************************************/
    
    //swap old and new arrays for next iteration
    utemp = uold; uold = u; u = utemp;
    vtemp = vold; vold = v; v = vtemp;
    dtemp = dold; dold = d; d = dtemp;
}

And that's a 2D fluid simulation based on the Navier-Stokes equation on the GPU! Like it was mentioned above, you can add diffusive effects to both the density and velocity giving the fluid some sense of viscosity and allowing the "ink" to disperse more realistically. One downfall of this method is that due to numerical error, you tend to lose the extra vorticity in the movement of the fluid, but you can restore this using vorticity confinement methods.
Overall, compared to traditional serial methods, this tends to run much faster, and the bottleneck still tends to be the writing of images as opposed to the actual calculations. If you would like to view the full source code, go to /u/mknyszek/cuda/nsflow and all the files should be there.

For some examples of what this simulation code has produced, take a look at these videos:

Debugging CUDA Code

Debugging CUDA code can seem daunting at first because at first it may seem almost impossible to look at values on GPU memory without first copying the values back to the CPU. Luckily, NVIDIA provides several methods for debugging CUDA code and handling errors.
Before you get started, for the most detailed results with NVIDIA CUDA debugging tools, make sure to compile your code with the -G flag in addition to the -g flag (which is the standard gcc flag for compiling with debugging information).

printf Usage

On newer GPUs (such as the ones we have at the lab) you can actually use printf statements in device code to find out values at a certain time. The only issue is that when a kernel runs, it cannot print out values in real-time. When using printf, be wary of the fact that anything printing will only appear after the kernel terminates. During kernel execution the string is shoved into a buffer which is then flushed at kernel termination. This can sometimes make debugging very tedious as you get inundated with statements printed to the command line simultaneously.

Of course, you can always isolate a certain thread and only print values from that thread, but in the end that can be a lot of wasted effort because the range of values you can see is now very narrow and it won't bring you to the root of the problem.

cuda-gdb

NVIDIA has developed an extension to gdb to help diagnose and eliminate bugs in CUDA programs. Simply run cuda-gdb in the same way you would run gdb, and cuda-gdb will give you a plethora of information about your program as its running, including kernel launches and any errors or warnings related to CUDA runtime functions.
cuda-gdb works identically to gdb and lets you view values in global memory as well as set breakpoints in your device code just as you would with any other C/C++ program.
While cuda-gdb can really save you a few hours of debugging your device code, it can sometimes act strangely with host code. It is recommended that when you have problems with your host code, stick to using only gdb as it is much more reliable.

cuda-memcheck

If you fear that your program is starting to do strange things with GPU memory and other memory errors are occurring that are not made obvious using cuda-gdb, NVIDIA has also provided another program called cuda-memcheck. Run it on your program by specifying your program as an argument to cuda-memcheck and it will generate a nice report if it finds any fatal memory errors.

HANDLE_ERROR Macro

Although NVIDIA includes a hefty error handling system for CUDA runtime subroutines, it can get very tedious to use for every single instance of a CUDA runtime function you wish to check.

This tool in particular is not a tool created by NVIDIA but rather by developers who use it to handle errors easily for the purposes of debugging. Simply define the following function and macro at the top of your file:

static void HandleError( cudaError_t err, const char *file, int line ) {
  if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
    exit( EXIT_FAILURE );
  }
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

This macro now makes handling CUDA runtime errors very easy. Simply wrap your CUDA runtime functions with the macro like so:

HANDLE_ERROR( cudaMalloc((void **) &d, size * sizeof(float)) );

Back to Top ➚

Profiling CUDA Code

When your code is complete and working well, you may wish to identify bottlenecks in running time and optimize the code to squeeze out as much GPU performance as possible.
NVIDIA also has a pair of tools for profiling: one for the command line and one with an Eclipse-based GUI. Before using either of these tools, make sure you first compile your code with the -pg flag.

nvprof

This is NVIDIA's counterpart to the well-known C program profiler gprof. nvprof is excellent for performing quick profiling checks on your code to ensure that time is being expended where you expect it to. Its very lightweight and handy, and in addition can be run on any program that creates CUDA kernels. This even includes programs which use OpenACC or CUDA Python and it can be extremely helpful for determining where exactly a CUDA kernel is being launched and for what.

One particularly helpful flag for nvprof is --print-gpu-trace which prints a detailed GPU stack trace with all function calls, what GPU they are running on, and their dimensions.
Another helpful flag is --output-profile which generates a file detailing the profile, which can be later imported back into nvprof, or perhaps nvvp.

If you find that your program uses significant CPU time as well, using a combination of nvprof and gprof can give you a nice look into where the most time is being spent.

nvvp

nvvp is the NVIDIA Visual Profiler. Simply create a new session, specify the command line arguments and working directory of your properly compiled program, and it will generate very detailed reports regarding the performance of your program. There exist additional options to only profile certain portions of your program too, among other things. Those accustomed to Eclipse should find that this tool is very easy to use and incredibly useful.

Back to Top ➚

Tips & Tricks

Below is a collection of tips & tricks that have been mostly found through experience and can come in handy if/when CUDA C/C++ does not behave as expected.

Pay special care to not read from uninitialized device memory. While this is good practice for any memory read, this is especially important when it comes to the GPU because it can lead to values accidentally persisting and data becoming corrupt.
When attempting to port code over to the GPU, try first to have it parallelized and working using some other method such as OpenMP or MPI. When the operations are already paralellized, it is much easier to make the port because the code is better structured for it.
The saying "premature optimization is the root of all evil" also applies to GPU programming. It is better to have something working than to try to over-optimize it the first time around. Even unoptimized GPU code tends to be significantly faster than serial code.
Do not forget that CUDA C/C++ comes with its own math library, including things such as hypot() (which will compute the sqrt(x^2 + y^2)). Built-in functions tend to be much faster and hypot(), for example, is guaranteed to only overflow if the result would overflow, meaning that the built-in library tends to be much safer.
Too many math operations slowing you down? Try compiling with --fast-math. Your program will lose numerical accuracy, but will get a significant performance boost.
CUDA C/C++ comes with vector types (float2, float3, int4, double3, etc.) with their own constructors (make_()) that guarantee memory alignment. Use them, but use them with caution. Not much documentation exists so you may find yourself wondering if an array or matrix of these vector types is worth the headache.