Why GPUs?

performance

Exploit course-grained and fine-grained parallelism.

GPUs can be faster than CPUs, but they can be slower too. Results depend on the code you write.

 

Outline

  1. Definitions

  2. GPU hardware: streaming multiprocessors (SM) and memory model

  3. Grids, blocks, threads, warps

  4. CUDA programming concepts

  5. Example programs

 

CUDA programming model basics

cpu gpu

 

Key CUDA abstractions

  1. Hierarchy of thread groups

  2. Shared memories

  3. Barrier synchronization

These are exposed to the programmer through a minimal set of language abstractions.

Partition problem into course sub-problems that can be solved independently in parallel by blocks of threads. (Blocks are independent and parallel.)

Partition sub-problems into finer pieces that can be solved cooperatively in parallel by all threads within a block. (Threads work together and in parallel.)

 

GPU hardware examples

fermi

kepler

 

The Nvidia Jetson TX2 we will be using in our lab is based on the "Pascal" architecture. It has 2 SMs with 128 cores/SM. A stand-alone GPU based on Pascal:

 

Modern GPUs include hardware for special instructions (sine, cosine, etc.), as well as single and double precision floating-point units.

GPU software/hardware

gpu hw sw

warps

Typically number of threads per block should be multiple of 32 (128 and 256 are typical).

SM can concurrently execute up to 8 thread blocks.

 

GPU memory model

There are different types of memory in the memory hierarchy, with different accessibility. The following is a programming hierarchy.

memory-heirarchy

 

The physical memory is designed as follows.

memory model

Attempt hit in L1 cache, then L2 cache, then global memory.

Load granularity is 128-byte line.

Memory operations are issued per warp (32 threads) just like instructions (SIMT).

Strive for coalesced memory accesses (warp should access within a contiguous region).

Heterogeneous Programming

The parallel programming in a CUDA program is a mix of C and CUDA. This is called heterogeneous programming. The sequence of instructions can be illustrated by:

Heterogeneous Programming

Typical sequence of operations in a CUDA C program

  1. Declare pointers to host and device memory

  2. Allocate memory on host and device

  3. Initialize memory on host

  4. Copy data from host to device memory

  5. Execute one or more kernels that run on the device and operate on device memory

  6. Copy results from device to host memory

 

Example program

Single-precision A*X plus Y = SAXPY

(1)y=ax+y[y1y2yn]=a[x1x2xn]+[y1y2yn]yi=axi+yi,i=1,2,,n

Example C program (host only)

 

Example CUDA C program (host + device)

Compile and run the code:

 

Function execution space specifiers

SpecifierUse
__device__Execute on device, call from device only
__global__Execute on device, call from host, must be called with an execution configuration <<<gridSize , blockSize>>>
__host__Execute on host, call from host only, this is the default if no specifier is given
__noinline__avoid inlining
__forceinline__force inlining

 

Variable memory space specifiers

SpecifierUse
nonean automatic variable declared in device code without memory space specifiers generally resides in a register
__device__declares device variable in global memory, accessible from all threads, with lifetime of application
__constant__declares device variable in constant memory, accessible from all threads, with lifetime of application
__shared__declares device variable in block's shared memory, accessible from all threads within a block, with lifetime of block
__restrict__standard C definition that pointers are not aliased (so that compiler optimizations may be performed)

memory model

variable declaration

variable performance

 

Kernel built-in variables

typevariablefunction
dim3gridDimgrid size in blocks: user specified <<< gridDim, … >>>
dim3blockDimblock size in threads: user specified <<< …, blockDim >>>
uint3blockIdxblock index within grid: 0, 1, 2, …, gridDim-1
uint3threadIdxthread index within block: 0, 1, 2, ..., blockDim-1
intwarpSizewarp size in threads

The built-in variables let threads figure out what work they are supposed to do.

 

Grid, Blocks, Threads

grids, blocks, threads

The programmer controls the size and dimensions of grids and blocks when kernels are launched through the execution configuration parameters.

thread blocks

The configuration of the grid, blocks, and threads can be different on each kernel call.

 

Execution configuration

In the SAXPY example, the saxpy kernel was launched using the execution configuration shown below.

This launches 212 blocks with 28=256 threads each for a total of 220 threads altogether. That's one thread for each element in the vectors.

Let T be the number of threads per block and N be the total number of units of work to be done. Then the number of blocks needed to complete all the work is:

(2)B=N+T1T

where floors the argument which is what happens when doing the integer divide (N+255)/256 in C.

If N is an exact multiple of T, then B=N/T, otherwise B=N/T+1. Let's look at a some examples.

Suppose N=20 and T=5. Then B=24/5=4=N/T. This is correct, because exactly B=4 blocks with T=5 threads each can do N=20 units of work.

On the other hand, if N=21 and T=5, then B=25/5=5=N/T+1. This is correct, because five B=5 blocks of T=5 threads each can do 25 units of work. That's more than the N=21 that are needed and in the last block 4 threads sit idle while the other threads do their work.

 

Splitting up the work

When you have a computational problem to solve, recognize opportunities for parallelization and look for independence in calculations. Map out ways to divide the work among blocks and threads. In the SAXPY example, we decided to create one thread for every element in the arrays: that's 1M threads. We created 220106 threads with 4096=212 blocks in the grid and 256=28 threads in each block. Each thread operated on independent data and stored results to independent memory locations.

Let's look again at our SAXPY kernel and the built-in variables.

The saxpy kernel launches 220 threads: 4096 blocks of 256 threads each. Each thread knows which block it's in through the blockIdx.x built-in variable. Each thread knows which thread it is through the threadIdx.x built-in variable. It also knows the number of blocks through the gridDim.x variable and the number of threads per block through the blockDim.x built-in variable.

cuda indexing

The fourth thread in the third block can compute it's global index (515) and then do work based on that index. Every thread does their own work. No other thread will do the work for index 515.

 

Synchronization concepts

Data transfers using cudaMemcpy() are synchronous (blocking): they do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls can not begin until the synchronous copy has completed.

Kernel launches are asynchronous, control returns immediately to the CPU and does not wait for the kernel to complete.

 

Timing using CUDA events

cudaEventRecord(…) records the state of the stream at the time of the call but is asynchronous. cudaEventSynchronize(…) waits until the completion of all device work preceding the most recent call to cudaEventRecord(…). cudaEventElapsedTime(…) computes the elapsed time between two events. This will be more accurate than using CPU timers.

 

Querying device properties

The cudaDeviceProp structure contains lots of information.

 

Compute capabilities

Different GPUs have different compute capabilities. You should know the compute capability of your device, its capabilities and limitations.

See Nvidia's Compute Capability Table

 

Handling CUDA errors

Errors in CUDA C code can arise from:

  1. Errors when CUDA runtime API calls fail

  2. Errors in CUDA kernel calls (e.g. invalid memory access inside a kernel)

All CUDA runtime API calls return an error value. Including explicit error checking code interfers with readability, but it can save hours of debugging. Many people define macros for error checking.

This code catches the return values from CUDA runtime API calls. If any error occurs, cudaGetErrorString(…) returns a string describing the error.

Errors in kernels can be synchronous or asynchronous. Synchronous errors (such as exceeding the thread count) can be caught by a call to cudaGetLastError(). Asynchronous errors (such as out-of-bounds memory accesses) require a synchronization mechanism. cudaDeviceSynchronize() blocks the host thread until all previously issued commands have completed. Generally this is only used for debugging because it destroys any potential for concurrency.

 

Profiler nvprof

Nvidia's visual profiler shows the execution timeline.

visual profiler

 

Other Tools in the CUDA SDK

cuda-memcheck detects out of bounds and misaligned memory access errors in CUDA apps. GPU hardware exceptions are also reported. Reports memory leaks.

racecheck reports shared memory data access hazards.

initcheck reports cases where the GPU performs uninitialized access to global memory.

synchcheck reports cases where the application attempts invalid synchronization primitives.

When developing CUDA applications, do the following often:

  1. compile using nvcc

  2. run your app through cuda-memcheck

 

Parallel Reduction

Consider the problem of summing the elements of a vector. Such is needed when computing the norm/length of a vector or the inner/dot product between two vectors. Suppose we have a vector of N=220 elements: x=[x0,x1,x2,,xN1]. We want to compute s=i=0N1xi. In a conventional C program, this can be done on the CPU as shown below.

 

Synchronization functions

void __syncthreads();

Coordinates communication between threads in the same thread block.

Avoid placing __syncthreads(); inside conditional code.

 

How can parallelism be exploited to efficiently compute the sum? Consider graph below.

parallel sum

In the first stage, N/2 threads perform one add each.

In the second stage, N/4 threads perform one add each while N/2 threads sit idle.

In the third stage, N/8 threads perform one add each, while 3N/8 threads sit idle.

And so on.

If N is large, we cannot launch N/2 threads. In this case, do some pre-summing.

Write a kernel function to perform parallel dot product.

This example shows how shared memory can be allocated statically and dynamically. It shows how shared memory enables threads in the same block to work together in parallel. It illustrates how thread synchronization may be needed.

Example: Parallel Histogram

How would you create a 256-bin histogram in parallel of a very long list of bytes.

This example introduces a new concept: atomic adds in global and shared memory. There are many other atomic operations. An atomic operation performs three operations:

  1. read a value from memory

  2. operate on the value

  3. write the result back to memory

Because the execution of these operations can not be broken up into small parts without the possibility of errors, it is called an atomic operation. Atomic operations are available for various data types.

NameFunction
atomicAdd()old = old + val, return old
attomicSub()old = old - val, return old
atomicExch()old = val, return old
atomicMin()old = min(old,val), return old
atomicMax()old = max(old,val), return old
atomicInc()old = ((old >= val) ? 0 : (old+1)), return old
atomicDec()old = (((old == 0) | (old > val)) ? val : (old-1)), return old
atomicCAS()old = (old == compare ? val : old), return old (Compare And Swap)
atomicAnd()old = old & val, return old
atomicOr()old = old | val, return old
atomicXor()old = old ^ val, return old

Streams

It is possible to execute two or more tasks at once. This is a form of coarse-grained parallelism, in which unrelated processes can execute in parallel. It can be thought of as similar to threads in a multi-core CPU, although with less flexibility. Each task is executed in a stream. The copy engine and the kernel engine can execute one operation at the same time.

Streams

The mapping of the streams onto GPU engines can be shown as:

We can view the call dependency of the different tasks as:

Call Dependency

where the arrows indicate that the kernel execution for stream 0 must complete before the memcpy C can start. The resulting execution order will be:

Execution Order

Notice that there are "bubbles" in the execution where the engines are idle. By reordering the operations in the streams it is possible to remove the bubbles.

Overlapped Calls

Important concepts:

  1. Memory must be page-locked (pinned). This keeps the CPU from paging out memory to disk.

  2. Memory transfers are asynchronous.

  3. The stream objects must be allocated and initialized:

    cudaStream_t stream0;

    HANDLE_ERROR(cudaStreamCreate( &stream0 ));

  4. Kernel launches must include a stream object:

    dot<<< B , T , B*sizeof(float), stream0 >>>( dev_a, dev_b, dev_partial_c );

  5. Overlap calls to each stream.

 

GPU Accelerated Libraries

NVidia Web Site

 

Challenge problems

  1. RGB to gray conversion

  2. Sobel edge detection