Chapter 5: Samples 04–06 — Real Algorithms and Performance

These three samples move from toy examples to real parallel algorithms. They introduce 2D problems, a non-trivial parallel pattern (reduction), and the most important GPU optimisation technique (local memory tiling).


Sample 04 — Naive Matrix Multiplication

Repository: OpenCL-Samples on Github File: 04_MatrixMultiply/Program.cs
Goal: Implement C = A × B for 512×512 matrices using a 2D NDRange. Measure GPU vs CPU performance.

New concepts: 2D NDRange, one work item per output element, GFLOP/s, row-major memory layout.

Mathematical Background

Matrix multiplication C = A × B means:

C[row, col] = Σ (A[row, k] * B[k, col])   for k = 0..N-1

Each output element requires N multiply-add operations. For N=512: 512³ = 134 million operations per matrix multiply.

Row-Major Layout

2D matrices are stored as 1D arrays in row-major order:

Matrix (3×3):          Array (9 elements):
A[0,0]  A[0,1]  A[0,2]    → index: row*N + col
A[1,0]  A[1,1]  A[1,2]    A[0,0]=a[0]  A[0,1]=a[1]  A[0,2]=a[2]
A[2,0]  A[2,1]  A[2,2]       A[1,0]=a[3]  A[1,1]=a[4] ...

The Kernel

__kernel void matmul_naive(
    __global const float* A,
    __global const float* B,
    __global       float* C,
    int N)
{
    int row = get_global_id(0);   // which output row this work item owns
    int col = get_global_id(1);   // which output column
 
    if (row >= N || col >= N) return;
 
    // Dot product of A[row, *] and B[*, col]
    float sum = 0.0f;
    for (int k = 0; k < N; k++)
        sum += A[row * N + k] * B[k * N + col];
 
    C[row * N + col] = sum;
}

Each work item computes exactly one output element. With a 512×512 matrix, we launch 512×512 = 262,144 work items.

                Work Item Grid (N×N = 512×512)
                ┌─────────────────────────────────────────┐
row=0 │ WI(0,0) WI(0,1) WI(0,2) ... WI(0,511) │ → computes C[0,0..511]
row=1 │ WI(1,0) WI(1,1) WI(1,2) ... WI(1,511) │ → computes C[1,0..511]
  ... │                                         │
row=511│ WI(511,0)               ... WI(511,511)│ → computes C[511,0..511]
       └─────────────────────────────────────────┘

The 2D Launch

unsafe
{
    nuint* gs = stackalloc nuint[] { (nuint)N, (nuint)N };       // {512, 512}
    nuint* ls = stackalloc nuint[] { (nuint)LocalN, (nuint)LocalN }; // {16, 16}
    cl.EnqueueNdrangeKernel(queue, kernel, 2u, (nuint*)null, gs, ls,
                            0u, (nint*)null, out nint _);
}

LocalN = 16 means each work group is a 16×16 tile of 256 work items.

Performance Measurement

// GFLOP/s: each output element requires N multiply-adds = 2*N³ total operations
double gflops = 2.0 * N * N * N / 1e9;
 
Console.WriteLine($"  CPU: {swCpu.ElapsedMilliseconds} ms  ({gflops / swCpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($"  GPU: {swGpu.ElapsedMilliseconds} ms  ({gflops / swGpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($"  Speedup: {(double)swCpu.ElapsedMilliseconds / swGpu.ElapsedMilliseconds:F1}×");

Typical results (N=512):

TimeGFLOP/s
CPU~300 ms~0.45
GPU (naive)~10 ms~13
Speedup~30×

The Memory Access Problem

The naive kernel has a hidden problem: strided memory access.

Reading A[row, *]:  A[row*N+0], A[row*N+1], ...   ← consecutive → FAST
Reading B[*, col]:  B[0*N+col], B[1*N+col], ...   ← stride N   → SLOW

When consecutive work items read B[k*N+col] for different values of col:

  • Work item 0 reads B[k*N+0]
  • Work item 1 reads B[k*N+1]
  • Work item 2 reads B[k*N+2]

These addresses ARE consecutive in memory → GPU memory controller can coalesce them into one wide transaction. But within a single work item, reading successive values of B[k*N+col] for k=0,1,2,… accesses addresses N*4 bytes apart — causing cache misses on every access.

Sample 06 fixes this with tiling.


Sample 05 — Parallel Reduction

Repository: OpenCL-Samples on Github File: 05_Reduction/Program.cs
Goal: Compute the sum of 4 million floats using a two-phase GPU reduction.

New concepts: Local memory (__local), barrier(), tree reduction pattern.

Why Not Just Loop?

A naive serial sum on the GPU:

// Bad: only one work item does work, zero parallelism
if (gid == 0) {
    float sum = 0;
    for (int i = 0; i < n; i++) sum += input[i];
    output[0] = sum;
}

This gives zero speedup. We need all work items to participate simultaneously.

The Tree Reduction Idea

With N items and a tree-shaped computation, we can sum N values in log₂(N) steps:

In 3 steps we reduced 8 values to 1. For N=256 (local work group size), this takes only 8 barrier steps.

The Kernel

__kernel void reduce_sum(
    __global const float* input,      // large input array
    __global       float* partials,   // one result per work group
    __local        float* scratch,    // local memory scratch buffer
    int n)
{
    int gid       = get_global_id(0);
    int lid       = get_local_id(0);
    int groupSize = get_local_size(0);
 
    // Step 1: each work item loads one value into local memory
    scratch[lid] = (gid < n) ? input[gid] : 0.0f;
 
    // All items must finish writing before we start reading
    barrier(CLK_LOCAL_MEM_FENCE);
 
    // Step 2: tree reduction
    // stride = groupSize/2, groupSize/4, ..., 1
    for (int stride = groupSize / 2; stride > 0; stride >>= 1)
    {
        if (lid < stride)
            scratch[lid] += scratch[lid + stride];
 
        // Wait for this round before starting the next
        barrier(CLK_LOCAL_MEM_FENCE);
    }
 
    // Step 3: work item 0 writes the group's sum
    if (lid == 0)
        partials[get_group_id(0)] = scratch[0];
}

The __local Memory Argument

Local memory requires a special SetKernelArg call — you pass the size (not data):

// Argument 2 is __local memory: pass null and the SIZE in bytes
// OpenCL allocates this many bytes of local memory per work group
nuint localMemBytes = (nuint)(LocalSize * sizeof(float));
unsafe { cl.SetKernelArg(kernel, 2u, localMemBytes, (void*)null); }

The void*null signals “allocate local memory”, not “pass a pointer”.

Two-Phase Strategy

// Phase 1: GPU reduces N elements → numGroups partial sums
int numGroups  = (N + LocalSize - 1) / LocalSize;   // = ceil(N / LocalSize)
int globalSize = numGroups * LocalSize;              // rounded up
 
unsafe
{
    nuint gs = (nuint)globalSize;
    nuint ls = (nuint)LocalSize;
    cl.EnqueueNdrangeKernel(queue, kernel, 1u, (nuint*)null, &gs, &ls,
                            0u, (nint*)null, out nint _);
    cl.Finish(queue);
}
 
// Phase 2: CPU sums the small partials array (numGroups is small)
float[] partials = new float[numGroups];
// ... download partials ...
double gpuSum = 0;
for (int i = 0; i < numGroups; i++) gpuSum += partials[i];

For N=4,194,304 and LocalSize=256: numGroups = 16,384 partial sums on GPU, then 16,384 additions on CPU (trivial).

barrier() — The Synchronisation Wall

graph LR
    subgraph "Without barrier (WRONG)"
        A1["Item 0 writes scratch[0]"] --> C1["Item 0 reads scratch[4]\n(might be old!)"]
        B1["Item 4 writes scratch[4]"] --> C1
    end

    subgraph "With barrier (CORRECT)"
        A2["Item 0 writes scratch[0]"]
        B2["Item 4 writes scratch[4]"]
        A2 --> BAR["barrier(CLK_LOCAL_MEM_FENCE)\nAll items stop here until all writes complete"]
        B2 --> BAR
        BAR --> C2["Item 0 reads scratch[4]\n(guaranteed fresh)"]
    end

barrier ensures memory consistency within a work group. It does not work across work groups — that’s why we need the CPU phase.


Sample 06 — Tiled Matrix Multiplication

Repository: OpenCL-Samples on Github File: 06_MatrixMultiply_Tiled/Program.cs
Goal: Dramatically accelerate matrix multiplication by loading tiles into local memory.

New concepts: Tiling strategy, collaborative data loading, memory access coalescing.

The Problem with Sample 04

In the naive kernel, each work item reads a full column of B from global memory:

B column access (N=512):
  B[0*512+col], B[1*512+col], ... B[511*512+col]
  ← stride of 2048 bytes between consecutive reads
  ← 512 global memory transactions PER OUTPUT ELEMENT
  ← N² output elements → N³ = 134M global reads of B

This is extremely bandwidth-hungry. We can do much better.

The Tiling Idea

The key insight: Instead of each work item independently fetching its own row/column from global memory, the whole work group collaboratively loads a tile of A and a tile of B into local memory once, then all work items compute their partial results from local memory.

The Tiled Kernel

#define TILE 16
 
__kernel void matmul_tiled(
    __global const float* A,
    __global const float* B,
    __global       float* C,
    int N)
{
    // Two TILE×TILE tiles in fast local memory
    __local float tileA[TILE][TILE];
    __local float tileB[TILE][TILE];
 
    int row      = get_global_id(0);
    int col      = get_global_id(1);
    int localRow = get_local_id(0);
    int localCol = get_local_id(1);
 
    float sum = 0.0f;
 
    int numTiles = (N + TILE - 1) / TILE;
    for (int t = 0; t < numTiles; t++)
    {
        // --- Collaborative load: each work item loads ONE element of each tile ---
        int aCol = t * TILE + localCol;
        int bRow = t * TILE + localRow;
 
        tileA[localRow][localCol] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        tileB[localRow][localCol] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
 
        // Wait: ALL items must finish loading before any starts computing
        barrier(CLK_LOCAL_MEM_FENCE);
 
        // --- Partial dot product using ONLY local memory ---
        for (int k = 0; k < TILE; k++)
            sum += tileA[localRow][k] * tileB[k][localCol];
 
        // Wait: ALL items must finish computing before next tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }
 
    if (row < N && col < N)
        C[row * N + col] = sum;
}

Why Two Barriers?

The kernel has two barrier() calls per tile iteration:

1st barrier: after loading tiles
   → guarantees all 16×16 elements of tileA and tileB are in local memory
     before any item reads from them

2nd barrier: after computing partial sums
   → guarantees all items finish reading the tiles
     before any item overwrites them with the next tile's data

Without the 2nd barrier, a fast work item might start loading tile t+1 while a slow item is still reading tile t.

Performance Comparison

The sample runs both the naive (Sample 04) and tiled kernels on the same data:

// Run naive kernel
SetArgs(cl, kernNaive, bufA, bufB, bufCn, N);
Console.Write("Naive kernel...  ");
var swNaive = Stopwatch.StartNew();
RunKernel(cl, queue, kernNaive, N, TileSize);
swNaive.Stop();
 
// Run tiled kernel
SetArgs(cl, kernTiled, bufA, bufB, bufCt, N);
Console.Write("Tiled kernel...  ");
var swTiled = Stopwatch.StartNew();
RunKernel(cl, queue, kernTiled, N, TileSize);
swTiled.Stop();
 
Console.WriteLine($"  Speedup: {(double)swNaive.ElapsedMilliseconds / swTiled.ElapsedMilliseconds:F2}×");

Typical output (N=512, TILE=16):

Naive kernel...   85 ms
Tiled kernel...    8 ms
Speedup: 10.6×

Why the Speedup?

MetricNaiveTiled
Global reads of B per output elementN (= 512)N / TILE (= 32)
Local memory reads per output element0N (= 512)
Global memory bandwidth used~100%~6%

For each of the N² output elements, the naive kernel reads N values of B from global memory (N³ total). With tiling, global reads of B are reduced by a factor of TILE — because each tile of B is read once from global memory and then reused TILE times from local memory.

Global reads of B:
  Naive:  N²  × N    = N³      reads    (each output element reads its column)
  Tiled:  N²  × N/TILE = N³/TILE reads   (each tile is shared by TILE items)

With TILE=16: a 16× reduction in global memory traffic for B.

Checking the TileSize Constant

Notice how the C# constant TileSize is injected into the kernel source using string interpolation:

const int TileSize = 16;
 
string KernelSource = $$"""
    #define TILE {{TileSize}}    // ← C# interpolation inserts "16"
    ...
    """;

The $$"""...""" syntax uses {{expr}} for interpolation (double braces because the kernel code itself uses {}). This ensures the C# value and the C #define are always in sync.