Chapter 5: Samples 04–06 — Real Algorithms and Performance

These three samples move from toy examples to real parallel algorithms. They introduce 2D problems, a non-trivial parallel pattern (reduction), and the most important GPU optimisation technique (local memory tiling).

Sample 04 — Naive Matrix Multiplication

Repository: OpenCL-Samples on Github File: 04_MatrixMultiply/Program.cs
Goal: Implement C = A × B for 512×512 matrices using a 2D NDRange. Measure GPU vs CPU performance.

New concepts: 2D NDRange, one work item per output element, GFLOP/s, row-major memory layout.

Mathematical Background

Matrix multiplication C = A × B means:

C[row, col] = Σ (A[row, k] * B[k, col])   for k = 0..N-1

Each output element requires N multiply-add operations. For N=512: 512³ = 134 million operations per matrix multiply.

Row-Major Layout

2D matrices are stored as 1D arrays in row-major order:

Matrix (3×3):          Array (9 elements):
A[0,0]  A[0,1]  A[0,2]    → index: row*N + col
A[1,0]  A[1,1]  A[1,2]    A[0,0]=a[0]  A[0,1]=a[1]  A[0,2]=a[2]
A[2,0]  A[2,1]  A[2,2]       A[1,0]=a[3]  A[1,1]=a[4] ...

The Kernel

__kernel void matmul_naive(
    __global const float* A,
    __global const float* B,
    __global       float* C,
    int N)
{
    int row = get_global_id(0);   // which output row this work item owns
    int col = get_global_id(1);   // which output column
 
    if (row >= N || col >= N) return;
 
    // Dot product of A[row, *] and B[*, col]
    float sum = 0.0f;
    for (int k = 0; k < N; k++)
        sum += A[row * N + k] * B[k * N + col];
 
    C[row * N + col] = sum;
}

Each work item computes exactly one output element. With a 512×512 matrix, we launch 512×512 = 262,144 work items.

                Work Item Grid (N×N = 512×512)
                ┌─────────────────────────────────────────┐
row=0 │ WI(0,0) WI(0,1) WI(0,2) ... WI(0,511) │ → computes C[0,0..511]
row=1 │ WI(1,0) WI(1,1) WI(1,2) ... WI(1,511) │ → computes C[1,0..511]
  ... │                                         │
row=511│ WI(511,0)               ... WI(511,511)│ → computes C[511,0..511]
       └─────────────────────────────────────────┘

The 2D Launch

unsafe
{
    nuint* gs = stackalloc nuint[] { (nuint)N, (nuint)N };       // {512, 512}
    nuint* ls = stackalloc nuint[] { (nuint)LocalN, (nuint)LocalN }; // {16, 16}
    cl.EnqueueNdrangeKernel(queue, kernel, 2u, (nuint*)null, gs, ls,
                            0u, (nint*)null, out nint _);
}

LocalN = 16 means each work group is a 16×16 tile of 256 work items.

Performance Measurement

// GFLOP/s: each output element requires N multiply-adds = 2*N³ total operations
double gflops = 2.0 * N * N * N / 1e9;
 
Console.WriteLine($"  CPU: {swCpu.ElapsedMilliseconds} ms  ({gflops / swCpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($"  GPU: {swGpu.ElapsedMilliseconds} ms  ({gflops / swGpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($"  Speedup: {(double)swCpu.ElapsedMilliseconds / swGpu.ElapsedMilliseconds:F1}×");

Typical results (N=512):

	Time	GFLOP/s
CPU	~300 ms	~0.45
GPU (naive)	~10 ms	~13
Speedup	~30×	—

The Memory Access Problem

The naive kernel has a hidden problem: strided memory access.

Reading A[row, *]:  A[row*N+0], A[row*N+1], ...   ← consecutive → FAST
Reading B[*, col]:  B[0*N+col], B[1*N+col], ...   ← stride N   → SLOW

When consecutive work items read B[k*N+col] for different values of col:

Work item 0 reads B[k*N+0]
Work item 1 reads B[k*N+1]
Work item 2 reads B[k*N+2]

These addresses ARE consecutive in memory → GPU memory controller can coalesce them into one wide transaction. But within a single work item, reading successive values of B[k*N+col] for k=0,1,2,… accesses addresses N*4 bytes apart — causing cache misses on every access.

Sample 06 fixes this with tiling.

Sample 05 — Parallel Reduction

Repository: OpenCL-Samples on Github File: 05_Reduction/Program.cs
Goal: Compute the sum of 4 million floats using a two-phase GPU reduction.

New concepts: Local memory (__local), barrier(), tree reduction pattern.

Why Not Just Loop?

A naive serial sum on the GPU:

// Bad: only one work item does work, zero parallelism
if (gid == 0) {
    float sum = 0;
    for (int i = 0; i < n; i++) sum += input[i];
    output[0] = sum;
}

This gives zero speedup. We need all work items to participate simultaneously.

The Tree Reduction Idea

With N items and a tree-shaped computation, we can sum N values in log₂(N) steps:

In 3 steps we reduced 8 values to 1. For N=256 (local work group size), this takes only 8 barrier steps.

The Kernel

__kernel void reduce_sum(
    __global const float* input,      // large input array
    __global       float* partials,   // one result per work group
    __local        float* scratch,    // local memory scratch buffer
    int n)
{
    int gid       = get_global_id(0);
    int lid       = get_local_id(0);
    int groupSize = get_local_size(0);
 
    // Step 1: each work item loads one value into local memory
    scratch[lid] = (gid < n) ? input[gid] : 0.0f;
 
    // All items must finish writing before we start reading
    barrier(CLK_LOCAL_MEM_FENCE);
 
    // Step 2: tree reduction
    // stride = groupSize/2, groupSize/4, ..., 1
    for (int stride = groupSize / 2; stride > 0; stride >>= 1)
    {
        if (lid < stride)
            scratch[lid] += scratch[lid + stride];
 
        // Wait for this round before starting the next
        barrier(CLK_LOCAL_MEM_FENCE);
    }
 
    // Step 3: work item 0 writes the group's sum
    if (lid == 0)
        partials[get_group_id(0)] = scratch[0];
}

The `__local` Memory Argument

Local memory requires a special SetKernelArg call — you pass the size (not data):

// Argument 2 is __local memory: pass null and the SIZE in bytes
// OpenCL allocates this many bytes of local memory per work group
nuint localMemBytes = (nuint)(LocalSize * sizeof(float));
unsafe { cl.SetKernelArg(kernel, 2u, localMemBytes, (void*)null); }

The void*null signals “allocate local memory”, not “pass a pointer”.

Two-Phase Strategy

// Phase 1: GPU reduces N elements → numGroups partial sums
int numGroups  = (N + LocalSize - 1) / LocalSize;   // = ceil(N / LocalSize)
int globalSize = numGroups * LocalSize;              // rounded up
 
unsafe
{
    nuint gs = (nuint)globalSize;
    nuint ls = (nuint)LocalSize;
    cl.EnqueueNdrangeKernel(queue, kernel, 1u, (nuint*)null, &gs, &ls,
                            0u, (nint*)null, out nint _);
    cl.Finish(queue);
}
 
// Phase 2: CPU sums the small partials array (numGroups is small)
float[] partials = new float[numGroups];
// ... download partials ...
double gpuSum = 0;
for (int i = 0; i < numGroups; i++) gpuSum += partials[i];

For N=4,194,304 and LocalSize=256: numGroups = 16,384 partial sums on GPU, then 16,384 additions on CPU (trivial).

`barrier()` — The Synchronisation Wall

graph LR
    subgraph "Without barrier (WRONG)"
        A1["Item 0 writes scratch[0]"] --> C1["Item 0 reads scratch[4]\n(might be old!)"]
        B1["Item 4 writes scratch[4]"] --> C1
    end

    subgraph "With barrier (CORRECT)"
        A2["Item 0 writes scratch[0]"]
        B2["Item 4 writes scratch[4]"]
        A2 --> BAR["barrier(CLK_LOCAL_MEM_FENCE)\nAll items stop here until all writes complete"]
        B2 --> BAR
        BAR --> C2["Item 0 reads scratch[4]\n(guaranteed fresh)"]
    end

barrier ensures memory consistency within a work group. It does not work across work groups — that’s why we need the CPU phase.

Sample 06 — Tiled Matrix Multiplication

Repository: OpenCL-Samples on Github File: 06_MatrixMultiply_Tiled/Program.cs
Goal: Dramatically accelerate matrix multiplication by loading tiles into local memory.

New concepts: Tiling strategy, collaborative data loading, memory access coalescing.

The Problem with Sample 04

In the naive kernel, each work item reads a full column of B from global memory:

B column access (N=512):
  B[0*512+col], B[1*512+col], ... B[511*512+col]
  ← stride of 2048 bytes between consecutive reads
  ← 512 global memory transactions PER OUTPUT ELEMENT
  ← N² output elements → N³ = 134M global reads of B

This is extremely bandwidth-hungry. We can do much better.

The Tiling Idea

The key insight: Instead of each work item independently fetching its own row/column from global memory, the whole work group collaboratively loads a tile of A and a tile of B into local memory once, then all work items compute their partial results from local memory.

The Tiled Kernel

#define TILE 16
 
__kernel void matmul_tiled(
    __global const float* A,
    __global const float* B,
    __global       float* C,
    int N)
{
    // Two TILE×TILE tiles in fast local memory
    __local float tileA[TILE][TILE];
    __local float tileB[TILE][TILE];
 
    int row      = get_global_id(0);
    int col      = get_global_id(1);
    int localRow = get_local_id(0);
    int localCol = get_local_id(1);
 
    float sum = 0.0f;
 
    int numTiles = (N + TILE - 1) / TILE;
    for (int t = 0; t < numTiles; t++)
    {
        // --- Collaborative load: each work item loads ONE element of each tile ---
        int aCol = t * TILE + localCol;
        int bRow = t * TILE + localRow;
 
        tileA[localRow][localCol] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        tileB[localRow][localCol] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
 
        // Wait: ALL items must finish loading before any starts computing
        barrier(CLK_LOCAL_MEM_FENCE);
 
        // --- Partial dot product using ONLY local memory ---
        for (int k = 0; k < TILE; k++)
            sum += tileA[localRow][k] * tileB[k][localCol];
 
        // Wait: ALL items must finish computing before next tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }
 
    if (row < N && col < N)
        C[row * N + col] = sum;
}

Why Two Barriers?

The kernel has two barrier() calls per tile iteration:

1st barrier: after loading tiles
   → guarantees all 16×16 elements of tileA and tileB are in local memory
     before any item reads from them

2nd barrier: after computing partial sums
   → guarantees all items finish reading the tiles
     before any item overwrites them with the next tile's data

Without the 2nd barrier, a fast work item might start loading tile t+1 while a slow item is still reading tile t.

Performance Comparison

The sample runs both the naive (Sample 04) and tiled kernels on the same data:

// Run naive kernel
SetArgs(cl, kernNaive, bufA, bufB, bufCn, N);
Console.Write("Naive kernel...  ");
var swNaive = Stopwatch.StartNew();
RunKernel(cl, queue, kernNaive, N, TileSize);
swNaive.Stop();
 
// Run tiled kernel
SetArgs(cl, kernTiled, bufA, bufB, bufCt, N);
Console.Write("Tiled kernel...  ");
var swTiled = Stopwatch.StartNew();
RunKernel(cl, queue, kernTiled, N, TileSize);
swTiled.Stop();
 
Console.WriteLine($"  Speedup: {(double)swNaive.ElapsedMilliseconds / swTiled.ElapsedMilliseconds:F2}×");

Typical output (N=512, TILE=16):

Naive kernel...   85 ms
Tiled kernel...    8 ms
Speedup: 10.6×

Why the Speedup?

Metric	Naive	Tiled
Global reads of B per output element	N (= 512)	N / TILE (= 32)
Local memory reads per output element	0	N (= 512)
Global memory bandwidth used	~100%	~6%

For each of the N² output elements, the naive kernel reads N values of B from global memory (N³ total). With tiling, global reads of B are reduced by a factor of TILE — because each tile of B is read once from global memory and then reused TILE times from local memory.

Global reads of B:
  Naive:  N²  × N    = N³      reads    (each output element reads its column)
  Tiled:  N²  × N/TILE = N³/TILE reads   (each tile is shared by TILE items)

With TILE=16: a 16× reduction in global memory traffic for B.

Checking the `TileSize` Constant

Notice how the C# constant TileSize is injected into the kernel source using string interpolation:

const int TileSize = 16;
 
string KernelSource = $$"""
    #define TILE {{TileSize}}    // ← C# interpolation inserts "16"
    ...
    """;

The $$"""...""" syntax uses {{expr}} for interpolation (double braces because the kernel code itself uses {}). This ensures the C# value and the C #define are always in sync.

Deep Thought

Explorer

05_Samples_04_to_06

Chapter 5: Samples 04–06 — Real Algorithms and Performance

Sample 04 — Naive Matrix Multiplication

Mathematical Background

Row-Major Layout

The Kernel

The 2D Launch

Performance Measurement

The Memory Access Problem

Sample 05 — Parallel Reduction

Why Not Just Loop?

The Tree Reduction Idea

The Kernel

The `__local` Memory Argument

Two-Phase Strategy

`barrier()` — The Synchronisation Wall

Sample 06 — Tiled Matrix Multiplication

The Problem with Sample 04

The Tiling Idea

The Tiled Kernel

Why Two Barriers?

Performance Comparison

Why the Speedup?

Checking the `TileSize` Constant

Graph View

Table of Contents

Deep Thought

Explorer

05_Samples_04_to_06

Chapter 5: Samples 04–06 — Real Algorithms and Performance

Sample 04 — Naive Matrix Multiplication

Mathematical Background

Row-Major Layout

The Kernel

The 2D Launch

Performance Measurement

The Memory Access Problem

Sample 05 — Parallel Reduction

Why Not Just Loop?

The Tree Reduction Idea

The Kernel

The __local Memory Argument

Two-Phase Strategy

barrier() — The Synchronisation Wall

Sample 06 — Tiled Matrix Multiplication

The Problem with Sample 04

The Tiling Idea

The Tiled Kernel

Why Two Barriers?

Performance Comparison

Why the Speedup?

Checking the TileSize Constant

Graph View

Table of Contents

The `__local` Memory Argument

`barrier()` — The Synchronisation Wall

Checking the `TileSize` Constant