Chapter 5: Samples 04–06 — Real Algorithms and Performance
These three samples move from toy examples to real parallel algorithms. They introduce 2D problems, a non-trivial parallel pattern (reduction), and the most important GPU optimisation technique (local memory tiling).
Sample 04 — Naive Matrix Multiplication
Repository: OpenCL-Samples on Github
File: 04_MatrixMultiply/Program.cs
Goal: Implement C = A × B for 512×512 matrices using a 2D NDRange. Measure GPU vs CPU performance.
New concepts: 2D NDRange, one work item per output element, GFLOP/s, row-major memory layout.
Mathematical Background
Matrix multiplication C = A × B means:
C[row, col] = Σ (A[row, k] * B[k, col]) for k = 0..N-1
Each output element requires N multiply-add operations. For N=512: 512³ = 134 million operations per matrix multiply.
Row-Major Layout
2D matrices are stored as 1D arrays in row-major order:
Matrix (3×3): Array (9 elements):
A[0,0] A[0,1] A[0,2] → index: row*N + col
A[1,0] A[1,1] A[1,2] A[0,0]=a[0] A[0,1]=a[1] A[0,2]=a[2]
A[2,0] A[2,1] A[2,2] A[1,0]=a[3] A[1,1]=a[4] ...
The Kernel
__kernel void matmul_naive(
__global const float* A,
__global const float* B,
__global float* C,
int N)
{
int row = get_global_id(0); // which output row this work item owns
int col = get_global_id(1); // which output column
if (row >= N || col >= N) return;
// Dot product of A[row, *] and B[*, col]
float sum = 0.0f;
for (int k = 0; k < N; k++)
sum += A[row * N + k] * B[k * N + col];
C[row * N + col] = sum;
}Each work item computes exactly one output element. With a 512×512 matrix, we launch 512×512 = 262,144 work items.
Work Item Grid (N×N = 512×512)
┌─────────────────────────────────────────┐
row=0 │ WI(0,0) WI(0,1) WI(0,2) ... WI(0,511) │ → computes C[0,0..511]
row=1 │ WI(1,0) WI(1,1) WI(1,2) ... WI(1,511) │ → computes C[1,0..511]
... │ │
row=511│ WI(511,0) ... WI(511,511)│ → computes C[511,0..511]
└─────────────────────────────────────────┘
The 2D Launch
unsafe
{
nuint* gs = stackalloc nuint[] { (nuint)N, (nuint)N }; // {512, 512}
nuint* ls = stackalloc nuint[] { (nuint)LocalN, (nuint)LocalN }; // {16, 16}
cl.EnqueueNdrangeKernel(queue, kernel, 2u, (nuint*)null, gs, ls,
0u, (nint*)null, out nint _);
}LocalN = 16 means each work group is a 16×16 tile of 256 work items.
Performance Measurement
// GFLOP/s: each output element requires N multiply-adds = 2*N³ total operations
double gflops = 2.0 * N * N * N / 1e9;
Console.WriteLine($" CPU: {swCpu.ElapsedMilliseconds} ms ({gflops / swCpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($" GPU: {swGpu.ElapsedMilliseconds} ms ({gflops / swGpu.Elapsed.TotalSeconds:F1} GFLOP/s)");
Console.WriteLine($" Speedup: {(double)swCpu.ElapsedMilliseconds / swGpu.ElapsedMilliseconds:F1}×");Typical results (N=512):
| Time | GFLOP/s | |
|---|---|---|
| CPU | ~300 ms | ~0.45 |
| GPU (naive) | ~10 ms | ~13 |
| Speedup | ~30× | — |
The Memory Access Problem
The naive kernel has a hidden problem: strided memory access.
Reading A[row, *]: A[row*N+0], A[row*N+1], ... ← consecutive → FAST
Reading B[*, col]: B[0*N+col], B[1*N+col], ... ← stride N → SLOW
When consecutive work items read B[k*N+col] for different values of col:
- Work item 0 reads
B[k*N+0] - Work item 1 reads
B[k*N+1] - Work item 2 reads
B[k*N+2]
These addresses ARE consecutive in memory → GPU memory controller can coalesce them into one wide transaction. But within a single work item, reading successive values of B[k*N+col] for k=0,1,2,… accesses addresses N*4 bytes apart — causing cache misses on every access.
Sample 06 fixes this with tiling.
Sample 05 — Parallel Reduction
Repository: OpenCL-Samples on Github
File: 05_Reduction/Program.cs
Goal: Compute the sum of 4 million floats using a two-phase GPU reduction.
New concepts: Local memory (__local), barrier(), tree reduction pattern.
Why Not Just Loop?
A naive serial sum on the GPU:
// Bad: only one work item does work, zero parallelism
if (gid == 0) {
float sum = 0;
for (int i = 0; i < n; i++) sum += input[i];
output[0] = sum;
}This gives zero speedup. We need all work items to participate simultaneously.
The Tree Reduction Idea
With N items and a tree-shaped computation, we can sum N values in log₂(N) steps:
In 3 steps we reduced 8 values to 1. For N=256 (local work group size), this takes only 8 barrier steps.
The Kernel
__kernel void reduce_sum(
__global const float* input, // large input array
__global float* partials, // one result per work group
__local float* scratch, // local memory scratch buffer
int n)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
int groupSize = get_local_size(0);
// Step 1: each work item loads one value into local memory
scratch[lid] = (gid < n) ? input[gid] : 0.0f;
// All items must finish writing before we start reading
barrier(CLK_LOCAL_MEM_FENCE);
// Step 2: tree reduction
// stride = groupSize/2, groupSize/4, ..., 1
for (int stride = groupSize / 2; stride > 0; stride >>= 1)
{
if (lid < stride)
scratch[lid] += scratch[lid + stride];
// Wait for this round before starting the next
barrier(CLK_LOCAL_MEM_FENCE);
}
// Step 3: work item 0 writes the group's sum
if (lid == 0)
partials[get_group_id(0)] = scratch[0];
}The __local Memory Argument
Local memory requires a special SetKernelArg call — you pass the size (not data):
// Argument 2 is __local memory: pass null and the SIZE in bytes
// OpenCL allocates this many bytes of local memory per work group
nuint localMemBytes = (nuint)(LocalSize * sizeof(float));
unsafe { cl.SetKernelArg(kernel, 2u, localMemBytes, (void*)null); }The void*null signals “allocate local memory”, not “pass a pointer”.
Two-Phase Strategy
// Phase 1: GPU reduces N elements → numGroups partial sums
int numGroups = (N + LocalSize - 1) / LocalSize; // = ceil(N / LocalSize)
int globalSize = numGroups * LocalSize; // rounded up
unsafe
{
nuint gs = (nuint)globalSize;
nuint ls = (nuint)LocalSize;
cl.EnqueueNdrangeKernel(queue, kernel, 1u, (nuint*)null, &gs, &ls,
0u, (nint*)null, out nint _);
cl.Finish(queue);
}
// Phase 2: CPU sums the small partials array (numGroups is small)
float[] partials = new float[numGroups];
// ... download partials ...
double gpuSum = 0;
for (int i = 0; i < numGroups; i++) gpuSum += partials[i];For N=4,194,304 and LocalSize=256: numGroups = 16,384 partial sums on GPU, then 16,384 additions on CPU (trivial).
barrier() — The Synchronisation Wall
graph LR subgraph "Without barrier (WRONG)" A1["Item 0 writes scratch[0]"] --> C1["Item 0 reads scratch[4]\n(might be old!)"] B1["Item 4 writes scratch[4]"] --> C1 end subgraph "With barrier (CORRECT)" A2["Item 0 writes scratch[0]"] B2["Item 4 writes scratch[4]"] A2 --> BAR["barrier(CLK_LOCAL_MEM_FENCE)\nAll items stop here until all writes complete"] B2 --> BAR BAR --> C2["Item 0 reads scratch[4]\n(guaranteed fresh)"] end
barrier ensures memory consistency within a work group. It does not work across work groups — that’s why we need the CPU phase.
Sample 06 — Tiled Matrix Multiplication
Repository: OpenCL-Samples on Github
File: 06_MatrixMultiply_Tiled/Program.cs
Goal: Dramatically accelerate matrix multiplication by loading tiles into local memory.
New concepts: Tiling strategy, collaborative data loading, memory access coalescing.
The Problem with Sample 04
In the naive kernel, each work item reads a full column of B from global memory:
B column access (N=512):
B[0*512+col], B[1*512+col], ... B[511*512+col]
← stride of 2048 bytes between consecutive reads
← 512 global memory transactions PER OUTPUT ELEMENT
← N² output elements → N³ = 134M global reads of B
This is extremely bandwidth-hungry. We can do much better.
The Tiling Idea
The key insight: Instead of each work item independently fetching its own row/column from global memory, the whole work group collaboratively loads a tile of A and a tile of B into local memory once, then all work items compute their partial results from local memory.
The Tiled Kernel
#define TILE 16
__kernel void matmul_tiled(
__global const float* A,
__global const float* B,
__global float* C,
int N)
{
// Two TILE×TILE tiles in fast local memory
__local float tileA[TILE][TILE];
__local float tileB[TILE][TILE];
int row = get_global_id(0);
int col = get_global_id(1);
int localRow = get_local_id(0);
int localCol = get_local_id(1);
float sum = 0.0f;
int numTiles = (N + TILE - 1) / TILE;
for (int t = 0; t < numTiles; t++)
{
// --- Collaborative load: each work item loads ONE element of each tile ---
int aCol = t * TILE + localCol;
int bRow = t * TILE + localRow;
tileA[localRow][localCol] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
tileB[localRow][localCol] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
// Wait: ALL items must finish loading before any starts computing
barrier(CLK_LOCAL_MEM_FENCE);
// --- Partial dot product using ONLY local memory ---
for (int k = 0; k < TILE; k++)
sum += tileA[localRow][k] * tileB[k][localCol];
// Wait: ALL items must finish computing before next tile is loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
if (row < N && col < N)
C[row * N + col] = sum;
}Why Two Barriers?
The kernel has two barrier() calls per tile iteration:
1st barrier: after loading tiles
→ guarantees all 16×16 elements of tileA and tileB are in local memory
before any item reads from them
2nd barrier: after computing partial sums
→ guarantees all items finish reading the tiles
before any item overwrites them with the next tile's data
Without the 2nd barrier, a fast work item might start loading tile t+1 while a slow item is still reading tile t.
Performance Comparison
The sample runs both the naive (Sample 04) and tiled kernels on the same data:
// Run naive kernel
SetArgs(cl, kernNaive, bufA, bufB, bufCn, N);
Console.Write("Naive kernel... ");
var swNaive = Stopwatch.StartNew();
RunKernel(cl, queue, kernNaive, N, TileSize);
swNaive.Stop();
// Run tiled kernel
SetArgs(cl, kernTiled, bufA, bufB, bufCt, N);
Console.Write("Tiled kernel... ");
var swTiled = Stopwatch.StartNew();
RunKernel(cl, queue, kernTiled, N, TileSize);
swTiled.Stop();
Console.WriteLine($" Speedup: {(double)swNaive.ElapsedMilliseconds / swTiled.ElapsedMilliseconds:F2}×");Typical output (N=512, TILE=16):
Naive kernel... 85 ms
Tiled kernel... 8 ms
Speedup: 10.6×
Why the Speedup?
| Metric | Naive | Tiled |
|---|---|---|
| Global reads of B per output element | N (= 512) | N / TILE (= 32) |
| Local memory reads per output element | 0 | N (= 512) |
| Global memory bandwidth used | ~100% | ~6% |
For each of the N² output elements, the naive kernel reads N values of B from global memory (N³ total). With tiling, global reads of B are reduced by a factor of TILE — because each tile of B is read once from global memory and then reused TILE times from local memory.
Global reads of B:
Naive: N² × N = N³ reads (each output element reads its column)
Tiled: N² × N/TILE = N³/TILE reads (each tile is shared by TILE items)
With TILE=16: a 16× reduction in global memory traffic for B.
Checking the TileSize Constant
Notice how the C# constant TileSize is injected into the kernel source using string interpolation:
const int TileSize = 16;
string KernelSource = $$"""
#define TILE {{TileSize}} // ← C# interpolation inserts "16"
...
""";The $$"""...""" syntax uses {{expr}} for interpolation (double braces because the kernel code itself uses {}). This ensures the C# value and the C #define are always in sync.