Chapter 6: Samples 07–08 — Real-World Application
The Mandelbrot set is a visually beautiful problem that maps perfectly to GPGPU: every pixel is independent of every other pixel, and each requires substantial computation. These two samples show GPU computing in action on a problem you can see and interact with.
The Mandelbrot Set
The Mandelbrot set is defined in the complex plane (real + imaginary numbers). For each point c = (cr + i·ci), we repeatedly apply the map:
z₀ = 0
zₙ₊₁ = zₙ² + c
If |z| (the magnitude) stays ≤ 2 forever, the point belongs to the set (coloured black).
If |z| exceeds 2 within MAX_ITER steps, the point escapes — coloured by how quickly it escaped.
In practice, z² = (zr + i·zi)² expands to:
new_zr = zr² - zi² + cr
new_zi = 2·zr·zi + ci
And |z| > 2 is equivalent to zr² + zi² > 4 (avoids computing a square root).
for (int iter = 0; iter < maxIter; iter++) {
if (zr*zr + zi*zi >= 4.0f) break;
float newZr = zr*zr - zi*zi + cr;
zi = 2.0f * zr * zi + ci;
zr = newZr;
}
This is embarrassingly parallel: every pixel computation is completely independent. No communication between pixels is needed.
Sample 07 — Mandelbrot Image
Repository: OpenCL-Samples on Github
File: 07_Mandelbrot/Program.cs
Goal: Compute a 1920×1080 Mandelbrot image on the GPU, compare with CPU, save as PPM.
New concepts: Embarrassingly parallel problem, 2D image output, PPM file format, GPU vs CPU timing on a real workload.
The Kernel
__kernel void mandelbrot(
__global int* output, // iteration count per pixel
int width, int height,
float x0, // real part of top-left corner
float y0, // imaginary part of top-left corner
float scale, // size of one pixel in the complex plane
int maxIter)
{
int px = get_global_id(0); // pixel column (x)
int py = get_global_id(1); // pixel row (y)
if (px >= width || py >= height) return;
// Map pixel (px, py) → complex number c = cr + i·ci
float cr = x0 + px * scale;
float ci = y0 + py * scale;
// Iterate z = z² + c, starting from z = 0
float zr = 0.0f, zi = 0.0f;
int iter = 0;
while (zr * zr + zi * zi < 4.0f && iter < maxIter)
{
float newZr = zr * zr - zi * zi + cr;
zi = 2.0f * zr * zi + ci;
zr = newZr;
iter++;
}
// Store escape count (or maxIter if point is in the set)
output[py * width + px] = iter;
}Each work item is one pixel. It maps its (px, py) pixel coordinate to a complex number using the view parameters x0, y0, scale, then runs the Mandelbrot iteration and writes the escape count.
The 2D Launch
const int Width = 1920;
const int Height = 1080;
const int LocalTile = 16;
// Round up to multiple of LocalTile in both dimensions
int gsW = ((Width + LocalTile - 1) / LocalTile) * LocalTile; // = 1920
int gsH = ((Height + LocalTile - 1) / LocalTile) * LocalTile; // = 1088
unsafe
{
nuint* gs = stackalloc nuint[] { (nuint)gsW, (nuint)gsH };
nuint* ls = stackalloc nuint[] { (nuint)LocalTile, (nuint)LocalTile }; // 16×16
cl.EnqueueNdrangeKernel(queue, kernel, 2u, (nuint*)null, gs, ls,
0u, (nint*)null, out nint _);
cl.Finish(queue);
}Total work items: 1920 × 1088 = 2,088,960. Each computes one pixel with up to 1000 iterations.
Colouring
After the kernel runs, each pixel has an iteration count. We convert this to RGB:
for (int py = 0; py < Height; py++)
for (int px = 0; px < Width; px++)
{
int iter = gpuOutput[py * Width + px];
if (iter == MaxIter)
{
// Point is IN the set → black
bw.Write((byte)0); bw.Write((byte)0); bw.Write((byte)0);
}
else
{
// Smooth colouring: polynomial colour mapping
float t = (float)iter / MaxIter;
bw.Write((byte)(9 * (1 - t) * t * t * t * 255)); // Red
bw.Write((byte)(15 * (1 - t) * (1 - t) * t * t * 255)); // Green
bw.Write((byte)(8.5f * (1 - t) * (1 - t) * (1 - t) * t * 255)); // Blue
}
}The colour formula is a polynomial that creates a blue-purple gradient for points just outside the set, fading to black at the boundary.
The PPM File Format
PPM (Portable PixMap) is the simplest possible image format — no compression, no library needed:
Header (plain text):
P6\n1920 1080\n255\n
Body (binary):
R G B R G B R G B ... (3 bytes per pixel, row by row)
Open with Preview (macOS), GIMP, IrfanView, or any image viewer.
Typical Performance (1920×1080, MaxIter=1000)
| Time | |
|---|---|
| CPU (sequential) | ~8,000 ms |
| GPU | ~150 ms |
| Speedup | ~53× |
The speedup is larger than in the matrix multiply samples because:
- All 2M pixels run completely in parallel (no shared data, no synchronisation needed)
- The computation per pixel (up to 1000 iterations) is substantial — the GPU can hide memory latency by switching between work items
Sample 08 — Interactive Mandelbrot Explorer
Repository: OpenCL-Samples on Github
File: 08_Mandelbrot_Interactive/Program.cs
Goal: GPU-powered real-time Mandelbrot viewer with pan, zoom, and iteration control.
New concepts: Dynamic kernel arguments (per-frame), dirty flag optimisation, Raylib for display, double→float precision limits.
Controls:
| Input | Action |
|---|---|
| Mouse wheel | Zoom toward cursor |
| Left button drag | Pan |
R | Reset view |
+ / - | Increase / decrease max iterations |
Esc | Quit |
Architecture
flowchart LR A["Raylib\nPoll input"] --> B{dirty?} B -- yes --> C["Update\nkernel args\n(cx, cy, scale, maxIter)"] C --> D["EnqueueNDRange\n1280×720 items"] D --> E["Finish\n(wait for GPU)"] E --> F["EnqueueReadBuffer\n(download result)"] F --> G["Color mapping\n(CPU, palette lookup)"] G --> H["UpdateTexture\n(upload RGBA to GPU)"] H --> I["DrawTexture\n+ HUD text"] I --> A B -- no --> I
The dirty flag is key: the GPU only re-computes when the view actually changes. At 60 FPS, most frames are identical (user is not moving the mouse) — those frames just draw the cached texture.
Dynamic Kernel Arguments
In the interactive version, the kernel parameters cx, cy, scale, and maxIter change every frame the view moves. We update them cheaply without re-allocating any buffers:
// Per-frame: update 4 scalar arguments only when view changed
if (dirty)
{
float cx = (float)viewX; // ← host tracks position in double precision
float cy = (float)viewY; // ← but kernel receives float
float sc = (float)scale;
int mi = maxIter;
cl.SetKernelArg(kernel, 3u, (nuint)sizeof(float), ref cx);
cl.SetKernelArg(kernel, 4u, (nuint)sizeof(float), ref cy);
cl.SetKernelArg(kernel, 5u, (nuint)sizeof(float), ref sc);
cl.SetKernelArg(kernel, 6u, (nuint)sizeof(int), ref mi);
// Launch kernel and download result
// ...
dirty = false;
}SetKernelArg is cheap — it just writes 4 bytes into OpenCL’s argument table. No memory allocation, no kernel recompile.
Static vs Dynamic Arguments
// Set ONCE at startup (never change)
cl.SetKernelArg(kernel, 0u, (nuint)IntPtr.Size, ref bufOut); // output buffer
int kw = WinW, kh = WinH;
cl.SetKernelArg(kernel, 1u, (nuint)sizeof(int), ref kw); // width
cl.SetKernelArg(kernel, 2u, (nuint)sizeof(int), ref kh); // height
// Set EACH FRAME (change with every zoom/pan)
// cl.SetKernelArg(kernel, 3, cx) // complex centre X
// cl.SetKernelArg(kernel, 4, cy) // complex centre Y
// cl.SetKernelArg(kernel, 5, scale) // units per pixel
// cl.SetKernelArg(kernel, 6, mi) // max iterationsZoom-toward-cursor Mathematics
When the user scrolls, we want the point under the cursor to stay fixed on screen:
float wheel = Raylib.GetMouseWheelMove();
if (wheel != 0)
{
Vector2 mouse = Raylib.GetMousePosition();
// Find the complex coordinate of the mouse position
double mouseRe = viewX + (mouse.X - WinW * 0.5) * scale;
double mouseIm = viewY - (mouse.Y - WinH * 0.5) * scale;
// Zoom
double factor = wheel > 0 ? 0.8 : 1.25; // scroll up = zoom in (smaller scale)
scale *= factor;
// Shift the centre so mouseRe/mouseIm maps back to the same screen position
viewX = mouseRe - (mouse.X - WinW * 0.5) * scale;
viewY = mouseIm + (mouse.Y - WinH * 0.5) * scale;
dirty = true;
}Colour Palette
Instead of computing RGB from the iteration count per-pixel (as in Sample 07), Sample 08 pre-computes a 1024-entry palette at startup:
static byte[] BuildPalette(int size)
{
byte[] pal = new byte[size * 4];
pal[3] = 255; // index 0 = black (in-set), alpha = 255
for (int i = 1; i < size; i++)
{
double t = (double)i / size;
// Three sine waves, 120° apart → smooth cycling RGB colours
pal[i * 4 ] = (byte)(Math.Sin(Math.PI * 2 * t ) * 127 + 128);
pal[i * 4 + 1] = (byte)(Math.Sin(Math.PI * 2 * t + 2.094395) * 127 + 128);
pal[i * 4 + 2] = (byte)(Math.Sin(Math.PI * 2 * t + 4.188790) * 127 + 128);
pal[i * 4 + 3] = 255;
}
return pal;
}During rendering, a palette lookup replaces the per-pixel polynomial calculation:
for (int i = 0; i < total; i++)
{
int iter = gpuOut[i];
int pi = (iter == maxIter) ? 0 : (iter % (PaletteSize - 1)) + 1;
// palette[pi] → RGBA colour for this pixel
pixels[i * 4 ] = palette[pi * 4 ];
pixels[i * 4 + 1] = palette[pi * 4 + 1];
pixels[i * 4 + 2] = palette[pi * 4 + 2];
pixels[i * 4 + 3] = 255;
}Floating-Point Precision Limit
The kernel uses float (32-bit, ~7 significant decimal digits). When zooming deeply:
Zoom ~10³: fine, plenty of precision
Zoom ~10⁶: starts to look blocky (hit the float precision limit)
Zoom ~10⁹: would need double precision
The host tracks the view in double (64-bit), but casts to float before passing to the kernel:
float cx = (float)viewX; // possible precision loss here
float cy = (float)viewY;
float sc = (float)scale;The program detects the limit and warns the user:
double zoomLevel = 3.5 / (scale * WinW);
if (zoomLevel > 5e5)
Raylib.DrawText("Float precision limit – image may appear blocky",
10, 32, 16, Color.Yellow);True deep zoom would require double precision throughout the kernel — possible with the cl_khr_fp64 extension on supporting hardware.
Course Summary
You have now seen the complete arc of GPU programming with OpenCL:
graph TD S01["Sample 01<br>Discover hardware<br>Platforms, devices, properties"] --> S02 S02["Sample 02<br>First GPU computation<br>Context, queue, buffer, kernel, NDRange"] --> S03 S03["Sample 03<br>Understand the thread model<br>Work items, work groups, IDs"] --> S04 S04["Sample 04<br>Real algorithm: matrix multiply<br>2D NDRange, performance measurement"] --> S05 S05["Sample 05<br>Non-trivial pattern: reduction<br>Local memory, barriers, tree algorithm"] --> S06 S06["Sample 06<br>Memory optimisation: tiling<br>Local memory reuse, bandwidth reduction"] --> S07 S07["Sample 07<br>Real-world application<br>Embarrassingly parallel, image output"] --> S08 S08["Sample 08<br>Interactive real-time<br>Dynamic arguments, game loop, Raylib"] style S01 fill:#E3F2FD style S02 fill:#BBDEFB style S03 fill:#90CAF9 style S04 fill:#64B5F6 style S05 fill:#42A5F5 style S06 fill:#2196F3,color:#fff style S07 fill:#1E88E5,color:#fff style S08 fill:#1565C0,color:#fff
Key Takeaways
| Concept | Learned in |
|---|---|
| OpenCL platform / device hierarchy | Sample 01 |
| Full host-device pipeline | Sample 02 |
| Work items, work groups, NDRange | Sample 03 |
| 2D NDRange, parallel algorithms | Sample 04 |
| Local memory, barriers, reduction | Sample 05 |
| Memory access optimisation, tiling | Sample 06 |
| Embarrassingly parallel problems | Sample 07 |
| Interactive / dynamic GPU usage | Sample 08 |
What Comes Next (beyond this course)
- Double precision:
cl_khr_fp64extension for high-precision computation - Images and samplers: OpenCL’s built-in image type with hardware interpolation
- Atomic operations: thread-safe accumulation without barriers
- OpenCL 2.0 and 3.0: unified memory (SVM), pipes, sub-groups
- CUDA: NVIDIA-specific alternative with better tooling and libraries
- SYCL / HIP: modern C++ abstractions over GPU hardware
- Vulkan Compute / Metal Compute: low-level compute APIs in graphics engines