Chapter 6: Samples 07–08 — Real-World Application

The Mandelbrot set is a visually beautiful problem that maps perfectly to GPGPU: every pixel is independent of every other pixel, and each requires substantial computation. These two samples show GPU computing in action on a problem you can see and interact with.


The Mandelbrot Set

The Mandelbrot set is defined in the complex plane (real + imaginary numbers). For each point c = (cr + i·ci), we repeatedly apply the map:

z₀ = 0
zₙ₊₁ = zₙ² + c

If |z| (the magnitude) stays ≤ 2 forever, the point belongs to the set (coloured black).
If |z| exceeds 2 within MAX_ITER steps, the point escapes — coloured by how quickly it escaped.

In practice, z² = (zr + i·zi)² expands to:

new_zr = zr² - zi² + cr
new_zi = 2·zr·zi + ci

And |z| > 2 is equivalent to zr² + zi² > 4 (avoids computing a square root).

for (int iter = 0; iter < maxIter; iter++) {
    if (zr*zr + zi*zi >= 4.0f) break;
    float newZr = zr*zr - zi*zi + cr;
    zi = 2.0f * zr * zi + ci;
    zr = newZr;
}

This is embarrassingly parallel: every pixel computation is completely independent. No communication between pixels is needed.


Sample 07 — Mandelbrot Image

Repository: OpenCL-Samples on Github File: 07_Mandelbrot/Program.cs
Goal: Compute a 1920×1080 Mandelbrot image on the GPU, compare with CPU, save as PPM.

New concepts: Embarrassingly parallel problem, 2D image output, PPM file format, GPU vs CPU timing on a real workload.

The Kernel

__kernel void mandelbrot(
    __global int* output,   // iteration count per pixel
    int width, int height,
    float x0,               // real part of top-left corner
    float y0,               // imaginary part of top-left corner
    float scale,            // size of one pixel in the complex plane
    int maxIter)
{
    int px = get_global_id(0);   // pixel column (x)
    int py = get_global_id(1);   // pixel row    (y)
 
    if (px >= width || py >= height) return;
 
    // Map pixel (px, py) → complex number c = cr + i·ci
    float cr = x0 + px * scale;
    float ci = y0 + py * scale;
 
    // Iterate z = z² + c, starting from z = 0
    float zr = 0.0f, zi = 0.0f;
    int iter = 0;
 
    while (zr * zr + zi * zi < 4.0f && iter < maxIter)
    {
        float newZr = zr * zr - zi * zi + cr;
        zi = 2.0f * zr * zi + ci;
        zr = newZr;
        iter++;
    }
 
    // Store escape count (or maxIter if point is in the set)
    output[py * width + px] = iter;
}

Each work item is one pixel. It maps its (px, py) pixel coordinate to a complex number using the view parameters x0, y0, scale, then runs the Mandelbrot iteration and writes the escape count.

The 2D Launch

const int Width  = 1920;
const int Height = 1080;
const int LocalTile = 16;
 
// Round up to multiple of LocalTile in both dimensions
int gsW = ((Width  + LocalTile - 1) / LocalTile) * LocalTile;  // = 1920
int gsH = ((Height + LocalTile - 1) / LocalTile) * LocalTile;  // = 1088
 
unsafe
{
    nuint* gs = stackalloc nuint[] { (nuint)gsW, (nuint)gsH };
    nuint* ls = stackalloc nuint[] { (nuint)LocalTile, (nuint)LocalTile }; // 16×16
    cl.EnqueueNdrangeKernel(queue, kernel, 2u, (nuint*)null, gs, ls,
                            0u, (nint*)null, out nint _);
    cl.Finish(queue);
}

Total work items: 1920 × 1088 = 2,088,960. Each computes one pixel with up to 1000 iterations.

Colouring

After the kernel runs, each pixel has an iteration count. We convert this to RGB:

for (int py = 0; py < Height; py++)
    for (int px = 0; px < Width; px++)
    {
        int iter = gpuOutput[py * Width + px];
 
        if (iter == MaxIter)
        {
            // Point is IN the set → black
            bw.Write((byte)0); bw.Write((byte)0); bw.Write((byte)0);
        }
        else
        {
            // Smooth colouring: polynomial colour mapping
            float t = (float)iter / MaxIter;
            bw.Write((byte)(9    * (1 - t) * t * t * t * 255));  // Red
            bw.Write((byte)(15   * (1 - t) * (1 - t) * t * t * 255));  // Green
            bw.Write((byte)(8.5f * (1 - t) * (1 - t) * (1 - t) * t * 255));  // Blue
        }
    }

The colour formula is a polynomial that creates a blue-purple gradient for points just outside the set, fading to black at the boundary.

The PPM File Format

PPM (Portable PixMap) is the simplest possible image format — no compression, no library needed:

Header (plain text):
  P6\n1920 1080\n255\n
Body (binary):
  R G B  R G B  R G B  ...  (3 bytes per pixel, row by row)

Open with Preview (macOS), GIMP, IrfanView, or any image viewer.

Typical Performance (1920×1080, MaxIter=1000)

Time
CPU (sequential)~8,000 ms
GPU~150 ms
Speedup~53×

The speedup is larger than in the matrix multiply samples because:

  • All 2M pixels run completely in parallel (no shared data, no synchronisation needed)
  • The computation per pixel (up to 1000 iterations) is substantial — the GPU can hide memory latency by switching between work items

Sample 08 — Interactive Mandelbrot Explorer

Repository: OpenCL-Samples on Github File: 08_Mandelbrot_Interactive/Program.cs
Goal: GPU-powered real-time Mandelbrot viewer with pan, zoom, and iteration control.

New concepts: Dynamic kernel arguments (per-frame), dirty flag optimisation, Raylib for display, double→float precision limits.

Controls:

InputAction
Mouse wheelZoom toward cursor
Left button dragPan
RReset view
+ / -Increase / decrease max iterations
EscQuit

Architecture

flowchart LR
    A["Raylib\nPoll input"] --> B{dirty?}
    B -- yes --> C["Update\nkernel args\n(cx, cy, scale, maxIter)"]
    C --> D["EnqueueNDRange\n1280×720 items"]
    D --> E["Finish\n(wait for GPU)"]
    E --> F["EnqueueReadBuffer\n(download result)"]
    F --> G["Color mapping\n(CPU, palette lookup)"]
    G --> H["UpdateTexture\n(upload RGBA to GPU)"]
    H --> I["DrawTexture\n+ HUD text"]
    I --> A
    B -- no --> I

The dirty flag is key: the GPU only re-computes when the view actually changes. At 60 FPS, most frames are identical (user is not moving the mouse) — those frames just draw the cached texture.

Dynamic Kernel Arguments

In the interactive version, the kernel parameters cx, cy, scale, and maxIter change every frame the view moves. We update them cheaply without re-allocating any buffers:

// Per-frame: update 4 scalar arguments only when view changed
if (dirty)
{
    float cx = (float)viewX;   // ← host tracks position in double precision
    float cy = (float)viewY;   // ← but kernel receives float
    float sc = (float)scale;
    int   mi = maxIter;
 
    cl.SetKernelArg(kernel, 3u, (nuint)sizeof(float), ref cx);
    cl.SetKernelArg(kernel, 4u, (nuint)sizeof(float), ref cy);
    cl.SetKernelArg(kernel, 5u, (nuint)sizeof(float), ref sc);
    cl.SetKernelArg(kernel, 6u, (nuint)sizeof(int),   ref mi);
 
    // Launch kernel and download result
    // ...
 
    dirty = false;
}

SetKernelArg is cheap — it just writes 4 bytes into OpenCL’s argument table. No memory allocation, no kernel recompile.

Static vs Dynamic Arguments

// Set ONCE at startup (never change)
cl.SetKernelArg(kernel, 0u, (nuint)IntPtr.Size, ref bufOut);  // output buffer
int kw = WinW, kh = WinH;
cl.SetKernelArg(kernel, 1u, (nuint)sizeof(int), ref kw);      // width
cl.SetKernelArg(kernel, 2u, (nuint)sizeof(int), ref kh);      // height
 
// Set EACH FRAME (change with every zoom/pan)
// cl.SetKernelArg(kernel, 3, cx)    // complex centre X
// cl.SetKernelArg(kernel, 4, cy)    // complex centre Y
// cl.SetKernelArg(kernel, 5, scale) // units per pixel
// cl.SetKernelArg(kernel, 6, mi)    // max iterations

Zoom-toward-cursor Mathematics

When the user scrolls, we want the point under the cursor to stay fixed on screen:

float wheel = Raylib.GetMouseWheelMove();
if (wheel != 0)
{
    Vector2 mouse = Raylib.GetMousePosition();
 
    // Find the complex coordinate of the mouse position
    double mouseRe = viewX + (mouse.X - WinW * 0.5) * scale;
    double mouseIm = viewY - (mouse.Y - WinH * 0.5) * scale;
 
    // Zoom
    double factor = wheel > 0 ? 0.8 : 1.25;  // scroll up = zoom in (smaller scale)
    scale *= factor;
 
    // Shift the centre so mouseRe/mouseIm maps back to the same screen position
    viewX = mouseRe - (mouse.X - WinW * 0.5) * scale;
    viewY = mouseIm + (mouse.Y - WinH * 0.5) * scale;
 
    dirty = true;
}

Colour Palette

Instead of computing RGB from the iteration count per-pixel (as in Sample 07), Sample 08 pre-computes a 1024-entry palette at startup:

static byte[] BuildPalette(int size)
{
    byte[] pal = new byte[size * 4];
    pal[3] = 255;  // index 0 = black (in-set), alpha = 255
 
    for (int i = 1; i < size; i++)
    {
        double t = (double)i / size;
        // Three sine waves, 120° apart → smooth cycling RGB colours
        pal[i * 4    ] = (byte)(Math.Sin(Math.PI * 2 * t            ) * 127 + 128);
        pal[i * 4 + 1] = (byte)(Math.Sin(Math.PI * 2 * t + 2.094395) * 127 + 128);
        pal[i * 4 + 2] = (byte)(Math.Sin(Math.PI * 2 * t + 4.188790) * 127 + 128);
        pal[i * 4 + 3] = 255;
    }
    return pal;
}

During rendering, a palette lookup replaces the per-pixel polynomial calculation:

for (int i = 0; i < total; i++)
{
    int iter = gpuOut[i];
    int pi   = (iter == maxIter) ? 0 : (iter % (PaletteSize - 1)) + 1;
    // palette[pi] → RGBA colour for this pixel
    pixels[i * 4    ] = palette[pi * 4    ];
    pixels[i * 4 + 1] = palette[pi * 4 + 1];
    pixels[i * 4 + 2] = palette[pi * 4 + 2];
    pixels[i * 4 + 3] = 255;
}

Floating-Point Precision Limit

The kernel uses float (32-bit, ~7 significant decimal digits). When zooming deeply:

Zoom ~10³:   fine, plenty of precision
Zoom ~10⁶:   starts to look blocky (hit the float precision limit)
Zoom ~10⁹:   would need double precision

The host tracks the view in double (64-bit), but casts to float before passing to the kernel:

float cx = (float)viewX;   // possible precision loss here
float cy = (float)viewY;
float sc = (float)scale;

The program detects the limit and warns the user:

double zoomLevel = 3.5 / (scale * WinW);
if (zoomLevel > 5e5)
    Raylib.DrawText("Float precision limit – image may appear blocky",
                    10, 32, 16, Color.Yellow);

True deep zoom would require double precision throughout the kernel — possible with the cl_khr_fp64 extension on supporting hardware.


Course Summary

You have now seen the complete arc of GPU programming with OpenCL:

graph TD
    S01["Sample 01<br>Discover hardware<br>Platforms, devices, properties"] --> S02
    S02["Sample 02<br>First GPU computation<br>Context, queue, buffer, kernel, NDRange"] --> S03
    S03["Sample 03<br>Understand the thread model<br>Work items, work groups, IDs"] --> S04
    S04["Sample 04<br>Real algorithm: matrix multiply<br>2D NDRange, performance measurement"] --> S05
    S05["Sample 05<br>Non-trivial pattern: reduction<br>Local memory, barriers, tree algorithm"] --> S06
    S06["Sample 06<br>Memory optimisation: tiling<br>Local memory reuse, bandwidth reduction"] --> S07
    S07["Sample 07<br>Real-world application<br>Embarrassingly parallel, image output"] --> S08
    S08["Sample 08<br>Interactive real-time<br>Dynamic arguments, game loop, Raylib"]

    style S01 fill:#E3F2FD
    style S02 fill:#BBDEFB
    style S03 fill:#90CAF9
    style S04 fill:#64B5F6
    style S05 fill:#42A5F5
    style S06 fill:#2196F3,color:#fff
    style S07 fill:#1E88E5,color:#fff
    style S08 fill:#1565C0,color:#fff

Key Takeaways

ConceptLearned in
OpenCL platform / device hierarchySample 01
Full host-device pipelineSample 02
Work items, work groups, NDRangeSample 03
2D NDRange, parallel algorithmsSample 04
Local memory, barriers, reductionSample 05
Memory access optimisation, tilingSample 06
Embarrassingly parallel problemsSample 07
Interactive / dynamic GPU usageSample 08

What Comes Next (beyond this course)

  • Double precision: cl_khr_fp64 extension for high-precision computation
  • Images and samplers: OpenCL’s built-in image type with hardware interpolation
  • Atomic operations: thread-safe accumulation without barriers
  • OpenCL 2.0 and 3.0: unified memory (SVM), pipes, sub-groups
  • CUDA: NVIDIA-specific alternative with better tooling and libraries
  • SYCL / HIP: modern C++ abstractions over GPU hardware
  • Vulkan Compute / Metal Compute: low-level compute APIs in graphics engines