Chapter 3: The OpenCL Programming Model

This chapter explains the abstract model that OpenCL uses. Once you understand the model, all the API calls in the samples will make sense.


3.1 The Platform Model

OpenCL organises hardware into a hierarchy:

graph TD
    Host["Host (your C# program on the CPU)"]
    Host --> P1["Platform<br>(e.g. NVIDIA OpenCL Runtime)"]
    Host --> P2["Platform<br>(e.g. Intel OpenCL Runtime)"]
    P1 --> D1["Device: NVIDIA RTX 4070<br>(GPU)"]
    P2 --> D2["Device: Intel Iris Xe<br>(GPU)"]
    P2 --> D3["Device: Intel Core i9<br>(CPU)"]

    style Host fill:#1565C0,color:#fff
    style P1 fill:#2196F3,color:#fff
    style P2 fill:#2196F3,color:#fff
    style D1 fill:#43A047,color:#fff
    style D2 fill:#43A047,color:#fff
    style D3 fill:#E65100,color:#fff
ConceptMeaningC# type
HostYour C# program running on the CPU
PlatformA vendor’s OpenCL runtime (driver)nint (handle)
DeviceA physical processor (GPU, CPU, accelerator)nint (handle)

Discovering platforms and devices is what Sample 01 does — it walks this hierarchy and prints every device it finds.


3.2 The Runtime Objects

Before you can run a kernel you need to create several runtime objects:

graph LR
    Platform["Platform"] --> Context
    Device["Device"] --> Context
    Context["Context"] --> Queue["Command Queue"]
    Context --> Buffer["Buffer (GPU memory)"]
    Context --> Program["Program (compiled kernel)"]
    Program --> Kernel["Kernel (callable function)"]

    style Context fill:#7B1FA2,color:#fff
    style Queue fill:#1565C0,color:#fff
    style Buffer fill:#E65100,color:#fff
    style Program fill:#2E7D32,color:#fff
    style Kernel fill:#2E7D32,color:#fff
ObjectCreated byPurpose
ContextCreateContextShared environment linking host and one or more devices. All buffers and programs live inside a context.
Command QueueCreateCommandQueueAn ordered channel for sending commands (kernel launches, buffer copies) to one device.
BufferCreateBufferA region of memory on the device (GPU VRAM). Like a GPU-side array.
ProgramCreateProgramWithSource + BuildProgramThe compiled GPU code (can contain multiple kernels).
KernelCreateKernelA single entry-point function from a program, with bound arguments.

3.3 The Execution Model

Work Items

When you launch a kernel, OpenCL creates a large number of work items — one for each element of data you want to process. Think of a work item as one thread.

Each work item:

  • Runs the same kernel function
  • Has a unique global ID — its index in the launch grid
  • Has access to its own private registers
  • Runs independently from all other work items (unless you use barriers)

Work Groups

Work items are grouped into work groups. Work items inside the same group can:

  • Share local memory (fast on-chip SRAM)
  • Synchronise with each other using barriers

Work items in different groups cannot communicate during kernel execution.

NDRange

The NDRange (N-Dimensional Range) defines the shape of the launch grid. It can be 1D, 2D, or 3D.

Built-in ID functions (available inside every kernel):

FunctionReturnsExample (2D, row=2, col=5, group 4×3)
get_global_id(dim)Unique index across all work itemsget_global_id(0) → 2
get_local_id(dim)Index within the current work groupget_local_id(0) → 2 % 4 = 2
get_group_id(dim)Index of the current work groupget_group_id(0) → 2 / 4 = 0
get_global_size(dim)Total work items in this dimension
get_local_size(dim)Work items per group in this dimension
get_num_groups(dim)Number of work groups

3.4 The Memory Model

OpenCL defines four memory spaces, each with different speed and visibility:

MemoryKeywordScopeSpeedSize
Global__globalAll work itemsSlow (hundreds of cycles)GBs
Constant__constantAll work items (read-only)Fast (cached)~64 KB
Local__localOne work groupVery fast (~10 TB/s)~32–64 KB
Private__private (default)One work itemFastest (registers)~few KB

Key insight: Reading from global memory is the main performance bottleneck in GPU programming. Moving data into local memory and reusing it is the primary optimisation technique (demonstrated in Sample 06).


3.5 OpenCL C — The Kernel Language

Kernels are written in OpenCL C, which is a subset of C99 with GPU-specific extensions.

What you get in OpenCL C

// Address space qualifiers on pointers
__kernel void my_kernel(
    __global const float* input,   // reads from global memory
    __global       float* output,  // writes to global memory
    __local        float* scratch, // shared within work group
    int n)                         // scalar passed by value
{
    int gid = get_global_id(0);    // built-in: this item's global index
    int lid = get_local_id(0);     // built-in: this item's local index
    int grp = get_group_id(0);     // built-in: this item's group index
 
    // Load into local memory
    scratch[lid] = input[gid];
    barrier(CLK_LOCAL_MEM_FENCE);  // wait: all items must finish writing
 
    // Use local memory (fast)
    output[gid] = scratch[lid] * 2.0f;
}

What’s missing compared to C#

C# / C++ featureAvailable in OpenCL C?
Classes, structs❌ No structs with methods, no classes
Dynamic memory (new, malloc)❌ No heap allocation in kernels
Recursion❌ Not supported
Function pointers❌ Not supported
STL / standard library❌ No
printf for debugging✅ Available as an extension
Vector types (float4, int2)✅ Yes, built-in
Math functions (sqrt, sin, …)✅ Yes, built-in
Bit-shift operators✅ Yes

The __kernel qualifier

Only functions marked __kernel can be launched from the host. They must return void. All other functions are device-side helper functions.

Barriers

barrier(CLK_LOCAL_MEM_FENCE) is a synchronisation point within a work group:

  • All work items in the group must reach the barrier before any can continue
  • Ensures writes to local memory by some items are visible to all items
  • Required whenever one item reads data written by another item in the same group

3.6 The Host-Side Pipeline

Every OpenCL program follows this sequence:

flowchart TD
    A["GetPlatformIDs<br>Find available OpenCL runtimes"] --> B
    B["GetDeviceIDs<br>Find GPU or CPU devices"] --> C
    C["CreateContext<br>Link host and device"] --> D
    D["CreateCommandQueue<br>Create channel to device"] --> E
    E["CreateProgramWithSource<br>Compile kernel from string"] --> F
    F["BuildProgram<br>JIT-compile for this device's hardware"] --> G
    G["CreateKernel<br>Get a callable handle to the kernel function"] --> H
    H["CreateBuffer<br>Allocate GPU-side memory"] --> I
    I["CopyHostPtr / EnqueueWriteBuffer<br>Upload data to GPU"] --> J
    J["SetKernelArg<br>Bind buffers and scalars as kernel parameters"] --> K
    K["EnqueueNDRangeKernel<br>Launch N work items"] --> L
    L["Finish<br>Wait for GPU to complete"] --> M
    M["EnqueueReadBuffer<br>Download result to CPU"] --> N
    N["Release*<br>Free all OpenCL objects"]

    style A fill:#E3F2FD
    style B fill:#E3F2FD
    style C fill:#BBDEFB
    style D fill:#BBDEFB
    style E fill:#C8E6C9
    style F fill:#C8E6C9
    style G fill:#C8E6C9
    style H fill:#FFE0B2
    style I fill:#FFE0B2
    style J fill:#FFF9C4
    style K fill:#EF9A9A,color:#fff
    style L fill:#EF9A9A,color:#fff
    style M fill:#FFE0B2
    style N fill:#E0E0E0

This pipeline is implemented explicitly in every sample. Once you have seen it in Sample 02, the later samples only show the parts that change.


3.7 Key Rules to Remember

  1. Global size must be a multiple of local size. Round up if needed:

    int globalSize = ((N + localSize - 1) / localSize) * localSize;

    And guard against out-of-bounds in the kernel:

    if (gid >= n) return;
  2. Kernels are compiled at runtime from C strings. Compilation errors appear as runtime exceptions, not build errors.

  3. Buffer handles (nint) are 8 bytes on 64-bit systems. Pass IntPtr.Size as the argument size to SetKernelArg.

  4. cl.Finish(queue) blocks the host until all commands in the queue complete. This is how we ensure the GPU is done before we read results back.

  5. Every object has a Release* function. OpenCL uses reference counting. Always release objects you create (context, queue, buffers, kernel, program).