Chapter 4: Samples 01–03 — First Steps

These three samples build the foundation. By the end you will be able to discover OpenCL hardware, execute your first GPU computation, and understand the work item model.


Sample 01 — Device Info

Repository: OpenCL-Samples on Github File: 01_DeviceInfo/Program.cs
Goal: Discover and print all OpenCL hardware on the machine. No computation.

New concepts: Platform, Device, opaque handles (nint), two-call query pattern, GetPlatformInfo, GetDeviceInfo.

What it does

The program walks the platform → device hierarchy and prints properties of every OpenCL device it finds. Output looks like:

Found 1 OpenCL platform(s):

==============================================================
Platform : Apple
  Vendor : Apple
  Version: OpenCL 1.2 (Dec 18 2024 19:56:59)
  Devices: 2
--------------------------------------------------------------
  Name             : Apple M3 Pro
  Type             : GPU
  Compute Units    : 18
  Max Clock        : 1000 MHz
  Global Memory    : 36864 MB
  Local Memory     : 32 KB
  Max Work-Group   : 256 work-items
  Driver Version   : 1.2 1.0
  Device Version   : OpenCL 1.2

Step 1: Loading the API

// CL.GetApi() loads the OpenCL shared library (OpenCL.dll / libOpenCL.so)
// and returns a managed wrapper object with one method per OpenCL function.
CL cl = CL.GetApi();

cl is your gateway to all OpenCL functions. Every sample starts with this line.

Step 2: Discovering Platforms (Two-Call Pattern)

// First call: count only (pass null for the array)
uint numPlatforms;
unsafe { cl.GetPlatformIDs(0u, (nint*)null, out numPlatforms); }
 
// Second call: fill the array
nint[] platforms = new nint[numPlatforms];
unsafe
{
    fixed (nint* p = platforms)
        cl.GetPlatformIDs(numPlatforms, p, (uint*)null);
}

This two-call pattern (first count, then fill) is used for every list-returning OpenCL function. The fixed block pins the C# array so the GC cannot move it while OpenCL writes into it.

Step 3: Querying Device Properties

uint  computeUnits = DeviceInfoU32(cl, device, DeviceInfo.MaxComputeUnits);
ulong globalMemB   = DeviceInfoU64(cl, device, DeviceInfo.GlobalMemSize);
nuint maxWGSize    = DeviceInfoNuint(cl, device, DeviceInfo.MaxWorkGroupSize);
 
// Type is a bit-field: CPU=2, GPU=4, Accelerator=8
ulong typeBits = DeviceInfoU64(cl, device, DeviceInfo.Type);
string typeStr = typeBits switch
{
    2 => "CPU",
    4 => "GPU",
    8 => "Accelerator",
    _ => $"Other (0x{typeBits:X})"
};

GetDeviceInfo returns raw bytes. For scalar values we use unsafe code to write directly into a typed local variable:

static uint DeviceInfoU32(CL cl, nint device, DeviceInfo info)
{
    uint v = 0;
    unsafe { cl.GetDeviceInfo(device, info, (nuint)sizeof(uint), &v, (nuint*)null); }
    return v;
}

&v takes the address of the local variable v. OpenCL writes 4 bytes to that address — exactly the right number of bytes for a uint.

Step 4: Reading Strings

For string properties (name, driver version), the two-call pattern is used again to find the size, then read:

static string DeviceString(CL cl, nint device, DeviceInfo info)
{
    nuint size;
    unsafe { cl.GetDeviceInfo(device, info, 0u, (void*)null, out size); }
    byte[] buf = new byte[(int)size];
    unsafe { fixed (byte* p = buf) cl.GetDeviceInfo(device, info, size, p, out _); }
    return Encoding.UTF8.GetString(buf).TrimEnd('\0');
}

OpenCL strings are null-terminated C strings (UTF-8). We read them as a byte array and strip the trailing \0.

Exercise

Run Sample 01 and note the values for your device:

  • How many compute units?
  • How much local memory?
  • What is the maximum work group size?

These numbers will matter in later samples.


Sample 02 — Vector Addition

Repository: OpenCL-Samples on Github File: 02_VectorAdd/Program.cs
Goal: Implement the simplest possible parallel computation: C[i] = A[i] + B[i] for 1 million elements.

New concepts: Kernel source, CreateContext, CreateCommandQueue, CreateBuffer, SetKernelArg, EnqueueNDRangeKernel, Finish, EnqueueReadBuffer, ReleaseMemObject.

The Kernel

The kernel is a string of OpenCL C code embedded in the C# file:

__kernel void vector_add(
    __global const float* a,   // input buffer A (read-only)
    __global const float* b,   // input buffer B (read-only)
    __global       float* c,   // output buffer C (write-only)
    int n)                     // total number of elements
{
    int i = get_global_id(0);  // unique index of THIS work item
 
    // Safety check: global size is rounded up, so last group may go out of bounds
    if (i < n)
        c[i] = a[i] + b[i];
}

Key observations:

  • __kernel marks it as a GPU entry point
  • __global means the pointer lives in GPU global memory
  • get_global_id(0) gives each work item a unique index along dimension 0
  • The if (i < n) guard handles the case where global size > n (rounding)

The Full Pipeline

sequenceDiagram
    participant H as Host (C#)
    participant GPU as GPU

    H->>H: 1. Prepare data: float[] a, b (1M elements)
    H->>GPU: 2. CreateBuffer + CopyHostPtr → upload a, b
    H->>H: 3. Compile kernel string
    H->>H: 4. SetKernelArg (buffers + n)
    H->>GPU: 5. EnqueueNDRangeKernel (1M work items)
    GPU->>GPU: each item: c[i] = a[i] + b[i]
    H->>GPU: 6. Finish() — wait for completion
    H->>H: 7. EnqueueReadBuffer — download result
    H->>H: 8. Verify: compare c[i] vs expected

Creating the Context and Queue

// Context: links host to the device
nint context;
unsafe
{
    nint dev = device;
    context = cl.CreateContext((nint*)null, 1u, &dev, null, (void*)null, (int*)null);
    //                          ^^^^^^^^^^^                                ^^^^^^^^^
    //                          null properties     no callback    no error pointer
}
 
// Command queue: the ordered channel for sending work to this device
nint queue = cl.CreateCommandQueue(context, device, CommandQueueProperties.None, out _);

Compiling the Kernel

static nint Compile(CL cl, nint context, nint device, string source)
{
    nint program;
    unsafe { program = cl.CreateProgramWithSource(context, 1u, new[] { source }, (nuint*)null, out _); }
 
    nint dev = device;
    int err;
    unsafe { err = cl.BuildProgram(program, 1u, &dev, (byte*)null, null, (void*)null); }
 
    if (err != 0)
    {
        // Fetch the compiler log to show what went wrong
        nuint logSize;
        unsafe { cl.GetProgramBuildInfo(program, device, ProgramBuildInfo.BuildLog,
                                        0u, (void*)null, out logSize); }
        byte[] log = new byte[(int)logSize];
        unsafe { fixed (byte* p = log)
                     cl.GetProgramBuildInfo(program, device, ProgramBuildInfo.BuildLog,
                                           logSize, p, (nuint*)null); }
        throw new Exception($"Kernel build failed:\n{Encoding.UTF8.GetString(log).TrimEnd('\0')}");
    }
    return program;
}

BuildProgram is like dotnet build — it compiles the OpenCL C source for the target device’s instruction set at runtime. If there’s a syntax error, the build log shows the error message.

Allocating and Uploading Buffers

nuint bufBytes = (nuint)(N * sizeof(float));
 
nint bufA, bufB, bufC;
unsafe
{
    fixed (float* pa = a)
        bufA = cl.CreateBuffer(context,
                               MemFlags.ReadOnly | MemFlags.CopyHostPtr,
                               bufBytes, pa, out _);
    //                                   ^^^
    //                     tells OpenCL to immediately copy from *pa into GPU memory
 
    fixed (float* pb = b)
        bufB = cl.CreateBuffer(context, MemFlags.ReadOnly | MemFlags.CopyHostPtr,
                               bufBytes, pb, out _);
 
    // Output buffer: just allocate, no initial data
    bufC = cl.CreateBuffer(context, MemFlags.WriteOnly, bufBytes, (void*)null, out _);
}

MemFlags.CopyHostPtr + fixed + pa causes OpenCL to copy the contents of array a into GPU memory as part of CreateBuffer. This is a convenience shortcut for EnqueueWriteBuffer.

Setting Kernel Arguments

// Buffer arguments: pass the handle (8 bytes on 64-bit)
cl.SetKernelArg(kernel, 0u, (nuint)IntPtr.Size, ref bufA);
cl.SetKernelArg(kernel, 1u, (nuint)IntPtr.Size, ref bufB);
cl.SetKernelArg(kernel, 2u, (nuint)IntPtr.Size, ref bufC);
 
// Scalar argument: pass the value (4 bytes)
int n = N;
cl.SetKernelArg(kernel, 3u, (nuint)sizeof(int), ref n);

Arguments are numbered 0, 1, 2, … matching the kernel function parameters.

Launching the Kernel

unsafe
{
    nuint globalSize = (nuint)N;       // 1 048 576 work items
    nuint localSize  = (nuint)64;      // 64 items per work group
    cl.EnqueueNdrangeKernel(queue, kernel, 1u,
                            (nuint*)null,   // global offset (start at 0)
                            &globalSize,    // total work items
                            &localSize,     // work group size
                            0u, (nint*)null, out nint _);
}
 
// Block host until GPU is done
cl.Finish(queue);

With 1,048,576 work items and local size 64, OpenCL creates 16,384 work groups. Each work group of 64 items processes 64 consecutive array elements simultaneously.

Downloading the Result

unsafe
{
    fixed (float* pr = result)
        cl.EnqueueReadBuffer(queue, bufC, true,   // true = blocking read
                             0u, bufBytes, pr,
                             0u, (nint*)null, out nint _);
}

true (blocking) means the call does not return until the transfer is complete — equivalent to calling Finish after a non-blocking read.

Cleanup

cl.ReleaseMemObject(bufA);
cl.ReleaseMemObject(bufB);
cl.ReleaseMemObject(bufC);
cl.ReleaseKernel(kernel);
cl.ReleaseProgram(program);
cl.ReleaseCommandQueue(queue);
cl.ReleaseContext(context);

OpenCL uses reference counting. Every Create* call increments a counter; every Release* call decrements it. When the count reaches zero, the object is freed. Failing to release objects causes memory leaks.

Exercise

Modify N and LocalSize and observe the timing. Try:

  • N = 1 << 24 (16 million elements)
  • LocalSize = 256

Sample 03 — Work Items and NDRange

Repository: OpenCL-Samples on Github File: 03_WorkItems/Program.cs
Goal: Make the NDRange, work item, and work group concepts tangible by having each work item record its own IDs.

New concepts: get_local_id, get_group_id, 2D NDRange, stackalloc.

Part A: 1D NDRange

The kernel for Part A simply writes each item’s three IDs into output arrays:

__kernel void show_ids_1d(
    __global int* global_ids,
    __global int* local_ids,
    __global int* group_ids)
{
    int gid = get_global_id(0);   // position in the entire NDRange
    int lid = get_local_id(0);    // position within the work group
    int grp = get_group_id(0);    // which work group we belong to
 
    global_ids[gid] = gid;
    local_ids[gid]  = lid;
    group_ids[gid]  = grp;
}

Launch with 16 work items, local size 4 (→ 4 work groups):

const int GlobalSize1D = 16;
const int LocalSize1D  = 4;
// ...
unsafe
{
    nuint gs = (nuint)GlobalSize1D;
    nuint ls = (nuint)LocalSize1D;
    cl.EnqueueNdrangeKernel(queue, kernel1D, 1u, (nuint*)null, &gs, &ls,
                            0u, (nint*)null, out nint _);
}

Output:

WorkItem    GlobalID     LocalID     GroupID
--------------------------------------------
         0          0           0           0
         1          1           1           0
         2          2           2           0
         3          3           3           0
         4          4           0           1   ← local ID resets at group boundary
         5          5           1           1
         6          6           2           1
         7          7           3           1
         8          8           0           2
         9          9           1           2
        10         10           2           2
        11         11           3           2
        12         12           0           3
        13         13           1           3
        14         14           2           3
        15         15           3           3

Observation: LocalID resets to 0 at every group boundary. Work items 0–3 form group 0, items 4–7 form group 1, etc.

Part B: 2D NDRange

For a 4×6 grid with 2×3 local work groups:

__kernel void show_ids_2d(
    __global int* global_row, __global int* global_col,
    __global int* local_row,  __global int* local_col,
    __global int* group_row,  __global int* group_col,
    int width)
{
    int gr = get_global_id(0);  // row in full grid
    int gc = get_global_id(1);  // column in full grid
    int idx = gr * width + gc;  // linear index
 
    global_row[idx] = gr;       global_col[idx] = gc;
    local_row[idx]  = (int)get_local_id(0);
    local_col[idx]  = (int)get_local_id(1);
    group_row[idx]  = (int)get_group_id(0);
    group_col[idx]  = (int)get_group_id(1);
}

The 2D launch uses stackalloc for the size arrays:

unsafe
{
    nuint* gs = stackalloc nuint[] { (nuint)Rows, (nuint)Cols };          // {4, 6}
    nuint* ls = stackalloc nuint[] { (nuint)LocalRows, (nuint)LocalCols }; // {2, 3}
    cl.EnqueueNdrangeKernel(queue, kernel2D, 2u, (nuint*)null, gs, ls,
                            0u, (nint*)null, out nint _);
}

Output snippet:

 Idx  gRow  gCol  lRow  lCol  grpRow  grpCol
--------------------------------------------
   0     0     0     0     0       0       0
   1     0     1     0     1       0       0
   2     0     2     0     2       0       0
   3     0     3     0     0       0       1   ← new column group
   4     0     4     0     1       0       1
   5     0     5     0     2       0       1
   6     1     0     1     0       0       0
   ...

Observation: The 4×6 grid is divided into 2×2 work groups, each of size 2×3. Local IDs restart at (0,0) at the top-left corner of each work group.

Key Insight: Why Do Work Groups Exist?

graph LR
    A["Work Group\n= team of threads"] --> B["Share Local Memory\n(fast, ~32 KB)"]
    A --> C["Can barrier()\n= synchronise internally"]
    A --> D["Scheduled as a unit\non one Compute Unit"]

    B --> E["Enables: tree reduction\n(Sample 05)"]
    B --> F["Enables: matrix tiling\n(Sample 06)"]
    C --> E

Work groups matter because their members can communicate via shared local memory. Work items in different groups are completely isolated — they can only communicate through slow global memory.

Summary Table

ID functionWhat it gives youRange
get_global_id(d)Unique index across entire NDRange0 .. globalSize-1
get_local_id(d)Index within work group0 .. localSize-1
get_group_id(d)Which work group0 .. numGroups-1
get_global_size(d)Total items in dimension d
get_local_size(d)Items per group in dimension d
get_num_groups(d)Number of groups in dimension d

Relationship: get_global_id(d) == get_group_id(d) * get_local_size(d) + get_local_id(d)