Chapter 4: Samples 01–03 — First Steps
These three samples build the foundation. By the end you will be able to discover OpenCL hardware, execute your first GPU computation, and understand the work item model.
Sample 01 — Device Info
Repository: OpenCL-Samples on Github
File: 01_DeviceInfo/Program.cs
Goal: Discover and print all OpenCL hardware on the machine. No computation.
New concepts: Platform, Device, opaque handles (nint), two-call query pattern, GetPlatformInfo, GetDeviceInfo.
What it does
The program walks the platform → device hierarchy and prints properties of every OpenCL device it finds. Output looks like:
Found 1 OpenCL platform(s):
==============================================================
Platform : Apple
Vendor : Apple
Version: OpenCL 1.2 (Dec 18 2024 19:56:59)
Devices: 2
--------------------------------------------------------------
Name : Apple M3 Pro
Type : GPU
Compute Units : 18
Max Clock : 1000 MHz
Global Memory : 36864 MB
Local Memory : 32 KB
Max Work-Group : 256 work-items
Driver Version : 1.2 1.0
Device Version : OpenCL 1.2
Step 1: Loading the API
// CL.GetApi() loads the OpenCL shared library (OpenCL.dll / libOpenCL.so)
// and returns a managed wrapper object with one method per OpenCL function.
CL cl = CL.GetApi();cl is your gateway to all OpenCL functions. Every sample starts with this line.
Step 2: Discovering Platforms (Two-Call Pattern)
// First call: count only (pass null for the array)
uint numPlatforms;
unsafe { cl.GetPlatformIDs(0u, (nint*)null, out numPlatforms); }
// Second call: fill the array
nint[] platforms = new nint[numPlatforms];
unsafe
{
fixed (nint* p = platforms)
cl.GetPlatformIDs(numPlatforms, p, (uint*)null);
}This two-call pattern (first count, then fill) is used for every list-returning OpenCL function. The fixed block pins the C# array so the GC cannot move it while OpenCL writes into it.
Step 3: Querying Device Properties
uint computeUnits = DeviceInfoU32(cl, device, DeviceInfo.MaxComputeUnits);
ulong globalMemB = DeviceInfoU64(cl, device, DeviceInfo.GlobalMemSize);
nuint maxWGSize = DeviceInfoNuint(cl, device, DeviceInfo.MaxWorkGroupSize);
// Type is a bit-field: CPU=2, GPU=4, Accelerator=8
ulong typeBits = DeviceInfoU64(cl, device, DeviceInfo.Type);
string typeStr = typeBits switch
{
2 => "CPU",
4 => "GPU",
8 => "Accelerator",
_ => $"Other (0x{typeBits:X})"
};GetDeviceInfo returns raw bytes. For scalar values we use unsafe code to write directly into a typed local variable:
static uint DeviceInfoU32(CL cl, nint device, DeviceInfo info)
{
uint v = 0;
unsafe { cl.GetDeviceInfo(device, info, (nuint)sizeof(uint), &v, (nuint*)null); }
return v;
}&v takes the address of the local variable v. OpenCL writes 4 bytes to that address — exactly the right number of bytes for a uint.
Step 4: Reading Strings
For string properties (name, driver version), the two-call pattern is used again to find the size, then read:
static string DeviceString(CL cl, nint device, DeviceInfo info)
{
nuint size;
unsafe { cl.GetDeviceInfo(device, info, 0u, (void*)null, out size); }
byte[] buf = new byte[(int)size];
unsafe { fixed (byte* p = buf) cl.GetDeviceInfo(device, info, size, p, out _); }
return Encoding.UTF8.GetString(buf).TrimEnd('\0');
}OpenCL strings are null-terminated C strings (UTF-8). We read them as a byte array and strip the trailing \0.
Exercise
Run Sample 01 and note the values for your device:
- How many compute units?
- How much local memory?
- What is the maximum work group size?
These numbers will matter in later samples.
Sample 02 — Vector Addition
Repository: OpenCL-Samples on Github
File: 02_VectorAdd/Program.cs
Goal: Implement the simplest possible parallel computation: C[i] = A[i] + B[i] for 1 million elements.
New concepts: Kernel source, CreateContext, CreateCommandQueue, CreateBuffer, SetKernelArg, EnqueueNDRangeKernel, Finish, EnqueueReadBuffer, ReleaseMemObject.
The Kernel
The kernel is a string of OpenCL C code embedded in the C# file:
__kernel void vector_add(
__global const float* a, // input buffer A (read-only)
__global const float* b, // input buffer B (read-only)
__global float* c, // output buffer C (write-only)
int n) // total number of elements
{
int i = get_global_id(0); // unique index of THIS work item
// Safety check: global size is rounded up, so last group may go out of bounds
if (i < n)
c[i] = a[i] + b[i];
}Key observations:
__kernelmarks it as a GPU entry point__globalmeans the pointer lives in GPU global memoryget_global_id(0)gives each work item a unique index along dimension 0- The
if (i < n)guard handles the case where global size > n (rounding)
The Full Pipeline
sequenceDiagram participant H as Host (C#) participant GPU as GPU H->>H: 1. Prepare data: float[] a, b (1M elements) H->>GPU: 2. CreateBuffer + CopyHostPtr → upload a, b H->>H: 3. Compile kernel string H->>H: 4. SetKernelArg (buffers + n) H->>GPU: 5. EnqueueNDRangeKernel (1M work items) GPU->>GPU: each item: c[i] = a[i] + b[i] H->>GPU: 6. Finish() — wait for completion H->>H: 7. EnqueueReadBuffer — download result H->>H: 8. Verify: compare c[i] vs expected
Creating the Context and Queue
// Context: links host to the device
nint context;
unsafe
{
nint dev = device;
context = cl.CreateContext((nint*)null, 1u, &dev, null, (void*)null, (int*)null);
// ^^^^^^^^^^^ ^^^^^^^^^
// null properties no callback no error pointer
}
// Command queue: the ordered channel for sending work to this device
nint queue = cl.CreateCommandQueue(context, device, CommandQueueProperties.None, out _);Compiling the Kernel
static nint Compile(CL cl, nint context, nint device, string source)
{
nint program;
unsafe { program = cl.CreateProgramWithSource(context, 1u, new[] { source }, (nuint*)null, out _); }
nint dev = device;
int err;
unsafe { err = cl.BuildProgram(program, 1u, &dev, (byte*)null, null, (void*)null); }
if (err != 0)
{
// Fetch the compiler log to show what went wrong
nuint logSize;
unsafe { cl.GetProgramBuildInfo(program, device, ProgramBuildInfo.BuildLog,
0u, (void*)null, out logSize); }
byte[] log = new byte[(int)logSize];
unsafe { fixed (byte* p = log)
cl.GetProgramBuildInfo(program, device, ProgramBuildInfo.BuildLog,
logSize, p, (nuint*)null); }
throw new Exception($"Kernel build failed:\n{Encoding.UTF8.GetString(log).TrimEnd('\0')}");
}
return program;
}BuildProgram is like dotnet build — it compiles the OpenCL C source for the target device’s instruction set at runtime. If there’s a syntax error, the build log shows the error message.
Allocating and Uploading Buffers
nuint bufBytes = (nuint)(N * sizeof(float));
nint bufA, bufB, bufC;
unsafe
{
fixed (float* pa = a)
bufA = cl.CreateBuffer(context,
MemFlags.ReadOnly | MemFlags.CopyHostPtr,
bufBytes, pa, out _);
// ^^^
// tells OpenCL to immediately copy from *pa into GPU memory
fixed (float* pb = b)
bufB = cl.CreateBuffer(context, MemFlags.ReadOnly | MemFlags.CopyHostPtr,
bufBytes, pb, out _);
// Output buffer: just allocate, no initial data
bufC = cl.CreateBuffer(context, MemFlags.WriteOnly, bufBytes, (void*)null, out _);
}MemFlags.CopyHostPtr + fixed + pa causes OpenCL to copy the contents of array a into GPU memory as part of CreateBuffer. This is a convenience shortcut for EnqueueWriteBuffer.
Setting Kernel Arguments
// Buffer arguments: pass the handle (8 bytes on 64-bit)
cl.SetKernelArg(kernel, 0u, (nuint)IntPtr.Size, ref bufA);
cl.SetKernelArg(kernel, 1u, (nuint)IntPtr.Size, ref bufB);
cl.SetKernelArg(kernel, 2u, (nuint)IntPtr.Size, ref bufC);
// Scalar argument: pass the value (4 bytes)
int n = N;
cl.SetKernelArg(kernel, 3u, (nuint)sizeof(int), ref n);Arguments are numbered 0, 1, 2, … matching the kernel function parameters.
Launching the Kernel
unsafe
{
nuint globalSize = (nuint)N; // 1 048 576 work items
nuint localSize = (nuint)64; // 64 items per work group
cl.EnqueueNdrangeKernel(queue, kernel, 1u,
(nuint*)null, // global offset (start at 0)
&globalSize, // total work items
&localSize, // work group size
0u, (nint*)null, out nint _);
}
// Block host until GPU is done
cl.Finish(queue);With 1,048,576 work items and local size 64, OpenCL creates 16,384 work groups. Each work group of 64 items processes 64 consecutive array elements simultaneously.
Downloading the Result
unsafe
{
fixed (float* pr = result)
cl.EnqueueReadBuffer(queue, bufC, true, // true = blocking read
0u, bufBytes, pr,
0u, (nint*)null, out nint _);
}true (blocking) means the call does not return until the transfer is complete — equivalent to calling Finish after a non-blocking read.
Cleanup
cl.ReleaseMemObject(bufA);
cl.ReleaseMemObject(bufB);
cl.ReleaseMemObject(bufC);
cl.ReleaseKernel(kernel);
cl.ReleaseProgram(program);
cl.ReleaseCommandQueue(queue);
cl.ReleaseContext(context);OpenCL uses reference counting. Every Create* call increments a counter; every Release* call decrements it. When the count reaches zero, the object is freed. Failing to release objects causes memory leaks.
Exercise
Modify N and LocalSize and observe the timing. Try:
N = 1 << 24(16 million elements)LocalSize = 256
Sample 03 — Work Items and NDRange
Repository: OpenCL-Samples on Github
File: 03_WorkItems/Program.cs
Goal: Make the NDRange, work item, and work group concepts tangible by having each work item record its own IDs.
New concepts: get_local_id, get_group_id, 2D NDRange, stackalloc.
Part A: 1D NDRange
The kernel for Part A simply writes each item’s three IDs into output arrays:
__kernel void show_ids_1d(
__global int* global_ids,
__global int* local_ids,
__global int* group_ids)
{
int gid = get_global_id(0); // position in the entire NDRange
int lid = get_local_id(0); // position within the work group
int grp = get_group_id(0); // which work group we belong to
global_ids[gid] = gid;
local_ids[gid] = lid;
group_ids[gid] = grp;
}Launch with 16 work items, local size 4 (→ 4 work groups):
const int GlobalSize1D = 16;
const int LocalSize1D = 4;
// ...
unsafe
{
nuint gs = (nuint)GlobalSize1D;
nuint ls = (nuint)LocalSize1D;
cl.EnqueueNdrangeKernel(queue, kernel1D, 1u, (nuint*)null, &gs, &ls,
0u, (nint*)null, out nint _);
}Output:
WorkItem GlobalID LocalID GroupID
--------------------------------------------
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 0 1 ← local ID resets at group boundary
5 5 1 1
6 6 2 1
7 7 3 1
8 8 0 2
9 9 1 2
10 10 2 2
11 11 3 2
12 12 0 3
13 13 1 3
14 14 2 3
15 15 3 3
Observation: LocalID resets to 0 at every group boundary. Work items 0–3 form group 0, items 4–7 form group 1, etc.
Part B: 2D NDRange
For a 4×6 grid with 2×3 local work groups:
__kernel void show_ids_2d(
__global int* global_row, __global int* global_col,
__global int* local_row, __global int* local_col,
__global int* group_row, __global int* group_col,
int width)
{
int gr = get_global_id(0); // row in full grid
int gc = get_global_id(1); // column in full grid
int idx = gr * width + gc; // linear index
global_row[idx] = gr; global_col[idx] = gc;
local_row[idx] = (int)get_local_id(0);
local_col[idx] = (int)get_local_id(1);
group_row[idx] = (int)get_group_id(0);
group_col[idx] = (int)get_group_id(1);
}The 2D launch uses stackalloc for the size arrays:
unsafe
{
nuint* gs = stackalloc nuint[] { (nuint)Rows, (nuint)Cols }; // {4, 6}
nuint* ls = stackalloc nuint[] { (nuint)LocalRows, (nuint)LocalCols }; // {2, 3}
cl.EnqueueNdrangeKernel(queue, kernel2D, 2u, (nuint*)null, gs, ls,
0u, (nint*)null, out nint _);
}Output snippet:
Idx gRow gCol lRow lCol grpRow grpCol
--------------------------------------------
0 0 0 0 0 0 0
1 0 1 0 1 0 0
2 0 2 0 2 0 0
3 0 3 0 0 0 1 ← new column group
4 0 4 0 1 0 1
5 0 5 0 2 0 1
6 1 0 1 0 0 0
...
Observation: The 4×6 grid is divided into 2×2 work groups, each of size 2×3. Local IDs restart at (0,0) at the top-left corner of each work group.
Key Insight: Why Do Work Groups Exist?
graph LR A["Work Group\n= team of threads"] --> B["Share Local Memory\n(fast, ~32 KB)"] A --> C["Can barrier()\n= synchronise internally"] A --> D["Scheduled as a unit\non one Compute Unit"] B --> E["Enables: tree reduction\n(Sample 05)"] B --> F["Enables: matrix tiling\n(Sample 06)"] C --> E
Work groups matter because their members can communicate via shared local memory. Work items in different groups are completely isolated — they can only communicate through slow global memory.
Summary Table
| ID function | What it gives you | Range |
|---|---|---|
get_global_id(d) | Unique index across entire NDRange | 0 .. globalSize-1 |
get_local_id(d) | Index within work group | 0 .. localSize-1 |
get_group_id(d) | Which work group | 0 .. numGroups-1 |
get_global_size(d) | Total items in dimension d | — |
get_local_size(d) | Items per group in dimension d | — |
get_num_groups(d) | Number of groups in dimension d | — |
Relationship: get_global_id(d) == get_group_id(d) * get_local_size(d) + get_local_id(d)