Chapter 1: Heterogeneous Computing & GPGPU

1.1 What is a Heterogeneous System?

A homogeneous computing system has only one type of processor — a multi-core CPU, for example. Every core can run any program at full capability.

A heterogeneous system combines different types of processors on the same machine, each optimised for different kinds of work:

ComponentOptimised forTypical count
CPUComplex sequential logic, branching, OS tasks4 – 32 cores
GPUSimple arithmetic on huge data sets1 000 – 10 000+ shader cores
NPU / TPUNeural network inference1
DSPReal-time signal processing1 – 4

Modern smartphones, laptops, and servers all contain at least a CPU and a GPU. Our laptop’s GPU is not just for games — it can accelerate scientific computation, image processing, and machine learning.

GPGPU = General-Purpose computing on Graphics Processing Units
Using the GPU for calculations that have nothing to do with rendering graphics.


1.2 CPU Architecture

A modern desktop CPU has a small number of powerful, complex cores.

Key CPU characteristics:

  • Each core can run a completely different program with different data
  • Large caches reduce memory access latency
  • Branch prediction allows speculative execution
  • Cores run at 3–5 GHz
  • Typical total: 4 to 32 cores

1.3 GPU Architecture

A GPU looks completely different. It has thousands of simple, small cores (called shader processors or compute units).

Key GPU characteristics:

  • Cores are simple: little branching logic, small cache
  • All cores run the same instruction on different data simultaneously (SIMT)
  • Huge memory bandwidth (10× more than CPU)
  • Runs at 1–3 GHz, but compensates with sheer core count
  • Typical total: 1,000 – 10,000+ shader cores

1.4 CPU vs GPU — The Mental Model

Think of it this way:

CPU = a few expert surgeons who can each handle any complex operation independently.

GPU = a huge army of simple workers who all do the same job at the same time.

The CPU is best when tasks are complex, sequential, and different from each other (e.g., running your operating system, compiling code, handling network packets).

The GPU is best when tasks are simple, independent, and identical applied to massive amounts of data (e.g., computing every pixel of an image, training a neural network, simulating a million particles).


1.5 Data Parallelism

The fundamental idea behind GPGPU is data parallelism: performing the same operation on many data items simultaneously.

Example — adding two arrays element by element:

Sequential (CPU):           Parallel (GPU):
for i in 0..N:              all i at once:
  C[i] = A[i] + B[i]         C[i] = A[i] + B[i]   ← each "thread" handles one i

On a CPU with 8 cores, you can do 8 additions at once.
On a GPU with 4,096 cores, you can do 4,096 additions at once.
For 1,000,000 elements: GPU finishes ~500× faster (in the ideal case).

Problems well-suited for GPU:

  • Image / video processing (each pixel is independent)
  • Physical simulation (each particle is independent)
  • Matrix multiplication (each output element is independent)
  • Cryptography, compression, sorting

Problems poorly suited for GPU:

  • Sequential algorithms (each step depends on the previous)
  • Heavy branching (cores must take the same code path)
  • Irregular memory access patterns

1.6 Amdahl’s Law

Not all code can be parallelised. Amdahl’s Law gives the theoretical maximum speedup:

Where:

  • P = fraction of code that can run in parallel (0.0–1.0)
  • N = number of parallel processors
  • S = resulting speedup factor

Example: If 90% of your program is parallelisable and you have a GPU with 1,000 cores:

Even with infinite cores, the maximum speedup is 10× (limited by the sequential 10%).

Lesson: Profile first. Only parallelise the bottleneck.


1.7 What is OpenCL?

OpenCL (Open Computing Language) is an open standard for writing programs that run across heterogeneous platforms — CPUs, GPUs, FPGAs, and DSPs — from any vendor.

It was developed by the Khronos Group (same group behind OpenGL and Vulkan).

graph TD
    A["Your C# Host Program"] --> B["OpenCL API  (Silk.NET.OpenCL)"]
    B --> C["OpenCL Platform – NVIDIA"]
    B --> D["OpenCL Platform – AMD"]
    B --> E["OpenCL Platform – Intel"]
    B --> F["OpenCL Platform – Apple"]
    C --> G["NVIDIA GPU"]
    D --> H["AMD GPU"]
    D --> I["AMD CPU"]
    E --> J["Intel GPU"]
    E --> K["Intel CPU"]
    F --> L["Apple Silicon GPU"]

OpenCL vs CUDA

FeatureOpenCLCUDA
VendorKhronos (open standard)NVIDIA (proprietary)
Hardware supportAny GPU/CPU with an OpenCL driverNVIDIA GPUs only
PerformanceSlightly lower on NVIDIAOptimal on NVIDIA
PortabilityHighLow
LanguageOpenCL C (C99 subset)CUDA C++

We use OpenCL because it works on any hardware, whether that’s NVIDIA, AMD, Intel, or Apple Silicon.

How Silk.NET.OpenCL fits in

The OpenCL API is a C library — it was designed for C programs, not C#. Silk.NET.OpenCL is a thin, auto-generated C# wrapper that makes every OpenCL function available as a C# method.

Under the hood it still calls the native C API, which is why we need unsafe code in some places (covered in Chapter 2).

OpenCL C API           Silk.NET wrapper              Your C# code
─────────────          ─────────────────             ────────────
clGetPlatformIDs()  ←→  cl.GetPlatformIDs(...)   ←   Program.cs
clCreateBuffer()    ←→  cl.CreateBuffer(...)      ←   Program.cs
clEnqueueNDRange()  ←→  cl.EnqueueNdrangeKernel() ←   Program.cs

1.8 The Big Picture

Here is the overall flow of a GPGPU program:

flowchart LR
    A["1\. Write kernel<br>(OpenCL C)"] --> B["2\. Compile<br>kernel<br>at runtime"]
    B --> C["3\. Upload data<br>to GPU memory"]
    C --> D["4\. Launch thousands<br>of threads"]
    D --> E["5\. Download result<br>from GPU memory"]
    E --> F["6\. Use result<br>in C#"]

    style A fill:#E3F2FD
    style B fill:#BBDEFB
    style C fill:#90CAF9
    style D fill:#42A5F5,color:#fff
    style E fill:#1E88E5,color:#fff
    style F fill:#1565C0,color:#fff

Each step maps to specific OpenCL API calls, which we explore in Chapter 03_OpenCL_Model and the sample walkthroughs.