SAGA Pattern

A SAGA is a pattern for handling long-running business processes in distributed systems without using a single distributed database transaction.

In a microservices architecture, each service owns its own data store. That means you cannot safely do:

“Update Service A DB, update Service B DB, update Service C DB — all in one ACID transaction.”

Instead, a SAGA coordinates a sequence of local transactions (each service commits its own DB changes) and uses messages to move the process forward.

If something fails, the SAGA triggers compensating actions to “undo” or offset earlier steps.


What a SAGA does (in one sentence)

A SAGA ensures a multi-step distributed workflow reaches a consistent outcome by coordinating steps and executing compensations when needed.


Why SAGA exists

Distributed transactions (2PC / XA) are usually avoided because they:

  • reduce availability
  • are complex to operate
  • couple services tightly
  • don’t play nicely with message brokers and retries

SAGAs are the practical alternative for real-world microservice systems.


Two common SAGA styles

1) Orchestration (central coordinator)

A dedicated Orchestrator service:

  • decides what the next step is
  • sends commands / publishes events
  • tracks saga state
  • triggers compensation on failure

Pros

  • Clear control flow in one place
  • Easier to reason about and test

Cons

  • Orchestrator becomes a critical component
  • You must design it carefully (state, idempotency, retries)

2) Choreography (event-driven, no central brain)

Services react to events and emit new events, forming a chain.

Pros

  • No central service; more decentralized

Cons

  • Harder to understand global flow
  • Can become “spaghetti events” without strong discipline

Core SAGA concepts you must implement

1) Steps are local transactions

Each step is a normal local DB transaction inside a single service.

2) Correlation

All events/commands in a saga must include a correlation id (often called processId, sagaId, etc.) so the orchestrator can match replies to the correct saga instance.

3) Idempotency

Because messages may be delivered more than once, the orchestrator and all participants must be safe to retry:

  • “same command/event again” must not corrupt state
  • repeated messages should be ignored or treated as already done

4) Compensation

For each step that can’t be “rolled back” automatically, define a compensating action. Example: if you already issued a ticket, compensation might be “refund/cancel ticket”.


Simple example

Use case: Auto-refund when entry fails for a system reason

Story

  1. User purchases a ticket.
  2. User scans ticket at gate.
  3. Gate denies entry due to a system reason (e.g., scanner service had partial outage, policy misconfiguration).
  4. Orchestrator triggers a refund in Ticket Service.
  5. User is notified.

Event/command flow (concept)

sequenceDiagram
  participant TS as Ticket Service
  participant ACS as Access Control
  participant ORCH as Orchestrator
  participant MQ as Broker

  TS->>MQ: TicketPurchased (processId)
  MQ->>ORCH: TicketPurchased
  ORCH->>MQ: StartEntrySaga (processId)

  ACS->>MQ: EntryDenied (processId, reason=SYSTEM)
  MQ->>ORCH: EntryDenied
  ORCH->>MQ: RefundTicketCommand (processId, ticketId)

  TS->>MQ: TicketRefunded (processId)
  MQ->>ORCH: TicketRefunded
  ORCH->>MQ: SagaCompleted (processId)

Minimal C# example (illustrative orchestrator logic)

This snippet shows the shape of an orchestrator as a state machine. It is not a full implementation guide - you have to connect this idea to your messaging + persistence. This is pseudo-code - you need to implement this correctly in C#.

// =======================================================
// REFUND SAGA (Orchestrated SAGA)
//
// Scenario:
//  - Ticket was purchased
//  - User tries to enter the festival
//  - If entry is denied due to a SYSTEM reason -> refund ticket
//  - Otherwise, saga ends as Failed (or could end with no refund)
// =======================================================
 
 
// 1) The saga is a small state machine.
//    Each saga instance lives across multiple messages/events.
enum RefundSagaState
{
    Started,                // Saga exists but we haven't started waiting for entry yet
    WaitingForEntryResult,  // Ticket bought; waiting to learn if entry was granted/denied
    Refunding,              // We decided to refund and are waiting for confirmation
    Completed,              // Everything finished successfully
    Failed                  // Saga ended in a failure scenario (no compensation or unresolved)
}
 
 
// 2) Saga instance = the "memory" of the orchestrator for ONE processId.
//    It must be stored somewhere durable (DB recommended).
class RefundSagaInstance
{
    string ProcessId;             // Correlation ID: ties all related messages together
    string TicketId;              // Needed so we can tell TicketService what to refund
    RefundSagaState State;        // Current step in the workflow
 
    // Idempotency helper:
    // In distributed systems, messages can be delivered more than once.
    // If we process duplicates, we might refund twice -> bad.
    // So we record event IDs that we already handled.
    Set<string> ProcessedEventIds;
}
 
 
// 3) The orchestrator is just an event handler + a state machine.
//    It reacts to incoming events and publishes commands/events.
class RefundSagaOrchestrator
{
    ISagaStore store;             // Loads/saves saga instances (DB, Redis, etc.)
    IMessagePublisher publisher;  // Publishes outgoing messages (commands/events)
 
    // -------------------------------------------------------
    // Event handler: TicketPurchased
    //
    // Meaning:
    //   The user successfully purchased a ticket.
    //
    // Our job:
    //   Create/initialize saga state and wait for the entry outcome.
    // -------------------------------------------------------
    async Task OnTicketPurchased(eventId, processId, ticketId)
    {
        saga = await store.LoadOrCreate(processId);
 
        // Idempotency:
        // If we already processed this exact event, do nothing.
        if (saga.ProcessedEventIds contains eventId)
            return;
 
        saga.ProcessedEventIds.add(eventId);
 
        // Store data we will need later
        saga.TicketId = ticketId;
 
        // Move saga forward: we now wait for entry outcome events
        saga.State = WaitingForEntryResult;
 
        await store.Save(saga);
 
        // Optional: publish a "SagaStarted" event for monitoring/notifications
        // publisher.Publish(type="festivo.saga.started.v1", processId=processId, ...)
    }
 
 
    // -------------------------------------------------------
    // Event handler: EntryDenied
    //
    // Meaning:
    //   The gate system denied entry for some reason.
    //
    // Our job:
    //   If denial reason is SYSTEM -> compensate by refunding the ticket.
    //   If denial reason is USER -> fail (no refund), or choose your own policy.
    // -------------------------------------------------------
    async Task OnEntryDenied(eventId, processId, reason)
    {
        saga = await store.Load(processId);
 
        // If the saga does not exist, we cannot correlate this event.
        // In real systems you would log this and possibly alert.
        if (saga == null)
            return;
 
        // Idempotency: ignore duplicates
        if (saga.ProcessedEventIds contains eventId)
            return;
 
        saga.ProcessedEventIds.add(eventId);
 
        // Guard against out-of-order events:
        // If we are not waiting for entry results anymore, ignore/record for debugging.
        if (saga.State != WaitingForEntryResult)
            return;
 
        // Business decision:
        // Only compensate (refund) if the denial was due to a SYSTEM issue.
        if (reason == "SYSTEM")
        {
            saga.State = Refunding;
            await store.Save(saga);
 
            // We do NOT call TicketService directly here.
            // We publish a command message so the system stays loosely coupled.
            await publisher.Publish(
                type: "festivo.ticket.refund.requested.v1",
                data: {
                    processId: processId,
                    ticketId: saga.TicketId
                }
            );
 
            // Now we wait for TicketRefunded to come back later.
        }
        else
        {
            // Example: USER reason could be "ticket already used", "invalid ticket", etc.
            // We end the saga without compensation.
            saga.State = Failed;
            await store.Save(saga);
 
            // Optional: publish "SagaFailed" so UI can show a clear outcome
        }
    }
 
 
    // -------------------------------------------------------
    // Event handler: TicketRefunded
    //
    // Meaning:
    //   TicketService confirms it refunded the ticket.
    //
    // Our job:
    //   Mark saga as Completed.
    // -------------------------------------------------------
    async Task OnTicketRefunded(eventId, processId)
    {
        saga = await store.Load(processId);
        if (saga == null)
            return;
 
        // Idempotency: ignore duplicates
        if (saga.ProcessedEventIds contains eventId)
            return;
 
        saga.ProcessedEventIds.add(eventId);
 
        // Only accept this event if we're currently waiting for it
        if (saga.State != Refunding)
            return;
 
        saga.State = Completed;
        await store.Save(saga);
 
        // Publish a final status event (useful for NotificationService / UI)
        await publisher.Publish(
            type: "festivo.saga.completed.v1",
            data: {
                processId: processId
            }
        );
    }
}

What this example highlights

  • SAGA instances are state machines
  • Every message must be correlated with processId
  • Idempotency is required (ProcessedEventIds conceptually)
  • The orchestrator triggers compensating actions (e.g. refund)

Practical checklist (what you should enforce in your implementation)

  • All saga-related messages carry a processId
  • Orchestrator persists state (recommended)
  • Orchestrator is idempotent (duplicate events are safe)
  • Participants handle duplicate commands safely
  • Compensation is implemented and observable (logs + notifications)

What you should be able to answer after reading this

  • Why can’t we just use a normal transaction across microservices?
  • What’s the difference between orchestration and choreography?
  • What does “compensation” mean in a SAGA?
  • How do correlation ids and idempotency prevent chaos under retries?