Error Handling That Scales in .NET: Railway-Oriented Programming, Result Types & Resilient Architecture

1 Why Error Handling Must Scale

Modern distributed systems rarely fail in clean, predictable ways. In small prototypes, an exception stack trace is often “good enough” to debug issues. But under real-world load—hundreds of concurrent requests, multiple downstream dependencies, retries, and timeouts—errors multiply, compound, and hide. The difference between a graceful degradation and a cascading outage lies in how intentionally your team treats failure.

This section explores why error handling must scale alongside your system, how to classify failure modes, and what goals define a “mature” strategy for .NET applications running in production.

1.1 The cost of “works on my machine”: failure modes that only appear under load

When applications scale beyond a few concurrent users, assumptions that once felt safe start breaking silently. Consider these examples from real .NET services:

Thread starvation in ASP.NET Core: A single blocking call in a request pipeline (.Result or .Wait()) can exhaust the thread pool under load, producing timeout errors that disappear in development.
Connection pool exhaustion: A missing using statement or forgotten await can leak database connections. With ten users, you never notice. With 10,000, the pool is dry, latency spikes, and your API returns HTTP 500s.
Hidden retry storms: Unhandled transient exceptions in HttpClient calls trigger retries at multiple layers (Polly, load balancers, SDKs), multiplying the load on already-failing systems.

These are not “bugs” in the traditional sense; they’re emergent behaviors caused by inadequate failure boundaries.

“Works on my machine” is often shorthand for “I haven’t tested how this fails.” In production, what matters is not just whether a function works, but how it fails—and whether that failure is observable, classified, and recoverable.

At scale, exception stack traces alone become noise. What you need are consistent, structured failure models that let you aggregate, reason about, and act on error data across the entire system.

1.2 Taxonomy of failures at scale

Before you can handle failures effectively, you need a shared language for them. Scalable systems treat errors not as monolithic “something went wrong” states but as typed signals. Classifying them enables consistent decisions about retries, compensation, and user feedback.

1.2.1 Expected vs. exceptional vs. catastrophic

Expected failures These are domain-level, business-understood outcomes. Examples:

Invalid input or failed validation
Resource not found (CustomerNotFound)
Business rule violation (CreditLimitExceeded)
Conflict (OrderAlreadyProcessed)

Expected failures are not “bad” per se—they’re part of normal flow. They should be modeled explicitly (e.g., with Result<T> or Either) and surfaced as 4xx responses in HTTP APIs.

Exceptional failures Unexpected but recoverable issues:

A transient SQL or HTTP timeout
Network flakiness
External API quota exceeded

These should trigger retries, fallbacks, or degradation paths. They’re often wrapped in exceptions but mapped to structured results upstream.

Catastrophic failures System corruption, invalid invariants, or unrecoverable resource exhaustion:

OutOfMemoryException
Inconsistent persisted state
Serialization of invalid data types
Logic that cannot continue safely

These should terminate the operation (and possibly the process). Attempting recovery can worsen corruption.

1.2.2 Transient vs. persistent vs. systemic

A transient failure is short-lived—retrying might succeed. A persistent failure is deterministic and reproducible. A systemic failure is emergent, caused by feedback loops, dependencies, or misconfiguration.

Understanding which one you’re dealing with changes your handling:

try
{
    var response = await _httpClient.GetAsync(url, cancellationToken);
    return Result.Ok(await response.Content.ReadAsStringAsync());
}
catch (HttpRequestException ex) when (IsTransient(ex))
{
    return Result.Fail("Transient network issue").WithError(ex);
}
catch (Exception ex)
{
    return Result.Fail("Persistent or systemic failure").WithError(ex);
}

1.2.3 Local vs. distributed (network, downstream, data)

Local failures stay within process boundaries—exceptions in code, invalid state, failed assertions. Distributed failures involve latency, coordination, and partial success.

Examples:

Local: Parsing a JSON payload fails due to malformed data.
Distributed: The downstream payment API succeeds but your acknowledgment message fails to publish—creating inconsistency.
Data-level: A transaction partially commits, leaving state desynchronized.

In distributed systems, “error handling” is as much about coordination and observability as it is about catching exceptions.

1.3 Goals for a scalable strategy: correctness, debuggability, operability, cost, and user experience

An effective error handling strategy must scale across people, services, and time. It serves five intertwined goals:

1. Correctness – The system must preserve invariants under failure. Validation and domain rules should be explicit, not implicit. Failing fast is better than corrupting state.

2. Debuggability – Developers must be able to reproduce and reason about failures. Structured logs (e.g., Serilog + correlation IDs) and standardized error models (e.g., RFC 7807 ProblemDetails) make errors actionable, not mysterious.

3. Operability – Ops teams need to detect, classify, and respond quickly. Consistent error contracts allow dashboards and alerts to work uniformly across services.

4. Cost – Over-retries and cascading failures can multiply infrastructure costs. Scalable error handling minimizes redundant retries and wasted work.

5. User experience – Users should see helpful, consistent messages. “Something went wrong” is not enough; tell them what went wrong and, where possible, what to do next.

A strategy that optimizes one dimension (say, developer ergonomics) at the expense of another (observability) doesn’t scale.

1.4 .NET 9 vs .NET 8 (LTS) context for production teams: platform support and what that means for error handling stacks

.NET 9 (as of late 2025) builds on .NET 8 LTS with enhancements that directly impact error handling and resilience:

Unified resilience stack: The Microsoft.Extensions.Resilience namespace (introduced in .NET 8) stabilizes in .NET 9, offering built-in retry, timeout, and circuit breaker pipelines. Exception classification integrates with Polly v8’s resilience primitives.
Improved ProblemDetails middleware: ASP.NET Core 9 introduces extended support for RFC 9457 (ProblemDetails extensions), allowing custom error codes and trace IDs to propagate automatically.
OpenTelemetry integration: .NET 9 ships with richer diagnostics hooks, linking exceptions to spans and structured logs without manual correlation.
Native discriminated unions (preview): C# language support for discriminated unions (in preview) aligns perfectly with Result/Either patterns, reducing the need for libraries like OneOf.

For production teams, the LTS status of .NET 8 still makes it the safer baseline for long-lived services. But .NET 9’s first-class resilience APIs make it easier to embed scalable error handling patterns natively—less custom middleware, more standardized observability.

2 Exceptions vs. Result/Either Types—Choosing Explicitly

Error handling in .NET historically revolved around exceptions. For decades, try/catch was the canonical way to indicate failure. But in distributed, high-throughput systems, exceptions don’t scale well as a control flow mechanism.

The modern .NET ecosystem provides alternatives—Result, Either, and discriminated unions—that make failures explicit and composable. Choosing the right model per boundary (domain, application, transport) is the key to clarity and performance.

2.1 What exceptions are for—truly exceptional, rare, “can’t reasonably continue”

Exceptions signal unrecoverable problems—unexpected states where the program cannot proceed safely.

Examples include:

NullReferenceException (invariant violation)
InvalidOperationException (misused API)
OutOfMemoryException
SqlException (system-level, not business rule)

These are “you can’t reason about this state anymore” conditions.

Use exceptions when:

The failure is rare and unrecoverable at the current abstraction level.
Propagation to a higher-level handler (e.g., middleware) makes sense.
The recovery logic is non-local and would complicate normal code flow.

2.2 The problem with exception-driven control flow at scale (hidden edges, performance, readability)

At small scale, throwing and catching exceptions feels harmless. At scale, it introduces three major costs:

1. Hidden control flow – Exceptions bypass normal return paths. This makes code harder to reason about and test. A method signature like Task<User> GetUserAsync() hides the fact that it might throw NotFoundException, SqlException, or TimeoutException.

2. Performance overhead – Throwing exceptions is expensive; creating a stack trace allocates memory and unwinds the stack. Under high load, frequent exceptions can dominate CPU time.

3. Observability blind spots – Exceptions often carry unstructured data (message strings, stack traces). Aggregating and classifying them for metrics is difficult.

Incorrect:

public User GetUser(Guid id)
{
    var user = _repository.Find(id);
    if (user == null)
        throw new NotFoundException($"User {id} not found.");
    return user;
}

Correct (explicit failure):

public Result<User> GetUser(Guid id)
{
    var user = _repository.Find(id);
    return user is null
        ? Result.Fail<User>("User not found.")
        : Result.Ok(user);
}

Explicit modeling makes the function’s possible outcomes visible at compile time.

2.3 Result/Either types—making failures a first-class return value

Result/Either types encode the idea that a computation can either succeed (Ok, Success) or fail (Error, Failure) without using exceptions.

public Result<Order> CreateOrder(OrderRequest request)
{
    if (!IsValid(request))
        return Result.Fail<Order>("Invalid order data");

    var order = new Order(request);
    return Result.Ok(order);
}

The caller must handle both outcomes explicitly:

var result = CreateOrder(request);

result.Match(
    success => Process(order),
    failure => Log.Warning(failure.Reasons.First().Message)
);

2.3.1 Libraries and ergonomics

Several mature libraries bring ergonomic, composable Result/Either APIs to .NET:

FluentResults – Lightweight, fluent API with optional metadata and chaining (Bind, Map, Ensure).

Result<Order> result = Result.Ok(order)
    .Ensure(o => o.Total > 0, "Order total must be positive")
    .Bind(SaveOrder)
    .Tap(LogSuccess);

ErrorOr – Designed for API modeling; integrates naturally with ASP.NET Core minimal APIs.

public async Task<ErrorOr<Order>> Handle(CreateOrderCommand command)
{
    if (!Validate(command))
        return Errors.Validation.InvalidData;

    var order = new Order(command);
    return await _repository.Save(order);
}

OneOf – Discriminated union for representing multiple possible return types.

public OneOf<Order, NotFound, Conflict> GetOrder(Guid id) =>
    _repository.Find(id) switch
    {
        null => new NotFound(),
        var o when o.IsLocked => new Conflict(),
        var o => o
    };

Each approach lets failures travel through the system predictably, with strong typing.

2.3.2 Domain modeling and exhaustiveness via pattern matching

With C# pattern matching (and upcoming discriminated unions), you can exhaustively handle all possible outcomes:

switch (result)
{
    case SuccessResult<Order> success:
        return Ok(success.Value);
    case ErrorResult error when error.HasErrorCode("NotFound"):
        return NotFound(error.Message);
    default:
        return Problem("Unexpected failure");
}

This is more maintainable than catching broad Exception types, as the compiler forces you to cover all paths.

2.4 LanguageExt and Either/Option—benefits and trade-offs in modern codebases

LanguageExt brings functional patterns from F# and Haskell to C#, including Either<L, R>, Option<T>, and LINQ-based composition.

Example:

Either<Error, User> user =
    from id in ValidateUserId(request.Id)
    from u in _repository.Get(id)
    from validated in Validate(u)
    select validated;

This declarative, monadic style makes error propagation automatic—no need to check if (result.IsSuccess) at every step.

Benefits

Composability for complex domain logic.
Eliminates “pyramid of doom” error checks.
Encourages immutability and pure functions.

Trade-offs

Steeper learning curve for non-F# developers.
Less ergonomic debugging in mixed OO/functional codebases.
May introduce allocation overhead in tight loops.

2.5 Performance and ergonomics: when `try/catch` is clearer, when Result is safer

Use try/catch when:

You’re at an integration boundary (e.g., calling into framework or third-party code).
Failures are truly exceptional (disk I/O, framework-level exceptions).
The handler is centralized (middleware, background worker wrapper).

Use Result or Either when:

Failures are part of expected domain logic (validation, conflicts).
You want deterministic outcomes without stack unwinding.
You need to compose multiple operations cleanly.

Hybrid example:

public async Task<Result<Customer>> GetCustomerAsync(Guid id)
{
    try
    {
        var entity = await _repository.FindAsync(id);
        return entity is null
            ? Result.Fail<Customer>("Customer not found.")
            : Result.Ok(entity);
    }
    catch (SqlException ex)
    {
        _logger.LogError(ex, "Database failure");
        return Result.Fail<Customer>("Database unavailable.");
    }
}

2.6 Design decision matrix: how to decide per boundary (domain, app service, transport, persistence)

Boundary	Preferred Error Model	Why
Domain	Result/Either	Expected business rule failures; enables exhaustive modeling
Application Service / Use Case	Result/Either	Composable operations; clear success/failure contracts
Transport (HTTP, gRPC)	ProblemDetails + Mapped Results	Translate failures into user-facing semantics
Persistence / External Calls	Exceptions internally, mapped to Results	Handle transient or infrastructural errors
Cross-cutting (Middleware)	Exception-handling middleware	Capture unhandled exceptions, log, standardize responses

3 Railway-Oriented Programming (ROP) in .NET

3.1 ROP in a nutshell: compose success/failure rails and stop on first failure

Railway-Oriented Programming (ROP) is a metaphor introduced by Scott Wlaschin in F# for Fun and Profit. Imagine your function pipeline as a railway with two tracks: one for success, one for failure. If any function fails, the train switches to the failure track and never returns to the success rail.

This model fits perfectly with Result/Either types: every step consumes a Result<T> and returns another, composing naturally.

Result<Order> result = Validate(request)
    .Bind(CreateOrder)
    .Bind(SaveOrder)
    .Tap(SendNotification);

No exceptions, no nested if (result.IsFailed) checks. The pipeline flows until a step fails.

3.2 Building blocks: `Bind`, `Map`, `Tap`, `Ensure`, `Match` with Result/Either

Bind – Chains a function that returns another Result.
Map – Transforms a success value without altering failure state.
Tap – Executes side effects (logging, metrics) on success.
Ensure – Validates conditions; fails if predicate false.
Match – Deconstructs final result into explicit success/failure handling.

Result<Order> result = Validate(request)
    .Ensure(r => r.Total > 0, "Total must be positive")
    .Bind(CreateOrder)
    .Bind(PersistOrder)
    .Tap(o => _logger.LogInformation("Created order {Id}", o.Id))
    .Match(
        success => Result.Ok(success),
        failure => Result.Fail<Order>(failure.Errors)
    );

3.3 Implementing ROP with `FluentResults` / `ErrorOr` / `OneOf`—progressive examples

Example with FluentResults

public Result<Order> ProcessOrder(OrderRequest request) =>
    Validate(request)
        .Bind(CreateOrder)
        .Bind(SaveOrder)
        .Tap(SendConfirmation);

private Result<OrderRequest> Validate(OrderRequest request) =>
    string.IsNullOrWhiteSpace(request.CustomerId)
        ? Result.Fail<OrderRequest>("Customer ID required")
        : Result.Ok(request);

Example with ErrorOr

public async Task<ErrorOr<Order>> Handle(CreateOrderCommand cmd)
{
    var result = await Validate(cmd)
        .BindAsync(CreateOrderAsync)
        .BindAsync(SaveOrderAsync);

    return result;
}

Example with OneOf

public OneOf<Order, ValidationError, DatabaseError> Execute(Request req)
{
    if (!Validate(req)) return new ValidationError();
    var saved = Save(req);
    return saved.IsSuccess ? saved.Value : new DatabaseError();
}

Each variant expresses the same principle: clear flow, no exceptions for expected outcomes.

3.4 Mapping validation, authorization, and business rules into the rail

In ROP, validations and policies become first-class citizens in the rail rather than preconditions scattered throughout.

return Validate(request)
    .Ensure(HasPermission, "User not authorized")
    .Bind(ProcessDomainRules)
    .Bind(Persist)
    .Tap(PublishEvent);

You can chain policy checks (Ensure), authorization handlers, and domain invariants seamlessly. Each one short-circuits failure cleanly.

3.5 From synchronous to async rails; cancellation and timeouts

Async composition works identically, thanks to extension methods like BindAsync and TapAsync:

return await Validate(request)
    .BindAsync(async r => await SaveAsync(r, token))
    .TapAsync(async _ => await PublishAsync(token));

In production, integrate cancellation tokens and resilience policies (Polly or .NET Resilience Pipelines) to make these rails robust against latency and timeouts.

3.6 Anti-patterns: exception tunneling, swallowed errors, “over-monadizing”

Exception tunneling: wrapping exceptions inside Results but never surfacing them. Always log or translate them at a meaningful boundary.

Swallowed errors: returning Result.Ok() despite failures, to “keep the rail green.” This hides real problems and breaks observability.

Over-monadizing: forcing ROP everywhere, including trivial methods. Not every method needs Result<T>—reserve it for boundaries where failure is meaningful.

A balanced approach is pragmatic: use ROP where it clarifies logic and improves safety, not as a dogma.

4 A Hybrid Strategy That Scales

At this point, the key principles are clear: exceptions should signal the exceptional, while Result/Either models encode expected failures as data. Yet large systems are not purely functional or purely imperative—they’re hybrid. A production-grade .NET system combines both philosophies through consistent layering, cross-cutting middleware, and a shared error vocabulary. This section defines a practical rule set for scalable error handling across the stack and illustrates how to translate those principles into maintainable, observable ASP.NET Core implementations.

4.1 The core rule set

Error handling at scale is a social contract between your code, your team, and your users. The rule set below provides an explicit playbook to reduce ambiguity and ensure consistent semantics.

4.1.1 Throw exceptions only for truly exceptional and unrecoverable states

The first rule is deceptively simple: only throw when something is impossible to model as data or recover from gracefully. Examples include:

Framework-level errors: failed configuration binding, corrupted state, I/O failure.
Violations of domain invariants that should never occur if code is correct.
External dependencies misbehaving in unexpected ways (e.g., SDK bugs, deserialization corruption).

Correct usage:

public class InvoiceService
{
    public Invoice Generate(InvoiceRequest request)
    {
        if (request is null)
            throw new ArgumentNullException(nameof(request));

        if (request.Items.Count == 0)
            throw new InvalidOperationException("Invoice must contain at least one item.");

        return new Invoice(request);
    }
}

The idea is to fail fast and visibly when invariant assumptions are broken. These exceptions will bubble up to the middleware layer, where they are converted into a standardized error response for clients.

4.1.2 Use Result/Either for expected failures (validation, not-found, conflicts)

Expected failures are normal business conditions, not exceptional states. Returning a Result clarifies intent, makes the outcome composable, and avoids polluting logs with noise.

public Result<Customer> GetCustomer(Guid id)
{
    var customer = _repository.FindById(id);
    return customer is null
        ? Result.Fail<Customer>("Customer not found").WithErrorCode("NotFound")
        : Result.Ok(customer);
}

Upstream callers can then handle success and failure consistently:

var result = _customerService.GetCustomer(id);
return result.Match(
    success => Ok(success),
    failure => NotFound(failure.Errors.First().Message)
);

When applied uniformly, Result-based domain and application services make the entire system predictable—each layer knows which failures are recoverable and which are not.

4.1.3 Use middleware/pipeline for cross-cutting concerns (logging, problem details, correlation)

While Results handle local, expected failures, middleware provides global safety nets and observability. All unhandled exceptions, timeouts, or infrastructure-level errors are intercepted, logged with correlation metadata, and returned in a standardized form.

ASP.NET Core makes this pattern first-class with UseExceptionHandler and ProblemDetails integration. A common pattern:

app.UseExceptionHandler("/error");
app.Map("/error", (HttpContext context) =>
{
    var exception = context.Features.Get<IExceptionHandlerFeature>()?.Error;
    var problem = new ProblemDetails
    {
        Title = "An unexpected error occurred.",
        Status = 500,
        Type = "https://example.com/errors/unexpected",
        Instance = context.TraceIdentifier
    };
    context.Response.StatusCode = problem.Status.Value;
    return Results.Json(problem);
});

Adding a correlation ID to each request closes the loop between logs, traces, and responses. This becomes invaluable during incident triage.

4.2 Translating domain failures to HTTP semantics (RFC 7807 Problem Details)

Once your domain services produce structured failures (via Result/Either), the next challenge is to translate them into HTTP-meaningful responses. A ValidationError becomes 400, a NotFound becomes 404, and a domain conflict becomes 409. RFC 7807 (Problem Details for HTTP APIs) provides a standard envelope for doing exactly this.

The goal is not just correctness, but consistency—every API in the organization should express failure in the same shape and metadata structure, regardless of which microservice emits it.

4.2.1 Built-in ProblemDetails in ASP.NET Core, and production hardening with Hellang middleware

ASP.NET Core provides a built-in ProblemDetails class and factory methods (Results.Problem, Results.ValidationProblem). However, for production-scale reliability and error normalization, most teams layer Hellang.Middleware.ProblemDetails on top.

Configuration example:

builder.Services.AddProblemDetails(opts =>
{
    opts.IncludeExceptionDetails = (ctx, ex) => 
        builder.Environment.IsDevelopment();

    opts.MapToStatusCode<ValidationException>(StatusCodes.Status400BadRequest);
    opts.MapToStatusCode<NotFoundException>(StatusCodes.Status404NotFound);
    opts.MapToStatusCode<UnauthorizedAccessException>(StatusCodes.Status401Unauthorized);
    opts.MapToStatusCode<ConflictException>(StatusCodes.Status409Conflict);
});

Usage:

app.UseProblemDetails();

This middleware automatically converts exceptions to RFC 7807-compliant JSON, attaches trace IDs, and respects environment-based verbosity rules. For example, in production it hides stack traces but keeps correlation identifiers.

{
  "type": "https://example.com/errors/not-found",
  "title": "Customer not found",
  "status": 404,
  "traceId": "00-f6b0322fce2d1d4b8d1e-892e10aa8cf47e9c-00"
}

Such uniformity means clients can handle errors programmatically, without custom parsers for each service.

4.3 Mapping exceptions to standardized responses without leaking internals

Exception leakage is one of the most common production vulnerabilities. Stack traces or detailed messages from SqlException or HttpRequestException should never be visible to API consumers.

A robust handler maps internal exceptions to sanitized ProblemDetails:

app.Map("/error", (HttpContext context) =>
{
    var exception = context.Features.Get<IExceptionHandlerFeature>()?.Error;
    var (status, title, type) = exception switch
    {
        ValidationException => (400, "Validation error", "https://example.com/errors/validation"),
        NotFoundException => (404, "Resource not found", "https://example.com/errors/not-found"),
        _ => (500, "Unexpected server error", "https://example.com/errors/internal")
    };

    var problem = new ProblemDetails
    {
        Title = title,
        Status = status,
        Type = type,
        Instance = context.TraceIdentifier
    };

    context.Response.StatusCode = status;
    return Results.Json(problem);
});

Note that internal error messages and stack traces are logged (via Serilog or ILogger), not returned. The client receives only the safe summary and trace ID for support correlation.

4.4 Surfacing actionable context (error codes, user messages, remediation hints)

Error responses that merely say “Something went wrong” create frustration. A scalable system surfaces actionable context—structured codes, clear user messages, and hints for next steps.

Example of a production-ready ProblemDetails:

{
  "type": "https://api.example.com/errors/credit-limit-exceeded",
  "title": "Credit limit exceeded",
  "status": 409,
  "code": "FIN-003",
  "detail": "Customer credit limit of $5,000 has been exceeded by $250.",
  "instance": "/orders/checkout/923d",
  "traceId": "00-abc123def456...",
  "extensions": {
    "remediation": "Reduce order total or request a limit increase."
  }
}

This structure provides both human readability and machine parsability. Monitoring dashboards can aggregate by code, while users receive contextually meaningful messages.

5 Reference Implementation (ASP.NET Core, .NET 8/9)

Let’s bring the theory to life through a reference architecture that combines all these practices—Results, ROP, ProblemDetails, and correlation—into a coherent, production-grade ASP.NET Core application.

5.1 Architecture and layers

5.1.1 Contracts & error vocabulary (domain-centric)

At the foundation, define a shared error vocabulary within the domain layer. Each error represents a meaningful business or technical failure.

public static class Errors
{
    public static class Customer
    {
        public static readonly Error NotFound = 
            Error.NotFound("Customer.NotFound", "Customer not found");
        public static readonly Error CreditLimitExceeded = 
            Error.Conflict("Customer.CreditLimitExceeded", "Credit limit exceeded");
    }

    public static class System
    {
        public static readonly Error Unexpected = 
            Error.Failure("System.Unexpected", "Unexpected error occurred");
    }
}

This vocabulary acts as the single source of truth for mapping Results to ProblemDetails responses.

5.1.2 Application services with ROP (compose use cases)

Use the ROP pattern to compose domain operations cleanly:

public class CheckoutService
{
    private readonly ICustomerRepository _customers;
    private readonly IPaymentGateway _payments;

    public async Task<ErrorOr<Receipt>> CheckoutAsync(CheckoutRequest request)
    {
        return await Validate(request)
            .BindAsync(GetCustomerAsync)
            .EnsureAsync(HasSufficientCredit, Errors.Customer.CreditLimitExceeded)
            .BindAsync(ProcessPaymentAsync)
            .BindAsync(SaveReceiptAsync);
    }
}

Each step returns an ErrorOr<T>; the pipeline stops automatically on the first failure.

5.1.3 Web/API adapters (ProblemDetails + consistent status mapping)

In the API layer, convert these domain Results into standardized responses:

[HttpPost("checkout")]
public async Task<IResult> Checkout([FromBody] CheckoutRequest request)
{
    var result = await _service.CheckoutAsync(request);
    return result.Match(
        receipt => Results.Ok(receipt),
        error => error.Type switch
        {
            ErrorType.NotFound => Results.Problem(statusCode: 404, title: error.Description),
            ErrorType.Conflict => Results.Problem(statusCode: 409, title: error.Description),
            _ => Results.Problem(statusCode: 500, title: "Unexpected error")
        });
}

This separation ensures domain logic never directly references HTTP semantics.

5.2 Project setup and packages

A minimal yet scalable setup includes:

dotnet add package FluentResults
dotnet add package Hellang.Middleware.ProblemDetails
dotnet add package Serilog.AspNetCore
dotnet add package CorrelationId
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Each dependency plays a defined role:

FluentResults or ErrorOr – typed Result models
Hellang ProblemDetails – standardized error responses
CorrelationId – consistent trace propagation
Serilog – structured, contextual logging
OpenTelemetry – distributed tracing and metrics

5.3 Building the error vocabulary (domain + transport)

5.3.1 Error codes, types, and metadata (human/readable + machine/parsable)

Each Error includes:

Code (unique, namespaced)
Type (NotFound, Validation, Conflict)
Description
Optional metadata

public record Error(string Code, string Description, ErrorType Type)
{
    public static Error NotFound(string code, string description) =>
        new(code, description, ErrorType.NotFound);

    public static Error Conflict(string code, string description) =>
        new(code, description, ErrorType.Conflict);
}

5.3.2 Mapping table: Result → ProblemDetails (status, title, type, detail)

ErrorType	HTTP Status	ProblemDetails Type URL	Example Title
Validation	400	`/errors/validation`	Invalid input
NotFound	404	`/errors/not-found`	Customer not found
Conflict	409	`/errors/conflict`	Credit limit exceeded
Failure	500	`/errors/internal`	Unexpected server error

This table ensures deterministic translation between Result failures and API semantics.

5.4 Global error handling path

5.4.1 `IExceptionHandler` / `UseExceptionHandler()` in .NET 8/9

.NET 8 introduced a new IExceptionHandler abstraction that integrates with minimal APIs and ProblemDetails seamlessly.

public class GlobalExceptionHandler : IExceptionHandler
{
    private readonly ILogger<GlobalExceptionHandler> _logger;

    public async ValueTask<bool> TryHandleAsync(HttpContext context, Exception ex, CancellationToken token)
    {
        _logger.LogError(ex, "Unhandled exception");

        var problem = new ProblemDetails
        {
            Title = "Unexpected error",
            Status = 500,
            Type = "/errors/internal",
            Instance = context.TraceIdentifier
        };

        context.Response.StatusCode = 500;
        await context.Response.WriteAsJsonAsync(problem, token);
        return true;
    }
}

builder.Services.AddExceptionHandler<GlobalExceptionHandler>();
app.UseExceptionHandler();

5.4.2 ProblemDetails configuration (include details per environment; correlation id in extensions)

Enrich responses dynamically:

builder.Services.AddProblemDetails(options =>
{
    options.CustomizeProblemDetails = ctx =>
    {
        ctx.ProblemDetails.Extensions["traceId"] = ctx.HttpContext.TraceIdentifier;
        ctx.ProblemDetails.Extensions["correlationId"] =
            ctx.HttpContext.Request.Headers["X-Correlation-ID"].FirstOrDefault();
    };
});

5.5 Correlation and context propagation

5.5.1 Accept incoming header (e.g., `X-Correlation-ID`), generate if missing, push to logging scope

app.UseCorrelationId(new CorrelationIdOptions
{
    Header = "X-Correlation-ID",
    IncludeInResponse = true
});

Log.Logger = new LoggerConfiguration()
    .Enrich.FromLogContext()
    .Enrich.WithCorrelationIdHeader()
    .WriteTo.Console()
    .CreateLogger();

This ensures every log line, trace, and error response can be correlated across services and environments.

5.6 End-to-end sample feature (“Checkout”)

5.6.1 Request validation → Result rail → domain ops → external calls

public async Task<ErrorOr<Receipt>> CheckoutAsync(CheckoutRequest request)
{
    return await Validate(request)
        .BindAsync(GetCustomerAsync)
        .EnsureAsync(HasSufficientCredit, Errors.Customer.CreditLimitExceeded)
        .BindAsync(ChargePaymentAsync)
        .BindAsync(CreateReceiptAsync);
}

Each step is idempotent and reversible, enabling clear compensation strategies if failures occur downstream.

5.6.2 Returning `Results.Problem(...)` / `Results.ValidationProblem(...)` consistently

[HttpPost("checkout")]
public async Task<IResult> Checkout(CheckoutRequest request)
{
    var result = await _checkoutService.CheckoutAsync(request);
    return result.Match(
        success => Results.Ok(success),
        error => error.Type switch
        {
            ErrorType.Validation => Results.ValidationProblem(new Dictionary<string, string[]>
                { { "Request", new[] { error.Description } } }),
            ErrorType.NotFound => Results.Problem(statusCode: 404, title: error.Description),
            ErrorType.Conflict => Results.Problem(statusCode: 409, title: error.Description),
            _ => Results.Problem(statusCode: 500, title: "Unexpected error")
        });
}

5.6.3 OpenTelemetry spans: HTTP → handler → repository → HTTP client

services.AddOpenTelemetry()
    .WithTracing(b => b
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddConsoleExporter());

Traces visualize the full error propagation path—from request to downstream call—linking errors to their originating span and correlation ID.

5.7 CQRS/MediatR integration (optional but common)

5.7.1 Pipeline behaviors for validation/logging and how they complement middleware

CQRS applications using MediatR often wrap handlers in pipeline behaviors for cross-cutting concerns:

public class ValidationBehavior<TRequest, TResponse> : IPipelineBehavior<TRequest, TResponse>
    where TResponse : IErrorOr
{
    public async Task<TResponse> Handle(TRequest request, RequestHandlerDelegate<TResponse> next, CancellationToken ct)
    {
        var validationResult = Validate(request);
        if (validationResult.IsError)
            return (TResponse)(object)validationResult;

        return await next();
    }
}

This complements middleware by ensuring invalid commands never reach the core domain logic.

5.7.2 When to throw vs. when to return Result from handlers (and map later)

Within MediatR handlers:

Use Result or ErrorOr for business rule and validation failures.
Throw only when invariants or system assumptions are violated (e.g., repository unavailable).

public async Task<ErrorOr<Order>> Handle(CreateOrderCommand cmd, CancellationToken ct)
{
    try
    {
        return await _service.CreateOrderAsync(cmd);
    }
    catch (SqlException ex)
    {
        _logger.LogError(ex, "Database unavailable");
        return Errors.System.Unexpected;
    }
}

This separation ensures errors remain structured throughout the pipeline, while exceptions remain reserved for the exceptional.

6 Resilience Patterns: Retries, Timeouts, Fallbacks & Compensation

Once error handling and observability are in place, the next concern is resilience—ensuring the system degrades gracefully under stress, transient faults, or downstream instability. In .NET 8 and 9, resilience is no longer a patchwork of libraries; it’s a unified, first-class concept integrated with the Microsoft.Extensions ecosystem. The combination of Polly v8+, resilience pipelines, and idempotent API design provides a foundation that balances reliability with cost and predictability.

6.1 Modern .NET resilience stack

6.1.1 Polly v8+ resilience pipelines: retry, timeout, circuit breaker, hedging, rate limiting

Polly v8 redefined resilience in .NET by introducing composable resilience pipelines—declarative, type-safe configurations that wrap code execution with policies like retry, timeout, circuit breaker, and fallback.

A simple retry pipeline:

var retryPipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddRetry(new RetryStrategyOptions<HttpResponseMessage>
    {
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .HandleResult(r => !r.IsSuccessStatusCode)
            .Handle<HttpRequestException>(),
        Delay = TimeSpan.FromSeconds(1),
        MaxRetryAttempts = 3,
        OnRetry = args =>
        {
            Console.WriteLine($"Retrying... attempt {args.AttemptNumber}");
            return default;
        }
    })
    .Build();

You execute the pipeline just as you would a function:

var response = await retryPipeline.ExecuteAsync(
    token => _httpClient.GetAsync("https://api.example.com/customers", token),
    cancellationToken);

To extend resilience, combine additional strategies in a single pipeline:

var pipeline = new ResiliencePipelineBuilder<HttpResponseMessage>()
    .AddTimeout(TimeSpan.FromSeconds(5))
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
    {
        FailureRatio = 0.5,
        MinimumThroughput = 10,
        SamplingDuration = TimeSpan.FromSeconds(30),
        BreakDuration = TimeSpan.FromSeconds(15)
    })
    .AddRetry(...)
    .Build();

This layered composition ensures failures don’t cascade: timeouts prevent hanging requests, retries recover transient faults, and circuit breakers isolate systemic ones.

6.1.2 Microsoft.Extensions.* resilience integration for HttpClient (ASP.NET Core 8/9)

ASP.NET Core 8+ integrates Polly pipelines directly into the HttpClientFactory. Each named client can define its own resilience strategy declaratively through AddResilienceHandler.

builder.Services.AddHttpClient("Payments")
    .AddResilienceHandler("standard-pipeline", pipelineBuilder =>
    {
        pipelineBuilder
            .AddRetry(3)
            .AddTimeout(TimeSpan.FromSeconds(3))
            .AddCircuitBreaker(10, TimeSpan.FromSeconds(30));
    });

This eliminates boilerplate, aligns with dependency injection, and ensures that resilience policies are consistent across all HTTP clients. You can view metrics for each handler in OpenTelemetry, providing full visibility into retry counts, timeouts, and breaker states.

6.2 Idempotency + retries: designing APIs that can be safely retried (keys, stores, response caching)

Retries are essential but dangerous if the operation is not idempotent. For POST endpoints (which create or mutate state), a second execution can create duplicate records or charge customers twice.

An idempotency key mitigates this risk by identifying requests uniquely:

[HttpPost("checkout")]
public async Task<IResult> Checkout(
    [FromBody] CheckoutRequest request,
    [FromHeader(Name = "Idempotency-Key")] string key)
{
    if (await _store.ExistsAsync(key))
        return await _store.GetResponseAsync(key);

    var result = await _checkoutService.CheckoutAsync(request);
    var response = result.Match(
        success => Results.Ok(success),
        error => Results.Problem(title: error.Description, statusCode: 400));

    await _store.SaveAsync(key, response);
    return response;
}

The store can be an in-memory cache (for short-lived operations) or Redis (for distributed, long-running workflows). The goal is to ensure the same key yields the same result—no matter how many retries occur.

Design guidelines:

Generate idempotency keys client-side (UUID or hash of payload).
Expire keys after a reasonable window (e.g., 24 hours).
Persist both request and response metadata.

6.3 Choosing strategies per dependency (DB, cache, queue, HTTP)

Different dependencies exhibit different failure modes; resilience must be tailored accordingly.

Dependency	Typical Failures	Recommended Strategy
Database (SQL, Cosmos, etc.)	Transient connection drops, throttling	Retry (exponential backoff) + circuit breaker
Cache (Redis, MemoryCache)	Network latency, node failover	Timeout + fallback to primary data store
Message Queue (RabbitMQ, Azure Service Bus)	Publish/ack failures	Retry (bounded) + outbox pattern
HTTP (APIs, webhooks)	Timeout, transient 5xx errors	Timeout + retry + circuit breaker + idempotency

Example—resilient database access:

var dbPipeline = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        ShouldHandle = PredicateBuilder.From<SqlException>(),
        Delay = TimeSpan.FromSeconds(1),
        MaxRetryAttempts = 5
    })
    .Build();

await dbPipeline.ExecuteAsync(token => _dbContext.SaveChangesAsync(token));

Each policy should be observable: expose retry counts and circuit breaker transitions through logs and metrics.

6.4 Compensation and sagas for multi-step workflows

Complex business operations—checkout, booking, or money transfer—span multiple systems. A single rollback can’t unwind all effects. Instead, compensation—logical undo actions—must be orchestrated across boundaries.

6.4.1 When you can’t “just” rollback—business-meaningful compensation

Consider a checkout that reserves inventory, charges a payment, and generates an invoice. If the invoice step fails, you can’t simply “rollback” the payment; you must issue a refund—a different business process. That’s compensation.

public async Task<Result> CheckoutAsync(Order order)
{
    return await ReserveInventory(order)
        .BindAsync(ChargePayment)
        .BindAsync(CreateInvoice)
        .OnFailureCompensateAsync(RefundPayment, ReleaseInventory);
}

Each compensating action is idempotent and safe to retry independently. They form a durable audit trail in case of partial success.

6.4.2 MassTransit state-machine sagas (Automatonymous) with compensation examples

MassTransit uses Automatonymous to manage long-running workflows with explicit state transitions and compensations.

public class CheckoutSaga : MassTransitStateMachine<CheckoutState>
{
    public State PaymentCharged { get; private set; }
    public State Completed { get; private set; }

    public Event<OrderSubmitted> OrderSubmitted { get; private set; }
    public Event<InvoiceFailed> InvoiceFailed { get; private set; }

    public CheckoutSaga()
    {
        InstanceState(x => x.CurrentState);

        Initially(
            When(OrderSubmitted)
                .ThenAsync(ctx => ChargePayment(ctx.Instance))
                .TransitionTo(PaymentCharged));

        During(PaymentCharged,
            When(InvoiceFailed)
                .ThenAsync(ctx => RefundPayment(ctx.Instance))
                .TransitionTo(Completed));
    }
}

Each state transition is persisted; if the service crashes mid-flight, it resumes automatically. Compensation steps (like refund) are first-class messages—observable, retriable, and durable.

6.4.3 NServiceBus sagas and timeouts; orchestrating compensations

NServiceBus provides similar saga orchestration with built-in timeout handling.

public class BookingSaga : Saga<BookingData>,
    IHandleMessages<PaymentCompleted>,
    IHandleTimeouts<BookingTimeout>
{
    public Task Handle(PaymentCompleted message, IMessageHandlerContext context)
    {
        Data.PaymentId = message.PaymentId;
        return ProcessBooking(message, context);
    }

    public async Task Timeout(BookingTimeout state, IMessageHandlerContext context)
    {
        if (!Data.IsCompleted)
            await context.Send(new IssueRefund { PaymentId = Data.PaymentId });
    }
}

The Timeout feature enables delayed compensations (like refunds after expiry) without manual scheduling.

6.4.4 The Outbox pattern for exactly-once effects across boundaries (Kafka example)

When workflows must publish messages and persist state atomically, the Outbox pattern guarantees “exactly once” delivery.

Write both the local state change and the outbound message to the same database transaction.
A background processor reads pending messages and publishes them.
On success, mark them as dispatched.

using var tx = await _dbContext.Database.BeginTransactionAsync();

_dbContext.Orders.Add(order);
_dbContext.OutboxMessages.Add(new OutboxMessage
{
    EventType = nameof(OrderCreated),
    Payload = JsonSerializer.Serialize(order)
});

await _dbContext.SaveChangesAsync();
await tx.CommitAsync();

The outbox worker reads from OutboxMessages, publishes to Kafka, and marks the record as processed. This ensures resilience even during transient broker failures.

6.5 Putting it together in the reference feature

6.5.1 Define a resilient `HttpClient` for payments: retry + timeout + circuit breaker + hedging

builder.Services.AddHttpClient("Payments", client =>
{
    client.BaseAddress = new Uri("https://api.payments.example.com");
})
.AddResilienceHandler("payment-pipeline", pipelineBuilder =>
{
    pipelineBuilder
        .AddTimeout(TimeSpan.FromSeconds(3))
        .AddRetry(3)
        .AddCircuitBreaker(5, TimeSpan.FromSeconds(20))
        .AddHedging(o => o.MaxHedgedAttempts = 2);
});

Hedging sends speculative parallel requests after a configurable delay, reducing tail latency in high-load scenarios.

6.5.2 Safeguard POSTs with idempotency keys

The same payment API should enforce idempotency across retries:

public async Task<HttpResponseMessage> ChargePaymentAsync(PaymentRequest request, string key)
{
    var message = new HttpRequestMessage(HttpMethod.Post, "charge")
    {
        Content = JsonContent.Create(request)
    };
    message.Headers.Add("Idempotency-Key", key);
    return await _httpClient.SendAsync(message);
}

Downstream services log and deduplicate by this key, ensuring no double charges.

6.5.3 Surface failures as Result → ProblemDetails, with correlation in logs and spans

try
{
    var response = await _paymentsClient.PostAsync(...);
    if (!response.IsSuccessStatusCode)
        return Result.Fail("Payment gateway failure").WithErrorCode("PAYMENT_DOWN");
}
catch (TimeoutRejectedException ex)
{
    _logger.LogWarning(ex, "Payment timeout");
    return Result.Fail("Payment timeout");
}

These failures map consistently to ProblemDetails responses while also appearing as correlated logs and spans in observability pipelines.

7 Observability: From Logs to Traces and SLOs

Resilience without observability is blind optimism. Mature systems measure their own reliability through structured logs, distributed traces, and error-rate metrics. In .NET 8/9, OpenTelemetry and Serilog make these observability pillars first-class citizens.

7.1 Logging that scales: structure first, message templates, never log stack traces raw in API responses

Structured logging uses named properties instead of interpolated strings:

_logger.LogWarning("Order {OrderId} failed with status {Status}", orderId, status);

This allows filtering and aggregation by OrderId or Status in log analytics tools. Log levels guide triage:

Information – successful paths
Warning – recoverable issues
Error – unexpected but handled exceptions
Fatal – unrecoverable, service-impacting errors

Never log full stack traces into API responses; expose them only in logs with correlation IDs.

7.2 Correlation IDs end-to-end (ingress, internal messages, egress) with Serilog enrichers/scopes

Consistent correlation IDs link all telemetry for a single request. In Serilog:

using (_logger.BeginScope(new Dictionary<string, object>
{
    ["CorrelationId"] = correlationId
}))
{
    _logger.LogInformation("Processing checkout {Id}", checkoutId);
}

Use Serilog.Enrichers.CorrelationId to automatically attach correlation IDs from the request header to every log line, ensuring traces, logs, and metrics all align.

7.3 Tracing with OpenTelemetry (.NET SDK and automatic instrumentation): spans, attributes, exemplars, resources

OpenTelemetry instrumentation automatically creates spans for HTTP, gRPC, and EF Core operations.

builder.Services.AddOpenTelemetry()
    .WithTracing(b => b
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("CheckoutAPI"))
        .AddOtlpExporter());

Each trace captures latency, status, and correlation context. Custom spans can capture business semantics:

using var activity = MyTelemetrySource.StartActivity("ProcessOrder");
activity?.SetTag("order.id", orderId);

7.4 Metrics that matter for error handling (error rate, retry rate, circuit state, latency percentiles)

Key metrics to track:

Metric	Description	Target
`http.server.errors`	Total failed API responses	<1% per minute
`resilience.retry.count`	Retries per request	<3 on average
`circuit.open.count`	Active open breakers	<5% of endpoints
`latency.p95`	95th percentile response time	<2s for APIs

Polly and OpenTelemetry export these automatically. Custom counters can augment them:

_metrics.CreateCounter<int>("checkout_failures", "Number of failed checkout attempts");

7.5 ProblemDetails telemetry—capturing type/code and linking to logs and traces

Every ProblemDetails response should emit a corresponding trace event:

activity?.SetTag("error.type", problem.Type);
activity?.SetTag("error.code", problem.Extensions["code"]);
activity?.SetTag("error.title", problem.Title);

This makes failures searchable by error code in APM tools and allows correlation between logs, metrics, and traces.

7.6 Dashboards and alerts: fast triage, slow-burn detection, and budget-aware alerting

Operational dashboards should visualize:

Error rates by ProblemDetails.code
Retry counts and circuit breaker states
Latency heatmaps by endpoint
SLO adherence (availability, latency)

Alert only on actionable thresholds—e.g., sustained 5% error rate over 5 minutes—to avoid alert fatigue. Slow-burn issues (gradual latency increases) require anomaly detection rather than static thresholds.

8 Testing, Hardening, and Rollout

Error handling isn’t complete until proven resilient under chaos. Testing, fault injection, and progressive rollout ensure your strategy holds up under production pressure.

8.1 Unit and property tests for Result flows (happy, sad, and mixed paths)

Each Result-returning function should have explicit tests for success, expected failure, and cascading failure cases.

[Fact]
public void Checkout_ShouldFail_WhenCreditLimitExceeded()
{
    var result = _service.Checkout(_requestWithExceededLimit);
    Assert.True(result.IsError);
    Assert.Equal("Customer.CreditLimitExceeded", result.FirstError.Code);
}

Property-based testing verifies composition logic holds for a range of inputs, ensuring ROP chains don’t short-circuit incorrectly.

8.2 Contract tests for ProblemDetails (schema + examples)

Use HTTP contract tests to ensure consistent RFC 7807 compliance:

var response = await _client.PostAsJsonAsync("/checkout", invalidRequest);
response.StatusCode.Should().Be(HttpStatusCode.BadRequest);
var problem = await response.Content.ReadFromJsonAsync<ProblemDetails>();
problem.Type.Should().Contain("/errors/validation");

Schema conformance can be validated automatically via OpenAPI examples.

8.3 Fault injection & chaos experiments with Simmy + Polly (latency, exceptions, result faults)

Simmy extends Polly with fault injection—a key tool for validating resilience under stress.

var chaos = new ResiliencePipelineBuilder()
    .AddChaosLatency(new ChaosLatencyStrategyOptions
    {
        InjectionRate = 0.1, // 10% of requests delayed
        Latency = TimeSpan.FromSeconds(3)
    })
    .AddChaosException(new ChaosExceptionStrategyOptions
    {
        InjectionRate = 0.05,
        Exception = new TimeoutException("Injected fault")
    })
    .Build();

Run chaos experiments in staging or shadow traffic environments to validate recovery and alerting behavior.

8.4 Load and soak tests focused on retry/idempotency behavior

Load tests should measure not only throughput but also retry amplification and idempotency correctness. Monitor metrics:

Retry count per minute under load
Response consistency for repeated idempotent POSTs
Circuit breaker open/close frequency

K6 or Locust can generate realistic workloads that validate how retries and compensations behave under sustained stress.

8.5 Deployment guardrails: feature flags for resilience toggles, progressive rollout

Resilience features—like retries or fallback endpoints—should be toggleable at runtime via feature flags (e.g., LaunchDarkly, Azure App Configuration). This allows:

Gradual rollout of new resilience strategies
Safe disablement of problematic retries
Targeted testing in subset environments

if (_featureManager.IsEnabled("NewResiliencePolicy"))
    _pipeline.ExecuteAsync(...);

8.6 Production runbook & troubleshooting checklist: where to look first; correlate by ID; common misconfigurations

A solid runbook accelerates triage during incidents.

Start with the correlation ID — trace across logs, metrics, and spans.
Inspect ProblemDetails type/code — determine if it’s domain or infrastructure.
Check circuit breaker metrics — too many open breakers may indicate systemic failure.
Validate retry amplification — excessive retries can mask deeper latency issues.
Review idempotency keys — ensure duplicate keys are handled predictably.
Check environment config drift — mismatched retry policies between services are common.

Common misconfigurations include:

Catching exceptions but not returning Result (losing context)
Logging unstructured messages
Missing cancellation tokens in async flows
Overly aggressive retry loops without backoff

Error Handling That Scales: Railway-Oriented Programming, Result Types, and Exceptions in .NET