Mastering the Retry Pattern: Building Resilient Cloud-Native Applications with .NET

1. The Inevitability of Failure in Modern Cloud Architectures

1.1 Introduction: Why Resilience is Non-Negotiable

1.1.1. The Shift from Monolithic Stability to Distributed Dynamism

Traditionally, software applications were monolithic—tightly integrated, single-unit deployments with fewer points of failure. Today, software architectures embrace cloud-native principles: distributed, microservice-based, and dynamically scalable. While this shift offers flexibility and scalability, it introduces more moving parts, increasing complexity and the likelihood of intermittent failures.

Consider cloud-native architecture like orchestrating a symphony with many musicians: if even one musician falters, the harmony can break down. Thus, resilience is no longer optional—it’s foundational.

1.1.2. Introducing Transient Faults: The Silent Killers of Availability

In cloud environments, transient faults—temporary, short-lived issues—are common. Unlike persistent errors, transient faults typically correct themselves without human intervention. They can be difficult to detect because they vanish quickly, yet their impact can cascade, causing significant downtime or inconsistent application behavior.

1.1.3. Thesis: The Retry Pattern as a Fundamental Building Block

As software architects, embracing the Retry Pattern isn’t simply advisable—it’s essential. Retries form the backbone of self-healing applications in .NET, transforming inevitable faults from destructive disruptions into manageable hiccups.

1.2. Understanding Transient Faults

1.2.1. Definition and Characteristics

Transient faults are temporary, sporadic failures that resolve independently after brief periods. They’re characterized by being unpredictable but brief, typically resolving upon retry.

1.2.2. Common Sources of Transient Faults

Network hiccups: Brief connectivity losses.
Service throttling: Rate limiting (e.g., HTTP 429).
Resource unavailability: Database connection pool exhaustion.
Deployment-induced faults: Service restarts during deployments or auto-scaling.
Load balancer congestion: Sudden traffic spikes overwhelming load balancers.

Transient faults are subtle disruptors, but ignoring them can have severe repercussions.

1.3. The Business Impact of Ignoring Transient Faults

Ignoring transient faults may seem convenient initially, but the long-term repercussions are significant:

Degraded user experience: Frequent service disruptions erode trust.
Cascading failures: Unmanaged transient errors can trigger chain reactions, spreading outages.
Data integrity issues: Failed operations risk data inconsistencies or corruption.

As architects, championing resilience protects business continuity, customer satisfaction, and data reliability.

2. Deconstructing the Retry Pattern

The Retry Pattern elegantly solves transient faults through a structured approach:

2.1. Core Principles

Detection: Identify retriable faults explicitly.
Delay: Wait appropriately before retrying.
Retry execution: Re-attempt the operation.
Boundaries: Define conditions clearly when retries should cease.

2.2. Key Architectural Considerations

2.2.1. Idempotency: The Critical Prerequisite

Idempotency ensures operations produce the same result regardless of how many times they’re performed.

Imagine submitting an order online. If a retry happens due to a network glitch, you certainly don’t want two identical orders being processed. Ensuring idempotency prevents such disasters.

Common strategies to achieve idempotency include:

Using unique idempotency keys (e.g., HTTP headers, database constraints).
Leveraging database constraints to avoid duplicate inserts.

Here’s a C# example leveraging idempotency keys in HTTP requests using HttpClient:

using System.Net.Http;
using System.Net.Http.Headers;

var httpClient = new HttpClient();
var request = new HttpRequestMessage(HttpMethod.Post, "https://api.example.com/orders")
{
    Content = new StringContent("{\"item\":\"Laptop\",\"quantity\":1}")
};
request.Content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
request.Headers.Add("Idempotency-Key", Guid.NewGuid().ToString());

var response = await httpClient.SendAsync(request);

This unique header ensures multiple retries won’t cause unintended duplication.

2.2.2. Retriable vs. Non-Retriable Faults

Clearly distinguishing retriable from non-retriable faults prevents wasteful retries or worse, harmful outcomes:

Retriable faults: Temporary server errors (HTTP 503, HTTP 429, connection timeouts).
Non-Retriable faults: Permanent client-side errors (HTTP 400, HTTP 401 Unauthorized).

Retrying non-retriable faults can be harmful—like repeatedly entering the wrong password. Identify and categorize faults carefully.

3. Foundational Retry Strategies and Implementations

Choosing the right retry strategy can mean the difference between resolving an issue gracefully or making it worse. Let’s explore four common strategies:

3.1. Strategy 1: Immediate Retry

Concept: Retry instantly after failure.

Use Cases: Very brief faults, such as momentary network glitches.

Risks: Can create “retry storms,” flooding services with rapid requests, potentially causing self-inflicted denial-of-service (DoS).

3.2. Strategy 2: Fixed Interval

Concept: Wait a constant amount between retries (e.g., 3 seconds each time).

Use Cases: Predictable short downtimes.

Risks: May lead to synchronized retry spikes from multiple clients (“thundering herd”).

3.3. Strategy 3: Incremental Interval

Concept: Gradually increase delay by a fixed amount (e.g., 1s, then 2s, then 3s).

Use Cases: Moderate unpredictability.

Risks: Still somewhat aggressive, potentially overwhelming services if not managed carefully.

3.4. Strategy 4: Exponential Backoff (The Industry Standard)

Concept: Double wait times after each retry—1s, 2s, 4s, 8s, etc.

Why Effective: Gives overloaded services increasingly more breathing room to recover.

The Critical Role of Jitter

To prevent multiple clients from retrying simultaneously (the “thundering herd” problem), jitter—randomizing delays slightly—is essential.

Here’s a practical example in C# demonstrating exponential backoff with jitter using Polly, a popular resilience library in .NET:

using Polly;
using System;
using System.Net.Http;

var jitterer = new Random();

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(5, retryAttempt =>
        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
        + TimeSpan.FromMilliseconds(jitterer.Next(0, 1000)),
        onRetry: (exception, timeSpan, retryCount, context) =>
        {
            Console.WriteLine($"Retry {retryCount} after {timeSpan.TotalSeconds} seconds due to {exception.Message}");
        });

await retryPolicy.ExecuteAsync(async () =>
{
    using var client = new HttpClient();
    var response = await client.GetAsync("https://api.example.com/data");
    response.EnsureSuccessStatusCode();
});

Explanation:

Retries 5 times with exponential delays (2^retryCount seconds).
Adds jitter up to 1 second to each retry.
Logs each retry attempt for observability.

4. Implementing Retry in .NET: From Manual to Managed

Theoretical knowledge is essential, but true resilience emerges in implementation. In .NET, architects have multiple ways to introduce retries—from rudimentary code to sophisticated libraries. Let’s examine the journey from naive approaches to industry-grade patterns.

4.1. The Naive Approach: A Manual `for` Loop (Anti-Pattern)

4.1.1. Code Example: The Classic Try-Catch Inside a Loop

It’s tempting, especially in proof-of-concept code, to reach for a simple loop wrapped around a try-catch. It looks straightforward and seems to solve the problem:

int maxRetries = 3;
for (int attempt = 1; attempt <= maxRetries; attempt++)
{
    try
    {
        // Simulate a transient operation, e.g., a network call
        PerformUnstableOperation();
        break; // Success, exit the loop
    }
    catch (Exception ex)
    {
        if (attempt == maxRetries)
        {
            throw; // All retries failed, escalate the error
        }
        // Optionally, sleep between retries
        Thread.Sleep(1000);
    }
}

4.1.2. Why This Is a Bad Idea for Production Systems

While this pattern might seem serviceable, it falls short for several reasons:

Hard to Maintain: Every retry scenario ends up with duplicated, scattered retry logic.
Error-Prone: It’s easy to forget to add delays, or to retry on the wrong exceptions.
Lacks Flexibility: No built-in support for sophisticated strategies like exponential backoff or jitter.
Violates DRY Principle: Each manual implementation risks divergence, leading to inconsistent behavior.

In production, avoid such home-grown approaches. Instead, leverage built-in or well-established resilience frameworks.

4.2. Built-in .NET Mechanisms: Where to Find Them

Modern .NET provides native support for retries in critical frameworks, making resilience part of your infrastructure rather than your application logic.

4.2.1. Entity Framework Core: Connection Resiliency

Cloud-hosted databases—such as Azure SQL or AWS RDS—are particularly susceptible to transient connectivity faults. Entity Framework Core addresses this with execution strategies: automatic, configurable retries for database operations.

EnableRetryOnFailure is the primary flag architects should enable for cloud applications:

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder.UseSqlServer(
        connectionString,
        sqlServerOptions =>
            sqlServerOptions.EnableRetryOnFailure(
                maxRetryCount: 5,
                maxRetryDelay: TimeSpan.FromSeconds(30),
                errorNumbersToAdd: null)
    );
}

maxRetryCount: How many attempts before failing.
maxRetryDelay: The maximum pause between retries.
errorNumbersToAdd: Fine-tune for specific SQL error codes.

Architect’s Note: Handling User-Initiated Transactions

Execution strategies and manual transactions can conflict. When you explicitly manage transactions, always use Database.CreateExecutionStrategy() to wrap your work, ensuring both retry logic and transactional integrity:

var strategy = context.Database.CreateExecutionStrategy();
await strategy.ExecuteAsync(async () =>
{
    using var transaction = await context.Database.BeginTransactionAsync();
    // Your operations here
    await context.SaveChangesAsync();
    await transaction.CommitAsync();
});

This approach encapsulates both retries and transaction management, reducing the risk of partial commits or duplicated logic.

4.3. Introducing Polly: The Swiss Army Knife for .NET Resilience

4.3.1. What Is Polly and Why Should Every .NET Architect Know It?

Polly is an open-source, comprehensive resilience and transient-fault-handling library for .NET. Polly enables you to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent, thread-safe manner.

Why Polly?

Composability: Policies can be combined for complex resilience strategies.
Extensibility: Covers a wide range of scenarios, from HTTP retries to custom logic.
Observability: Hooks for logging and metrics with every attempt.

4.3.2. The Concept of Resilience Policies

A policy is a set of rules that defines how to handle certain faults. In Polly, policies are first-class citizens: you can configure, reuse, and compose them.

Retry Policy: Specifies how, when, and how many times to retry.
Wait and Retry Policy: Introduces delays between attempts.
Circuit Breaker Policy: Prevents further retries after repeated failures, allowing time for recovery.

4.3.3. Setting Up Polly in a .NET Project

Install Polly via NuGet:

dotnet add package Polly

For HTTP scenarios, add the official integration:

dotnet add package Microsoft.Extensions.Http.Polly

Once installed, you can begin composing robust, centralized retry strategies that are easy to maintain and test.

5. Mastering Retry Patterns with Polly

Polly’s expressive syntax and integration with .NET’s dependency injection make it a preferred choice for architects aiming for clarity and power. Let’s explore practical policy design, code examples, and architectural tips.

5.1. Defining a Basic Retry Policy

5.1.1. `Policy.Handle<TException>()`: Specifying Which Exceptions to Handle

You rarely want to retry every exception. For example, you should not retry on authentication failures, but you should retry on timeouts or server errors.

Here’s how to target specific exceptions:

var retryPolicy = Policy
    .Handle<HttpRequestException>() // Only retry these exceptions
    .RetryAsync(3); // Try up to three times

5.1.2. `Policy.OrResult<TResult>()`: Retrying Based on the Return Value

Sometimes, failures manifest as error results, not exceptions—like an HTTP 500 response. Polly supports handling these cases too:

var retryPolicy = Policy
    .HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .RetryAsync(3);

Now, your policy triggers on failed HTTP responses as well as exceptions.

5.1.3. `RetryAsync()`: Simple Retry a Fixed Number of Times

The most basic retry policy simply attempts the action a fixed number of times:

await retryPolicy.ExecuteAsync(async () =>
{
    // Some transient operation, e.g., fetching remote data
    var response = await client.GetAsync("https://api.example.com/data");
    response.EnsureSuccessStatusCode();
});

5.1.4. Code Example: Basic Retry Policy for a Transient Database Exception

Let’s tie this all together with a concrete example:

var retryPolicy = Policy
    .Handle<SqlException>(ex => IsTransient(ex))
    .RetryAsync(5);

await retryPolicy.ExecuteAsync(async () =>
{
    using var connection = new SqlConnection(connectionString);
    await connection.OpenAsync();
    // Database operations here
});

Assume IsTransient() is a method that determines whether the exception is likely transient.

5.2. Advanced Policies: Wait and Retry

5.2.1. `WaitAndRetryAsync()`: The Workhorse for Sophisticated Strategies

Real-world systems need more than brute-force retries. Enter Wait and Retry, which introduces configurable delays between attempts. This is where strategies like exponential backoff and jitter shine.

5.2.2. Code Example: Implementing a Fixed Interval Retry

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(
        retryCount: 3,
        sleepDurationProvider: attempt => TimeSpan.FromSeconds(2)
    );

await retryPolicy.ExecuteAsync(async () =>
{
    // HTTP or database call
    await PerformUnstableOperationAsync();
});

5.2.3. Code Example: Exponential Backoff with Jitter Using Polly

Production-Ready Example:

var jitterer = new Random();

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: retryAttempt =>
            TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
            + TimeSpan.FromMilliseconds(jitterer.Next(0, 500)), // Add 0-500ms jitter
        onRetry: (exception, timespan, retryCount, context) =>
        {
            Console.WriteLine($"Retry {retryCount} after {timespan.TotalSeconds:N1}s due to {exception.Message}");
        }
    );

Explanation:

Exponential Backoff: Doubles the delay each time.
Jitter: Adds random variance to prevent clients from synchronizing their retries.
onRetry: Useful for observability; logs retry attempts.

5.3. Seamless Integration with `IHttpClientFactory`

5.3.1. The Modern Approach to Resilient HTTP Calls in ASP.NET Core

IHttpClientFactory is now the preferred way to manage HttpClient instances. It addresses connection lifecycle issues and, combined with Polly, makes implementing retries clean and maintainable.

5.3.2. Using the `Microsoft.Extensions.Http.Polly` Package

Install the integration:

dotnet add package Microsoft.Extensions.Http.Polly

5.3.3. Code Example: Configuring a Retry Policy for a Typed HttpClient

Configure Polly policies during service registration—no need to litter your codebase with retry logic.

In Program.cs or Startup.cs:

using Microsoft.Extensions.DependencyInjection;
using Polly;
using Polly.Extensions.Http;

builder.Services.AddHttpClient<IMyApiClient, MyApiClient>()
    .AddPolicyHandler(HttpPolicyExtensions
        .HandleTransientHttpError()
        .WaitAndRetryAsync(
            retryCount: 5,
            sleepDurationProvider: retryAttempt =>
                TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) +
                TimeSpan.FromMilliseconds(new Random().Next(0, 1000))
        )
    );

HandleTransientHttpError() is a Polly helper for 5xx, 408, and network errors.
AddPolicyHandler wires up the retry policy to every outgoing HTTP request for that client.
No retry clutter in your business logic. Retry is handled consistently across all usages.

Result: Separation of concerns. Your HTTP clients are automatically resilient, with retry logic centralized and versioned alongside your DI setup.

5.4. Context is Key: Passing Data to Policies

Sophisticated retry strategies often need context—metadata to make smarter decisions or produce better logs.

5.4.1. Using `Policy.ExecuteAsync` with a `Context` Object

Polly allows you to pass a Context dictionary to ExecuteAsync, enabling contextual information for each retry.

5.4.2. Passing Correlation IDs, Logger Instances, or Other Data

For distributed tracing, pass a correlation ID into your retry logic for each request. You might also want to pass a logger or custom telemetry object.

5.4.3. Code Example: Logging Retry Attempts with a Correlation ID

var correlationId = Guid.NewGuid().ToString();

var context = new Context
{
    { "CorrelationId", correlationId }
};

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(
        retryCount: 3,
        sleepDurationProvider: attempt => TimeSpan.FromSeconds(2),
        onRetry: (exception, timespan, retryCount, ctx) =>
        {
            var cid = ctx.ContainsKey("CorrelationId") ? ctx["CorrelationId"] : "N/A";
            logger.LogWarning($"Retry {retryCount} for CorrelationId {cid} after {timespan.TotalSeconds}s due to {exception.Message}");
        }
    );

await retryPolicy.ExecuteAsync(async (ctx) =>
{
    // Perform resilient operation here
    await PerformUnstableOperationAsync();
}, context);

Benefit: Every retry is associated with a correlation ID, making debugging and tracing across microservices straightforward.
Extensibility: Pass any metadata needed for enhanced observability, security, or feature toggles.

6. Beyond Retry: Composing Resilience Patterns

Retries alone are powerful, but in distributed systems, failures aren’t always short-lived or random. Sometimes, they persist—or signal a deeper systemic issue. Resilience is most effective when multiple patterns work in concert, each handling a specific facet of failure.

6.1. When Retry Isn’t Enough: The Circuit Breaker Pattern

6.1.1. Concept: Proactive Failure Protection

While retries tackle intermittent faults, they can backfire when a dependency is experiencing sustained issues. Constant retries in these circumstances can cause resource exhaustion, amplify downstream failures, and lengthen recovery times. Enter the Circuit Breaker Pattern: it “breaks the circuit,” preventing further attempts once a threshold of failures is reached.

6.1.2. The Three States: Closed, Open, and Half-Open

Circuit Breakers have three core states, much like an electrical circuit:

Closed: Everything is healthy; calls are allowed through.
Open: Failure threshold exceeded; calls are blocked for a defined interval, failing fast.
Half-Open: After a cool-down period, a limited number of calls are allowed through to test if the dependency has recovered.

This mechanism not only protects downstream services but also gives them time to heal.

6.1.3. Architectural Synergy: How Retry and Circuit Breaker Work Together

A common misunderstanding is that retries and circuit breakers are mutually exclusive. In reality, they’re synergistic:

Retry is for transient, unpredictable faults—a short blip in network connectivity, a service restart, or a brief resource exhaustion.
Circuit Breaker shields your system from repeatedly hammering a dependency in distress, acting as a fail-fast mechanism.

Combined, they create a self-regulating feedback loop: retries for recoverable faults, circuit breaking for deeper issues.

6.1.4. Code Example: Wrapping a Retry Policy with a Circuit Breaker

Polly makes composing policies straightforward using PolicyWrap or Policy.WrapAsync().

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));

var circuitBreakerPolicy = Policy
    .Handle<HttpRequestException>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30)
    );

// Compose them: retry, then break the circuit if retries keep failing
var resiliencePolicy = Policy.WrapAsync(retryPolicy, circuitBreakerPolicy);

await resiliencePolicy.ExecuteAsync(async () =>
{
    await CallRemoteServiceAsync();
});

Here, the retry policy will attempt three times before the circuit breaker logic applies. If failures persist, the breaker trips and subsequent calls fail immediately until the cool-down period expires.

6.2. The Fallback Pattern: Graceful Degradation

6.2.1. Concept: Providing Alternatives, Not Just Errors

What if, despite retries and circuit breaking, an operation still fails? A robust system doesn’t just give up—it degrades gracefully. The Fallback Pattern lets your application offer a substitute: cached data, a default value, or a friendly message, instead of propagating a raw exception to users.

6.2.2. Use Cases

Returning the last known good value from a cache when the primary data source is unavailable.
Serving a static page when a content service is down.
Displaying a “service temporarily unavailable” message that’s helpful, not cryptic.

6.2.3. Code Example: Using FallbackAsync in Polly

Polly’s FallbackAsync policy lets you specify what should happen when all else fails.

var fallbackPolicy = Policy<HttpResponseMessage>
    .Handle<Exception>()
    .FallbackAsync(
        fallbackValue: new HttpResponseMessage(HttpStatusCode.OK)
        {
            Content = new StringContent("{\"data\":\"cached\"}")
        },
        onFallbackAsync: async b =>
        {
            logger.LogWarning("Fallback executed. Service unavailable, serving cached data.");
        }
    );

var resiliencePolicy = Policy.WrapAsync(fallbackPolicy, retryPolicy, circuitBreakerPolicy);

var response = await resiliencePolicy.ExecuteAsync(async () =>
{
    return await httpClient.GetAsync("https://critical-api.com/data");
});

By wrapping fallback, retry, and circuit breaker policies, you ensure your application doesn’t just survive failures but continues to provide value to users even in adverse conditions.

7. Real-World Scenarios for the .NET Architect

Applying these patterns to theory is one thing. Applying them to real systems is another. Let’s walk through practical .NET scenarios, with architecture diagrams, code, and actionable guidance.

7.1. Scenario 1: Resilient API-to-API Communication

7.1.1. Architecture Diagram

[Client App] 
     |
     v
[API Gateway] ---> [Downstream Service (REST API)]
             |         ^
             |         |
      (Polly Retry & Circuit Breaker)

7.1.2. Code Walkthrough Using IHttpClientFactory and Polly

Suppose your application calls an external REST API. To guard against failures, you configure resilience at the HTTP client layer.

In Program.cs:

builder.Services.AddHttpClient<MyWeatherClient>()
    .AddPolicyHandler(HttpPolicyExtensions.HandleTransientHttpError()
        .WaitAndRetryAsync(
            5,
            attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)) + TimeSpan.FromMilliseconds(new Random().Next(0, 500))
        )
    )
    .AddPolicyHandler(Policy<HttpResponseMessage>
        .Handle<Exception>()
        .CircuitBreakerAsync(
            handledEventsAllowedBeforeBreaking: 3,
            durationOfBreak: TimeSpan.FromSeconds(15)
        )
    );

Typed client:

public class MyWeatherClient
{
    private readonly HttpClient _client;
    public MyWeatherClient(HttpClient client) => _client = client;

    public async Task<string> GetWeatherAsync()
    {
        var response = await _client.GetAsync("/weather");
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
}

No retry logic litters your core code. Resilience is declared centrally, enforced consistently.

7.2. Scenario 2: Resilient Database Operations

7.2.1. Architecture Diagram

[.NET App] --> [EF Core DbContext] 
                  |
                  v
             [Cloud SQL DB]
         (Execution Strategy with Retry)

7.2.2. Code Walkthrough Using EF Core’s Built-in Resiliency

In your DbContext:

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder.UseSqlServer(
        connectionString,
        sqlOptions => sqlOptions.EnableRetryOnFailure(5, TimeSpan.FromSeconds(10), null)
    );
}

This ensures transient database connectivity issues are retried transparently.

7.2.3. Augmenting with Polly for Complex Transaction-Level Retries

For scenarios needing advanced control (e.g., transaction boundaries):

var strategy = context.Database.CreateExecutionStrategy();
await strategy.ExecuteAsync(async () =>
{
    using var transaction = await context.Database.BeginTransactionAsync();
    try
    {
        // Multiple DB operations
        await context.SaveChangesAsync();
        await transaction.CommitAsync();
    }
    catch
    {
        await transaction.RollbackAsync();
        throw;
    }
});

Optionally, you can wrap higher-level orchestration logic with a Polly policy for retrying at the service boundary—useful for saga or outbox patterns.

7.3. Scenario 3: Resilient Message Queue Processing

7.3.1. Architecture Diagram

[Producer Service] --> [Queue Broker (e.g., RabbitMQ, Azure Service Bus)] <-- [Consumer Service (.NET)]
                                                               |
                                                      (Polly Retry on Processing)

7.3.2. Code Walkthrough for Message Processing with Polly

Suppose your consumer occasionally fails to process messages due to transient broker issues or external dependencies.

var retryPolicy = Policy
    .Handle<BrokerUnreachableException>()
    .Or<TimeoutException>()
    .WaitAndRetryAsync(5, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));

await retryPolicy.ExecuteAsync(async () =>
{
    // Message processing logic
    await ProcessMessageAsync(message);
});

If you need to combine with a circuit breaker or fallback, simply wrap these as shown in earlier examples.

8. Best Practices and Anti-Patterns

Every resilience journey benefits from hard-won lessons and industry wisdom. Here are clear, actionable dos and don’ts for .NET architects.

8.1. Architectural Best Practices

DO ensure all retried operations are idempotent. Without this, you risk duplicating actions and corrupting state.
DO use exponential backoff with jitter as your default. This protects both your systems and your dependencies.
DO configure different retry policies for different services. The right strategy for a payment gateway may differ from a search API.
DO centralize policy configuration, typically in Program.cs or via DI containers, for visibility and maintainability.
DO implement comprehensive logging and monitoring. Observability on retries and circuit breaking is vital for detecting issues before they become outages.
DO set a finite maximum retry limit. Infinite or very high retry counts can degrade system responsiveness and reliability.

8.2. Common Anti-Patterns to Avoid

DON’T retry without a delay, or use fixed intervals—this can overwhelm downstream services (“retry storm”).
DON’T retry on non-transient faults, like client errors or programming bugs. Validate error types before retrying.
DON’T add retry logic at every layer. This can cause “retry amplification,” where failures are multiplied, not mitigated.
DON’T let retry durations become excessive. Keep total wait time reasonable so that users aren’t left hanging.
DON’T build your own complex retry logic when mature, well-tested libraries like Polly are available. Reinventing the wheel often leads to subtle bugs and technical debt.

9. Conclusion: Building a Culture of Resilience

9.1. Recap

The Retry Pattern isn’t a luxury or a feature—it’s a foundational requirement for any application operating in the unpredictable landscape of the cloud. In distributed systems, failures are expected, but unhandled failures are unacceptable.

9.2. The Architect’s Role

As a software architect, you play a pivotal role in advocating for, standardizing, and educating teams on resilience patterns. Your influence sets the tone: resilience is not an afterthought or a bolt-on, but an architectural pillar woven into every layer of the stack.

Champion proven frameworks like Polly.
Centralize configuration and logging.
Conduct regular reviews of error logs and adjust strategies as systems evolve.

9.3. Final Thoughts

A well-implemented retry strategy is a silent guardian. It works invisibly, smoothing over the cracks and disruptions of distributed computing, delivering on the stability, reliability, and performance that your users—and your business—demand. By thoughtfully integrating retry with circuit breakers, fallback, and other patterns, you create robust systems that don’t just survive failure—they expect it, manage it, and ultimately thrive because of it.