Skip to content
Building ETL Pipelines That Don't Break: Idempotency, Schema Evolution & Recovery with Azure Data

Building ETL Pipelines That Don't Break: Idempotency, Schema Evolution & Recovery with Azure Data

1 Introduction: The Fragility of Modern Data Workflows

Modern ETL systems move faster and integrate more sources than anything built a decade ago. APIs evolve without notice. SaaS vendors add or rename fields. Internal teams ship releases that subtly change payload formats. At the same time, business stakeholders expect analytics to refresh reliably every morning.

But reliability in data systems is not automatic.

A 2025 Monte Carlo industry survey reported that over 40% of data teams spend more than 30% of their engineering time responding to pipeline incidents—reruns, data corrections, schema mismatches, and production hotfixes. That’s not innovation time. That’s cleanup time.

The core issue is simple: many pipelines are built for the happy path. They assume:

  • files arrive complete and on time
  • schemas don’t change unexpectedly
  • failures are rare
  • retries won’t duplicate data

Those assumptions break quickly in real systems.

Once a file arrives late, a schema changes, or a network timeout interrupts a batch, pipelines often degrade into partial writes, duplicate rows, or silent data loss. When downstream analytics, reporting systems, or ML models depend on that data, the impact spreads far beyond the ingestion job.

A reliable ETL system must commit to one principle:

It should be safe to run the same pipeline step repeatedly.

That principle connects idempotency, schema evolution, replayability, and recovery patterns into a single architectural mindset.

This article covers:

  • Designing idempotent pipelines that tolerate retries and duplicates
  • Managing schema evolution without corrupting downstream datasets
  • Achieving effective exactly-once processing in distributed Azure systems
  • Handling late-arriving data and historical backfills
  • Implementing recovery strategies with Azure services and .NET
  • Building scalable ETL components using modern .NET patterns
  • Adding observability and governance for long-term stability

1.1 The Cost of Fragility and the Shift to Defensive Design

Pipeline fragility is expensive, but the cost is often hidden until something goes wrong. A failed ingestion job might silently drop transactions, duplicate revenue records, misalign dimension keys, or overwrite valid historical data. These issues rarely surface immediately—they show up days later as inconsistent dashboards, reconciliation mismatches, or stakeholder complaints.

The cost appears in three areas:

  1. Operational load — Engineers spend hours replaying jobs, manually deduplicating records, and validating fixes. Without idempotent pipelines, recovery becomes risky and manual.
  2. Business trust — Executives assume data platforms are deterministic. One incorrect financial dashboard can erode confidence quickly.
  3. Cascading breakage — A malformed batch can corrupt curated Delta tables, invalidate ML training sets, and propagate incorrect metrics across multiple domains.

Traditional ETL development assumed structured, predictable inputs—CSV files arrived nightly, schemas were stable, transformation logic rarely changed. That model no longer fits modern Azure-based systems where files arrive incomplete or out of order, upstream APIs add fields without notice, message brokers deliver duplicates, and compute nodes restart mid-execution.

A defensive pipeline is:

  • Replayable — safe to re-run without duplicating data
  • Deterministic — same input produces the same output
  • State-aware — checkpoints and offsets reflect durable writes
  • Observable — failures are visible and traceable

Working once is not success. Working repeatedly under failure conditions is.

That shift changes how you design ingestion, transformation, storage, and orchestration layers.

1.2 The 2025 Landscape: Unified Data in Microsoft Fabric and OneLake

The Azure data ecosystem has consolidated significantly. Microsoft Fabric and OneLake promote a unified lakehouse model instead of stitching together loosely coupled storage and compute services.

Two characteristics directly affect pipeline reliability. First, Delta Lake is now foundational—its transaction log, ACID guarantees, merge semantics, and schema enforcement make idempotent and replayable systems practical. Second, OneLake centralizes storage under a single logical namespace, simplifying lineage, governance, and recovery.

But unification increases responsibility. When multiple workloads—BI, data science, real-time analytics—consume the same curated tables, pipeline correctness becomes critical. The lakehouse model reduces architectural complexity. It does not reduce the need for discipline.

1.3 Why .NET Architects are Increasingly Responsible for Data Reliability

Data reliability no longer lives only in Spark notebooks or orchestration tools. It often begins in application code.

Consider a common Azure scenario: an ASP.NET Core API publishes domain events to Azure Event Hubs, a .NET Azure Function consumes those events, enriches records and writes them into a Delta Lake table in Fabric, and a downstream Power BI semantic model reads from that curated table.

If the Azure Function produces non-deterministic IDs or writes duplicates, the Delta table becomes inconsistent. If the Event Hub consumer checkpoints offsets before a durable write, events may be lost. If schema changes aren’t validated before publishing, downstream merges fail. A concrete implementation of this flow appears in Section 2.

The reliability of the entire analytics platform depends on deterministic identity generation, idempotent Delta merge logic, correct retry and checkpoint handling, and schema validation before write. These concerns sit directly in .NET code.

The boundary between application architecture and data architecture has dissolved. Reliable pipelines start at the edge—where events are produced and first processed—not just in orchestration tools. And in Azure-based systems, that edge is very often written in C#.


2 Designing for Idempotency: The “Play-It-Again” Principle

Idempotency ensures that re-running a pipeline step produces the same final state as running it once. In distributed Azure data systems, retries are normal. Network timeouts happen. Spark jobs fail halfway through. Event processors restart.

If your pipeline cannot tolerate re-execution, every retry becomes a risk.

Idempotency is not about preventing duplicates at the infrastructure layer. It’s about designing storage and transformation logic so that duplicates don’t change the outcome.

2.1 Understanding Idempotency in Data Pipelines

In practical terms, idempotency means this:

If the same input is processed multiple times, the target dataset ends up in the same state.

The goal is not to avoid performing work twice. The goal is to avoid incorrect side effects.

2.1.1 Definition: Side-effect-free Re-execution

A pipeline operation is idempotent when:

  • repeated execution does not create duplicate rows
  • partial writes can be safely retried
  • checkpointing reflects durable, completed work

The simplest test is:

“If this job crashes after writing half a batch, what happens when it restarts?”

If the answer is “we must truncate the table” or “we’ll get duplicates,” the system is fragile.

2.1.2 Deterministic vs. Non-deterministic Transforms

Many idempotency problems are not caused by storage. They are caused by transformation logic.

Deterministic transforms always produce the same output for the same input. Examples include normalizing strings, mapping codes to reference tables, and computing hashes.

Non-deterministic transforms change across runs: generating random GUIDs, embedding DateTime.UtcNow, or using auto-incrementing counters.

If a retry generates different values than the original run, merges and comparisons break.

The rule is simple: identity-defining values must be deterministic. If timestamps are needed, compute them once at ingestion and persist them, rather than regenerating them during retries.

2.2 Practical Patterns for Idempotent Writes

Storage-level idempotency is where most recovery logic lives. In Azure-based lakehouse architectures, Delta Lake provides the core building block: transactional MERGE.

But idempotency depends on key design and clean source data.

2.2.1 The Upsert (Merge) Pattern in Delta Lake

An idempotent write typically uses MERGE to upsert records based on a stable key.

from delta.tables import DeltaTable

incoming = spark.read.format("parquet").load("/landing/batch_2025_01_15")

# Important: Deduplicate source before merge
incoming_deduped = (
    incoming
    .dropDuplicates(["IdentityKey"])
)

deltaTable = DeltaTable.forPath(spark, "/curated/orders")

(
    deltaTable.alias("t")
    .merge(
        source=incoming_deduped.alias("s"),
        condition="t.IdentityKey = s.IdentityKey"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

Two important caveats:

  1. Deduplicate the source before merging. Delta MERGE is idempotent only if the source dataset has a single row per key. Duplicate keys in the source can cause non-deterministic updates or runtime failures.

  2. Ensure the merge condition is stable. If the identity column changes between runs, retries will insert duplicates instead of updating existing rows.

Running this merge multiple times with the same input produces the same final state. This pattern is referenced throughout the article wherever idempotent merges are needed.

2.2.2 Natural Keys vs. Synthetic Hash Keys (using System.IO.Hashing)

Natural keys like email addresses or document numbers are tempting merge keys. But they often contain inconsistencies: case sensitivity differences, trailing spaces, and format changes over time.

Synthetic keys solve this by computing a stable hash from selected identity fields.

In .NET, System.IO.Hashing provides high-performance non-cryptographic hashing:

using System.IO.Hashing;
using System.Text;

public static string ComputeRecordHash(string canonicalIdentity)
{
    var bytes = Encoding.UTF8.GetBytes(canonicalIdentity);
    return Convert.ToHexString(XxHash128.Hash(bytes));
}

The key idea is that canonicalIdentity must be deterministic, normalized, and composed only of stable fields. If any non-repeatable value is included, retries will generate different hashes and break idempotency.

2.3 Implementation Example: A .NET Service for Consistent Record Hashing

The earlier example used JsonSerializer.Serialize(object). That approach has a subtle flaw: property order is not guaranteed across runtime types or polymorphic objects. That can produce different hashes for logically identical records.

To make hashing reliable, we must explicitly select and order fields.

Here’s a safer implementation using a strongly-typed DTO and explicit canonicalization:

public sealed record OrderIdentityDto(
    string OrderNumber,
    string CustomerId,
    DateOnly OrderDate
);

public interface IRecordIdentityService<T>
{
    string CreateIdentityHash(T record);
}

public sealed class OrderIdentityService : IRecordIdentityService<OrderIdentityDto>
{
    public string CreateIdentityHash(OrderIdentityDto record)
    {
        // Canonical string with explicit ordering
        var canonical = string.Join("|",
            record.OrderNumber.Trim().ToUpperInvariant(),
            record.CustomerId.Trim().ToUpperInvariant(),
            record.OrderDate.ToString("yyyy-MM-dd")
        );

        var bytes = Encoding.UTF8.GetBytes(canonical);
        return Convert.ToHexString(XxHash128.Hash(bytes));
    }
}

This avoids serializer ordering issues entirely. The canonical representation is field-ordered, normalized, and culture-independent.

How the Hash Reaches Delta Lake

The flow must be explicit.

  1. The .NET ingestion service computes IdentityKey.
  2. It writes records (including IdentityKey) to a Parquet landing file or directly to a staging Delta table.
  3. A Fabric notebook or Spark job reads that dataset and uses IdentityKey as the MERGE key.

Example landing model:

public sealed record OrderLandingRecord(
    string IdentityKey,
    string OrderNumber,
    string CustomerId,
    decimal Amount,
    DateTime EventTimeUtc
);

When Spark reads this landing file, IdentityKey becomes the deterministic merge condition.

Without this explicit hand-off, hashing and merging remain disconnected concepts.

2.4 Scenario: Recovering from a Mid-Batch Failure without Duplicates

Let’s make the failure scenario concrete.

Assume:

  • 500,000 records in a batch
  • records written to /landing/batch_2025_01_15
  • Spark merges into /curated/orders
  • failure occurs after 300,000 records are written

Because the merge uses IdentityKey, recovery is straightforward. On restart, the same merge from Section 2.2.1 is re-executed against the full landing dataset.

The 300,000 previously written records already exist in the curated table with the same IdentityKey. They are updated (effectively no-op if data is identical). The remaining 200,000 are inserted.

No truncation. No manual cleanup. No duplicate rows.

If the landing dataset contained duplicate IdentityKey values and we skipped deduplication, the merge could fail or produce unpredictable results. Source deduplication is not optional—it is part of idempotent design.

This is the “play-it-again” principle in practice. When keys are deterministic and merges are stable, retries become safe. And when retries are safe, pipeline recovery stops being a crisis and becomes routine behavior.


3 Schema Evolution and Drift Management

Upstream systems change. They always do. A partner adds a new column. An internal API renames a field. A data type changes from integer to string because a special value appears. These changes are normal in distributed systems.

What determines whether your pipeline survives them is not whether change happens, but how you handle it.

Schema evolution and schema enforcement are not opposites. They are tools applied at different layers of the architecture. The key is knowing where to allow flexibility and where to enforce contracts.

3.1 The Taxonomy of Schema Change

Not all schema changes are equal. Before deciding how to handle a change, classify it correctly.

3.1.1 Additive vs. Breaking Changes

Additive changes — adding a new optional column, introducing a new nested field, expanding an enum domain, adding nullable metadata. These are usually safe and can often be handled automatically.

Breaking changes — renaming or removing columns, changing identity or key fields, changing semantics (e.g., Amount now means net instead of gross), or changing data type incompatibly. These require coordination and impact downstream models, dashboards, and contracts.

Treating all schema changes as equal is a mistake. Additive evolution should be easy. Breaking changes should be controlled.

3.1.2 Data Type Widening vs. Narrowing (Delta-Specific Behavior)

In Delta Lake, type compatibility follows Spark’s type system rules.

DirectionExampleBehavior
WideningIntegerTypeLongTypeAllowed with mergeSchema
WideningFloatTypeDoubleTypeAllowed with mergeSchema
WideningIncreasing DecimalType precisionAllowed with mergeSchema
NarrowingLongTypeIntegerTypeRejected — data loss risk
IncompatibleIntegerTypeStringTypeRejected — semantic change

Delta will fail the write even with mergeSchema enabled for narrowing or incompatible conversions. This is correct behavior—automatic narrowing would corrupt data.

The key takeaway: widening is manageable. Narrowing requires migration logic.

3.2 Automated Schema Evolution in Delta Lake and Azure Fabric

Delta Lake provides two main mechanisms for schema evolution: mergeSchema and autoMerge. These should be used intentionally.

3.2.1 Using mergeSchema and autoMerge Options

For append operations:

df.write \
  .format("delta") \
  .option("mergeSchema", "true") \
  .mode("append") \
  .save(targetPath)

For merge operations:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

(
    deltaTable.alias("t")
    .merge(
        source=incoming.alias("s"),
        condition="t.IdentityKey = s.IdentityKey"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

When enabled, additive columns in incoming are automatically added to the Delta table schema.

But this should not be turned on everywhere.

When to Enforce vs. When to Evolve

A practical decision framework:

LayerStrategyRationale
BronzeEnforce + rescued columnCatch drift early, preserve raw
SilverAuto-merge additive onlyControlled evolution
GoldStrict enforcementStable business contracts

Bronze layer: Enforce schema strictly, but capture unexpected fields in a rescue column. This prevents silent drift.

Silver layer: Allow additive evolution through mergeSchema, but block breaking changes.

Gold layer: Enforce strictly. Gold datasets are business-facing and must behave like contracts.

This layered approach keeps flexibility at ingestion while protecting curated outputs.

3.2.2 Schema Enforcement: Preventing Data Corruption at the Bronze Layer

Schema enforcement ensures that unexpected changes don’t silently alter meaning.

Example: If an upstream API changes amount from numeric to string because it sometimes sends "N/A", blindly accepting that change could poison analytics.

In Bronze:

  • Reject incompatible schema changes.
  • Route problematic records to a quarantine path or DLQ.
  • Preserve the raw payload for investigation.

A failed schema write in Bronze is a signal—not a failure to suppress.

3.3 Handling Drift with Azure Data Factory (ADF) Mapping Data Flows

ADF Mapping Data Flows provide built-in schema drift handling.

3.3.1 Building “Schema-Agnostic” Ingestion Pipelines

In the ADF UI, for a Source transformation:

  1. Enable Allow schema drift.
  2. Enable Infer drifted column types (optional but recommended).
  3. In the Projection tab, select Auto mapping or use * in derived expressions.

The key properties in the underlying Data Flow definition are:

  • allowSchemaDrift: true — passes unexpected columns downstream
  • validateSchema: false — prevents immediate failure on mismatch

This configuration is ideal for Bronze ingestion from semi-structured or partner sources.

3.3.2 The “Rescued Data Column” Pattern for Unexpected Fields

When drift is allowed, you often don’t want unknown columns polluting curated layers.

A common pattern is to consolidate unexpected fields into a JSON column such as _rescued_data. In Mapping Data Flow, use a Derived Column transformation to create _rescued_data and capture drifted columns using rule-based mapping.

This allows:

  • preserving unexpected fields
  • preventing silent data loss
  • deferring schema updates until reviewed

Later, if a new field becomes officially supported, it can be promoted from _rescued_data into structured columns.

3.4 .NET Interaction with Delta: Practical and Production-Ready Approaches

As of early 2026, delta-rs .NET bindings are still experimental and not widely adopted in production Azure data platforms. A more proven pattern is:

  1. Use .NET to write Parquet files to OneLake or ADLS Gen2.
  2. Use Fabric or Spark to handle Delta transactions, merges, and schema evolution.

Example using Parquet.Net:

using Parquet;
using Parquet.Data;

public async Task WriteParquetAsync(IEnumerable<OrderLandingRecord> records, string path)
{
    var fields = new DataField[]
    {
        new DataField<string>("IdentityKey"),
        new DataField<string>("OrderNumber"),
        new DataField<decimal>("Amount")
    };

    using var stream = File.Create(path);
    using var writer = await ParquetWriter.CreateAsync(new Schema(fields), stream);

    using (var rowGroup = writer.CreateRowGroup())
    {
        rowGroup.WriteColumn(new DataColumn(fields[0], records.Select(r => r.IdentityKey).ToArray()));
        rowGroup.WriteColumn(new DataColumn(fields[1], records.Select(r => r.OrderNumber).ToArray()));
        rowGroup.WriteColumn(new DataColumn(fields[2], records.Select(r => r.Amount).ToArray()));
    }
}

Fabric or Spark then reads these Parquet files and performs:

  • schema validation
  • schema evolution via mergeSchema
  • transactional MERGE into Delta tables

This division of responsibility keeps:

  • .NET focused on deterministic ingestion
  • Spark/Fabric focused on transactional consistency

For reliability, use mature tooling for transactional operations. Let .NET produce clean, well-defined landing data. Let Delta handle state transitions.


4 Exactly-Once Processing Semantics

Exactly-once processing sounds straightforward: a message is processed one time and only one time. In distributed Azure systems, that guarantee rarely exists in a literal sense.

Event producers retry. Consumers restart. Network partitions occur. Checkpoints may be written before or after durable storage. Once multiple systems participate—Event Hubs, Azure Functions, Delta Lake—the guarantee weakens.

The practical goal is not literal exactly-once delivery. The goal is exactly-once outcomes. That means duplicates may occur, but the final state remains correct. Achieving that outcome requires combining at-least-once delivery with deterministic and idempotent writes.

4.1 The Fallacy of “Exactly-Once” in Distributed Systems

Messaging systems provide delivery guarantees within defined boundaries. Azure Event Hubs provides at-least-once delivery. Azure Service Bus supports more advanced patterns but still cannot guarantee global exactly-once behavior across independent systems.

The weak point is always coordination:

  • A producer may retry after a timeout.
  • A consumer may crash after writing to storage but before checkpointing.
  • A downstream write may succeed while offset tracking fails.

In practice, reliable data platforms accept at-least-once delivery and push correctness into application logic and storage semantics. That is why idempotent Delta merges and deterministic identity keys matter. They turn repeated delivery into harmless repetition.

4.2 Achieving “Effective Exactly-Once” via At-Least-Once + Idempotency

Effective exactly-once behavior is built from two pieces:

  1. At-least-once delivery from the messaging system
  2. Idempotent writes at the storage layer

When a duplicate event arrives, the system does not panic. It computes the same identity key and performs a merge. If the record already exists, it updates or no-ops.

For example, if an order event arrives three times:

  • The .NET consumer computes the same IdentityKey.
  • The Delta MERGE matches on that key.
  • The final row state remains consistent.

This model supports retries, scaling, and redeployments without introducing corruption. The infrastructure does not promise perfection. The design absorbs imperfection.

4.3 Azure Event Hubs and Service Bus Integration

In Azure-based ETL systems, Event Hubs often acts as the ingestion backbone for telemetry, domain events, and operational signals. Service Bus is commonly used for command workflows and transactional boundaries.

The key design rule is simple:

Only checkpoint after durable, idempotent storage.

If the consumer crashes before checkpointing, the message is replayed. If the write was idempotent, replay causes no harm.

4.3.1 Managing Consumer Offsets and Checkpointing in C#

The modern pattern for consuming Event Hubs in .NET uses EventProcessorClient with BlobCheckpointStore:

var blobClient = new BlobContainerClient(storageConnection, "checkpoints");

var processor = new EventProcessorClient(
    blobClient,
    EventHubConsumerClient.DefaultConsumerGroupName,
    eventHubConnectionString,
    eventHubName);

processor.ProcessEventAsync += async args =>
{
    var order = JsonSerializer.Deserialize<OrderEvent>(
        args.Data.EventBody.ToString());

    // Deterministic identity generation
    var identityKey = _identityService.CreateIdentityHash(order);

    var success = await _deltaWriter.MergeAsync(order, identityKey);

    if (success)
    {
        // Only checkpoint after durable write
        await args.UpdateCheckpointAsync(args.CancellationToken);
    }
};

processor.ProcessErrorAsync += args =>
{
    Console.WriteLine($"Error: {args.Exception}");
    return Task.CompletedTask;
};

await processor.StartProcessingAsync();

Important behavior:

  • If _deltaWriter.MergeAsync fails, we do not checkpoint.
  • The event will be replayed.
  • Because the merge is idempotent, replay is safe.

This pattern creates at-least-once delivery with exactly-once outcomes. The full flow—Event Hubs to .NET to hash to Delta Lake—uses the same idempotent merge from Section 2.2.1, with source deduplication ensuring deterministic behavior even when the consumer replays events.

4.3.2 Using the Transactional Outbox Pattern in .NET Data Services

When publishing events from a database-backed service, you must ensure the database update and message publication are consistent. The transactional outbox pattern stores events in an outbox table within the same database transaction.

Producer side:

public async Task UpdateOrderAsync(Order order)
{
    using var tx = await _db.Database.BeginTransactionAsync();

    _db.Orders.Update(order);

    _db.OutboxMessages.Add(new OutboxMessage
    {
        Id = Guid.NewGuid(),
        Payload = JsonSerializer.Serialize(order),
        CreatedUtc = DateTime.UtcNow,
        Dispatched = false
    });

    await _db.SaveChangesAsync();
    await tx.CommitAsync();
}

The publisher:

public sealed class OutboxPublisher : BackgroundService
{
    private readonly AppDbContext _db;
    private readonly EventHubProducerClient _producer;

    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        while (!ct.IsCancellationRequested)
        {
            var pending = await _db.OutboxMessages
                .Where(x => !x.Dispatched)
                .Take(100)
                .ToListAsync(ct);

            foreach (var message in pending)
            {
                var eventData = new EventData(
                    Encoding.UTF8.GetBytes(message.Payload));

                await _producer.SendAsync(
                    new[] { eventData }, ct);

                message.Dispatched = true;
            }

            await _db.SaveChangesAsync(ct);

            await Task.Delay(TimeSpan.FromSeconds(5), ct);
        }
    }
}

If publishing fails, Dispatched remains false and the message is retried. If the publisher crashes mid-send, idempotent downstream merges absorb duplicates.

Frameworks like MassTransit and Wolverine provide built-in outbox implementations if you prefer not to build this manually.

4.4 Distributed Transactions vs. Saga Patterns for Data Consistency

Two-phase commit across SQL, Event Hubs, and Delta Lake is theoretically possible but operationally expensive and rarely used in modern cloud architectures. Instead, use local transactions and compensating logic.

Consider a two-step pipeline: enrich an order with an external pricing service, then write the enriched order to a Delta table. If enrichment succeeds but the Delta write fails, a saga sends the enriched record to a recovery queue:

public async Task ProcessOrderAsync(OrderEvent evt)
{
    var enriched = await _pricingService.EnrichAsync(evt);

    try
    {
        await _deltaWriter.MergeAsync(enriched);
    }
    catch
    {
        // Compensation: move event to recovery queue
        await _serviceBusSender.SendMessageAsync(
            new ServiceBusMessage(JsonSerializer.Serialize(enriched)));

        throw;
    }
}

A separate worker retries the Delta merge from the recovery queue. This is a saga in practice: forward action + compensating path. Sagas align well with Azure services such as Service Bus DLQs, retry policies, and idempotent Delta merges.


5 Managing Late-Arriving Data and Replayability

In real systems, data does not arrive in order. Some events are delayed by minutes. Others appear days later due to upstream retries, offline processing, or integration gaps. If your pipeline assumes that “today’s data contains only today’s events,” it will eventually break.

Late-arriving data is not an edge case. It is a normal operating condition.

The difference between a fragile and a resilient pipeline is how it handles the time gap between when something happened and when the system processes it.

5.1 The Temporal Gap: Event Time vs. Processing Time

Every record has at least two timestamps:

  • Event time — when the event occurred in the source system
  • Processing time — when the pipeline received or processed the record

If you build aggregates based on processing time, your metrics will drift whenever late records arrive. A sale that occurred last month but arrived today will inflate current metrics unless you correct for it.

The solution is straightforward:

  • Persist the original event timestamp.
  • Partition curated tables by event time.
  • Use event time—not processing time—for business logic.

When a late record arrives, it should be written into the partition corresponding to its event date. That partition may already exist. That’s fine. The system must be able to update it safely.

This is where the idempotent merge strategy from Section 2 becomes essential.

5.2 Watermarking Strategies for Real-time and Batch Pipelines

A watermark defines how long the system waits before considering a time window “complete.” It represents the acceptable lateness threshold.

For example:

  • IoT telemetry: 2–5 minutes
  • E-commerce transactions: 1–2 hours
  • Financial settlement systems: 1–2 days

The watermark is not about correctness alone. It’s about trade-offs between timeliness and completeness.

5.2.1 Watermarks with Spark Structured Streaming (Fabric-Aligned Approach)

In modern Azure data architectures using Fabric or Spark, Structured Streaming is the preferred mechanism for handling late data.

from pyspark.sql.functions import window

stream = (
    spark.readStream
        .format("eventhubs")
        .options(**ehConf)
        .load()
)

parsed = (
    stream
        .selectExpr("cast(body as string) as json")
        .select(from_json("json", schema).alias("data"))
        .select("data.*")
)

aggregated = (
    parsed
        .withWatermark("EventTime", "10 minutes")
        .groupBy(
            window("EventTime", "5 minutes"),
            "DeviceId"
        )
        .count()
)

query = (
    aggregated
        .writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", "/checkpoints/deviceAgg")
        .start("/curated/deviceAgg")
)

withWatermark("EventTime", "10 minutes") tells Spark to accept events arriving up to 10 minutes late and drop or handle separately events arriving later. This keeps aggregates consistent while preventing unbounded state growth.

Tracking Watermarks in .NET Batch Pipelines

Batch pipelines also need watermark logic. A simple approach is to persist the last processed event time in a checkpoint table and use it to filter incoming records:

public async Task<DateTime> GetWatermarkAsync(string pipelineName)
{
    var checkpoint = await _db.PipelineCheckpoints
        .FirstOrDefaultAsync(x => x.PipelineName == pipelineName);

    return checkpoint?.LastProcessedEventTime ?? DateTime.MinValue;
}

public async Task RunBatchAsync(string pipelineName, IEnumerable<OrderRecord> incoming)
{
    var watermark = await GetWatermarkAsync(pipelineName);
    var newRecords = incoming.Where(x => x.EventTime > watermark).ToList();

    // Process newRecords...

    var maxEventTime = newRecords.Max(x => x.EventTime);

    var checkpoint = await _db.PipelineCheckpoints
        .FirstAsync(x => x.PipelineName == pipelineName);

    checkpoint.LastProcessedEventTime = maxEventTime;
    checkpoint.UpdatedUtc = DateTime.UtcNow;
    await _db.SaveChangesAsync();
}

This ensures that re-running the job does not reprocess already committed data. Combined with idempotent merges, even overlapping windows are safe.

5.2.2 Late Arrivals + Idempotent MERGE

When a late record arrives, it carries an older EventDate. The partition for that date may already exist. The record is merged using the deterministic IdentityKey via the same merge pattern from Section 2.2.1.

Because the merge key is deterministic:

  • If the record was already processed, it updates safely.
  • If it was missing, it inserts correctly.
  • No duplicate rows appear.

Late data handling is not a separate mechanism. It relies on the same idempotent merge pattern used for normal processing.

5.3 Designing for Backfills and Re-runnability

Backfills are inevitable: transformation logic changes, reference data corrections are applied, and upstream systems fix historical errors. A resilient system must support targeted reprocessing without rewriting everything.

The key ingredients are:

  • partitioned storage
  • deterministic transforms
  • version-controlled logic

Backfills should reuse the same code paths as regular runs. Special-case scripts are how inconsistencies enter the system.

5.3.1 Partitioning Strategies (Year/Month/Day) for Targeted Replays

Partitioning curated tables by event date allows selective reprocessing.

Example layout:

/curated/orders/year=2025/month=01/day=15/

If January 15 data needs correction:

(
    df.filter("EventDate = '2025-01-15'")
      .write
      .format("delta")
      .mode("overwrite")
      .option("replaceWhere", "EventDate = '2025-01-15'")
      .save("/curated/orders")
)

replaceWhere ensures only the specified partition is rewritten. Because writes remain idempotent, re-running the same backfill multiple times yields the same result.


6 Recovery Patterns: Resilience by Design

Failures are normal. Transient network errors, throttling, partial writes, and unexpected schema changes will happen. A resilient pipeline assumes this and is built to recover automatically.

Every retry, replay, and resume operation relies on one critical guarantee from earlier sections:

The downstream write is idempotent.

Without idempotent writes (Section 2), retries amplify problems instead of fixing them.

6.1 Modern Retry Strategies for Data Pipelines

Retries are the first line of defense against transient failures. But naïve retry loops cause cascading failures, especially under load. A well-designed retry policy distinguishes between transient faults (timeouts, throttling) and systemic failures (schema mismatch, constraint violation). Retries should only target the first category.

In modern .NET (8+), the recommended approach is Microsoft.Extensions.Resilience, which integrates Polly v8’s pipeline-based API into the dependency injection system.

6.1.1 Beyond Simple Loops: Using Microsoft.Extensions.Resilience (Polly v8)

builder.Services.AddResiliencePipeline("data-writer", pipeline =>
{
    pipeline.AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 5,
        BackoffType = DelayBackoffType.Exponential,
        Delay = TimeSpan.FromSeconds(2),
        ShouldHandle = args =>
            ValueTask.FromResult(
                args.Outcome.Exception is TimeoutException ||
                args.Outcome.Exception is HttpRequestException)
    });

    pipeline.AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        SamplingDuration = TimeSpan.FromSeconds(30),
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(30)
    });
});

Usage:

var pipeline = resilienceProvider.GetPipeline("data-writer");

await pipeline.ExecuteAsync(async token =>
{
    await _deltaWriter.MergeAsync(record);
});

Key points:

  • Exponential backoff prevents overwhelming downstream systems.
  • Circuit breakers stop retries when failure rates spike.
  • The wrapped operation must be idempotent. Otherwise, repeated execution may corrupt data.

6.1.2 Handling Throttling vs. Hard Failures in Azure SQL and Cosmos DB

Azure services distinguish between transient and permanent errors. Cosmos DB returns HTTP 429 when rate limits are exceeded—this should be retried using the server-provided delay:

catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.TooManyRequests)
{
    await Task.Delay(ex.RetryAfter);
    throw; // Let resilience pipeline retry
}

Azure SQL exposes transient error numbers (for example, 40613 or 10928). These are safe to retry. But constraint violations or schema mismatches are not.

The recovery rule: retry transient faults, fail fast on data integrity issues, let idempotent storage absorb duplicates.

6.2 The Dead Letter Queue (DLQ) Triage Pattern

Not all failures are transient. Some records are malformed. Others reference missing data or violate business rules. Retrying those records indefinitely blocks progress.

A DLQ isolates problematic records while allowing the pipeline to continue.

Each DLQ message should include:

  • original payload
  • error reason
  • attempt count
  • correlation identifiers

This preserves enough context for either automated correction or human intervention.

6.2.1 Designing a .NET Self-Healing Worker for DLQ Processing

A self-healing worker reads from the DLQ, attempts corrective transformations, and re-submits valid messages:

await foreach (var message in dlqReceiver.ReceiveMessagesAsync())
{
    var payload = JsonSerializer.Deserialize<OrderEvent>(message.Body);

    var corrected = await _correctionService.TryFixAsync(payload);

    if (corrected != null)
    {
        await mainSender.SendMessageAsync(
            new ServiceBusMessage(JsonSerializer.Serialize(corrected)));

        await dlqReceiver.CompleteMessageAsync(message);
    }
    else
    {
        // Leave in DLQ for manual handling
        await dlqReceiver.AbandonMessageAsync(message);
    }
}

When corrected records are resubmitted, downstream merges must be idempotent to absorb any duplicates.

For records requiring human review, a minimal API can expose DLQ entries for operational tooling:

app.MapGet("/api/dlq", async (AppDbContext db) =>
{
    return await db.DlqMessages
        .OrderByDescending(x => x.CreatedUtc)
        .Take(100)
        .Select(x => new
        {
            x.Id,
            x.ErrorReason,
            x.Payload,
            x.AttemptCount
        })
        .ToListAsync();
});

This provides visibility into failures and auditability. Operational tooling is part of resilience.

6.3 Point-of-Failure Resumption in Azure Data Factory and Fabric Pipelines

Azure Data Factory supports retry policies and activity-level fault handling. These settings should be explicitly configured.

Example activity policy in pipeline JSON:

{
  "name": "CopyOrders",
  "type": "Copy",
  "policy": {
    "timeout": "1.00:00:00",
    "retry": 2,
    "retryIntervalInSeconds": 60
  }
}

For notebook or Spark activities:

{
  "name": "MergeOrders",
  "type": "DatabricksNotebook",
  "policy": {
    "retry": 1,
    "retryIntervalInSeconds": 120
  }
}

Important behaviors:

  • Retries re-execute the activity.
  • Downstream writes must be idempotent.
  • Partial writes must not corrupt state.

For partition-based copy activities, parameterize partition ranges so each activity handles a discrete slice. If a failure occurs on a specific partition, only that slice is retried. Combined with replaceWhere in Delta, this enables safe partial recovery.

Fabric pipelines follow similar principles. Notebook activities can be retried, and Delta Lake’s transactional log ensures partial merges do not corrupt tables.

The critical connection:

  • Pipeline-level retry handles orchestration failures.
  • Delta Lake handles transactional consistency.
  • Idempotent keys ensure duplicates do not accumulate.

Recovery is not a separate layer. It is an extension of idempotent design.


7 Implementation: .NET Patterns for Scalable ETL

Data workloads differ from traditional web APIs: throughput matters more than request latency, deterministic behavior matters more than convenience, and retry safety must be guaranteed. The patterns below align .NET components with the idempotency and recovery strategies discussed earlier.

7.1 Decoupling Pipeline Logic with MediatR and Clean Architecture

ETL services often become tightly coupled: ingestion, transformation, validation, and storage all live in one class. Using MediatR with Clean Architecture separates responsibilities. Each pipeline step becomes a request handler:

public record IngestOrderCommand(OrderRecord Record) : IRequest<Result>;

public sealed class IngestOrderHandler
    : IRequestHandler<IngestOrderCommand, Result>
{
    private readonly IRecordIdentityService _identity;
    private readonly IDeltaLandingWriter _landingWriter;

    public IngestOrderHandler(
        IRecordIdentityService identity,
        IDeltaLandingWriter landingWriter)
    {
        _identity = identity;
        _landingWriter = landingWriter;
    }

    public async Task<Result> Handle(
        IngestOrderCommand request,
        CancellationToken ct)
    {
        var key = _identity.CreateIdentityHash(request.Record);

        var landing = request.Record with { IdentityKey = key };

        await _landingWriter.WriteAsync(landing, ct);

        return Result.Success();
    }
}

Benefits:

  • Handlers are unit-testable.
  • Cross-cutting concerns (validation, logging, tracing) are pipeline behaviors.
  • Business logic is isolated from transport concerns (Event Hubs, HTTP, etc.).

7.2 Observability with OpenTelemetry

Observability ties together Azure Functions, ADF pipelines, Spark jobs, and downstream systems. OpenTelemetry propagates distributed traces:

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddAspNetCoreInstrumentation();
        tracing.AddHttpClientInstrumentation();
        tracing.AddSource("OrderPipeline");
    });

When deployed as an Azure Function, traces flow into Application Insights automatically. When invoked from ADF, correlation IDs can be passed via custom headers to maintain cross-service trace continuity.

7.3 FluentValidation for Schema-on-Read Quality Gates

Records that violate schema expectations must not silently pass through. FluentValidation integrates cleanly with MediatR:

public sealed class OrderRecordValidator
    : AbstractValidator<OrderRecord>
{
    public OrderRecordValidator()
    {
        RuleFor(x => x.OrderId).NotEmpty();
        RuleFor(x => x.Amount).GreaterThan(0);
        RuleFor(x => x.CustomerEmail).EmailAddress();
    }
}

Instead of throwing and losing the record, integrate with the DLQ strategy from Section 6:

var validationResult = _validator.Validate(order);

if (!validationResult.IsValid)
{
    await _dlqSender.SendMessageAsync(
        new ServiceBusMessage(
            JsonSerializer.Serialize(new
            {
                order,
                Errors = validationResult.Errors
            })));

    return;
}

Validation, DLQs, and idempotent writes work together. Validation prevents corrupted records from reaching curated layers. DLQs preserve failed records. Idempotent merges ensure replay is safe.


8 Governance and Observability: The Long-Term View

Building a resilient ETL pipeline is only half the job. Keeping it reliable over months and years requires governance and observability.

Governance answers:

  • Who owns this dataset?
  • What schema changes are allowed?
  • Who depends on this table?

Observability answers:

  • Did today’s load succeed?
  • Did row counts change unexpectedly?
  • Are we processing data later than expected?

Together, these form the operating model for a long-lived Azure data platform. Without them, even well-designed idempotent systems drift into uncertainty.

8.1 Data Quality Monitoring (DQM) Frameworks

Earlier (Section 7.3), we discussed FluentValidation for record-level schema gates—a pass/fail mechanism applied per record. Data Quality Monitoring (DQM) is different.

DQM operates at the batch or dataset level. It answers questions like: Is today’s row count within 10% of yesterday’s? Is the null ratio for CustomerId below 2%? Did the average order value suddenly spike?

A structured DQM approach typically includes:

  1. Row-count validation across partitions
  2. Null ratio and completeness thresholds
  3. Freshness checks (max event time vs. current time)
  4. Distribution monitoring for numeric columns

These checks run after ingestion and before promotion to higher layers (for example, Silver → Gold).

8.1.1 Implementing Batch-Level Expectations in .NET

Unlike FluentValidation, expectations operate over aggregates:

public sealed class NullRatioExpectation
{
    public ExpectationResult Evaluate(
        IEnumerable<OrderRecord> batch,
        double maxNullRatio)
    {
        var total = batch.Count();
        var nullCount = batch.Count(x => string.IsNullOrEmpty(x.CustomerId));

        var ratio = (double)nullCount / total;

        return ratio <= maxNullRatio
            ? ExpectationResult.Pass()
            : ExpectationResult.Fail(
                $"Null ratio {ratio:P} exceeds {maxNullRatio:P}");
    }
}

If expectations fail: do not promote data to Gold, emit structured telemetry, and optionally route to a governance review workflow.

Record-level validation (FluentValidation) prevents bad rows. Batch-level expectations detect systemic anomalies. Both are required.

8.2 End-to-End Lineage Tracking with Microsoft Purview

Microsoft Purview provides metadata scanning and lineage tracking across Azure services. In the context of this article’s running example, when properly configured, Purview displays lineage as a graph:

Event Hub → Parquet Landing → Delta Table (Silver) → Delta Table (Gold) → Power BI Dataset

Each node is clickable. You can see:

  • Schema definitions
  • Column-level lineage
  • Last refresh times
  • Downstream dependencies

To enable lineage:

  1. Register the ADLS Gen2 or OneLake storage account in Purview.
  2. Create a scan targeting the Delta Lake folder.
  3. Enable ADF and Fabric pipeline scanning.

Once enabled, Purview automatically detects Delta tables and builds lineage when ADF or Fabric pipelines run.

Operational value:

  • Before changing a schema, you can see which Power BI datasets depend on it.
  • When a pipeline fails, you can assess impact immediately.
  • When implementing schema evolution (Section 3), you can validate downstream contracts.

Lineage is not documentation. It is a live dependency graph.

8.3 Conclusion: Managing State in Azure Data Platforms

Reliable ETL pipelines are not built by accident. They are engineered.

Across this article, the pattern is consistent:

  • Idempotent writes make retries safe.
  • Schema discipline prevents silent corruption.
  • Watermarks manage temporal gaps.
  • DLQs isolate irrecoverable records.
  • Expectations detect systemic drift.
  • Purview exposes downstream impact.
  • Azure-native tooling (Fabric, Event Hubs, ADF, Delta Lake) provides the building blocks.

Azure’s convergence around Fabric, Delta Lake, and unified lakehouse storage makes these principles easier to implement than ever. But the platform does not replace architectural discipline.

The mindset shift is this:

You are not building pipelines. You are managing state transitions in a distributed system.

If state transitions are deterministic, replayable, and observable, failures become routine events rather than outages.

Reliability Checklist for Azure ETL Pipelines

Use this as a practical audit guide:

Idempotency

  • Every write uses a deterministic identity key.
  • Delta MERGE is used instead of blind append.
  • Source data is deduplicated before merge.

Schema Management

  • Bronze layer enforces schema and captures drift.
  • Silver allows controlled additive evolution.
  • Gold enforces strict contracts.
  • Type widening vs narrowing is explicitly handled.

Late Data

  • Event time is persisted.
  • Watermarks are tracked (streaming or batch).
  • Late arrivals use the same idempotent merge logic.

Recovery

  • Retry policies wrap only idempotent operations.
  • DLQ strategy exists for irrecoverable records.
  • ADF activity retry and timeout policies are configured.

Governance

  • Batch-level expectations are implemented.
  • Purview lineage is enabled and monitored.
  • Transformation logic is version-controlled.

If you can check every box, your ETL pipeline is not just functional—it is resilient.

And resilient pipelines don’t break when the real world behaves unpredictably.

Advertisement