Idempotent ETL: Pipeline Should Be Safe to Rerun

Your pipeline will fail at some point. Airflow retries a task. Someone reruns a backfill. The partnering network sends you the same file twice. Idempotency means the result stays correct no matter how many times you run it.

The Scenario

You receive a daily settlement file from a bank. Each row represents one settlement line for a merchant. Your job loads those rows into a raw table in your warehouse.

The raw table looks like this:

settlement_report_raw

settlement_id | merchant_id | settlement_date | net_amount
--------------+-------------+-----------------+-----------
S001          | M100        | 2026-01-31      | 5,000.00
S002          | M101        | 2026-01-31      | 3,200.00
S003          | M102        | 2026-01-31      | 8,750.00

The Problem: Blind Append

A naive pipeline just inserts every row it sees:

insert into settlement_report_raw
select * from settlement_report_stage;

If the same file runs twice, you get duplicate rows:

settlement_report_raw (after running twice)

settlement_id | merchant_id | settlement_date | net_amount
--------------+-------------+-----------------+-----------
S001          | M100        | 2026-01-31      | 5,000.00
S002          | M101        | 2026-01-31      | 3,200.00
S003          | M102        | 2026-01-31      | 8,750.00
S001          | M100        | 2026-01-31      | 5,000.00   <- duplicate
S002          | M101        | 2026-01-31      | 3,200.00   <- duplicate
S003          | M102        | 2026-01-31      | 8,750.00   <- duplicate

Any aggregation on this table is now wrong. Total net for Jan 31st doubles from $16,950 to $33,900. Reconciliation breaks. The accounting team gets a false alert.

The Fix: Merge on a Stable Key

Every settlement row from the bank has a natural unique identifier: settlement_id. Use it.

merge into settlement_report_raw as tgt
using settlement_report_stage as src
  on tgt.settlement_id = src.settlement_id
when not matched then insert (
  settlement_id,
  merchant_id,
  settlement_date,
  net_amount,
  load_ts
) values (
  src.settlement_id,
  src.merchant_id,
  src.settlement_date,
  src.net_amount,
  current_timestamp
);

Now run the same file twice:

settlement_report_raw (after running twice)

settlement_id | merchant_id | settlement_date | net_amount
--------------+-------------+-----------------+-----------
S001          | M100        | 2026-01-31      | 5,000.00
S002          | M101        | 2026-01-31      | 3,200.00
S003          | M102        | 2026-01-31      | 8,750.00

The second run finds that S001, S002, and S003 already exist and does nothing. The table stays clean.

What If There Is No Natural Key?

Sometimes external files do not come with a reliable row identifier. In that case, compute one from the row's stable fields:

md5(settlement_date || merchant_id || net_amount::text) as settlement_id

This gives you a deterministic key that is the same every time you process the same row. The merge logic stays identical.

The Broader Pattern

The same idea applies downstream. If you aggregate the raw table into a daily summary, use a delete and reinsert scoped to the exact date you are processing rather than appending:

delete from settlement_summary
where settlement_date = '2026-01-31';

insert into settlement_summary (merchant_id, settlement_date, total_net)
select merchant_id, settlement_date, sum(net_amount)
from settlement_report_raw
where settlement_date = '2026-01-31'
group by 1, 2;

Same input in, same output out, regardless of how many times it runs.

Why It Matters

When an incident happens and someone reruns a pipeline at 2am, the difference between "just rerun it" and "we need to check what was already loaded before touching anything" is enormous. The first team resolves the problem in 15 minutes. The second team is still investigating two hours later.

Build in the merge and the scoped delete from the start. It is much harder to fix after data has already been corrupted once.