Skip to content

ADR-007: Idempotent Data Pipeline (Watermarking)

Status: Proposed
Date: 2026-01-24

Context

Costing relies on large Excel/CSV uploads containing both geometry and volatile market rates (Steel/Aluminium). We need a high-performance ingestion layer that prevents duplicate work and maintains strict data lineage.

Decision

Implement an Idempotent Data Pipeline using Event-Time Watermarking.

  1. Fingerprinting: Use MD5 hashing to skip processing of unchanged files.
  2. Watermarking: Stamp calculations with specific master rate versions.

View Implementation Details: Data Pipeline Specification →

Rationale

  • Performance: 90% faster re-quotes by identifying which specific cost-components (Geometry vs. Rates) actually changed.
  • Reliability: Prevents duplicate ingestion jobs and "Double Counting" metrics.
  • Traceability: Watermarking ensures we know exactly which system state produced a specific quote version.

Consequences

  • Positive: Robust data lineage; high-speed "Thin Recalculations."
  • Negative: Requires consistent hashing logic across distributed background workers.

← ADR Index