ADR-007: Idempotent Data Pipeline (Watermarking)¶
Status: Proposed
Date: 2026-01-24
Context¶
Costing relies on large Excel/CSV uploads containing both geometry and volatile market rates (Steel/Aluminium). We need a high-performance ingestion layer that prevents duplicate work and maintains strict data lineage.
Decision¶
Implement an Idempotent Data Pipeline using Event-Time Watermarking.
- Fingerprinting: Use MD5 hashing to skip processing of unchanged files.
- Watermarking: Stamp calculations with specific master rate versions.
Rationale¶
- Performance: 90% faster re-quotes by identifying which specific cost-components (Geometry vs. Rates) actually changed.
- Reliability: Prevents duplicate ingestion jobs and "Double Counting" metrics.
- Traceability: Watermarking ensures we know exactly which system state produced a specific quote version.
Consequences¶
- Positive: Robust data lineage; high-speed "Thin Recalculations."
- Negative: Requires consistent hashing logic across distributed background workers.