ADR-010: AI/NLP-Driven Excel Ingestion Strategy¶
Status: Accepted
Date: 2026-01-27
Deciders: Preetam Balijepalli, Sonakshi Gupta
Context¶
CostEngine must onboard customers who have been using Excel for costing for 10+ years. Competitors require customers to adapt their logic to a fixed template, creating significant onboarding friction and requiring costly implementation services.
Problem Statement¶
Customer Excel files exhibit: - Inconsistent column naming (e.g., "Lathe" vs "Lath", "Material" vs "RM" vs "Raw Material") - Non-standard layouts and merged cells - Embedded formulas with implicit business logic - Visual cues (colors, formatting) carrying meaning
Decision¶
Implement a two-stage AI/NLP ingestion pipeline that:
- Stage 1 - Intelligent Parsing: Use AI/NLP to analyze unstructured customer Excel files and auto-populate a standardized internal template
- Stage 2 - Human Verification: Present parsed data for user review, correction, and confirmation
flowchart LR
A[Customer Excel] --> B[AI/NLP Parser]
B --> C[Standardized Template]
C --> D[User Verification UI]
D --> E{Corrections?}
E -->|Yes| F[Apply Fixes]
F --> G[Learning Feedback]
G --> B
E -->|No| H[Confirmed Data]
H --> I[Quote Creation]
Architecture¶
Component Design¶
graph TB
subgraph "Ingestion Pipeline"
Upload[File Upload]
Extract[Cell Extraction]
NLP[NLP Field Mapper]
Validate[Validation Engine]
Preview[Preview Generator]
end
subgraph "AI/NLP Engine"
Embed[Embedding Model]
Classify[Field Classifier]
Fuzzy[Fuzzy Matcher]
Learn[Learning Module]
end
Upload --> Extract
Extract --> NLP
NLP --> Embed
NLP --> Classify
NLP --> Fuzzy
Classify --> Validate
Validate --> Preview
Learn --> Classify
style NLP fill:#fce4ec,stroke:#e91e63,stroke-width:2px
style Learn fill:#c8e6c9
Field Mapping Strategy¶
| Source Variation | Target Field | Confidence Threshold |
|---|---|---|
| "Material", "RM", "Raw Material", "Mat'l" | material_grade |
0.85 |
| "Lathe", "Lath", "Turning", "CNC Turn" | operation_type: turning |
0.80 |
| "Time/pc", "Cycle", "CT", "Time per piece" | cycle_time |
0.85 |
| "Rate", "Price", "₹/kg", "Rs/kg" | rm_rate |
0.90 |
Rationale¶
Why AI/NLP over Fixed Templates?¶
| Approach | Onboarding Effort | Customer Experience | Scalability |
|---|---|---|---|
| Fixed Template | High (customer adapts) | Poor (forced change) | Limited |
| AI/NLP Ingestion | Low (system adapts) | Excellent (familiar format) | High |
Competitive Advantage¶
- Zero-friction onboarding: Customers continue using their existing Excel format
- Showcase intelligence: First interaction demonstrates the product's capabilities
- Learning system: Improves with each correction, building customer-specific models
Technical Approach¶
Phase 1: MVP (Heuristic + Basic NLP)¶
- Column header matching using fuzzy string matching (Levenshtein distance)
- Pattern recognition for common formats (dates, currencies, weights)
- Cell position heuristics (headers typically in row 1-3)
Phase 2: Enhanced (Embedding-based)¶
- Semantic similarity using embedding models
- Context-aware field classification
- Multi-column relationship detection
Phase 3: Adaptive (Customer-Specific Learning)¶
- Store correction history per customer
- Train lightweight models for repeat imports
- Suggest field mappings based on past behavior
Data Quality Handling¶
| Issue | Detection | Resolution |
|---|---|---|
| Inconsistent naming | Fuzzy matching + embeddings | Auto-map with confidence score |
| Missing required fields | Schema validation | Highlight in UI, block import |
| Implausible values | Range checks + history | Warning with suggested fix |
| Merged cells | Cell structure analysis | Unmerge and distribute values |
User Verification UI¶
+-------------------------------------------------------------+
| Excel Import Verification |
+-------------------------------------------------------------+
| [OK] Material Grade: SAE4140 [Source: Cell B7] |
| [OK] Gross Weight: 2.304 kg [Source: Cell D13] |
| [??] Operation: "Lath" -> Turning? [Confidence: 82%] |
| [Accept] [Change] [Skip] |
| [!!] Cycle Time: Missing [Expected: Column F] |
| [Enter Value: _______ ] |
+-------------------------------------------------------------+
| [Save Mapping Template] [Continue] [Cancel] |
+-------------------------------------------------------------+
Consequences¶
Positive¶
- Dramatically reduced onboarding friction
- Differentiating feature vs. competitors
- Builds proprietary dataset of manufacturing terminology
- System improves with usage
Negative¶
- Higher initial development complexity
- Requires investment in NLP/ML capabilities
- Edge cases may require manual intervention
- Need robust feedback loop for corrections
Mitigations¶
- Start with heuristic-based MVP, add ML incrementally
- Always provide manual override option
- Track and analyze correction patterns
- Set confidence thresholds to flag uncertain mappings
Related¶
- ADR-003: Excel Parsing Strategy - Base parsing approach
- UC-103: Import Legacy Excel - Use case coverage
- MVP Scope - Feature prioritization