Skip to content

ADR-010: AI/NLP-Driven Excel Ingestion Strategy

Status: Accepted
Date: 2026-01-27
Deciders: Preetam Balijepalli, Sonakshi Gupta


Context

CostEngine must onboard customers who have been using Excel for costing for 10+ years. Competitors require customers to adapt their logic to a fixed template, creating significant onboarding friction and requiring costly implementation services.

Problem Statement

Customer Excel files exhibit: - Inconsistent column naming (e.g., "Lathe" vs "Lath", "Material" vs "RM" vs "Raw Material") - Non-standard layouts and merged cells - Embedded formulas with implicit business logic - Visual cues (colors, formatting) carrying meaning


Decision

Implement a two-stage AI/NLP ingestion pipeline that:

  1. Stage 1 - Intelligent Parsing: Use AI/NLP to analyze unstructured customer Excel files and auto-populate a standardized internal template
  2. Stage 2 - Human Verification: Present parsed data for user review, correction, and confirmation
flowchart LR
    A[Customer Excel] --> B[AI/NLP Parser]
    B --> C[Standardized Template]
    C --> D[User Verification UI]
    D --> E{Corrections?}
    E -->|Yes| F[Apply Fixes]
    F --> G[Learning Feedback]
    G --> B
    E -->|No| H[Confirmed Data]
    H --> I[Quote Creation]

Architecture

Component Design

graph TB
    subgraph "Ingestion Pipeline"
        Upload[File Upload]
        Extract[Cell Extraction]
        NLP[NLP Field Mapper]
        Validate[Validation Engine]
        Preview[Preview Generator]
    end

    subgraph "AI/NLP Engine"
        Embed[Embedding Model]
        Classify[Field Classifier]
        Fuzzy[Fuzzy Matcher]
        Learn[Learning Module]
    end

    Upload --> Extract
    Extract --> NLP
    NLP --> Embed
    NLP --> Classify
    NLP --> Fuzzy
    Classify --> Validate
    Validate --> Preview

    Learn --> Classify

    style NLP fill:#fce4ec,stroke:#e91e63,stroke-width:2px
    style Learn fill:#c8e6c9

Field Mapping Strategy

Source Variation Target Field Confidence Threshold
"Material", "RM", "Raw Material", "Mat'l" material_grade 0.85
"Lathe", "Lath", "Turning", "CNC Turn" operation_type: turning 0.80
"Time/pc", "Cycle", "CT", "Time per piece" cycle_time 0.85
"Rate", "Price", "₹/kg", "Rs/kg" rm_rate 0.90

Rationale

Why AI/NLP over Fixed Templates?

Approach Onboarding Effort Customer Experience Scalability
Fixed Template High (customer adapts) Poor (forced change) Limited
AI/NLP Ingestion Low (system adapts) Excellent (familiar format) High

Competitive Advantage

  1. Zero-friction onboarding: Customers continue using their existing Excel format
  2. Showcase intelligence: First interaction demonstrates the product's capabilities
  3. Learning system: Improves with each correction, building customer-specific models

Technical Approach

Phase 1: MVP (Heuristic + Basic NLP)

  • Column header matching using fuzzy string matching (Levenshtein distance)
  • Pattern recognition for common formats (dates, currencies, weights)
  • Cell position heuristics (headers typically in row 1-3)

Phase 2: Enhanced (Embedding-based)

  • Semantic similarity using embedding models
  • Context-aware field classification
  • Multi-column relationship detection

Phase 3: Adaptive (Customer-Specific Learning)

  • Store correction history per customer
  • Train lightweight models for repeat imports
  • Suggest field mappings based on past behavior

Data Quality Handling

Issue Detection Resolution
Inconsistent naming Fuzzy matching + embeddings Auto-map with confidence score
Missing required fields Schema validation Highlight in UI, block import
Implausible values Range checks + history Warning with suggested fix
Merged cells Cell structure analysis Unmerge and distribute values

User Verification UI

+-------------------------------------------------------------+
| Excel Import Verification                                   |
+-------------------------------------------------------------+
| [OK] Material Grade: SAE4140        [Source: Cell B7]       |
| [OK] Gross Weight: 2.304 kg         [Source: Cell D13]      |
| [??] Operation: "Lath" -> Turning?   [Confidence: 82%]      |
|    [Accept] [Change] [Skip]                                 |
| [!!] Cycle Time: Missing            [Expected: Column F]    |
|    [Enter Value: _______ ]                                  |
+-------------------------------------------------------------+
| [Save Mapping Template]  [Continue]  [Cancel]               |
+-------------------------------------------------------------+

Consequences

Positive

  • Dramatically reduced onboarding friction
  • Differentiating feature vs. competitors
  • Builds proprietary dataset of manufacturing terminology
  • System improves with usage

Negative

  • Higher initial development complexity
  • Requires investment in NLP/ML capabilities
  • Edge cases may require manual intervention
  • Need robust feedback loop for corrections

Mitigations

  • Start with heuristic-based MVP, add ML incrementally
  • Always provide manual override option
  • Track and analyze correction patterns
  • Set confidence thresholds to flag uncertain mappings