Building a RAG Pipeline for Insurance Document Processing in South Africa

When I started building INSURELOANSA, I thought the hard part would be the AI. It turned out the hard part was the data.

Insurance documents are some of the most technically dense text you'll encounter: regulatory cross-references, defined terms that override ordinary language, schedules that modify the main policy body. A standard RAG implementation that treats insurance documents like Wikipedia articles will produce confidently wrong answers with plausible-sounding citations.

This is what I learned building a compliant RAG pipeline for insurance processing.

Why Standard Chunking Fails

The typical RAG tutorial tells you to chunk documents at 512 or 1024 tokens with some overlap. This works fine for general knowledge retrieval. For insurance compliance, it's a liability.

The problem: Insurance documents have semantic dependencies that span the entire document.

Consider: a clause that says "the insured party is liable for the excess as defined in Schedule B." A 512-token chunk containing that clause has no idea what "Schedule B" says. When your AI retrieves that chunk and tries to answer a compliance question, it's working with half the information it needs.

My solution: Structure-aware chunking that preserves the document's logical hierarchy.

def chunk_insurance_document(text: str, doc_metadata: dict) -> list[Document]:
    """
    Parse insurance documents into semantically complete chunks
    that preserve cross-references and defined terms.
    """
    # Phase 1: Identify document structure
    sections = extract_section_hierarchy(text)
    
    # Phase 2: Resolve internal references
    definitions = extract_defined_terms(sections)
    
    # Phase 3: Create enriched chunks
    chunks = []
    for section in sections:
        chunk_text = section.text
        
        # Inline definitions used in this section
        referenced_defs = find_referenced_definitions(section, definitions)
        if referenced_defs:
            chunk_text += "\n\n[Applicable definitions: " + "; ".join(
                f"{k}: {v}" for k, v in referenced_defs.items()
            ) + "]"
        
        chunks.append(Document(
            page_content=chunk_text,
            metadata={
                **doc_metadata,
                "section_id": section.id,
                "section_title": section.title,
                "doc_type": doc_metadata["type"],
                "cross_refs": section.external_references,
            }
        ))
    
    return chunks

This adds ~30% more tokens per chunk but dramatically improves retrieval accuracy for compliance questions.

The Citation Problem

Regulators don't trust "the AI said so." They need to trace every compliance finding back to a specific regulation, section, and clause.

My first implementation used standard RAG: retrieve relevant chunks, ask Claude to synthesise a compliance assessment. The output was readable and often correct — but the citations were wrong. Claude would correctly identify the compliance issue but cite a slightly wrong section number. In a regulatory context, that's as bad as being wrong.

The fix: Forced structured output with verbatim extraction.

Instead of asking Claude to synthesise from retrieved context, I switched to a two-stage approach:

Retrieval — find the most relevant regulatory chunks
Structured extraction — force Claude to identify the exact matching text using tool use

compliance_tool = {
    "name": "record_compliance_finding",
    "description": "Record a compliance finding with exact source citation",
    "input_schema": {
        "type": "object",
        "properties": {
            "requirement_id": {"type": "string"},
            "requirement_text": {
                "type": "string",
                "description": "VERBATIM text of the regulatory requirement"
            },
            "source_document": {"type": "string"},
            "source_section": {"type": "string"},
            "application_status": {
                "type": "string",
                "enum": ["compliant", "non-compliant", "requires-review"]
            },
            "finding_detail": {"type": "string"},
        },
        "required": ["requirement_id", "requirement_text", "source_document",
                     "source_section", "application_status", "finding_detail"]
    }
}

This forces every finding to include the verbatim regulatory text. No paraphrasing, no synthesis. If the exact text isn't in the retrieved context, the model has to say it doesn't have enough information — which is the correct behaviour.

Embedding the Regulatory Corpus

The FSCA regulatory corpus is large — hundreds of documents, millions of tokens. ChromaDB handles this well, but the embedding strategy matters.

I use two separate collections:

Primary regulations — FSCA frameworks, NCA, core legislation. High-quality embeddings, full structure preservation.
Product documents — policy schedules, product terms, underwriting guidelines. These change frequently and need a separate update pipeline.

Separating the collections lets me update product documents without re-embedding the stable regulatory corpus.

What I'd Do Differently

The structure-aware chunker took three iterations to get right. I'd spend more time on document analysis before writing any RAG code. Understanding the semantic structure of your documents is the most important architectural decision you'll make — everything else follows from it.

The other thing I underestimated: the importance of retrieval evaluation before building the generation layer. I spent days tuning the generation prompt when the real problem was poor retrieval precision. Build an evaluation harness for retrieval first.

If you're building a RAG pipeline for a regulated industry, the two things that will determine success are: how you chunk (structure-awareness over token count) and how you generate (extraction over synthesis). Get those two right and the rest is implementation detail.