Back to all work
— Project 03
Internal AI · RAG

AIVA

Document Intelligence Platform

AIVA is a document-intelligence platform for internal company documents: HR, contracts, financials, and asset records. The hard part was not orchestration; it was making messy PDFs, scans, blueprints, and rotated tables usable. I built a custom Textract parser that turns those files into tagged, searchable content, then a LangGraph RAG pipeline that answers over it without flooding the model context. It now runs as a multi-tenant platform with streaming chat and a live SharePoint crawl.

Role
Solo Founder + Lead Engineer
Period
2024 to present
Status
Production
AWS TextractMicrosoft GraphLangGraphWeaviateFastMCPSvelteKit
— Chapter 01
System shape

How the system fits together.

Click a block to zoom in
AIVA turns messy enterprise documents into searchable knowledge, then answers questions over them. The ingestion side now crawls a full company SharePoint.
Fig. 01 — AIVA architecture
— Chapter 02
Decisions and outcomes

The calls that shaped it.

  1. 01

    The parser solves the data problem head-on: Textract classifies every region of every page, tables become their own CSV files, figures are pulled by exact bounding box and read by a vision model (diagrams → Mermaid), and each document lands as one clean, normalized folder. Everything else stands on it.

  2. 02

    Tables never get inlined, since that burns context and drowns the search. Each stays a file behind an inline tag the chunker can't split, and an MCP server turns it into a tool the model queries at runtime to filter, aggregate, and join real rows.

  3. 03

    Every question is routed by cost: a lookup answers from a template in ~200 ms; a real one runs the full path. Three-layer discovery (keyword catalog → summaries → filtered hybrid search), RRF fusion, optional Cohere rerank, compose, then verify the answer against its own evidence. Feedback tunes the ranking.

  4. 04

    The recent push was scale: from a 5-document prototype to a real company SharePoint crawl. One orchestrator now handles local files and SharePoint, skips unchanged content, keeps moved documents attached to the same identity, and streams crawl progress into the console.

  5. 05

    I run it like production, not a demo: every sub-project goes spec → adversarial review by a second model → plan → TDD, one commit each, and that reviewer catches a real issue at every gate. The docs say plainly what's proven by tests versus what hasn't had a real end-to-end run.

  6. 06

    The console around it: a Svelte streaming chat (live pipeline · evidence · citations), the live SharePoint corpus tree, an upload → parse → index pipeline, and a per-query inspector. Plus RAGAS + side-by-side eval, LangSmith tracing, and every reported bug locked behind a regression test.

— Aside
The interesting work isn't the stack. It's the boundaries.
— Chapter 03
How it runs

What it runs on.

  • 01
    Custom AWS Textract parser (async API; LAYOUT / TABLES / FORMS); figures captured by a vision model (diagrams → Mermaid); three output modes (text · text+folders · tag-pointer)
  • 02
    Tag-aware chunker that never splits a tag; tables kept as files, referenced by [CSV_MCP:…] tags
  • 03
    LangGraph three-path query graph (template · suggestion · full RAG) with speculative draft-and-verify
  • 04
    Three-layer document discovery: SQLite FTS catalog → AI summary vectors → filtered chunk hybrid
  • 05
    Weaviate hybrid (BM25F + vector) with native multi-tenant isolation per department; Chunk vectors text-embedding-3-large (3072-d)
  • 06
    Reciprocal-rank fusion across search legs + optional Cohere cross-encoder rerank; per-chunk user feedback adjusts ranking
  • 07
    FastMCP server: an LLM agent that chains schema / filter / aggregate / join / validate tools over pandas DataFrames
  • 08
    Source-agnostic ingestion orchestrator (DocumentSource: local + SharePoint) with durable, single-transaction, restartable jobs
  • 09
    SharePoint connector: app-only Microsoft Graph auth, throttle-aware parallel crawl, live folder tree (since-date filter · scan-lock · delta-link checkpoint)
  • 10
    Content-addressed identity & cache: skip-unchanged, dedup, stable doc-id aliases, idempotent 16-department tenant resolver
  • 11
    OpenAI GPT-4.1-mini / nano across routing, extraction, and compose (structured + streamed); GPT-4o-mini for summaries; LangSmith tracing
  • 12
    Svelte 5 + SvelteKit + Tailwind console: streaming chat · live SharePoint corpus tree · parse → index pipeline · per-query inspector; RAGAS + side-by-side eval
  • 13
    Docker Compose for the app, Weaviate, and Redis / Valkey