On This Page

Docugami IDP platform founded by XML co-creator Jean Paoli that transforms business documents into XML Knowledge Graphs using open-source LLMs for data sovereignty.

Docugami

Overview

Founded in 2017 by Jean Paoli, co-creator of XML, Docugami converts business documents into XML Knowledge Graphs using exclusively open-source LLMs. The company has 40 employees and $9.3M estimated annual revenue with 18% year-over-year growth, headquartered in Kirkland, Washington. In late 2025, Docugami established a French subsidiary targeting regulated sectors - insurance and healthcare - where data residency requirements make cloud-dependent competitors a harder sell.

The core differentiation is architectural. Where traditional IDP vendors extract flat text, Docugami's patented Business Document Foundation Model uses Contextual Semantic Labels (CSLs) to produce XML semantic trees that preserve hierarchical relationships across heterogeneous document types. As Paoli has framed it, current AI models "can generate and summarize, but still struggle to extract, structure and cross-reference information contained in long and heterogeneous documents" - the gap Docugami's Knowledge Graph approach is built to close.

In February 2026, Docugami launched an MCP Server at https://api.docugami.com/mcp, exposing its Document AI as standardized tools callable by any MCP-compatible agent runtime. This is a distribution shift, not a capability change: the underlying KG-RAG architecture is unchanged, but the integration cost for embedding Docugami into third-party agent workflows drops significantly. Named integrations at launch include Mistral's Le Chat and GitHub Copilot in VS Code.

Gartner has cited Docugami as an example of Generative AI innovation beyond traditional IDP. The company holds grants from the U.S. National Science Foundation, NASA, and Mitacs, and participates in the NVIDIA Inception program. Total funding stands at $11.22M.

How Docugami processes documents

Docugami's pipeline transforms documents into structured Knowledge Graphs rather than extracting flat text. The Business Document Foundation Model applies Contextual Semantic Labels (CSLs) for hierarchical semantic chunking, producing XML semantic trees where every information element becomes an actionable data node with its structural context preserved.

On top of this graph, the KG-RAG Architecture enhances standard Retrieval-Augmented Generation by querying the Knowledge Graph rather than raw text chunks - enabling cross-document reasoning that standard RAG approaches cannot perform reliably on long, format-varied documents. The platform uses exclusively open-source LLMs throughout, which addresses data sovereignty requirements in regulated sectors and differentiates it from cloud-dependent competitors like Rossum and ABBYY.

The February 2026 MCP Server launch adds a new access layer to this pipeline. Via https://api.docugami.com/mcp with HTTP Bearer token authentication, any MCP-compatible agent can now list documents, docsets, and projects; upload documents for processing; download generated artifacts and reports; and add or remove documents from projects. Full technical documentation is available at api-docs.docugami.com/mcp.html. The practical effect is that Docugami's document engineering becomes embeddable infrastructure rather than a standalone destination - reducing the integration friction that has historically limited IDP adoption in mid-market agentic workflows. Teams evaluating open-source LLM-based extraction alternatives may also want to review Unstract, which takes a no-code approach to LLM-powered document processing with hallucination mitigation built into its pipeline.

No benchmark data or third-party validation accompanies the MCP announcement. Performance claims on the launch page are vendor-asserted. "Patented AI" is referenced without patent numbers.

Use cases

Commercial insurance

Docugami names four concrete automation scenarios for commercial insurance, each with defined inputs and outputs:

  • Loss run extraction - agent extracts key data from loss runs on receipt and generates a structured report
  • Policy comparison - agent compares policies across multiple providers in varied formats, identifying client-specific advantages
  • Submission preparation - agent aggregates loss runs, ACORD forms, claim documents, and emails into underwriting submission format
  • COI drafting - agent detects certificate of insurance request emails, identifies the relevant policy, drafts the certificate, and routes for human review

The depth of these scenarios - not just protocol compliance but defined inputs, outputs, and human-in-the-loop routing - signals genuine vertical focus. Indico Data and SortSpoke compete in overlapping commercial insurance workflows.

Legal teams use Docugami's document engineering approach to transform contract portfolios into structured data. The system's contextual understanding identifies non-standard clauses and cross-references obligations across multiple agreements, enabling comparative analysis without manual review. The XML semantic tree representation is particularly suited to contracts, where hierarchical clause relationships matter as much as the text itself. Cognaize, which applies neuro-symbolic AI to financial and legal documents, takes a comparable structured-reasoning approach to complex document types.

Regulated sector compliance

European insurance and healthcare organizations deploy Docugami's open-source LLM stack to meet data residency requirements. The French subsidiary established in late 2025 provides a local operational base for sectors where cloud-dependent processing creates regulatory exposure. The XML Knowledge Graphs extract compliance-relevant information while keeping data within jurisdictional boundaries. Taiger, which specializes in behind-firewall generative AI document processing for regulated industries, addresses a similar data sovereignty concern through a different architectural approach.

Real estate and construction

The MCP Server announcement identifies real estate and construction as target verticals alongside insurance and legal - industries defined by high-volume, format-varied long-form documents. Pixydocs, which uses neural networks for construction and property management document workflows, targets overlapping document types in these sectors. Specific automation scenarios for real estate and construction are not detailed in available Docugami sources.

Technical specifications

Feature Specification
Core Technology Patented Business Document Foundation Model, XML Knowledge Graphs
AI Architecture Exclusively open-source LLMs with agentic quality control
Chunking Method Hierarchical semantic chunking via Contextual Semantic Labels (CSLs)
Output Format XML semantic trees with actionable data nodes
Retrieval Architecture KG-RAG (Knowledge Graph-enabled RAG)
MCP Server https://api.docugami.com/mcp - Bearer token auth; docs at api-docs.docugami.com/mcp.html
MCP Integrations Mistral Le Chat, GitHub Copilot in VS Code (at launch)
Data Sovereignty Open-source LLMs throughout; no cloud dependency
Geographic Operations Kirkland, WA headquarters; French subsidiary (est. late 2025)
Company Size 40 employees, $9.3M estimated revenue (18% YoY growth)
Funding $11.22M total; NSF, NASA, and Mitacs grants
Technology Partners NVIDIA Inception program member

Resources

Company information

Headquarters: Kirkland, Washington, United States

European Operations: French subsidiary launched late 2025, targeting regulated insurance and healthcare sectors

Founded: 2017 by Jean Paoli (XML co-creator) and team

Funding: $11.22M total; grants from NSF, NASA, and Mitacs

Recognition: Gartner (Generative AI innovation example), NVIDIA Inception program member