Docugami: IDP Software Vendor
On This Page
Docugami IDP platform founded by XML co-creator Jean Paoli that transforms business documents into XML Knowledge Graphs using open-source LLMs for data sovereignty.

Overview
Founded in 2017 by Jean Paoli, co-creator of XML, Docugami converts business documents into XML Knowledge Graphs using exclusively open-source LLMs. The company has 40 employees and $9.3M estimated annual revenue with 18% year-over-year growth, headquartered in Kirkland, Washington. In late 2025, Docugami established a French subsidiary targeting regulated sectors - insurance and healthcare - where data residency requirements make cloud-dependent competitors a harder sell.
The core differentiation is architectural. Where traditional IDP vendors extract flat text, Docugami's patented Business Document Foundation Model uses Contextual Semantic Labels (CSLs) to produce XML semantic trees that preserve hierarchical relationships across heterogeneous document types. As Paoli has framed it, current AI models "can generate and summarize, but still struggle to extract, structure and cross-reference information contained in long and heterogeneous documents" - the gap Docugami's Knowledge Graph approach is built to close.
In February 2026, Docugami launched an MCP Server at https://api.docugami.com/mcp, exposing its Document AI as standardized tools callable by any MCP-compatible agent runtime. This is a distribution shift, not a capability change: the underlying KG-RAG architecture is unchanged, but the integration cost for embedding Docugami into third-party agent workflows drops significantly. Named integrations at launch include Mistral's Le Chat and GitHub Copilot in VS Code.
Gartner has cited Docugami as an example of Generative AI innovation beyond traditional IDP. The company holds grants from the U.S. National Science Foundation, NASA, and Mitacs, and participates in the NVIDIA Inception program. Total funding stands at $11.22M.
How Docugami processes documents
Docugami's pipeline transforms documents into structured Knowledge Graphs rather than extracting flat text. The Business Document Foundation Model applies Contextual Semantic Labels (CSLs) for hierarchical semantic chunking, producing XML semantic trees where every information element becomes an actionable data node with its structural context preserved.
On top of this graph, the KG-RAG Architecture enhances standard Retrieval-Augmented Generation by querying the Knowledge Graph rather than raw text chunks - enabling cross-document reasoning that standard RAG approaches cannot perform reliably on long, format-varied documents. The platform uses exclusively open-source LLMs throughout, which addresses data sovereignty requirements in regulated sectors and differentiates it from cloud-dependent competitors like Rossum and ABBYY.
The February 2026 MCP Server launch adds a new access layer to this pipeline. Via https://api.docugami.com/mcp with HTTP Bearer token authentication, any MCP-compatible agent can now list documents, docsets, and projects; upload documents for processing; download generated artifacts and reports; and add or remove documents from projects. Full technical documentation is available at api-docs.docugami.com/mcp.html. The practical effect is that Docugami's document engineering becomes embeddable infrastructure rather than a standalone destination - reducing the integration friction that has historically limited IDP adoption in mid-market agentic workflows. Teams evaluating open-source LLM-based extraction alternatives may also want to review Unstract, which takes a no-code approach to LLM-powered document processing with hallucination mitigation built into its pipeline.
No benchmark data or third-party validation accompanies the MCP announcement. Performance claims on the launch page are vendor-asserted. "Patented AI" is referenced without patent numbers.
Use cases
Commercial insurance
Docugami names four concrete automation scenarios for commercial insurance, each with defined inputs and outputs:
- Loss run extraction - agent extracts key data from loss runs on receipt and generates a structured report
- Policy comparison - agent compares policies across multiple providers in varied formats, identifying client-specific advantages
- Submission preparation - agent aggregates loss runs, ACORD forms, claim documents, and emails into underwriting submission format
- COI drafting - agent detects certificate of insurance request emails, identifies the relevant policy, drafts the certificate, and routes for human review
The depth of these scenarios - not just protocol compliance but defined inputs, outputs, and human-in-the-loop routing - signals genuine vertical focus. Indico Data and SortSpoke compete in overlapping commercial insurance workflows.
Legal contract analysis
Legal teams use Docugami's document engineering approach to transform contract portfolios into structured data. The system's contextual understanding identifies non-standard clauses and cross-references obligations across multiple agreements, enabling comparative analysis without manual review. The XML semantic tree representation is particularly suited to contracts, where hierarchical clause relationships matter as much as the text itself. Cognaize, which applies neuro-symbolic AI to financial and legal documents, takes a comparable structured-reasoning approach to complex document types.
Regulated sector compliance
European insurance and healthcare organizations deploy Docugami's open-source LLM stack to meet data residency requirements. The French subsidiary established in late 2025 provides a local operational base for sectors where cloud-dependent processing creates regulatory exposure. The XML Knowledge Graphs extract compliance-relevant information while keeping data within jurisdictional boundaries. Taiger, which specializes in behind-firewall generative AI document processing for regulated industries, addresses a similar data sovereignty concern through a different architectural approach.
Real estate and construction
The MCP Server announcement identifies real estate and construction as target verticals alongside insurance and legal - industries defined by high-volume, format-varied long-form documents. Pixydocs, which uses neural networks for construction and property management document workflows, targets overlapping document types in these sectors. Specific automation scenarios for real estate and construction are not detailed in available Docugami sources.
Technical specifications
| Feature | Specification |
|---|---|
| Core Technology | Patented Business Document Foundation Model, XML Knowledge Graphs |
| AI Architecture | Exclusively open-source LLMs with agentic quality control |
| Chunking Method | Hierarchical semantic chunking via Contextual Semantic Labels (CSLs) |
| Output Format | XML semantic trees with actionable data nodes |
| Retrieval Architecture | KG-RAG (Knowledge Graph-enabled RAG) |
| MCP Server | https://api.docugami.com/mcp - Bearer token auth; docs at api-docs.docugami.com/mcp.html |
| MCP Integrations | Mistral Le Chat, GitHub Copilot in VS Code (at launch) |
| Data Sovereignty | Open-source LLMs throughout; no cloud dependency |
| Geographic Operations | Kirkland, WA headquarters; French subsidiary (est. late 2025) |
| Company Size | 40 employees, $9.3M estimated revenue (18% YoY growth) |
| Funding | $11.22M total; NSF, NASA, and Mitacs grants |
| Technology Partners | NVIDIA Inception program member |
Resources
- Website
- MCP Server announcement
- MCP API documentation
- Document AI blog
- Competitive Analysis
Company information
Headquarters: Kirkland, Washington, United States
European Operations: French subsidiary launched late 2025, targeting regulated insurance and healthcare sectors
Founded: 2017 by Jean Paoli (XML co-creator) and team
Funding: $11.22M total; grants from NSF, NASA, and Mitacs
Recognition: Gartner (Generative AI innovation example), NVIDIA Inception program member