Docugami — IDP Software Vendor
On This Page
Docugami IDP platform founded by XML co-creator Jean Paoli that transforms business documents into XML Knowledge Graphs using open-source LLMs for data sovereignty.

Overview
Founded in 2017 by Jean Paoli, co-creator of XML, Docugami converts business documents into XML Knowledge Graphs using exclusively open-source LLMs. The company has 40 employees and $9.3M estimated annual revenue with 18% year-over-year growth, headquartered in Kirkland, Washington. In late 2025, Docugami established a French subsidiary targeting regulated sectors including insurance and healthcare, where data residency requirements make cloud-dependent competitors a harder sell.
The core differentiation is architectural. Where traditional intelligent document processing (IDP) vendors extract flat text, Docugami's patented Business Document Foundation Model uses Contextual Semantic Labels (CSLs) to produce XML semantic trees that preserve hierarchical relationships across heterogeneous document types. As Paoli has stated, current AI models "can generate and summarize, but still struggle to extract, structure and cross-reference information contained in long and heterogeneous documents." This gap in extraction and structural understanding is what Docugami's Knowledge Graph approach is built to close.
In February 2026, Docugami launched an MCP Server at https://api.docugami.com/mcp, exposing its Document AI as standardized tools callable by any MCP-compatible agent runtime. This is a distribution shift, not a capability change: the underlying KG-RAG architecture is unchanged, but the integration cost for embedding Docugami into third-party agent workflows drops significantly. Named integrations at launch include Mistral's Le Chat and GitHub Copilot in VS Code.
Most recently, Docugami published KG-RAG-datasets on GitHub, an open-source evaluation framework for knowledge graph retrieval-augmented generation (RAG) built on SEC 10-Q filings from Apple, Amazon, Intel, Microsoft, and Nvidia. The release enables third-party benchmarking of Docugami's document AI pipeline on financial documents, a category where accuracy and auditability are non-negotiable.
Gartner has cited Docugami as an example of generative AI innovation beyond traditional IDP. The company holds grants from the U.S. National Science Foundation, NASA, and Mitacs, and participates in the NVIDIA Inception program. Total funding stands at $11.22M.
How Docugami processes documents
Docugami's pipeline transforms documents into structured Knowledge Graphs rather than extracting flat text. The Business Document Foundation Model applies Contextual Semantic Labels for hierarchical semantic chunking, producing XML semantic trees where every information element becomes an actionable data node with its structural context preserved.
On top of this graph, the KG-RAG architecture enhances standard retrieval-augmented generation by querying the Knowledge Graph rather than raw text chunks, enabling cross-document reasoning that standard RAG approaches cannot perform reliably on long, format-varied documents. The platform uses exclusively open-source LLMs throughout, which addresses data sovereignty requirements in regulated sectors and differentiates it from cloud-dependent competitors like Rossum and ABBYY.
The February 2026 MCP Server launch adds a new access layer to this pipeline. Via https://api.docugami.com/mcp with HTTP Bearer token authentication, any MCP-compatible agent can now list documents, docsets, and projects; upload documents for processing; download generated artifacts and reports; and add or remove documents from projects. Full technical documentation is available at api-docs.docugami.com/mcp.html. The practical effect is that Docugami's document engineering becomes embeddable infrastructure rather than a standalone destination, reducing the integration friction that has historically limited IDP adoption in mid-market agentic workflows.
The KG-RAG-datasets release adds a third layer: independent verifiability. The dataset features manually curated questions, source document annotations, and human-reviewed question-answer pairs derived from SEC 10-Q filings. Generation used GPT-4-Turbo for draft answers, followed by manual human review at approximately two days per 20 questions. That labor-intensive validation rate signals that Docugami treats automated LLM output as a starting point, not a finished product, a position consistent with its broader emphasis on explainability in regulated industries.
No third-party benchmark data accompanies the MCP announcement. Performance claims on the MCP launch page are vendor-asserted. "Patented AI" is referenced without patent numbers.
Teams evaluating open-source LLM-based extraction alternatives may also want to review Unstract, which takes a no-code approach to LLM-powered document processing with hallucination mitigation built into its pipeline.
Use cases
Commercial insurance
Docugami names four concrete automation scenarios for commercial insurance, each with defined inputs and outputs. Loss run extraction has an agent pull key data from loss runs on receipt and generate a structured report. Policy comparison has an agent compare policies across multiple providers in varied formats, identifying client-specific advantages. Submission preparation has an agent aggregate loss runs, ACORD forms, claim documents, and emails into underwriting submission format. Certificate of insurance drafting has an agent detect COI request emails, identify the relevant policy, draft the certificate, and route for human review.
The depth of these scenarios, not just protocol compliance but defined inputs, outputs, and human-in-the-loop routing, signals genuine vertical focus. Indico Data and SortSpoke compete in overlapping commercial insurance workflows.
Financial document analysis
The KG-RAG-datasets release positions Docugami directly in financial services document processing. The dataset covers SEC 10-Q filings from five major technology companies, with question-answer pairs validated by human reviewers. For financial analysts and compliance teams, the open-source benchmark provides a concrete basis for evaluating whether Docugami's knowledge graph retrieval outperforms flat-text RAG on the multi-section, cross-referenced structure typical of regulatory filings. Accuracy and traceability requirements in this segment make the human-validation methodology a feature, not just a quality control step.
Legal contract analysis
Legal teams use Docugami's document engineering approach to transform contract portfolios into structured data. The system's contextual understanding identifies non-standard clauses and cross-references obligations across multiple agreements, enabling comparative analysis without manual review. The XML semantic tree representation is particularly suited to contracts, where hierarchical clause relationships matter as much as the text itself. Cognaize, which applies neuro-symbolic AI to financial and legal documents, takes a comparable structured-reasoning approach to complex document types.
Regulated sector compliance
European insurance and healthcare organizations deploy Docugami's open-source LLM stack to meet data residency requirements. The French subsidiary established in late 2025 provides a local operational base for sectors where cloud-dependent processing creates regulatory exposure. The XML Knowledge Graphs extract compliance-relevant information while keeping data within jurisdictional boundaries. Taiger, which specializes in behind-firewall generative AI document processing for regulated industries, addresses a similar data sovereignty concern through a different architectural approach.
Real estate and construction
The MCP Server announcement identifies real estate and construction as target verticals alongside insurance and legal, industries defined by high-volume, format-varied long-form documents. Pixydocs, which uses neural networks for construction and property management document workflows, targets overlapping document types in these sectors. Specific automation scenarios for real estate and construction are not detailed in available Docugami sources.
Technical specifications
| Feature | Specification |
|---|---|
| Core technology | Patented Business Document Foundation Model, XML Knowledge Graphs |
| AI architecture | Exclusively open-source LLMs with agentic quality control |
| Chunking method | Hierarchical semantic chunking via Contextual Semantic Labels (CSLs) |
| Output format | XML semantic trees with actionable data nodes |
| Retrieval architecture | KG-RAG (Knowledge Graph-enabled RAG) |
| Evaluation dataset | KG-RAG-datasets (GitHub, open-source); SEC 10-Q filings, 5 companies |
| Validation methodology | GPT-4-Turbo drafts with human review (~2 days per 20 questions) |
| MCP Server | https://api.docugami.com/mcp (Bearer token auth) |
| MCP integrations | Mistral Le Chat, GitHub Copilot in VS Code (at launch) |
| Data sovereignty | Open-source LLMs throughout; no cloud dependency |
| Geographic operations | Kirkland, WA headquarters; French subsidiary (est. late 2025) |
| Company size | 40 employees, $9.3M estimated revenue (18% YoY growth) |
| Funding | $11.22M total; NSF, NASA, and Mitacs grants |
| Technology partners | NVIDIA Inception program member |
Resources
- Website
- KG-RAG-datasets on GitHub
- MCP API documentation
- Document AI blog
- Competitive Analysis
Company information
Headquarters: Kirkland, Washington, United States
European operations: French subsidiary launched late 2025, targeting regulated insurance and healthcare sectors
Founded: 2017 by Jean Paoli (XML co-creator) and team
Funding: $11.22M total; grants from NSF, NASA, and Mitacs
Recognition: Gartner (generative AI innovation example), NVIDIA Inception program member