Docling: IBM Open-Source Document Processing
On This Page
Open-source document processing library for AI applications, developed by IBM Research Zurich and hosted by the LF AI & Data Foundation.

Overview
Docling began as IBM Research's answer to a specific problem: extracting clean, structured data from unstructured documents for AI pipelines. Peter Staar, Principal Research Staff Member at IBM Research Zurich and Chair of Technical Steering at the Linux Foundation, frames the challenge plainly: "How hard can it be? Well, it can be very hard." Since its first public release in August 2025, the library has logged over 100 releases and crossed 37,000 GitHub stars. Red Hat cited that community momentum when describing Docling as "the number one open source repository for document intelligence."
The strategic arc is deliberate. What launched as a document converter now positions as parsing and orchestration infrastructure for enterprise AI pipelines. Three moves in early 2026 confirm the direction: IBM released Granite-Docling-258M, a production-grade vision-language model (VLM) under Apache 2.0; launched the Docling OpenShift Operator with Red Hat, targeting banks as the named deployment segment; and donated the project to the Linux Foundation's Agentic AI Foundation (AAIF) alongside BeeAI and Data Prep Kit. Staar identifies two forward priorities: schema-driven structured extraction and agentic document generation. His framing confirms the repositioning is deliberate, not feature drift: "It's not just conversion anymore. We're thinking through it. We're generating and manipulating documents."
The Register classifies Docling as a framework and tool alongside Goose and BeeAI, not as a protocol alongside MCP and A2A. This editorial distinction places its competitive frame against other document-processing and agentic orchestration frameworks rather than the protocol layer.
Docling operates under the MIT License, granting unrestricted commercial and personal use rights with no copyleft restrictions. All repositories under the docling-project GitHub organization maintain MIT licensing, enabling integration into proprietary systems without licensing conflicts. The companion Granite-Docling-258M model is separately licensed under Apache 2.0.
Open-source release date: IBM open-sourced Docling in July 2024 under MIT license. The project was identified as the top trending repository worldwide in November 2024 before its first formal PyPI release in August 2025.
How Docling processes documents
Docling's architecture spans eight public repositories: docling (main package), docling-core (types, transforms, serializers; home of DoclingDocument), docling-parse (PDF backend), docling-serve (FastAPI REST wrapper), docling-ibm-models (AI models), docling-sdg (synthetic data generation for retrieval-augmented generation (RAG) and fine-tuning), docling-mcp (Model Context Protocol tool definitions for document agents), and docling-java (Java API built on docling-serve). The docling-mcp repository is the most forward-looking integration point for agentic workflows.
Document ingestion handles PDF, DOCX, PPTX, XLSX, HTML, audio (WAV, MP3), and video (MP4, AVI, MOV) through a unified pipeline. The TableFormer model, trained on 1M+ tables from scientific, financial, and general datasets, handles complex table structures including partial or missing borderlines, empty cells, cell spans, and hierarchical headers. The Heron layout model, introduced in December 2025, improves PDF parsing speed while maintaining accuracy. All output flows through the DoclingDocument unified representation format, which exports to Markdown, JSON, HTML, or DocTags.
The underlying Layout Analysis Model uses RT-DETR architecture trained on the DocLayNet dataset for page element classification. IBM trained the vision model on 81,000 manually labeled pages including patents, manuals, and 10-K filings, achieving within five percentage points of human accuracy on page element identification. Docling integrates with EasyOCR when OCR capabilities are needed, but the primary VLM path avoids traditional OCR entirely. According to IBM Research, Peter Staar notes: "Avoiding OCR reduces errors, and it also speeds up the time-to-solution by 30 times."
Granite-Docling-258M is the production VLM anchoring the ecosystem. Released in January 2026 under Apache 2.0, it replaces the experimental SmolDocling-256M-preview (March 2025) with a Granite 3 language backbone and SigLIP2 visual encoder. The model uses IBM's proprietary DocTags markup format, capturing charts, tables, forms, code, equations, footnotes, and captions in a single pass while avoiding the error accumulation that multi-stage ensemble pipelines introduce. As Abraham Daniels of IBM explains: "Whereas conventional OCR models convert documents directly to Markdown and lose connection to the source content, Granite-Docling's unique method of faithfully translating complex structural elements makes its output ideal for downstream RAG applications." SmolDocling had a known instability involving token-repetition loops at certain page positions; Granite-Docling addresses this through dataset filtering and removal of samples with inconsistent annotations. Multilingual support for Arabic, Chinese, and Japanese is included but flagged as experimental and not yet enterprise-validated. IBM's explicit recommendation: deploy Granite-Docling within the Docling framework, not as a standalone replacement. The model string for deployment is ibm-granite/granite-docling-258M.
DocTags, the markup format introduced with Granite-Docling, separates textual content from document structure to minimize token usage and ambiguity. This design choice is consequential for RAG: every element in the output includes provenance, page numbers, and bounding box information, preserving the link between extracted content and its source location in the original document.
GPU acceleration via NVIDIA RTX delivers up to 6x speedup over CPU-only PDF processing, with automatic GPU detection once CUDA drivers are installed. On Linux, a vLLM server path delivers approximately 4x better VLM performance compared to llama-server on Windows. Hardware-tiered batch size recommendations: RTX 5090 (32GB VRAM) at 64-128, RTX 4090 (24GB) at 32-64, RTX 5070 (12GB) at 16-32. CUDA 12.8 and 13.0 are both supported; RTX 50-series Blackwell uses CUDA 12.8-specific optimizations. Known caveat: table processing batch sizes are fixed at 4 regardless of VRAM tier, and complex documents may require reducing --gpu-memory-utilization from 0.9 to 0.8 to avoid out-of-memory errors.
Distributed processing via Ray Data and Docling runs each worker node with a dedicated Docling instance holding AI models in memory, avoiding repeated model-loading overhead. KubeRay handles autoscaling from 10 to 100 Kubernetes nodes with automatic worker recovery. The full pipeline runs: object storage, then Docling parsing and chunking, then GPU embedding generation, then Milvus vector database, then LLM retrieval for RAG. Deployment options include Red Hat OpenShift AI, the Anyscale platform, or the open-source stack directly. No processing-time benchmarks are published; the "weeks to hours" claim in Anyscale's documentation is unquantified.
Chunking strategy is a differentiator for RAG quality. As The New Stack documents, Docling respects document structure by maintaining semantic boundaries for paragraphs, sections, tables, and images rather than splitting by character count. This preserves context that character-count chunking destroys, reducing retrieval noise in downstream LLM queries.
Structured extraction supports Pydantic or JSON schema definitions with type safety and validation, enabling developers to define target data shapes and have Docling populate them directly from document content. The Model Context Protocol (MCP) server support, via the docling-mcp repository, allows AI agents to interact with documents through natural language commands, positioning Docling as a tool layer within agentic workflows rather than a standalone batch processor.
Chart extraction is listed as a coming-soon feature in the official documentation, covering bar charts, pie charts, and line plots. A community feature request submitted in September 2025 outlines extracting chart metadata and data series in structured JSON. The existing beta extraction API supports custom data schemas through Pydantic models, providing the foundation for chart-specific handling. For PPTX files, charts contain embedded data series extractable directly; PDF charts rely on VLM visual analysis through Granite-Docling-258M.
Native AI framework integration supports LangChain, LlamaIndex, Crew AI, and Haystack. For hands-on implementation, see the Docling guide and PDF to structured data guide.
Use cases
Enterprise AI and RAG pipelines
Organizations use Docling as the parsing layer for retrieval-augmented generation pipelines, converting PDFs, spreadsheets, and office documents into structured JSON for AI processing. IBM processed 2.1 million PDFs from Common Crawl for AI training data and planned to process 1.8 billion PDFs for integration into the IBM Granite multimodal model, demonstrating the library's throughput at scale. The InstructLab team used Docling to extract information from targeted PDFs for AI model training. Akash Srivastava, IBM researcher and principal AI product advisor at Red Hat, states: "The biggest roadblock we encounter in working with clients is preparing their proprietary data for InstructLab to use. There isn't an open-source tool of Docling's caliber out there."
Red Hat AI 3.3 includes hardened, enterprise-grade versions of Docling available through the Red Hat AI Python Index, positioning the library as part of Red Hat's "metal-to-agent" AI infrastructure stack for hybrid cloud deployments. Red Hat also plans to include Docling as a supported feature in upcoming Red Hat Enterprise Linux AI releases. The Docling OpenShift Operator, built with Red Hat, extends this to OpenShift-managed deployments at scale with banks named as the primary target segment. Cedric Clyburn, Senior Developer Advocate at Red Hat, notes: "It's got a governing organization that helps it be perfect for secure, regulated environments."
Scientific document processing
The paper-qa-docling library targets scientific AI applications. PaperQA Nemotron, which combines RAG capabilities with NVIDIA Nemotron models for scientific document processing, represents a complementary approach in this space. TableFormer's training on 1M+ scientific and financial tables gives it an advantage on documents with complex nested structures that defeat general-purpose OCR. The docling-sdg repository adds synthetic data generation for RAG fine-tuning, enabling domain-specific model adaptation without large labeled datasets.
Developer ecosystem integration
Third-party tools demonstrate Docling's role as infrastructure rather than end-user software. The docling-java repository extends reach to Java-based enterprise stacks via the docling-serve REST wrapper, without requiring Python in the deployment environment. Docling's command-line interface, Python API, and five-line code setup lower adoption friction compared to more complex document processing pipelines. The modular source connectors with built-in adapters for common file formats and extensible pipeline with user-defined processors and custom metadata attachment appeal to developers seeking flexibility without sacrificing structure preservation.
Developers building structured extraction pipelines may also evaluate LangExtract, Google's open-source Python library for LLM-powered extraction with source grounding, as a complementary tool for text-heavy workflows where layout analysis is less critical. For teams preferring a no-code LLM platform with hallucination mitigation built in, Unstract offers a production-grade alternative on the open-source side. Teams evaluating open-source document layout analysis as a standalone component may also consider Deepdoctection, a Python library that orchestrates layout analysis, OCR, and classification using deep learning models with a PyTorch-only architecture since v1.0.
Ming Zhao, IBM Software specialist, frames the broader context: "The real challenge in RAG or agentic AI isn't building the agent but curating the knowledge and the context behind it." Docling's positioning as the curation layer, rather than the agent itself, is the consistent thread across all three use case categories.
Technical specifications
| Feature | Specification |
|---|---|
| Programming Language | Python 3.9-3.14 |
| License | MIT License (library); Apache 2.0 (Granite-Docling-258M model) |
| Container Images | 4.4GB (CPU) to 11.4GB (CUDA) |
| Hardware Support | x86_64, arm64, Apple Silicon MLX, NVIDIA RTX (CUDA 12.8, 13.0), AMD ROCm |
| GPU Speedup | Up to 6x over CPU-only PDF processing (NVIDIA RTX) |
| VLM Speed Advantage | 30x faster than traditional OCR (IBM Research, November 2024) |
| API Status | Stable v1 API (January 2026) |
| Deployment Options | pip, containers, distributed (Kubeflow, Ray), OpenShift Operator |
| Processing Modes | Local, cloud, hybrid, air-gapped |
| VLM Model String | ibm-granite/granite-docling-258M |
| Multilingual Support | Arabic, Chinese, Japanese (experimental) |
| Enterprise Integration | Red Hat AI 3.3, OpenShift AI, Anyscale/KubeRay |
| Repositories | 8 (docling, docling-core, docling-parse, docling-serve, docling-ibm-models, docling-sdg, docling-mcp, docling-java) |
| GitHub Stars | 37,000+ (February 2026) |
| Training Data | Layout model trained on 81,000 manually labeled pages; TableFormer on 1M+ tables |
Resources
- Main Repository
- Docling Documentation
- Granite-Docling-258M on Hugging Face
- RTX GPU Getting-Started Guide
- LF AI & Data Foundation
- Red Hat AI Integration
- Granite-Docling Announcement
- IBM Think: Docling's Rise
- Docling Guide
- PDF to Structured Data Guide
- Document Processing for RAG Guide
- Competitive Analysis: Docling vs Enterprise IDP
Company information
- GitHub: docling-project
- License: MIT License (library); Apache 2.0 (Granite-Docling-258M)
- Developed by: IBM Research Zurich, AI for Knowledge Team
- Hosted by: LF AI & Data Foundation (Agentic AI Foundation, AAIF), formed December 2025
- Technical Lead: Peter Staar, Principal Research Staff Member, IBM Research Zurich; Chair of Technical Steering, Docling at the Linux Foundation
- IBM product integration: Watson Document Understanding, watsonx.ai