Docling: IBM Open-Source Document Processing

On This Page

Overview
How Docling Processes Documents
Use Cases
Enterprise AI and RAG Pipelines
Scientific Document Processing
Developer Ecosystem Integration
Technical Specifications
Resources
Company Information

Open-source document processing library for AI applications, developed by IBM Research Zurich and hosted by the LF AI & Data Foundation.

Docling

Overview

Docling began as IBM Research's answer to a specific problem: extracting clean, structured data from unstructured documents for AI pipelines. Peter Staar, Principal Research Staff Member at IBM Research Zurich and Chair of Technical Steering at the Linux Foundation, frames the challenge plainly: "How hard can it be? Well, it can be very hard." Since its first public release in August 2025, the library has logged over 100 releases and crossed 37,000 GitHub stars - a community health signal that Red Hat cited when describing it as "the number one open source repository for document intelligence."

The strategic arc is deliberate. What launched as a document converter is now positioning as parsing and orchestration infrastructure for enterprise AI pipelines. Three moves in early 2026 confirm the direction: IBM released Granite-Docling-258M, a production-grade vision-language model under Apache 2.0; launched the Docling OpenShift Operator with Red Hat, targeting banks as the named deployment segment; and donated the project to the Linux Foundation's Agentic AI Foundation (AAIF) alongside BeeAI and Data Prep Kit. Staar's framing of two forward priorities - schema-driven structured extraction and agentic document generation - confirms this is deliberate repositioning, not feature drift: "It's not just conversion anymore. We're thinking through it. We're generating and manipulating documents."

The Register classifies Docling as a framework and tool alongside Goose and BeeAI, not as a protocol alongside MCP and A2A - an editorial distinction that places its competitive frame against other document-processing and agentic orchestration frameworks rather than the protocol layer.

Docling operates under the MIT License, granting unrestricted commercial and personal use rights with no copyleft restrictions. All repositories under the docling-project GitHub organization maintain MIT licensing, enabling integration into proprietary systems without licensing conflicts. The companion Granite-Docling-258M model is separately licensed under Apache 2.0.

How Docling Processes Documents

Docling's architecture spans eight public repositories: docling (main package), docling-core (types, transforms, serializers; home of DoclingDocument), docling-parse (PDF backend), docling-serve (FastAPI REST wrapper), docling-ibm-models (AI models), docling-sdg (synthetic data generation for RAG and fine-tuning), docling-mcp (Model Context Protocol tool definitions for document agents), and docling-java (Java API built on docling-serve). The docling-mcp repository is the most forward-looking integration point for agentic workflows.

Document ingestion handles PDF, DOCX, PPTX, XLSX, HTML, audio (WAV, MP3), and video (MP4, AVI, MOV) through a unified pipeline. The TableFormer model, trained on 1M+ tables from scientific, financial, and general datasets, handles complex table structures. The Heron layout model, introduced in December 2025, improves PDF parsing speed while maintaining accuracy. All output flows through the DoclingDocument unified representation format, which exports to Markdown, JSON, HTML, or DocTags.

Granite-Docling-258M is the production VLM anchoring the ecosystem. Released in early 2026 under Apache 2.0, it replaces the experimental SmolDocling-256M-preview (March 2025) with a Granite 3 language backbone and SigLIP2 visual encoder. The model uses IBM's proprietary DocTags markup format, capturing charts, tables, forms, code, equations, footnotes, and captions in a single pass - avoiding the error accumulation that multi-stage ensemble pipelines introduce. SmolDocling had a known instability (token-repetition loops at certain page positions); Granite-Docling addresses this through dataset filtering and removal of samples with inconsistent annotations. Multilingual support for Arabic, Chinese, and Japanese is included but flagged as experimental and not yet enterprise-validated. IBM's explicit recommendation: deploy Granite-Docling within the Docling framework, not as a standalone replacement. Quantitative benchmarks are on the Hugging Face model card. The model string for deployment is ibm-granite/granite-docling-258M.

GPU acceleration via NVIDIA RTX delivers up to 6x speedup over CPU-only PDF processing, with automatic GPU detection once CUDA drivers are installed. On Linux, a vLLM server path delivers approximately 4x better VLM performance compared to llama-server on Windows. Hardware-tiered batch size recommendations: RTX 5090 (32GB VRAM) at 64-128, RTX 4090 (24GB) at 32-64, RTX 5070 (12GB) at 16-32. CUDA 12.8 and 13.0 are both supported; RTX 50-series Blackwell uses CUDA 12.8-specific optimizations. Known caveat: table processing batch sizes are fixed at 4 regardless of VRAM tier, and complex documents may require reducing --gpu-memory-utilization from 0.9 to 0.8 to avoid out-of-memory errors. See the official RTX getting-started guide.

Distributed processing via Ray Data + Docling runs each worker node with a dedicated Docling instance holding AI models in memory, avoiding repeated model-loading overhead. KubeRay handles autoscaling from 10 to 100 Kubernetes nodes with automatic worker recovery. The full pipeline runs: object storage → Docling parsing and chunking → GPU embedding generation → Milvus vector database → LLM retrieval for RAG. Deployment options include Red Hat OpenShift AI, the Anyscale platform, or the open-source stack directly. No processing-time benchmarks are published; the "weeks to hours" claim in Anyscale's documentation is unquantified.

Chart extraction is listed as a coming-soon feature in the official documentation, covering bar charts, pie charts, and line plots. A community feature request submitted in September 2025 outlines extracting chart metadata and data series in structured JSON. The existing beta extraction API supports custom data schemas through Pydantic models, providing the foundation for chart-specific handling. For PPTX files, charts contain embedded data series extractable directly; PDF charts rely on VLM visual analysis through Granite-Docling-258M.

Native AI framework integration supports LangChain, LlamaIndex, Crew AI, and Haystack. For hands-on implementation, see the Docling guide and PDF to structured data guide.

Use Cases

Enterprise AI and RAG Pipelines

Organizations use Docling as the parsing layer for retrieval-augmented generation pipelines, converting PDFs, spreadsheets, and office documents into structured JSON for AI processing. Red Hat's RamaLama project integrates Docling for containerized local inference, keeping enterprise data on-premises. The Docling OpenShift Operator, built with Red Hat, extends this to OpenShift-managed deployments at scale - with banks named as the primary target segment.

Scientific Document Processing

The paper-qa-docling library targets scientific AI applications. PaperQA Nemotron, which combines RAG capabilities with NVIDIA Nemotron models for scientific document processing, represents a complementary approach in this space. TableFormer's training on 1M+ scientific and financial tables gives it an advantage on documents with complex nested structures that defeat general-purpose OCR. The docling-sdg repository adds synthetic data generation for RAG fine-tuning, enabling domain-specific model adaptation without large labeled datasets.

Developer Ecosystem Integration

Third-party tools demonstrate Docling's role as infrastructure rather than end-user software. The RAG PDF Audit tool benchmarks Docling against standard libraries, and the Knowledge Base Self-Hosting Kit positions it as core infrastructure for self-hosted document intelligence. The docling-java repository extends reach to Java-based enterprise stacks via the docling-serve REST wrapper, without requiring Python in the deployment environment. Developers building structured extraction pipelines may also evaluate LangExtract, Google's open-source Python library for LLM-powered extraction with source grounding, as a complementary tool for text-heavy workflows where layout analysis is less critical. For teams preferring a no-code LLM platform with hallucination mitigation built in, Unstract offers a production-grade alternative on the open-source side. Teams evaluating open-source document layout analysis as a standalone component may also consider Deepdoctection, a Python library that orchestrates layout analysis, OCR, and classification using deep learning models with a PyTorch-only architecture since v1.0.

Technical Specifications

Feature	Specification
Programming Language	Python 3.9-3.14
License	MIT License (library); Apache 2.0 (Granite-Docling-258M model)
Container Images	4.4GB (CPU) to 11.4GB (CUDA)
Hardware Support	x86_64, arm64, Apple Silicon MLX, NVIDIA RTX (CUDA 12.8, 13.0), AMD ROCm
GPU Speedup	Up to 6x over CPU-only PDF processing (NVIDIA RTX)
API Status	Stable v1 API (January 2026)
Deployment Options	pip, containers, distributed (Kubeflow, Ray), OpenShift Operator
Processing Modes	Local, cloud, hybrid, air-gapped
VLM Model String	`ibm-granite/granite-docling-258M`
Multilingual Support	Arabic, Chinese, Japanese (experimental)
Enterprise Integration	Red Hat AI, OpenShift AI, Anyscale/KubeRay
Repositories	8 (docling, docling-core, docling-parse, docling-serve, docling-ibm-models, docling-sdg, docling-mcp, docling-java)
GitHub Stars	37,000+ (February 2026)

Resources

GitHub Organization
Main Repository
Documentation
Granite-Docling-258M on Hugging Face
RTX GPU Getting-Started Guide
LF AI & Data Foundation
Red Hat AI Integration
Granite-Docling Announcement
IBM Think: Docling's Rise
Docling Guide
PDF to Structured Data Guide
Document Processing for RAG Guide
Competitive Analysis: Docling vs Enterprise IDP