On This Page

Deepdoctection is an open-source Python library that orchestrates document layout analysis, OCR, and document classification using deep learning models. Created by Dr. Janis Meyer and released under Apache License 2.0, the platform enables developers to build custom document extraction pipelines with full traceability, mapping extracted text segments back to their original visual locations in source documents.

Overview

Deepdoctection functions as an orchestration layer for document AI - it wraps Detectron2, DocTr, Tesseract, AWS Textract, and Hugging Face Transformers into a single configurable pipeline rather than competing with those engines directly. Teams that want to swap underlying models without rewriting pipeline logic are the primary audience, placing Deepdoctection in a role analogous to what LangChain does for LLMs, but scoped to document AI.

The v1.0 release marked the project's clearest architectural commitment: TensorFlow support was dropped entirely in favor of PyTorch, and the codebase was decomposed into three sub-packages - dd-core, dd-datasets, and the main deepdoctection package. Version 1.1.0 is now the current release on PyPI, shipping as both a source distribution (148.5 kB) and built wheel (190.0 kB). No changelog for v1.1.0 has been published, making it impossible to assess what changed between the two most recent releases.

The PyTorch-only commitment deepened further when the GitHub README began requiring PyTorch >= 2.6 - released in late January 2026 - as the minimum supported version. This narrows compatibility for teams running TensorFlow-based document pipelines, who cannot adopt Deepdoctection without a framework migration. For teams already on PyTorch, it aligns the project with the dominant research and production framework and enables direct integration with the Hugging Face ecosystem, including LayoutLM, LiLT, and BERT model families.

Third-party adoption is visible: the RAGChain documentation selects Deepdoctection alongside Meta's Nougat as a supported OCR option for enterprise RAG implementations. No benchmark comparisons, analyst coverage, commercial partnerships, or funding events have been identified. The project lists Dr. Janis Meyer as sole author on PyPI; no organizational affiliation or commercial support tier is stated.

How Deepdoctection Processes Documents

Deepdoctection structures its pipeline around six modules. The analyzer module handles pipeline configuration and factory functions. The configs module manages YAML-based configuration for pipelines and model profiles. The extern module provides wrappers for all external engines - Detectron2, DocTr, Hugging Face Transformers, Tesseract, and pdfplumber - making it the integration surface where model substitution happens. The pipe module assembles pipeline components and services. The eval module provides evaluation metrics, requiring apted==1.0.3, distance==0.1.3, and pycocotools>=2.0.2. The train module supports training utilities for Detectron2 and selected Transformer models.

Within a pipeline, layout analysis and table recognition run through Detectron2 or PyTorch Transformers. OCR is handled by Tesseract, DocTr, or AWS Textract - selectable per deployment. Document and token classification use the LayoutLM family, LiLT-style models, or BERT with sliding window support for long documents. Native PDFs bypass OCR entirely through pdfplumber text mining. Language detection uses the papluca/xlm-roberta-base-language-detection model. Image preprocessing - deskewing and rotation - is handled by jdeskew or Tesseract.

Full traceability is a design constraint throughout: extracted text segments are mapped back to their original visual locations, which supports both debugging and downstream applications that require spatial grounding, such as document layout analysis and document segmentation.

The base install requires hardware-specific PyTorch to be configured manually, plus transformers>=4.48.0, timm>=0.9.16, python-doctr>=1.0.0, pdfplumber>=0.11.0, jdeskew>=0.2.2, and boto3==1.34.102. Detectron2 is not available on PyPI and must be installed from Deepdoctection's own fork. A full install including training and evaluation dependencies is available via uv pip install deepdoctection[full].

Use Cases

Enterprise RAG Workflows

Document ingestion for LLM-based processing pipelines is the most visible external adoption signal. The RAGChain documentation selects Deepdoctection alongside Meta's Nougat as a supported OCR option for retrieval-augmented generation systems. The transparent, swappable pipeline architecture makes it practical for teams that need to audit or replace individual components without rebuilding the full ingestion stack. For broader context on RAG-oriented data extraction approaches, see the AI Data Extraction guide.

Academic Research Applications

Scientific article processing, historical document archives, and reproducible research workflows benefit from the modular architecture and full traceability. Researchers can substitute detection, OCR, and classification models independently, and the eval module provides standardized metrics for comparing configurations. The dd-datasets sub-package supports dataset management for training and evaluation workflows.

Hybrid Deployment Scenarios

Organizations that need specific document understanding capabilities without committing to a full commercial pipeline can adopt individual sub-packages. The three-package decomposition - dd-core, dd-datasets, deepdoctection - allows selective installation. Docker Hub (deepdoctection/deepdoctection) and Conda/Mamba (environment.yml) provide deployment alternatives to PyPI for teams with constrained environments.

Technical Specifications

Component Detail
Current Version v1.1.0
Deep Learning Framework PyTorch >= 2.6 (TensorFlow removed in v1.0)
Object Detection Detectron2 (installed from project fork, not PyPI)
NLP Models Hugging Face Transformers >= 4.48.0; LayoutLM family, LiLT, BERT with sliding window, RoBERTa
OCR Engines Tesseract, DocTr (python-doctr >= 1.0.0), AWS Textract (boto3 == 1.34.102)
PDF Processing pdfplumber >= 0.11.0 for native PDF text mining
Image Preprocessing jdeskew >= 0.2.2 or Tesseract for deskewing and rotation
Language Detection papluca/xlm-roberta-base-language-detection
Architecture Three sub-packages: dd-core, dd-datasets, deepdoctection
Six Pipeline Modules analyzer, configs, extern, pipe, eval, train
Distribution PyPI (source 148.5 kB, wheel 190.0 kB), Conda/Mamba, Docker Hub
Licensing Apache License 2.0
Author Dr. Janis Meyer (sole listed PyPI author; no organizational affiliation stated)

Resources

For implementation context, see the Document Processing with Python guide, the Open-Source OCR Tools comparison, and the Document Layout Analysis guide. Teams evaluating Deepdoctection against other open-source orchestration options may also want to review Docling (IBM Research, MIT license) and Unstructured (25+ file types, AGPL).

Company Information

Deepdoctection is an open-source project authored by Dr. Janis Meyer. No company, organizational affiliation, commercial support tier, or enterprise roadmap has been identified in available sources. The project is distributed under Apache License 2.0 with no stated funding history.