DeepSeek-OCR: Open-Source Visual OCR Model

On This Page

Overview
How DeepSeek-OCR processes documents
Use cases
Enterprise document digitization
Complex layout processing
Multilingual document processing
Financial and legal document processing
RAG and AI pipeline preparation
Production performance
Technical specifications
Resources
Company information

DeepSeek-OCR 2, released January 28, 2026, scores 91.09% on OmniDocBench v1.5, a 3.73-point improvement over its predecessor, while processing complex pages in 256-1,120 visual tokens compared to the 1,500-6,000 tokens typical of competing models. The headline architectural change is replacing OpenAI's CLIP encoder with Alibaba's Qwen2-0.5B, completing a shift to a fully China-domestic model stack. The efficiency gains are real: 200,000 pages per day on a single A100 GPU versus typical large language models (LLMs) processing several thousand pages on 20 or more GPUs. The accuracy ceiling, however, is not the highest in its class. GLM-OCR reaches 94.62% and PaddleOCR-VL-1.5 reaches 94.50% on the same benchmark, both requiring substantially more visual tokens. Researchers from Tohoku University and the Chinese Academy of Sciences found accuracy collapses from approximately 90% to approximately 20% when linguistic support is removed, a limitation affecting suitability for high-stakes production pipelines where adversarial or degraded inputs are routine.

91.09%OmniDocBench v1.5 accuracy

200KPages per day on one A100 GPU

256–1,120Visual tokens per page

17.6sPer-page inference time (Instavar, March 2026)

Overview

DeepSeek AI released the original DeepSeek-OCR in October 2025 as a 3B parameter model combining windowed SAM and CLIP encoders with four processing modes (Tiny through Large). The architectural break came with DeepSeek-OCR 2 in January 2026: the CLIP ViT encoder was replaced by DeepEncoder-V2, initialized from Alibaba's Qwen2-0.5B, and a Visual Causal Flow attention mechanism was introduced that dynamically rearranges visual tokens by semantic context rather than fixed raster order. The result is measurable progress on complex layouts, with reading order edit distance improving from 0.085 to 0.057, alongside a narrow lead over Gemini-3 Pro on element-level edit distance (0.100 vs. 0.115).

The CLIP-to-Qwen2 substitution is a supply-chain decision as much as a technical one. For enterprise buyers with data sovereignty requirements or geopolitical procurement constraints, the shift to a fully China-domestic stack (DeepSeek decoder plus Alibaba encoder) is a material factor independent of benchmark performance. SCMP, which reported this angle most prominently, is owned by Alibaba Group, the developer of Qwen2-0.5B; that conflict of interest is worth weighing when assessing how the substitution is framed.

The model is available under MIT license with weights, code, and research paper published at deepseek-ai/DeepSeek-OCR-2. Cloud deployments followed quickly: Google Cloud Vertex AI as deepseek-ocr-maas (general availability, us-central1) and AWS SageMaker JumpStart with one-click provisioning. The open-source release also spawned Universal DeepSeek OCR 2 for CPU and Apple Metal GPU inference and SinapsisAI's commercial wrapper with Docker deployment and Gradio interface. The model attracted 4,000 GitHub stars within 24 hours of release and remained atop Hugging Face's most popular model list for over a week.

No enterprise adoption figures have been disclosed. No pricing beyond the MIT open-source release applies to the base model.

How DeepSeek-OCR processes documents

DeepSeek-OCR 2's core innovation is an asymmetric attention pattern inside DeepEncoder-V2: visual tokens use bidirectional attention, while appended "causal flow tokens" use causal attention. This produces a 1D sequence already aligned with a learned reading order before the decoder processes it, directly addressing complex layouts like tables, multi-column text, and mixed-format pages where reading order is non-obvious. The decoder is DeepSeek-3B-MoE with 6/64 expert routing, activating approximately 570M of 3B total parameters during inference.

The vision tokenizer uses an 80M parameter SAM-style backbone with two convolutional layers, downsampling by a factor of 16 into 896-dimensional embeddings. A global 1024x1024 view yields 256 tokens; up to six local 768x768 crops add 144 tokens each, for a maximum of 1,120 tokens per page. At compression ratios below 10x, the model achieves approximately 97% OCR precision; at 20x compression, accuracy drops to around 60%. This compression ceiling is the key constraint for teams evaluating the model against document types with extreme visual density.

Training ran across three stages on OCR-heavy data (80% OCR content; text:formula:table ratio 3:1:1), covering 30 million PDF pages across 100 languages including 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The three stages covered encoder pretraining on approximately 160 A100 GPUs for 40,000 iterations; encoder-decoder coupling for 15,000 iterations with 4-stage pipeline parallelism and 40 data-parallel replicas; and decoder fine-tuning for 20,000 iterations with the encoder frozen. Freezing the encoder at Stage 3 more than doubled training throughput, a detail relevant to teams considering domain-specific fine-tuning of the open-source release.

The academic challenge from Tohoku University and the Chinese Academy of Sciences (arXiv: 2601.03714) raises a structural question about this architecture: accuracy drops from approximately 90% to approximately 20% when linguistic support is removed, lower visual token counts correlate with higher hallucination risk, and total model collapse occurs at approximately 10,000 text tokens. Whether these failure modes reflect language-prior shortcuts baked into the architecture, or can be corrected through fine-tuning, is an open question the sources do not resolve. As PhD and AI startup founder Li Boji put it: "For manuscripts that are difficult to recognize, relying on acquired knowledge may help AI understand the text, but it could be a drawback for clearly printed materials."

Dense newspaper layouts remain a known weakness, with text edit distance above 0.13. DeepSeek attributes this to limited training data and extreme text density compression rather than architectural limitations. Repetition rates improved across both user log images (6.25% to 4.17%) and PDF production documents (3.69% to 2.88%), representing a practical quality signal for production deployments where repeated output tokens inflate cost and degrade downstream parsing.

Use cases

Enterprise document digitization

The model processes complex pages in 256-1,120 visual tokens, enabling high-throughput digitization pipelines with reduced downstream LLM compute costs. Community benchmarks report 200,000 pages daily on a single A100 GPU, with potential for 33 million pages per day on a 20-node cluster of 160 A100 GPUs. The MIT license enables on-premises deployment, which is the primary reason enterprise buyers with data sovereignty requirements or concerns about hosted Chinese AI services adopt the model. As Kaoutar El Maghraoui, Principal Research Scientist at IBM, noted: "Using vision token compression to convert document images into compact tokens helps slash context length and cost significantly."

Complex layout processing

Reading order edit distance improved from 0.085 to 0.057 compared to the original DeepSeek-OCR, the metric most directly enabled by the semantic architecture. This translates to measurably better handling of multi-column documents, academic papers, forms, and presentations. The model covers 9 document categories on OmniDocBench v1.5 including books, academic papers, forms, presentations, and newspapers in Chinese and English. Newspaper layouts remain the weakest category, above 0.13 text edit distance. PaddleOCR-VL-1.5 may outperform DeepSeek on reading order in irregular layouts, making it the stronger choice for document types where layout irregularity is the primary challenge.

Multilingual document processing

After fine-tuning, the model demonstrates 86% Character Error Rate improvement on Persian language processing, with similar 57-86% improvements across multiple languages. Training on 30 million PDF pages across 100 languages positions the model for global deployment, though the synthetic data strategy means real-world coverage for low-resource languages remains unverified. The three-stage training disclosure gives teams enough detail to attempt domain-specific fine-tuning; the encoder-freezing throughput gain at Stage 3 is the key implementation detail for teams with constrained compute budgets.

Financial and legal document processing

Available through AWS SageMaker JumpStart and Google Cloud Vertex AI for invoice processing, contract analysis, and regulatory document handling with structured output in HTML tables and Markdown. The 10,000-token collapse threshold documented by the academic researchers is a material constraint for long-form legal documents; teams processing contracts or regulatory filings should validate against this limit before production deployment. For regulated industries requiring auditable pipelines, see document processing compliance and on-premise document processing.

RAG and AI pipeline preparation

The token efficiency advantage (256-1,120 tokens per page versus 1,500-6,000 for competing models) directly reduces inference costs in document processing pipelines for RAG. The open-source release integrates with frameworks including Docling and LlamaIndex for chunking and retrieval workflows. IBM's El Maghraoui characterized the relationship between DeepSeek-OCR and Docling as complementary rather than competitive: "Where DeepSeek-OCR shines in raw OCR throughput and token efficiency, Docling excels in end-to-end conversion and structure." DeepSeek's research team has indicated the architecture could evolve into a unified full-modal encoder processing text, images, and audio through different modality query embeddings, though no timeline or release has been announced.

Production performance

Instavar's March 2026 workflow benchmark across five open-source models reveals the gap between benchmark scores and production fit. DeepSeek ranks second in grounded workflow capability after Hunyuan, extracting 926 total visual anchors versus Hunyuan's 1,517, FireRed's 48, and GLM's 57. Blank-page detection is a standout: DeepSeek achieved 3/3 in the 50-page workflow benchmark, matching Hunyuan and outperforming GLM (0/3) and FireRed (2/3).

The trade-off is inference speed. DeepSeek measured 17.591 seconds per page, the slowest among the four tested models. The Instavar analysis team concluded: "DeepSeek is now the second grounded workflow in the measured stack, although it is also the slowest." For real-time document processing, this latency is disqualifying. For batch pipelines where throughput matters more than per-page speed, the 200,000 pages per day figure on a single GPU remains the strongest efficiency argument in the open-source field.

By February 2026, Instavar observed that "production fit matters more than tiny score gaps" as open-source OCR models converged on headline benchmarks. That framing accurately describes DeepSeek-OCR 2's position: third on raw OmniDocBench accuracy behind GLM-OCR and PaddleOCR-VL-1.5, but first on token efficiency and competitive on blank-page detection in real workflow conditions.

Technical specifications

Specification	DeepSeek-OCR 2	DeepSeek-OCR Legacy
Parameters	3B total, ~570M active	3B
Architecture	Visual Causal Flow (DeepEncoder-V2)	Windowed SAM + CLIP
Visual encoder	Qwen2-0.5B (Alibaba)	80M SAM + 300M CLIP (OpenAI)
Vision tokenizer	80M SAM-style, 16x downsample, 896-dim	80M SAM
Resolution support	(0-6)x768x768 + 1x1024x1024	512x512 to 1280x1280
Visual tokens per page	256-1,120	64-400 (up to 1,156 in Gundam mode)
Max output tokens	8,192	Standard
OmniDocBench v1.5 score	91.09%	87.36%
Element-level edit distance	0.100 (vs. Gemini-3 Pro: 0.115)	0.129
Reading order edit distance	0.057	0.085
Text edit distance	0.048	0.073
Compression accuracy	~97% at 10x; ~60% at 20x	Not documented
Known failure threshold	~10,000 text tokens (model collapse)	Not documented
Inference speed (workflow)	17.591 seconds per page	Not documented
Throughput (single A100)	200,000 pages per day	Not documented
VRAM requirement	16 GB minimum; 24 GB recommended; 2 GB (Q4 quantized)	Not documented
License	MIT	MIT
Model string (Vertex AI)	`deepseek-ocr-maas`
Fine-tuning support	Unsloth (1.4x faster, 40% less VRAM)	Standard

DeepSeek-OCR 2 (OmniDocBench v1.5)91%

GLM-OCR (OmniDocBench v1.5)95%

PaddleOCR-VL-1.5 (OmniDocBench v1.5)95%

Gemini-3 Pro (element edit distance, inverted)88%

Resources

DeepSeek-OCR 2 GitHub Repository
DeepSeek-OCR Legacy GitHub Repository
DeepSeek-OCR Paper on arXiv
Academic challenge: arXiv 2601.03714, Tohoku University / Chinese Academy of Sciences
Hugging Face DeepSeek-OCR 2 Model
AWS SageMaker JumpStart documentation
Google Cloud Vertex AI DeepSeek-OCR documentation
Google Cloud self-deployment guide
Universal DeepSeek OCR 2: CPU/MPS/CUDA implementation
SinapsisAI commercial package
Unsloth fine-tuning documentation
OmniDocBench v1.5 benchmark analysis
Instavar March 2026 workflow benchmark
IBM Think: DeepSeek-OCR efficiency analysis
Regolo.ai: DeepSeek vs GLM-OCR vs PaddleOCR benchmark
DigitalOcean: optical context compression deep-dive
Document parsing benchmarks guide
Vision-language models for OCR
Open-source OCR tools comparison