DeepSeek-OCR: Open-Source Visual OCR Model
On This Page
DeepSeek-OCR 2, released January 28, 2026, scores 91.09% on OmniDocBench v1.5, a 3.73-point improvement over its predecessor, while processing complex pages in 256-1,120 visual tokens compared to the 1,500-6,000 tokens typical of competing models. The headline architectural change is replacing OpenAI's CLIP encoder with Alibaba's Qwen2-0.5B, completing a shift to a fully China-domestic model stack. The efficiency gains are real: 200,000 pages per day on a single A100 GPU versus typical large language models (LLMs) processing several thousand pages on 20 or more GPUs. The accuracy ceiling, however, is not the highest in its class. GLM-OCR reaches 94.62% and PaddleOCR-VL-1.5 reaches 94.50% on the same benchmark, both requiring substantially more visual tokens. Researchers from Tohoku University and the Chinese Academy of Sciences found accuracy collapses from approximately 90% to approximately 20% when linguistic support is removed, a limitation affecting suitability for high-stakes production pipelines where adversarial or degraded inputs are routine.
Overview
DeepSeek AI released the original DeepSeek-OCR in October 2025 as a 3B parameter model combining windowed SAM and CLIP encoders with four processing modes (Tiny through Large). The architectural break came with DeepSeek-OCR 2 in January 2026: the CLIP ViT encoder was replaced by DeepEncoder-V2, initialized from Alibaba's Qwen2-0.5B, and a Visual Causal Flow attention mechanism was introduced that dynamically rearranges visual tokens by semantic context rather than fixed raster order. The result is measurable progress on complex layouts, with reading order edit distance improving from 0.085 to 0.057, alongside a narrow lead over Gemini-3 Pro on element-level edit distance (0.100 vs. 0.115).
The CLIP-to-Qwen2 substitution is a supply-chain decision as much as a technical one. For enterprise buyers with data sovereignty requirements or geopolitical procurement constraints, the shift to a fully China-domestic stack (DeepSeek decoder plus Alibaba encoder) is a material factor independent of benchmark performance. SCMP, which reported this angle most prominently, is owned by Alibaba Group, the developer of Qwen2-0.5B; that conflict of interest is worth weighing when assessing how the substitution is framed.
The model is available under MIT license with weights, code, and research paper published at deepseek-ai/DeepSeek-OCR-2. Cloud deployments followed quickly: Google Cloud Vertex AI as deepseek-ocr-maas (general availability, us-central1) and AWS SageMaker JumpStart with one-click provisioning. The open-source release also spawned Universal DeepSeek OCR 2 for CPU and Apple Metal GPU inference and SinapsisAI's commercial wrapper with Docker deployment and Gradio interface. The model attracted 4,000 GitHub stars within 24 hours of release and remained atop Hugging Face's most popular model list for over a week.
No enterprise adoption figures have been disclosed. No pricing beyond the MIT open-source release applies to the base model.
How DeepSeek-OCR processes documents
DeepSeek-OCR 2's core innovation is an asymmetric attention pattern inside DeepEncoder-V2: visual tokens use bidirectional attention, while appended "causal flow tokens" use causal attention. This produces a 1D sequence already aligned with a learned reading order before the decoder processes it, directly addressing complex layouts like tables, multi-column text, and mixed-format pages where reading order is non-obvious. The decoder is DeepSeek-3B-MoE with 6/64 expert routing, activating approximately 570M of 3B total parameters during inference.
The vision tokenizer uses an 80M parameter SAM-style backbone with two convolutional layers, downsampling by a factor of 16 into 896-dimensional embeddings. A global 1024x1024 view yields 256 tokens; up to six local 768x768 crops add 144 tokens each, for a maximum of 1,120 tokens per page. At compression ratios below 10x, the model achieves approximately 97% OCR precision; at 20x compression, accuracy drops to around 60%. This compression ceiling is the key constraint for teams evaluating the model against document types with extreme visual density.
Training ran across three stages on OCR-heavy data (80% OCR content; text:formula:table ratio 3:1:1), covering 30 million PDF pages across 100 languages including 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The three stages covered encoder pretraining on approximately 160 A100 GPUs for 40,000 iterations; encoder-decoder coupling for 15,000 iterations with 4-stage pipeline parallelism and 40 data-parallel replicas; and decoder fine-tuning for 20,000 iterations with the encoder frozen. Freezing the encoder at Stage 3 more than doubled training throughput, a detail relevant to teams considering domain-specific fine-tuning of the open-source release.
The academic challenge from Tohoku University and the Chinese Academy of Sciences (arXiv: 2601.03714) raises a structural question about this architecture: accuracy drops from approximately 90% to approximately 20% when linguistic support is removed, lower visual token counts correlate with higher hallucination risk, and total model collapse occurs at approximately 10,000 text tokens. Whether these failure modes reflect language-prior shortcuts baked into the architecture, or can be corrected through fine-tuning, is an open question the sources do not resolve. As PhD and AI startup founder Li Boji put it: "For manuscripts that are difficult to recognize, relying on acquired knowledge may help AI understand the text, but it could be a drawback for clearly printed materials."
Dense newspaper layouts remain a known weakness, with text edit distance above 0.13. DeepSeek attributes this to limited training data and extreme text density compression rather than architectural limitations. Repetition rates improved across both user log images (6.25% to 4.17%) and PDF production documents (3.69% to 2.88%), representing a practical quality signal for production deployments where repeated output tokens inflate cost and degrade downstream parsing.
Use cases
Enterprise document digitization
The model processes complex pages in 256-1,120 visual tokens, enabling high-throughput digitization pipelines with reduced downstream LLM compute costs. Community benchmarks report 200,000 pages daily on a single A100 GPU, with potential for 33 million pages per day on a 20-node cluster of 160 A100 GPUs. The MIT license enables on-premises deployment, which is the primary reason enterprise buyers with data sovereignty requirements or concerns about hosted Chinese AI services adopt the model. As Kaoutar El Maghraoui, Principal Research Scientist at IBM, noted: "Using vision token compression to convert document images into compact tokens helps slash context length and cost significantly."
Complex layout processing
Reading order edit distance improved from 0.085 to 0.057 compared to the original DeepSeek-OCR, the metric most directly enabled by the semantic architecture. This translates to measurably better handling of multi-column documents, academic papers, forms, and presentations. The model covers 9 document categories on OmniDocBench v1.5 including books, academic papers, forms, presentations, and newspapers in Chinese and English. Newspaper layouts remain the weakest category, above 0.13 text edit distance. PaddleOCR-VL-1.5 may outperform DeepSeek on reading order in irregular layouts, making it the stronger choice for document types where layout irregularity is the primary challenge.
Multilingual document processing
After fine-tuning, the model demonstrates 86% Character Error Rate improvement on Persian language processing, with similar 57-86% improvements across multiple languages. Training on 30 million PDF pages across 100 languages positions the model for global deployment, though the synthetic data strategy means real-world coverage for low-resource languages remains unverified. The three-stage training disclosure gives teams enough detail to attempt domain-specific fine-tuning; the encoder-freezing throughput gain at Stage 3 is the key implementation detail for teams with constrained compute budgets.
Financial and legal document processing
Available through AWS SageMaker JumpStart and Google Cloud Vertex AI for invoice processing, contract analysis, and regulatory document handling with structured output in HTML tables and Markdown. The 10,000-token collapse threshold documented by the academic researchers is a material constraint for long-form legal documents; teams processing contracts or regulatory filings should validate against this limit before production deployment. For regulated industries requiring auditable pipelines, see document processing compliance and on-premise document processing.
RAG and AI pipeline preparation
The token efficiency advantage (256-1,120 tokens per page versus 1,500-6,000 for competing models) directly reduces inference costs in document processing pipelines for RAG. The open-source release integrates with frameworks including Docling and LlamaIndex for chunking and retrieval workflows. IBM's El Maghraoui characterized the relationship between DeepSeek-OCR and Docling as complementary rather than competitive: "Where DeepSeek-OCR shines in raw OCR throughput and token efficiency, Docling excels in end-to-end conversion and structure." DeepSeek's research team has indicated the architecture could evolve into a unified full-modal encoder processing text, images, and audio through different modality query embeddings, though no timeline or release has been announced.
Production performance
Instavar's March 2026 workflow benchmark across five open-source models reveals the gap between benchmark scores and production fit. DeepSeek ranks second in grounded workflow capability after Hunyuan, extracting 926 total visual anchors versus Hunyuan's 1,517, FireRed's 48, and GLM's 57. Blank-page detection is a standout: DeepSeek achieved 3/3 in the 50-page workflow benchmark, matching Hunyuan and outperforming GLM (0/3) and FireRed (2/3).
The trade-off is inference speed. DeepSeek measured 17.591 seconds per page, the slowest among the four tested models. The Instavar analysis team concluded: "DeepSeek is now the second grounded workflow in the measured stack, although it is also the slowest." For real-time document processing, this latency is disqualifying. For batch pipelines where throughput matters more than per-page speed, the 200,000 pages per day figure on a single GPU remains the strongest efficiency argument in the open-source field.
By February 2026, Instavar observed that "production fit matters more than tiny score gaps" as open-source OCR models converged on headline benchmarks. That framing accurately describes DeepSeek-OCR 2's position: third on raw OmniDocBench accuracy behind GLM-OCR and PaddleOCR-VL-1.5, but first on token efficiency and competitive on blank-page detection in real workflow conditions.
Technical specifications
| Specification | DeepSeek-OCR 2 | DeepSeek-OCR Legacy |
|---|---|---|
| Parameters | 3B total, ~570M active | 3B |
| Architecture | Visual Causal Flow (DeepEncoder-V2) | Windowed SAM + CLIP |
| Visual encoder | Qwen2-0.5B (Alibaba) | 80M SAM + 300M CLIP (OpenAI) |
| Vision tokenizer | 80M SAM-style, 16x downsample, 896-dim | 80M SAM |
| Resolution support | (0-6)x768x768 + 1x1024x1024 | 512x512 to 1280x1280 |
| Visual tokens per page | 256-1,120 | 64-400 (up to 1,156 in Gundam mode) |
| Max output tokens | 8,192 | Standard |
| OmniDocBench v1.5 score | 91.09% | 87.36% |
| Element-level edit distance | 0.100 (vs. Gemini-3 Pro: 0.115) | 0.129 |
| Reading order edit distance | 0.057 | 0.085 |
| Text edit distance | 0.048 | 0.073 |
| Compression accuracy | ~97% at 10x; ~60% at 20x | Not documented |
| Known failure threshold | ~10,000 text tokens (model collapse) | Not documented |
| Inference speed (workflow) | 17.591 seconds per page | Not documented |
| Throughput (single A100) | 200,000 pages per day | Not documented |
| VRAM requirement | 16 GB minimum; 24 GB recommended; 2 GB (Q4 quantized) | Not documented |
| License | MIT | MIT |
| Model string (Vertex AI) | deepseek-ocr-maas | |
| Fine-tuning support | Unsloth (1.4x faster, 40% less VRAM) | Standard |
Resources
- DeepSeek-OCR 2 GitHub Repository
- DeepSeek-OCR Legacy GitHub Repository
- DeepSeek-OCR Paper on arXiv
- Academic challenge: arXiv 2601.03714, Tohoku University / Chinese Academy of Sciences
- Hugging Face DeepSeek-OCR 2 Model
- AWS SageMaker JumpStart documentation
- Google Cloud Vertex AI DeepSeek-OCR documentation
- Google Cloud self-deployment guide
- Universal DeepSeek OCR 2: CPU/MPS/CUDA implementation
- SinapsisAI commercial package
- Unsloth fine-tuning documentation
- OmniDocBench v1.5 benchmark analysis
- Instavar March 2026 workflow benchmark
- IBM Think: DeepSeek-OCR efficiency analysis
- Regolo.ai: DeepSeek vs GLM-OCR vs PaddleOCR benchmark
- DigitalOcean: optical context compression deep-dive
- Document parsing benchmarks guide
- Vision-language models for OCR
- Open-source OCR tools comparison
Company information
DeepSeek AI is a Chinese AI research organization founded in 2023 as a subsidiary of quantitative hedge fund High-Flyer, headquartered in Hangzhou, China. The company has pursued an MIT licensing strategy across its model releases, enabling on-premises deployment. That choice addresses regulatory concerns enterprises have with hosted Chinese AI services and has driven adoption among buyers with data sovereignty requirements.
The CLIP-to-Qwen2-0.5B substitution in DeepSeek-OCR 2 completes a shift to a fully China-domestic model stack. Whether this reflects a technical preference, a supply-chain decision, or both, the practical effect for enterprise procurement is that the model now depends entirely on components developed within China's domestic AI ecosystem. For organizations with geopolitical procurement constraints, this is a factor independent of benchmark performance.
The benchmark gains in DeepSeek-OCR 2 are self-reported and not independently reproduced across all reviewed sources. The academic challenge from Tohoku University and the Chinese Academy of Sciences is the most rigorous third-party technical evaluation available. It identifies failure modes that are disqualifying for regulated or high-stakes document processing where degraded or adversarial inputs are routine. Teams evaluating the model for production use should treat the 10,000-token collapse threshold and the language-prior dependency as first-order constraints, not edge cases.
For competitive context, DeepSeek-OCR 2 holds a narrow but measurable lead over Gemini-3 Pro on element-level edit distance (0.100 vs. 0.115) but trails GLM-OCR (94.62%) and PaddleOCR-VL-1.5 (94.50%) on raw OmniDocBench accuracy. No head-to-head metrics against Mistral OCR 3 are available in any reviewed source. For broader open-source IDP alternatives, see Docling and Chunkr; for enterprise platforms with auditable pipelines, see ABBYY and Hyperscience.