Datalab: Document AI Foundation Models

On This Page

Overview
How Datalab processes documents
Use cases
AI laboratory and RAG pipeline infrastructure
Financial and compliance document processing
Academic and scientific research processing
Multilingual document processing
Developer integration
Technical specifications
Resources
Company information

AI company developing document intelligence foundation models with PDF-to-markdown conversion and OCR capabilities for 90+ languages.

Datalab Logo

7Team size

$3.5MSeed funding raised

85.9%Chandra 2 olmOCR benchmark

40K+GitHub stars (Surya + Marker)

Overview

Datalab is a Brooklyn-based AI company that reached seven-figure annual recurring revenue with a seven-person team serving tier 1 AI laboratories, including Anthropic as a confirmed customer. Founded in June 2024, the company raised $3.5 million in seed funding led by Pebblebed, an early-stage fund founded by OpenAI and FAIR alumni, with participation from Peak XV and angels including Balaji Srinivasan and Hugging Face founding members.

The company was built around Marker, an open-source PDF-to-markdown conversion tool, and has since expanded into a commercial API platform with its own OCR foundation models. Surya and Marker, Datalab's two core open-source tools, have combined for 40,000+ GitHub stars: Marker at 29,000 and Surya at 19,000.

The strategic arc is clear. Marker established Datalab's benchmark position in document conversion. The July 2025 Python SDK launch signaled an API-first pivot. The October 2025 release of the Chandra OCR model moved the company from tooling into foundation model territory. March 2026 brought Chandra OCR 2, a 4-billion-parameter vision-language model (VLM) that halved the parameter count of its predecessor while improving benchmark performance by 2.8 percentage points. April 2026 brought production availability on Replicate, a model hosting platform, making Marker and the Surya OCR model accessible via API without self-hosting.

CEO Vik Paruchuri, the author of Marker and the underlying Surya OCR technology, has described the company's core bet directly: "We train our own very small models that inference quickly, minimize hallucination risk with custom architecture, are very accurate, and do things that LLMs can't, like tell you exactly where in the document a piece of information is. We basically combine the accuracy and flexibility of LLMs, with the speed and limited hallucination-risk of older OCR tools." The GPL code license is paired with a restricted model-weights license that is free below $2M in revenue and paid above that threshold, mirroring a pattern used by other machine learning infrastructure companies to drive adoption while capturing commercial value as the user base scales.

How Datalab processes documents

Datalab's processing stack has two layers: the Marker conversion pipeline and the Chandra OCR foundation model that powers it.

Marker converts PDFs and other document formats to structured output by combining layout detection, OCR, and optional LLM post-processing. The core pipeline runs locally or via the hosted API. The optional --use_llm flag (defaulting to gemini-2.0-flash) routes selected tasks through one of six LLM backends, primarily for table extraction: Google Gemini, Google Vertex, Ollama (local), Anthropic Claude, OpenAI-compatible endpoints, and Azure OpenAI. This gives enterprise deployments flexibility to route LLM calls through existing infrastructure rather than a single vendor.

Two architectural shifts in early 2026 define the current Marker version. v1.9.0 moved OCR inference from line level to block level. The release notes explicitly name this trade-off: it is a bit slower, but it boosts accuracy. v1.10.0 introduced a new layout model via an upstream Surya upgrade, described as a "major performance boost," and added an --html_tables_in_markdown flag allowing tables to render as HTML tags rather than default markdown syntax. Table extraction has received sustained attention across multiple releases: v1.9.2 added iterative LLM looping and introduced detection and correction of table cells that cut text lines, specifically targeting hallucination reduction. v1.9.3 added metadata storage capability and a Modal deployment example.

On olmOCR-Bench (1,403 PDF files, 7,010 test cases), Marker's balanced mode scored 82.7% overall accuracy, outperforming GPT-4o at 69.9% by 12.8 percentage points, Deepseek OCR at 74.2% by 8.5 points, and Mistral OCR at 72.0% by 10.7 points. Marker fast mode scored 76.5% on the same benchmark. On speed, Marker processes one page in approximately 0.18 seconds and reaches 120 pages per second when batched, compared to LlamaParse's 23.35 seconds per page, an 8x throughput advantage at comparable or better accuracy.

Chandra OCR 2, released in March 2026, is Datalab's current full-page OCR foundation model. The 4-billion-parameter architecture runs on a single consumer GPU (10-12GB VRAM in BFloat16 format, with 16GB or more recommended), making it deployable without specialized hardware. Unlike pipeline-based systems that process line by line, Chandra decodes entire pages to preserve layout context. It outputs bounding box coordinates for every text block, table, and image, with exports to Markdown, HTML, and JSON.

Chandra 2 supports 15+ layout block types including tables, forms, diagrams, equations, code blocks, chemical structures, and bibliographies. New capabilities over Chandra 1 include Mermaid diagram conversion, structured chart extraction (values, categories, axis labels), image captioning, and chemistry detection from molecular structures. The model delivers 2x throughput improvement over its predecessor, processing 2 pages per second on an NVIDIA H100 GPU with 96 concurrent requests.

The evolution from Chandra 1 to Chandra 2 is measurable: 85.9% accuracy on the olmOCR benchmark (up 2.8 points from 83.1%), and 77.8% accuracy across a 43-language multilingual benchmark (up 8.4 points from 69.4%). On the expanded 90-language benchmark, Chandra 2 scores 72.7% versus Gemini 2.5 Flash at 60.8%. The hosted API scores 86.7 on olmOCR, the highest among tested models, suggesting post-processing or ensemble techniques enhance accuracy beyond the base model alone.

One limitation is worth noting directly: Chandra 2 scores 50.4% on the "Old Scans" category, compared to Marker v1.10.0 at 32.3%. Both figures are low, and Chandra 2 leads, but neither is competitive with legacy OCR vendors that have historically specialized in degraded document handling. Teams processing historical documents, microfilm, or heavily degraded source material should test against their specific document quality before committing.

Surya, the underlying OCR and layout analysis toolkit, achieves 0.97 normalized sentence similarity on OCR benchmarks versus Tesseract's 0.88, using matched CPU resources. Table recognition scores 1.0 row intersection and 0.98625 column intersection on a FinTabNet subset, compared to Table Transformer's 0.84 and 0.86857. The text detection model was trained on 4x A6000 GPUs for 3 days using a modified EfficientViT architecture; text recognition was trained on 4x A6000 GPUs for 2 weeks using a modified Donut model with grouped query attention (GQA), mixture-of-experts (MoE) layers, and UTF-16 decoding.

Deployment options span three tiers. The hosted API at datalab.to carries a 99.99% uptime SLA and processes a 250-page PDF in approximately 15 seconds. Both Marker and the Surya OCR model are now available via Replicate at $4 per 1,000 pages (fast and balanced modes) and $6 per 1,000 pages (accurate mode and structured JSON extraction). Chandra 2 is installable via pip install chandra-ocr with support for vLLM and HuggingFace inference methods. A self-serve on-premises licensed solution is available for enterprises with data-residency requirements. Sandy Kwon, COO, has confirmed: "We never train on customer data. All of our training data is curated or generated internally."

Benchmark position (from the marker_benchmark dataset on HuggingFace): Marker scores 95.67 heuristic accuracy, outperforming Mathpix (86.43) by 8.9 points, Docling (86.71) by 8.96 points, and LlamaParse (84.24) by 11.4 points. The largest per-category gap is on forms: Marker 88.01 vs. Docling 68.39, a 20-point lead. On FinTabNet table extraction (99 tables, tree-edit-distance metric), Marker with --use_llm scores 0.907 vs. Gemini standalone at 0.829 and base Marker at 0.816.

The Rules API provides a natural language-based correction system for customizing Marker outputs and handling edge cases. Datalab Forge is the interactive playground for visualizing and testing document processing rules. The docext repository, available at github.com/datalab-to, is an OCR-free unstructured data extraction and benchmarking toolkit, pointing toward a product direction beyond Marker and Surya. Datalab maintains 11 public repositories in total.

Use cases

AI laboratory and RAG pipeline infrastructure

Datalab's primary customer base is tier 1 AI laboratories requiring high-accuracy document conversion for model training and research workflows. Anthropic is a confirmed customer, though contractual restrictions prevent disclosure of engagement details. The sustained attention to table extraction, including iterative LLM looping, hallucination detection, and the FinTabNet benchmark, points toward Marker as infrastructure for retrieval-augmented generation pipelines, where table fidelity directly affects downstream answer quality. The 9.1-point lift from --use_llm on FinTabNet over base Marker makes the case for hybrid processing in RAG contexts.

Hugging Face selected Chandra OCR 2 as its primary model for processing 27,000+ arXiv papers lacking HTML versions, achieving a 100% success rate on 16x L40S GPUs at approximately 60 papers per hour. The cost comparison is instructive: running the job via HuggingFace Jobs cost approximately $850, versus $1,841 to $2,762 for the Datalab hosted API at that scale. Teams building similar open-source extraction pipelines may also evaluate Unstract, which offers a no-code LLM platform with hallucination mitigation for production document workflows.

Financial and compliance document processing

Marker's benchmark lead is widest where document complexity is highest: a 20-point gap over Docling on forms and a 13-point gap over Mathpix on engineering documents. Chandra 2 achieves 89.9% accuracy on tables, and the FinTabNet results (0.907 with LLM augmentation) are directly relevant to financial statement extraction workflows. This positions Datalab for financial analysis, compliance review, and technical documentation where layout fidelity and table accuracy directly affect downstream data quality. Teams processing investment research and SEC filings at scale may also consider Acuity Knowledge Partners, whose Agent Fleet agentic AI serves 800+ financial institutions with document processing and research automation.

Academic and scientific research processing

Chandra 2 achieves 90.2% accuracy on ArXiv papers and 89.3% on handwritten math. The full-page decoding architecture handles mathematical notation, handwriting, and chemical formulas that commonly fail in pipeline-based OCR systems. The v1.8.3 OCR model was specifically described as "better all-around, but particularly at math," and Chandra 2 extends this with explicit support for chemical structure detection from molecular diagrams. Researchers working with scientific literature at scale may also consider PaperQA Nemotron, an open-source platform combining RAG capabilities with NVIDIA Nemotron models for scientific document processing.

Multilingual document processing

South Asian scripts saw the largest gains in Chandra 2 over Chandra 1: Bengali improved 27.2 percentage points, Kannada 42.6%, Malayalam 46.2%, Tamil 26.9%, and Telugu 39.1%. On European languages, German achieves 94.8%, Italian 94.1%, French 93.7%, and Portuguese 95.2%. The 77.8% accuracy across 43 common languages and 72.7% across 90 languages positions Datalab for global document processing workflows, particularly in markets where mainstream OCR solutions have historically underperformed on non-Latin scripts.

Developer integration

The Python SDK (version 0.5.0, published April 6, 2026) exposes PDF-to-markdown conversion through a client.convert() method, supports workflow chaining for multi-step document processing, and includes CLI tools. Python 3.10+ is required. At one-quarter the price of leading cloud competitors, the hosted API is positioned as a cost-reduction play for teams already paying for Mathpix or LlamaParse. The self-serve on-premises licensing option extends this to enterprises with data-residency requirements, a segment where hosted-only competitors cannot compete directly. Developers building structured extraction pipelines on top of LLMs may also find LangExtract relevant, as Google's open-source Python library targets structured information extraction from unstructured text with precise source grounding. Teams requiring open-source document layout analysis as a foundation layer may also evaluate Deepdoctection, a PyTorch-based Python library that orchestrates layout analysis, OCR, and classification using deep learning models. See the Marker PDF-to-Markdown guide for implementation details.

Technical specifications

Feature	Specification
Deployment options	Cloud API (99.99% SLA), Replicate hosted API, On-premise (self-serve licensing), Open-source
API	REST API with Python SDK (v0.5.0)
Supported languages	90+ languages including complex scripts
Document formats	PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images
Output formats	Markdown, JSON, HTML, Chunks
SDK requirements	Python 3.10+
OCR architecture	Block-level inference (since v1.9.0); full-page decoding in Chandra
Chandra 2 parameters	4 billion
Chandra 2 GPU memory	10-12GB VRAM (BFloat16); 16GB+ recommended
LLM backends	Google Gemini, Google Vertex, Ollama, Anthropic Claude, OpenAI-compatible, Azure OpenAI
Throughput (H100, Chandra 2)	2 pages/sec at 96 concurrent requests
Throughput (Marker batched)	120 pages/sec
Hosted API speed	~15 seconds for 250-page PDF
olmOCR-Bench (Marker balanced)	82.7% (vs. GPT-4o 69.9%, Mistral OCR 72.0%, Deepseek OCR 74.2%)
olmOCR benchmark (Chandra 2 base)	85.9%
olmOCR benchmark (Chandra 2 hosted API)	86.7%
Multilingual benchmark (43 languages)	77.8% (vs. GPT-5 Mini 60.5%, Gemini 2.5 Flash 67.6%)
Marker benchmark score	95.67 heuristic
FinTabNet (Marker + LLM)	0.907
Surya OCR accuracy	0.97 normalized sentence similarity (vs. Tesseract 0.88)
Layout block types	15+ (tables, forms, diagrams, equations, code, chemical structures, bibliographies)
Replicate pricing	$4/1,000 pages (fast/balanced); $6/1,000 pages (accurate/JSON)
License (code)	OpenRAIL (as of v1.8.5)
License (model weights)	Modified OpenRAIL-M: free for research, personal use, startups under $2M; commercial licensing via datalab.to/pricing
Current Marker version	v1.10.2
Current Chandra version	2 (released 2026-03)
Current SDK version	0.5.0 (released 2026-04-06)

Resources

Vendor website
Documentation
GitHub: Marker
GitHub: Chandra
Python SDK on PyPI
Public playground
Pricing
Chandra on HuggingFace
marker_benchmark dataset
Marker and OCR on Replicate
Discord community
Competitive analysis