On This Page

AI company developing document intelligence foundation models with PDF-to-markdown conversion and OCR capabilities for 90+ languages.

Datalab Logo

Overview

Datalab is a Manhattan-based AI company that reached seven-figure annual recurring revenue with a seven-person team serving tier 1 AI laboratories. The company was founded around Marker, an open-source PDF-to-markdown conversion tool, and has since expanded into a commercial API platform with its own OCR foundation model.

The strategic arc is clear: Marker established Datalab's benchmark position in document conversion, the July 2025 Python SDK launch signaled an API-first pivot, and the October 2025 release of the Chandra OCR model moved the company from tooling into foundation model territory. By February 2026, Marker had shipped 13+ SDK versions and 8 library releases in a single month - a development cadence that reflects both the lean team's focus and the competitive pressure from LlamaParse, Docling, and Mathpix.

Vik, the author of Marker and the underlying Surya OCR technology, has described the company's operating philosophy as "stretching out the golden period of startups where high trust and careful, deliberate hiring of senior generalists dominate." The GPL code license paired with a restricted model-weights license - free below $2M in revenue, paid above - mirrors a pattern used by other ML infrastructure companies to drive adoption while capturing commercial value as the user base scales.

How Datalab Processes Documents

Datalab's processing stack has two layers: the Marker conversion pipeline and the Chandra OCR foundation model that powers it.

Marker converts PDFs and other document formats to structured output by combining layout detection, OCR, and optional LLM post-processing. The core pipeline runs locally or via the hosted API. The optional --use_llm flag (defaulting to gemini-2.0-flash) routes selected tasks - primarily table extraction - through one of six LLM backends: Google Gemini, Google Vertex, Ollama (local), Anthropic Claude, OpenAI-compatible endpoints, and Azure OpenAI. This gives enterprise deployments flexibility to route LLM calls through existing infrastructure rather than a single vendor.

Two architectural shifts in early 2026 define the current version. v1.9.0 moved OCR inference from line level to block level - the release notes name the trade-off explicitly: "a bit slower, it boosts accuracy." v1.10.0 introduced a new layout model via an upstream Surya upgrade, described as a "major performance boost," and added an --html_tables_in_markdown flag allowing tables to render as HTML tags rather than default markdown syntax. Table extraction has received sustained attention across multiple releases: v1.9.2 added iterative LLM looping and introduced detection and correction of table cells that cut text lines, specifically targeting hallucination reduction.

Chandra, released in October 2025, is Datalab's full-page OCR model. Unlike pipeline-based systems that process line by line, Chandra decodes entire pages to preserve layout context. It outputs bounding box coordinates for every text block, table, and image, with exports to Markdown, HTML, and JSON. The December 2025 Chandra 1.1 release added an Eagle3 speculative decoding model that reduced API p99 latency by 3x. Chandra 1.5, released January 22, 2026, brought further improvements to layout recognition, mathematical notation, table parsing, and multilingual performance.

Deployment options span three tiers. The hosted API at datalab.to carries a 99.99% uptime SLA and processes a 250-page PDF in approximately 15 seconds. On H100 hardware, throughput reaches 0.18 seconds per page, with a projected 122 pages per second at scale using 22 parallel processes. An on-premises licensed solution is available via self-serve licensing for enterprises with data-residency requirements. The local FastAPI server (marker_server) is documented as suitable for small-scale use only.

Benchmark position (current, from marker_benchmark on HuggingFace): Marker scores 95.67 heuristic accuracy - 8.9 points ahead of Mathpix (86.43), 8.96 points ahead of Docling (86.71), and 11.4 points ahead of LlamaParse (84.24). The largest per-category gap is on forms: Marker 88.01 vs. Docling 68.39, a 20-point lead. On FinTabNet table extraction (99 tables, tree-edit-distance metric), Marker with --use_llm scores 0.907 vs. Gemini standalone at 0.829 and base Marker at 0.816 - the pipeline combination outperforms either component independently. On speed, Marker processes a single page in 2.84 seconds vs. LlamaParse's 23.35 seconds, an 8x advantage at comparable or better accuracy.

The Rules API provides a natural language-based correction system for customizing Marker outputs and handling edge cases. Datalab Forge is the interactive playground for visualizing and testing document processing rules.

Use Cases

AI Laboratory and RAG Pipeline Infrastructure

Datalab's primary customer base is tier 1 AI laboratories requiring high-accuracy document conversion for model training and research workflows. The sustained multi-release attention to table extraction - iterative LLM looping, hallucination detection, the FinTabNet benchmark - points toward Marker as infrastructure for retrieval-augmented generation pipelines, where table fidelity directly affects downstream answer quality. The 9.1-point lift from --use_llm on FinTabNet over base Marker makes the case for hybrid processing in RAG contexts. Teams building similar open-source extraction pipelines may also evaluate Unstract, which offers a no-code LLM platform with hallucination mitigation for production document workflows.

Financial and Compliance Document Processing

Marker's benchmark lead is widest where document complexity is highest: a 20-point gap over Docling on forms and a 13-point gap over Mathpix on engineering documents. This positions Datalab for financial analysis, compliance review, and technical documentation - document types where layout fidelity and table accuracy translate directly to downstream data quality. The FinTabNet results (0.907 with LLM augmentation) are directly relevant to financial statement extraction workflows. Teams processing investment research and SEC filings at scale may also consider Acuity Knowledge Partners, whose Agent Fleet agentic AI serves 800+ financial institutions with document processing and research automation.

Academic and Scientific Research Processing

Processing of research papers and scientific documents, extracting structured information while preserving complex formatting and mathematical notation. Chandra's full-page decoding handles mathematical notation, handwriting, and chemical formulas - traditional failure points for pipeline-based OCR. The v1.8.3 OCR model was specifically described as "better all-around, but particularly at math." Researchers working with scientific literature at scale may also consider PaperQA Nemotron, an open-source platform combining RAG capabilities with NVIDIA Nemotron models for scientific document processing.

Developer Integration

The Python SDK ecosystem enables technical teams to embed document intelligence into custom applications, with MIT licensing and Python 3.10+ support. At one-quarter the price of leading cloud competitors, the hosted API is positioned as a cost-reduction play for teams already paying for Mathpix or LlamaParse. The self-serve on-premises licensing option extends this to enterprises with data-residency requirements - a segment where hosted-only competitors cannot compete directly. Developers building structured extraction pipelines on top of LLMs may also find LangExtract relevant, as Google's open-source Python library targets structured information extraction from unstructured text with precise source grounding. Teams requiring open-source document layout analysis as a foundation layer may also evaluate Deepdoctection, a PyTorch-based Python library that orchestrates layout analysis, OCR, and classification using deep learning models. See the Marker PDF-to-Markdown guide for implementation details.

Technical Specifications

Feature Specification
Deployment Options Cloud API (99.99% SLA), On-premise (self-serve licensing), Open-source
API REST API with Python SDK
Supported Languages 90+ languages including complex scripts
Document Formats PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images
Output Formats Markdown, JSON, HTML, Chunks
SDK Requirements Python 3.10+
OCR Architecture Block-level inference (since v1.9.0); full-page decoding in Chandra
LLM Backends Google Gemini, Google Vertex, Ollama, Anthropic Claude, OpenAI-compatible, Azure OpenAI
Throughput (H100) 0.18 sec/page; ~122 pages/sec at scale (22 parallel processes)
Hosted API Speed ~15 seconds for 250-page PDF
Benchmark Score 95.67 heuristic (marker_benchmark); 0.907 FinTabNet with --use_llm
License (Code) OpenRAIL (as of v1.8.5)
License (Model Weights) Modified AI Pubs Open Rail-M - free for research, personal use, startups under $2M; commercial licensing via datalab.to/pricing
Current Marker Version v1.10.2
Current Chandra Version 1.5 (released 2026-01-22)

Resources

Sources

  • 2025-07 [release: Python SDK launch | pypi.org] Datalab launches Python SDK v0.1.4 on PyPI, marking API-first pivot (https://pypi.org/project/datalab-python-sdk/0.1.4/)
  • 2025-10 [release: Chandra OCR | datalab.to] Chandra OCR model released, scoring 83.1% on olmOCR benchmark, surpassing GPT-4o and Gemini Flash 2 (https://www.datalab.to/blog)
  • 2025-10 [release: SDK v0.1.11 | pypi.org] 13 SDK versions shipped since July 2025 launch (https://pypi.org/project/datalab-python-sdk/0.1.11/)
  • 2025-12 [release: Chandra 1.1 | datalab.to] Chandra 1.1 released with Eagle3 speculative decoding, reducing API p99 latency by 3x (https://www.datalab.to/blog)
  • 2026-01 [release: Chandra 1.5 | datalab.to] Chandra 1.5 released January 22, 2026 with layout, math, table, and multilingual improvements (https://www.datalab.to/blog/chandra-1-5)
  • 2026-02 [release: Marker v1.8.3 | github.com] New OCR model "better all-around, but particularly at math"; format_lines flag replaced by --force_ocr (https://github.com/datalab-to/marker/releases)
  • 2026-02 [release: Marker v1.8.5 | github.com] License updated to OpenRAIL (https://github.com/datalab-to/marker/releases)
  • 2026-02 [release: Marker v1.9.0 | github.com] OCR inference moved from line-level to block-level, trading speed for accuracy (https://github.com/datalab-to/marker/releases)
  • 2026-02 [release: Marker v1.9.2 | github.com] Iterative LLM looping for table output; hallucination detection for table cells; commercial terms updated (https://github.com/datalab-to/marker/releases)
  • 2026-02 [release: Marker v1.10.0 | github.com] New layout model via Surya upgrade ("major performance boost"); --html_tables_in_markdown flag added (https://github.com/datalab-to/marker/releases)
  • 2026-02 [benchmark: marker_benchmark | huggingface.co] Marker 95.67 vs Mathpix 86.43, Docling 86.71, LlamaParse 84.24; forms gap 20 points over Docling (https://huggingface.co/datasets/datalab-to/marker_benchmark)
  • 2026-02 [benchmark: FinTabNet | github.com] Marker + --use_llm scores 0.907 vs Gemini 0.829 vs base Marker 0.816 on 99-table FinTabNet dataset (https://github.com/datalab-to/marker)
  • 2025-00 [profile: founder interview | latent.space] Seven-figure ARR, seven-person team, tier 1 AI lab customers (https://www.latent.space/p/tiny)

Company Information

  • Website: Datalab.to
  • Email: support@datalab.to
  • Discord: Active community with dedicated #marker channel
  • Social: Twitter/X, LinkedIn