Chunkr

Open-source document processing API by Lumina AI that transforms complex documents into RAG-ready data through layout analysis and semantic chunking.

Overview

Chunkr emerged from a strategic pivot by San Francisco-based Lumina AI after processing 600 million pages for their original scientific search product revealed significant gaps in existing document parsing solutions. Founded in 2023 by Mehul Chadda (CEO), Ishaan Kapoor, and Akhilesh Sharma, the Y Combinator Winter 2024 company launched their production-ready document intelligence API in October 2024, generating 400K impressions and strong developer adoption.

The platform operates a three-tier strategy: an open-source AGPL version using community models, a Cloud API with proprietary in-house models, and Enterprise offerings for regulated industries. The capability gap between tiers is concrete: Excel support is absent from the open-source tier entirely and available only via a native parser in the Cloud and Enterprise tiers — not a format-conversion workaround. For enterprise RAG pipelines where spreadsheets are a primary data source, that architectural distinction matters for data fidelity. Built in Rust for performance, Chunkr processes documents at 4 pages per second on RTX 4090 hardware with capacity for 11+ million pages monthly when self-hosted.

Chadda positions Chunkr as solving the "one-size-doesn't-fit-all problem in document parsing for RAG/LLM applications," offering granular pipeline control as an alternative to proprietary cloud services. The dual-licensing model — AGPL-3.0 for open-source use, commercial license for proprietary SaaS embedding — follows the open-core monetization pattern common among infrastructure-layer AI vendors: AGPL's copyleft requirements are stricter than MIT or Apache 2.0, making the commercial license a likely requirement for organizations building proprietary products on top of Chunkr.

How Chunkr Processes Documents

Chunkr's pipeline combines transformer-based layout analysis with semantic chunking to produce RAG-ready output. The segmentation model identifies 11+ document element types — titles, tables, formulas, captions, and more — before OCR and Vision Language Models handle text extraction and complex element interpretation respectively. Output is available as JSON, HTML, or Markdown with bounding box metadata preserved for spatial context.

Self-hosting runs via Docker Compose across GPU, CPU-only, and Mac ARM (M1/M2/M3) configurations. LLM routing is configurable through any OpenAI-compatible endpoint or a models.yaml file supporting OpenAI, Google Gemini, OpenRouter, vLLM, and Ollama — giving teams control over model selection without vendor lock-in. The CAMEL-AI integration demonstrates compatibility with multi-agent frameworks and Mistral AI models for complex document workflows, positioning Chunkr as infrastructure rather than a finished product.

For teams comparing open-source document processing options, Docling (IBM Research, MIT license) and Unstructured (ETL-focused, 25+ file types) address adjacent use cases with different licensing and deployment trade-offs. Chunkr's AGPL license and Rust-based performance profile distinguish it within that field.

Use Cases

RAG System Development

AI developers building Retrieval-Augmented Generation systems use Chunkr's semantic chunking to prepare document collections for vector databases. The platform's segment-level processing with configurable OCR and VLM options addresses customization gaps in existing solutions. The CAMEL-AI integration tutorial covers multi-agent framework compatibility for teams building complex document workflows. See the Document Processing for RAG guide for pipeline architecture patterns.

Scientific Literature Processing

Research institutions leverage Chunkr's architecture — proven across 600 million pages — to extract structured data from academic papers and technical documents. The system handles multi-column formats, mathematical formulas, and scientific diagrams while maintaining section boundaries and citation context.

Enterprise Document Digitization

Organizations in healthcare, finance, and government sectors convert complex documents into structured data while maintaining SOC 2 and HIPAA compliance. The platform's modular design and self-hosting options address enterprise concerns about dependency on proprietary document processing services. Teams evaluating on-premises deployment should note that Excel processing requires the Cloud or Enterprise tier — the open-source self-hosted version does not include it.

Technical Specifications

Feature	Specification
Core Technology	Transformer models, OCR, Vision-Language Models
Supported Formats	PDF, DOCX, PPTX, XLSX (Cloud/Enterprise only), PNG, JPEG
Output Formats	HTML, Markdown, JSON
Processing Speed	~4 pages/second (RTX 4090)
Monthly Capacity	11+ million pages (self-hosted)
Language Support	~100 languages
Segment Types	11+ including titles, tables, formulas, captions
Implementation	Rust (43.6%), TypeScript (34.8%)
Deployment	Docker Compose (GPU, CPU-only, Mac ARM M1/M2/M3), Kubernetes, cloud API
LLM Routing	OpenAI-compatible endpoints; models.yaml supporting OpenAI, Gemini, OpenRouter, vLLM, Ollama
API Compatibility	OpenAI-compatible endpoints
Compliance	SOC 2, HIPAA
License	AGPL-3.0 (open-source); commercial license available (mehul@chunkr.ai)
Free Tier	200 pages