Unstract
Open-source, no-code LLM platform for intelligent document processing, offering production-grade extraction with hallucination mitigation, token efficiency features, and flexible deployment under AGPL 3.0.
Overview
Unstract, developed by Zipstack, is a no-code platform that uses large language models to automate extraction from unstructured documents of any format, type, or design. It is available under the AGPL 3.0 license for self-hosting and as a managed cloud service with a 14-day free trial.
The platform targets the gap between legacy OCR-based IDP systems — which require rigid templates — and raw LLM APIs, which lack the document pre-processing and output validation needed for production reliability. Its architecture separates document ingestion and normalization (LLMWhisperer), extraction prompt engineering (Prompt Studio), output validation (LLMChallenge / LLMEval), and workflow integration into distinct, composable layers.
Unstract claims a 60% increase in document processing speed and a 30% reduction in operational costs for adopters, though these figures originate from the vendor's own materials without independent verification.
Key Features
Hallucination Mitigation: LLMChallenge
The most distinctive accuracy feature is LLMChallenge, which runs two separate LLMs in a maker-checker configuration. The first LLM extracts a field value; the second challenges it. A consensus is required before the value is returned. If the two models disagree, the system returns null rather than a potentially wrong answer — the vendor's stated principle being "NULL is better than wrong." This approach catches hallucinations before they reach downstream systems, at the cost of roughly doubling LLM API calls per extraction.
Token Efficiency: SinglePass and SummarizedExtraction
Two features address the cost of running LLMs on long documents:
- SinglePass Extraction consolidates all field-extraction prompts into a single large prompt, reducing LLM round-trips. The GitHub documentation claims up to 8x token reduction.
- SummarizedExtraction constructs a compact version of the input document before sending it to the LLM, claiming up to 6x token savings.
Combined, the vendor claims up to 7x overall token reduction versus naive per-field extraction.
Document Pre-Processing: LLMWhisperer
LLMWhisperer is Unstract's document normalization layer, designed to convert raw documents into formats LLMs can reliably interpret. Key capabilities include:
- Layout-preserving output for multi-column documents, forms, and tables
- State-of-the-art handwritten text detection
- Checkbox and radio button detection
- High-fidelity processing of scanned PDFs and smartphone-captured images
- Auto PDF repair and rotation/skew compensation
LLMWhisperer is also available as a standalone API with four extraction tiers: Native Text, Low Cost, High Quality, and High Quality with Form Elements. Pricing ranges from $1/1,000 pages (Native Text) to $15/1,000 pages (High Quality with Form Elements), with a free tier of 100 pages/day requiring no credit card.
Prompt Engineering: Prompt Studio
Prompt Studio is a purpose-built environment for defining extraction schemas. Engineers can compare outputs and costs from multiple LLMs side-by-side, test prompt versions against representative document samples, and roll back to previous prompt versions. Once a schema is finalized, it can be deployed as an API with a single click — removing the need to manage prompts in spreadsheets or ad hoc scripts.
Observability: LLMObservability
Given the probabilistic nature of LLMs, Unstract includes an observability layer that surfaces how data and models interact both during development and in production. This addresses a common gap in LLM-based pipelines where failures are silent or difficult to trace.
Integration and Deployment
Unstract supports four integration patterns, targeting different team types:
| Integration Type | Best For |
|---|---|
| API Deployments | Developers building apps or services requiring programmatic document structuring |
| ETL Pipelines | Data engineering teams batch-processing documents into JSON for warehouses |
| n8n Nodes | Low-code and ops teams automating workflows visually |
| MCP Servers | Developers building agentic or LLM-powered tools that speak the Model Context Protocol |
Document sources include Dropbox, S3, data lakes, and other cloud file storage. Output formats include JSON, spreadsheets, and direct database writes.
Self-Hosting Requirements
- Linux or macOS (Intel or M-series)
- Docker and Docker Compose
- 8 GB RAM minimum
- Git
Setup runs via ./run-platform.sh and is accessible at http://frontend.unstract.localhost.
Vector Database Connectors
Qdrant, Weaviate, Pinecone, PostgreSQL, and Milvus are confirmed working integrations.
Supported File Formats
Unstract handles a broad range of input formats across categories:
| Category | Formats |
|---|---|
| Word Processing | DOCX, DOC, ODT |
| Presentation | PPTX, PPT, ODP |
| Spreadsheet | XLSX, XLS, ODS |
| Document & Text | PDF, TXT, CSV, JSON |
| Image | BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP |
Pricing
Unstract Cloud (Bring Your Own Keys)
Plans include LLMWhisperer but require users to supply their own LLM, vector database, and embedding model API keys.
| Plan | Monthly Billing | Annual Billing | Pages/Year |
|---|---|---|---|
| Starter | $499/month | $416/month | 60,000 |
| Growth | $2,249/month | $1,874/month | 300,000 |
| Enterprise | Custom | Custom | Custom |
Overage is $0.10/page (Starter) or $0.09/page (Growth). A free onboarding credit of $10 on Azure OpenAI GPT-4o is included, along with free access to PostgreSQL, Azure OpenAI Embedding, and LLMWhisperer.
LLMWhisperer API (Standalone)
| Tier | Price |
|---|---|
| Native Text | $1/1,000 pages |
| Low Cost | $5/1,000 pages |
| High Quality | $7/1,000 pages (monthly) / $10/1,000 pages (annual) |
| High Quality with Form Elements | $15/1,000 pages |
Free tier: 100 pages/day, no credit card required.
Enterprise self-hosted plans are available for both Unstract and LLMWhisperer.
Use Cases
Insurance and Financial Services
Unstract targets insurance underwriting and claims processing — ingesting KYC documents, contracts, and claims forms and converting them from unstructured to structured formats. The vendor claims document review times can drop from days to minutes.
Bank Statement Processing
The platform explicitly addresses high-variation document sets: bank statements from 200 different banks or the same form with changes across 50 different states — scenarios where template-based OCR systems fail.
ETL for Unstructured Data
For data engineering teams, Unstract functions as an ETL pipeline for documents: ingest from cloud storage, extract structured fields, push to data warehouses or databases. This positions it alongside platforms like Unstructured in the LLM-native document ETL space.
Agentic Workflows
The MCP server integration and Human-in-the-Loop feature (available on cloud) position Unstract for agentic document workflows where autonomous agents need reliable structured data from documents.
Competitive Positioning
Unstract occupies the open-source, LLM-native segment of the IDP market — closer to Unstructured and Chunkr in architecture than to traditional OCR-first platforms like ABBYY or Tungsten Automation. Its AGPL 3.0 license means self-hosted deployments are free but require open-sourcing derivative works; commercial use cases that cannot comply with AGPL require the enterprise plan.
The LLMChallenge dual-LLM validation approach is a differentiator versus platforms that rely on single-model extraction with post-hoc confidence scoring. The trade-off is higher LLM API cost per document — partially offset by SinglePass and SummarizedExtraction efficiency features.
Bring-your-own-key pricing means Unstract's cloud cost does not include LLM inference, which can be significant at scale depending on model choice.
Technical Specifications
| Feature | Detail |
|---|---|
| License | AGPL 3.0 (open source); commercial enterprise plans available |
| Deployment | Cloud (managed), self-hosted (Docker), on-premise (enterprise) |
| LLM Support | Flexible — user-supplied keys; multiple LLMs compared in Prompt Studio |
| Vector DB Support | Qdrant, Weaviate, Pinecone, PostgreSQL, Milvus |
| Hallucination Mitigation | LLMChallenge (dual-LLM maker-checker) |
| Token Efficiency | SinglePass (up to 8x), SummarizedExtraction (up to 6x) |
| Document Pre-Processing | LLMWhisperer (layout-preserving, handwriting, forms, scanned PDFs) |
| Output Formats | JSON, spreadsheet, database, API |
| Languages (LLMWhisperer) | 300+ (High Quality tier); 120+ (Low Cost); all Unicode (Native Text) |
| SSO | Available on enterprise/cloud plans |
| Human-in-the-Loop | Available on cloud plans |
| Free Tier | 100 pages/day (LLMWhisperer), 14-day trial (Unstract Cloud) |
Resources
Company Information
Developed by Zipstack. Cloud-hosted instance available at us-central.unstract.com. Self-hosted deployment supported on Linux and macOS via Docker.