Skip to content
Unstract
VENDORS 5 min read

Unstract

Open-source, no-code LLM platform for intelligent document processing, offering production-grade extraction with hallucination mitigation, token efficiency features, and flexible deployment under AGPL 3.0.

Overview

Unstract, developed by Zipstack, is a no-code platform that uses large language models to automate extraction from unstructured documents of any format, type, or design. It is available under the AGPL 3.0 license for self-hosting and as a managed cloud service with a 14-day free trial.

The platform targets the gap between legacy OCR-based IDP systems — which require rigid templates — and raw LLM APIs, which lack the document pre-processing and output validation needed for production reliability. Its architecture separates document ingestion and normalization (LLMWhisperer), extraction prompt engineering (Prompt Studio), output validation (LLMChallenge / LLMEval), and workflow integration into distinct, composable layers.

Unstract claims a 60% increase in document processing speed and a 30% reduction in operational costs for adopters, though these figures originate from the vendor's own materials without independent verification.

Key Features

Hallucination Mitigation: LLMChallenge

The most distinctive accuracy feature is LLMChallenge, which runs two separate LLMs in a maker-checker configuration. The first LLM extracts a field value; the second challenges it. A consensus is required before the value is returned. If the two models disagree, the system returns null rather than a potentially wrong answer — the vendor's stated principle being "NULL is better than wrong." This approach catches hallucinations before they reach downstream systems, at the cost of roughly doubling LLM API calls per extraction.

Token Efficiency: SinglePass and SummarizedExtraction

Two features address the cost of running LLMs on long documents:

  • SinglePass Extraction consolidates all field-extraction prompts into a single large prompt, reducing LLM round-trips. The GitHub documentation claims up to 8x token reduction.
  • SummarizedExtraction constructs a compact version of the input document before sending it to the LLM, claiming up to 6x token savings.

Combined, the vendor claims up to 7x overall token reduction versus naive per-field extraction.

Document Pre-Processing: LLMWhisperer

LLMWhisperer is Unstract's document normalization layer, designed to convert raw documents into formats LLMs can reliably interpret. Key capabilities include:

  • Layout-preserving output for multi-column documents, forms, and tables
  • State-of-the-art handwritten text detection
  • Checkbox and radio button detection
  • High-fidelity processing of scanned PDFs and smartphone-captured images
  • Auto PDF repair and rotation/skew compensation

LLMWhisperer is also available as a standalone API with four extraction tiers: Native Text, Low Cost, High Quality, and High Quality with Form Elements. Pricing ranges from $1/1,000 pages (Native Text) to $15/1,000 pages (High Quality with Form Elements), with a free tier of 100 pages/day requiring no credit card.

Prompt Engineering: Prompt Studio

Prompt Studio is a purpose-built environment for defining extraction schemas. Engineers can compare outputs and costs from multiple LLMs side-by-side, test prompt versions against representative document samples, and roll back to previous prompt versions. Once a schema is finalized, it can be deployed as an API with a single click — removing the need to manage prompts in spreadsheets or ad hoc scripts.

Observability: LLMObservability

Given the probabilistic nature of LLMs, Unstract includes an observability layer that surfaces how data and models interact both during development and in production. This addresses a common gap in LLM-based pipelines where failures are silent or difficult to trace.

Integration and Deployment

Unstract supports four integration patterns, targeting different team types:

Integration Type Best For
API Deployments Developers building apps or services requiring programmatic document structuring
ETL Pipelines Data engineering teams batch-processing documents into JSON for warehouses
n8n Nodes Low-code and ops teams automating workflows visually
MCP Servers Developers building agentic or LLM-powered tools that speak the Model Context Protocol

Document sources include Dropbox, S3, data lakes, and other cloud file storage. Output formats include JSON, spreadsheets, and direct database writes.

Self-Hosting Requirements

  • Linux or macOS (Intel or M-series)
  • Docker and Docker Compose
  • 8 GB RAM minimum
  • Git

Setup runs via ./run-platform.sh and is accessible at http://frontend.unstract.localhost.

Vector Database Connectors

Qdrant, Weaviate, Pinecone, PostgreSQL, and Milvus are confirmed working integrations.

Supported File Formats

Unstract handles a broad range of input formats across categories:

Category Formats
Word Processing DOCX, DOC, ODT
Presentation PPTX, PPT, ODP
Spreadsheet XLSX, XLS, ODS
Document & Text PDF, TXT, CSV, JSON
Image BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP

Pricing

Unstract Cloud (Bring Your Own Keys)

Plans include LLMWhisperer but require users to supply their own LLM, vector database, and embedding model API keys.

Plan Monthly Billing Annual Billing Pages/Year
Starter $499/month $416/month 60,000
Growth $2,249/month $1,874/month 300,000
Enterprise Custom Custom Custom

Overage is $0.10/page (Starter) or $0.09/page (Growth). A free onboarding credit of $10 on Azure OpenAI GPT-4o is included, along with free access to PostgreSQL, Azure OpenAI Embedding, and LLMWhisperer.

LLMWhisperer API (Standalone)

Tier Price
Native Text $1/1,000 pages
Low Cost $5/1,000 pages
High Quality $7/1,000 pages (monthly) / $10/1,000 pages (annual)
High Quality with Form Elements $15/1,000 pages

Free tier: 100 pages/day, no credit card required.

Enterprise self-hosted plans are available for both Unstract and LLMWhisperer.

Use Cases

Insurance and Financial Services

Unstract targets insurance underwriting and claims processing — ingesting KYC documents, contracts, and claims forms and converting them from unstructured to structured formats. The vendor claims document review times can drop from days to minutes.

Bank Statement Processing

The platform explicitly addresses high-variation document sets: bank statements from 200 different banks or the same form with changes across 50 different states — scenarios where template-based OCR systems fail.

ETL for Unstructured Data

For data engineering teams, Unstract functions as an ETL pipeline for documents: ingest from cloud storage, extract structured fields, push to data warehouses or databases. This positions it alongside platforms like Unstructured in the LLM-native document ETL space.

Agentic Workflows

The MCP server integration and Human-in-the-Loop feature (available on cloud) position Unstract for agentic document workflows where autonomous agents need reliable structured data from documents.

Competitive Positioning

Unstract occupies the open-source, LLM-native segment of the IDP market — closer to Unstructured and Chunkr in architecture than to traditional OCR-first platforms like ABBYY or Tungsten Automation. Its AGPL 3.0 license means self-hosted deployments are free but require open-sourcing derivative works; commercial use cases that cannot comply with AGPL require the enterprise plan.

The LLMChallenge dual-LLM validation approach is a differentiator versus platforms that rely on single-model extraction with post-hoc confidence scoring. The trade-off is higher LLM API cost per document — partially offset by SinglePass and SummarizedExtraction efficiency features.

Bring-your-own-key pricing means Unstract's cloud cost does not include LLM inference, which can be significant at scale depending on model choice.

Technical Specifications

Feature Detail
License AGPL 3.0 (open source); commercial enterprise plans available
Deployment Cloud (managed), self-hosted (Docker), on-premise (enterprise)
LLM Support Flexible — user-supplied keys; multiple LLMs compared in Prompt Studio
Vector DB Support Qdrant, Weaviate, Pinecone, PostgreSQL, Milvus
Hallucination Mitigation LLMChallenge (dual-LLM maker-checker)
Token Efficiency SinglePass (up to 8x), SummarizedExtraction (up to 6x)
Document Pre-Processing LLMWhisperer (layout-preserving, handwriting, forms, scanned PDFs)
Output Formats JSON, spreadsheet, database, API
Languages (LLMWhisperer) 300+ (High Quality tier); 120+ (Low Cost); all Unicode (Native Text)
SSO Available on enterprise/cloud plans
Human-in-the-Loop Available on cloud plans
Free Tier 100 pages/day (LLMWhisperer), 14-day trial (Unstract Cloud)

Resources

Company Information

Developed by Zipstack. Cloud-hosted instance available at us-central.unstract.com. Self-hosted deployment supported on Linux and macOS via Docker.