Unstract — Open-Source LLM Platform for IDP

On This Page

Overview
What users say
Key features
Hallucination mitigation: LLMChallenge
Token efficiency: SinglePass and SummarizedExtraction
Document pre-processing: LLMWhisperer
Prompt engineering: Prompt Studio
Observability: LLMObservability
Integration and deployment
LLM support
Self-hosting requirements
Vector database connectors
Supported file formats
Pricing
Unstract Cloud (bring your own keys)
LLMWhisperer API (standalone)
Use cases
Insurance and financial services
Financial document extraction
ETL for unstructured data
Agentic workflows
Competitive positioning
Technical specifications
Resources
Company information

Open-source, no-code LLM platform for intelligent document processing, offering production-grade extraction with hallucination mitigation, token efficiency features, and flexible deployment under AGPL 3.0.

Overview

Unstract, developed by Zipstack, is a no-code platform that uses large language models (LLMs) to automate extraction from unstructured documents of any format, type, or design. It is available under the AGPL 3.0 license for self-hosting and as a managed cloud service with a 14-day free trial.

The platform targets the gap between legacy OCR-based IDP systems that require rigid templates and raw LLM APIs, which lack the document pre-processing and output validation needed for production reliability. Its architecture separates document ingestion and normalization (LLMWhisperer), extraction prompt engineering (Prompt Studio), output validation (LLMChallenge and LLMEval), and workflow integration into distinct, composable layers.

Shuveb Hussainn of Timescale described Unstract in November 2024 as "an IDP 2.0 platform, powered by LLMs that allows the processing of documents that are way more complex than current IDP 1.0 platforms can handle." The same analysis demonstrated cost reduction from $1.43 to $0.17 per 30-page 10-Q financial document using vector database integration, an 88% reduction verified through independent testing rather than vendor self-reporting.

Enterprise feature gating: LLMChallenge (dual-LLM consensus validation), SinglePass extraction, SummarizedExtraction, Human-in-the-Loop review, SSO, and SAML/OIDC RBAC are exclusive to the paid Cloud Edition. The open-source AGPL edition does not include these features.

What users say

Practitioners evaluating Unstract consistently highlight the elimination of manual template configuration as the primary draw. Teams processing high-variation document sets, such as bank statements from hundreds of different institutions or regulatory filings with inconsistent layouts, report that template-free extraction removes the retraining cycles that make legacy IDP systems expensive to maintain.

The cost-performance trade-off surfaces as the most common friction point. Users find that simple vector retrieval strategies achieve the lowest per-document cost but sacrifice accuracy on compound extraction tasks requiring multiple related fields. Teams that need both cost efficiency and high accuracy on complex documents report spending meaningful time on prompt engineering in Prompt Studio before reaching production-ready results.

The AGPL 3.0 license is a deliberate consideration for enterprise teams. Organizations that cannot open-source derivative works must move to the enterprise plan, which some practitioners note makes the "open-source" positioning feel conditional for commercial deployments. Teams with strict data residency requirements, however, cite self-hosted Docker deployment as a genuine differentiator versus cloud-only alternatives.

Key features

Hallucination mitigation: LLMChallenge

The most distinctive accuracy feature is LLMChallenge, which runs two separate LLMs in a maker-checker configuration. The first LLM extracts a field value; the second challenges it. A consensus is required before the value is returned. If the two models disagree, the system returns null rather than a potentially wrong answer. The vendor's stated principle is "NULL is better than wrong." This approach catches hallucinations before they reach downstream systems, at the cost of roughly doubling LLM API calls per extraction. LLMChallenge is exclusive to the paid Cloud Edition.

Token efficiency: SinglePass and SummarizedExtraction

Two features address the cost of running LLMs on long documents. SinglePass Extraction consolidates all field-extraction prompts into a single large prompt, reducing LLM round-trips; the GitHub documentation claims up to 8x token reduction. SummarizedExtraction constructs a compact version of the input document before sending it to the LLM, claiming up to 6x token savings. Combined, the vendor claims up to 7x overall token reduction versus naive per-field extraction. Both features are enterprise-only.

Independent testing by Timescale on 10-Q filings corroborates the cost reduction direction: vector database retrieval brought per-document cost to $0.17, while sub-question retrieval strategies that decompose complex queries cost $0.23 per document due to additional LLM calls. The trade-off between cost and accuracy on compound extraction tasks is real and requires prompt engineering to optimize.

Document pre-processing: LLMWhisperer

LLMWhisperer is Unstract's document normalization layer, converting raw documents into formats LLMs can reliably interpret. It handles layout-preserving output for multi-column documents, forms, and tables; handwritten text detection; checkbox and radio button detection; high-fidelity processing of scanned PDFs and smartphone-captured images; and auto PDF repair with rotation and skew compensation.

LLMWhisperer is also available as a standalone API with four extraction tiers: Native Text, Low Cost, High Quality, and High Quality with Form Elements. Pricing ranges from $1/1,000 pages (Native Text) to $15/1,000 pages (High Quality with Form Elements), with a free tier of 100 pages/day requiring no credit card.

Prompt engineering: Prompt Studio

Prompt Studio is a purpose-built environment for defining extraction schemas without pre-built templates. Engineers can compare outputs and costs from multiple LLMs side-by-side, test prompt versions against representative document samples, and roll back to previous prompt versions. Once a schema is finalized, it deploys as an API with a single click, removing the need to manage prompts in spreadsheets or ad hoc scripts.

Observability: LLMObservability

Given the probabilistic nature of LLMs, Unstract includes an observability layer that surfaces how data and models interact both during development and in production. This addresses a common gap in LLM-based pipelines where failures are silent or difficult to trace.

Integration and deployment

Unstract supports four integration patterns, targeting different team types:

Integration type	Best for
API deployments	Developers building apps or services requiring programmatic document structuring
ETL pipelines	Data engineering teams batch-processing documents into JSON for warehouses
n8n nodes	Low-code and ops teams automating workflows visually
MCP servers	Developers building agentic or LLM-powered tools using the Model Context Protocol

ETL data sources include AWS S3, MinIO, Google Cloud Storage, Azure Blob, Google Drive, Dropbox, and SFTP. Destinations include Snowflake, Amazon Redshift, Google BigQuery, PostgreSQL, MySQL, MariaDB, SQL Server, and Oracle. Output formats include JSON, spreadsheets, and direct database writes.

LLM support

The platform supports OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Mistral AI, Ollama (local), and Anyscale via user-supplied API keys. Prompt Studio enables side-by-side cost and output comparison across these models before committing to a production configuration.

Self-hosting requirements

Self-hosted deployment runs via Docker and Docker Compose on Linux or macOS (Intel or M-series), with 8 GB RAM minimum and Git. Setup runs via ./run-platform.sh and is accessible at http://frontend.unstract.localhost. Enterprise self-hosted plans include SOC 2, HIPAA, ISO 27001, and GDPR compliance certifications.

Vector database connectors

Qdrant, Weaviate, Pinecone, PostgreSQL, and Milvus are confirmed working integrations, enabling retrieval-augmented extraction strategies that reduce per-document token costs.

Supported file formats

Category	Formats
Word processing	DOCX, DOC, ODT
Presentation	PPTX, PPT, ODP
Spreadsheet	XLSX, XLS, ODS
Document and text	PDF, TXT, CSV, JSON
Image	BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP

Pricing

Unstract Cloud (bring your own keys)

Plans include LLMWhisperer but require users to supply their own LLM, vector database, and embedding model API keys. A free onboarding credit of $10 on Azure OpenAI GPT-4o is included, along with free access to PostgreSQL, Azure OpenAI Embedding, and LLMWhisperer.

Starter

$416/month (annual billing)

60,000 pages/year. Overage at $0.10/page. Includes LLMWhisperer, PostgreSQL, Azure OpenAI Embedding.

Growth {primary}

$1,874/month (annual billing)

300,000 pages/year. Overage at $0.09/page. All Starter features plus higher volume.

Enterprise

Custom pricing

Custom page volume. SSO, SAML/OIDC RBAC, HIPAA, SOC 2, ISO 27001, SLA, priority support.

LLMWhisperer API (standalone)

Tier	Price
Native Text	$1/1,000 pages
Low Cost	$5/1,000 pages
High Quality	$7/1,000 pages (monthly) / $10/1,000 pages (annual)
High Quality with Form Elements	$15/1,000 pages

Free tier: 100 pages/day, no credit card required.

Use cases

Insurance and financial services

Unstract targets insurance underwriting and claims processing by ingesting KYC documents, contracts, and claims forms and converting them from unstructured to structured formats. The vendor claims document review times can drop from days to minutes, though this figure is self-reported. The platform's handling of high-variation document sets, such as the same form across 50 different states, addresses a documented failure mode of template-based OCR systems.

Financial document extraction

Independent testing on 10-Q SEC filings demonstrated that vector database integration reduced per-document extraction cost from $1.43 to $0.17 on 30-page documents. Hussainn noted that "large language models are disrupting IDP platforms and making manual annotations, the greatest pain of IDP, completely redundant." This positions Unstract for finance teams processing high volumes of regulatory filings where annotation overhead makes legacy IDP economically unviable.

ETL for unstructured data

For data engineering teams, Unstract functions as an ETL pipeline for documents: ingest from cloud storage, extract structured fields, push to data warehouses or databases. This positions it alongside platforms like Unstructured in the LLM-native document ETL space, with the addition of open-source self-hosting as a differentiator.

Agentic workflows

The MCP server integration and Human-in-the-Loop feature (available on cloud) position Unstract for agentic document workflows where autonomous agents need reliable structured data from documents. Teams evaluating no-code agentic automation alongside open-source options will find Unstract's AGPL licensing a meaningful differentiator for self-hosted deployments.

Competitive positioning

Unstract occupies the open-source, LLM-native segment of the IDP market, positioned closer to Unstructured and Chunkr in architecture than traditional OCR-first platforms like ABBYY or Tungsten Automation. Its AGPL 3.0 license means self-hosted deployments are free but require open-sourcing derivative works; commercial use cases that cannot comply with AGPL require the enterprise plan.

The LLMChallenge dual-LLM validation approach differentiates Unstract from platforms that rely on single-model extraction with post-hoc confidence scoring. The trade-off is higher LLM API cost per document, partially offset by SinglePass and SummarizedExtraction efficiency features, both of which are enterprise-only. Teams evaluating the open-source edition should account for this gap before assuming production-grade accuracy is available without a paid plan.

Bring-your-own-key pricing means Unstract's cloud cost does not include LLM inference, which can be significant at scale depending on model choice. Vendors like Tiny IDP take a similar LLM-powered extraction approach but without the open-source self-hosting option, making deployment model a key differentiator for teams with data residency requirements.

Technical specifications

Feature	Detail
License	AGPL 3.0 (open source); commercial enterprise plans available
Deployment	Cloud (managed), self-hosted (Docker), on-premise (enterprise)
Backend stack	Django, React, Celery workers, FastAPI
LLM support	OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Mistral AI, Ollama, Anyscale
Vector DB support	Qdrant, Weaviate, Pinecone, PostgreSQL, Milvus
Hallucination mitigation	LLMChallenge (dual-LLM maker-checker; enterprise only)
Token efficiency	SinglePass (up to 8x), SummarizedExtraction (up to 6x); enterprise only
Document pre-processing	LLMWhisperer (layout-preserving, handwriting, forms, scanned PDFs)
Output formats	JSON, spreadsheet, database, API
Languages (LLMWhisperer)	300+ (High Quality tier); 120+ (Low Cost); all Unicode (Native Text)
Compliance certifications	SOC 2, HIPAA, ISO 27001, GDPR (enterprise)
SSO	SAML/OIDC; enterprise plans only
Human-in-the-Loop	Cloud plans only
Free tier	100 pages/day (LLMWhisperer); 14-day trial (Unstract Cloud)
Minimum system requirements	8 GB RAM, Linux or macOS, Docker, Docker Compose, Git

Resources

Unstract website
GitHub repository
Documentation
Quick start guide
Cloud instance