On This Page
- Overview
- What users say
- Key features
- Hallucination mitigation: LLMChallenge
- Token efficiency: SinglePass and SummarizedExtraction
- Document pre-processing: LLMWhisperer
- Prompt engineering: Prompt Studio
- Observability: LLMObservability
- Integration and deployment
- LLM support
- Self-hosting requirements
- Vector database connectors
- Supported file formats
- Pricing
- Unstract Cloud (bring your own keys)
- LLMWhisperer API (standalone)
- Use cases
- Insurance and financial services
- Financial document extraction
- ETL for unstructured data
- Agentic workflows
- Competitive positioning
- Technical specifications
- Resources
- Company information
Open-source, no-code LLM platform for intelligent document processing, offering production-grade extraction with hallucination mitigation, token efficiency features, and flexible deployment under AGPL 3.0.
Overview
Unstract, developed by Zipstack, is a no-code platform that uses large language models (LLMs) to automate extraction from unstructured documents of any format, type, or design. It is available under the AGPL 3.0 license for self-hosting and as a managed cloud service with a 14-day free trial.
The platform targets the gap between legacy OCR-based IDP systems that require rigid templates and raw LLM APIs, which lack the document pre-processing and output validation needed for production reliability. Its architecture separates document ingestion and normalization (LLMWhisperer), extraction prompt engineering (Prompt Studio), output validation (LLMChallenge and LLMEval), and workflow integration into distinct, composable layers.
Shuveb Hussainn of Timescale described Unstract in November 2024 as "an IDP 2.0 platform, powered by LLMs that allows the processing of documents that are way more complex than current IDP 1.0 platforms can handle." The same analysis demonstrated cost reduction from $1.43 to $0.17 per 30-page 10-Q financial document using vector database integration, an 88% reduction verified through independent testing rather than vendor self-reporting.
Enterprise feature gating: LLMChallenge (dual-LLM consensus validation), SinglePass extraction, SummarizedExtraction, Human-in-the-Loop review, SSO, and SAML/OIDC RBAC are exclusive to the paid Cloud Edition. The open-source AGPL edition does not include these features.
What users say
Practitioners evaluating Unstract consistently highlight the elimination of manual template configuration as the primary draw. Teams processing high-variation document sets, such as bank statements from hundreds of different institutions or regulatory filings with inconsistent layouts, report that template-free extraction removes the retraining cycles that make legacy IDP systems expensive to maintain.
The cost-performance trade-off surfaces as the most common friction point. Users find that simple vector retrieval strategies achieve the lowest per-document cost but sacrifice accuracy on compound extraction tasks requiring multiple related fields. Teams that need both cost efficiency and high accuracy on complex documents report spending meaningful time on prompt engineering in Prompt Studio before reaching production-ready results.
The AGPL 3.0 license is a deliberate consideration for enterprise teams. Organizations that cannot open-source derivative works must move to the enterprise plan, which some practitioners note makes the "open-source" positioning feel conditional for commercial deployments. Teams with strict data residency requirements, however, cite self-hosted Docker deployment as a genuine differentiator versus cloud-only alternatives.
Key features
Hallucination mitigation: LLMChallenge
The most distinctive accuracy feature is LLMChallenge, which runs two separate LLMs in a maker-checker configuration. The first LLM extracts a field value; the second challenges it. A consensus is required before the value is returned. If the two models disagree, the system returns null rather than a potentially wrong answer. The vendor's stated principle is "NULL is better than wrong." This approach catches hallucinations before they reach downstream systems, at the cost of roughly doubling LLM API calls per extraction. LLMChallenge is exclusive to the paid Cloud Edition.
Token efficiency: SinglePass and SummarizedExtraction
Two features address the cost of running LLMs on long documents. SinglePass Extraction consolidates all field-extraction prompts into a single large prompt, reducing LLM round-trips; the GitHub documentation claims up to 8x token reduction. SummarizedExtraction constructs a compact version of the input document before sending it to the LLM, claiming up to 6x token savings. Combined, the vendor claims up to 7x overall token reduction versus naive per-field extraction. Both features are enterprise-only.
Independent testing by Timescale on 10-Q filings corroborates the cost reduction direction: vector database retrieval brought per-document cost to $0.17, while sub-question retrieval strategies that decompose complex queries cost $0.23 per document due to additional LLM calls. The trade-off between cost and accuracy on compound extraction tasks is real and requires prompt engineering to optimize.
Document pre-processing: LLMWhisperer
LLMWhisperer is Unstract's document normalization layer, converting raw documents into formats LLMs can reliably interpret. It handles layout-preserving output for multi-column documents, forms, and tables; handwritten text detection; checkbox and radio button detection; high-fidelity processing of scanned PDFs and smartphone-captured images; and auto PDF repair with rotation and skew compensation.
LLMWhisperer is also available as a standalone API with four extraction tiers: Native Text, Low Cost, High Quality, and High Quality with Form Elements. Pricing ranges from $1/1,000 pages (Native Text) to $15/1,000 pages (High Quality with Form Elements), with a free tier of 100 pages/day requiring no credit card.
Prompt engineering: Prompt Studio
Prompt Studio is a purpose-built environment for defining extraction schemas without pre-built templates. Engineers can compare outputs and costs from multiple LLMs side-by-side, test prompt versions against representative document samples, and roll back to previous prompt versions. Once a schema is finalized, it deploys as an API with a single click, removing the need to manage prompts in spreadsheets or ad hoc scripts.
Observability: LLMObservability
Given the probabilistic nature of LLMs, Unstract includes an observability layer that surfaces how data and models interact both during development and in production. This addresses a common gap in LLM-based pipelines where failures are silent or difficult to trace.
Integration and deployment
Unstract supports four integration patterns, targeting different team types:
| Integration type | Best for |
|---|---|
| API deployments | Developers building apps or services requiring programmatic document structuring |
| ETL pipelines | Data engineering teams batch-processing documents into JSON for warehouses |
| n8n nodes | Low-code and ops teams automating workflows visually |
| MCP servers | Developers building agentic or LLM-powered tools using the Model Context Protocol |
ETL data sources include AWS S3, MinIO, Google Cloud Storage, Azure Blob, Google Drive, Dropbox, and SFTP. Destinations include Snowflake, Amazon Redshift, Google BigQuery, PostgreSQL, MySQL, MariaDB, SQL Server, and Oracle. Output formats include JSON, spreadsheets, and direct database writes.
LLM support
The platform supports OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Mistral AI, Ollama (local), and Anyscale via user-supplied API keys. Prompt Studio enables side-by-side cost and output comparison across these models before committing to a production configuration.
Self-hosting requirements
Self-hosted deployment runs via Docker and Docker Compose on Linux or macOS (Intel or M-series), with 8 GB RAM minimum and Git. Setup runs via ./run-platform.sh and is accessible at http://frontend.unstract.localhost. Enterprise self-hosted plans include SOC 2, HIPAA, ISO 27001, and GDPR compliance certifications.
Vector database connectors
Qdrant, Weaviate, Pinecone, PostgreSQL, and Milvus are confirmed working integrations, enabling retrieval-augmented extraction strategies that reduce per-document token costs.
Supported file formats
| Category | Formats |
|---|---|
| Word processing | DOCX, DOC, ODT |
| Presentation | PPTX, PPT, ODP |
| Spreadsheet | XLSX, XLS, ODS |
| Document and text | PDF, TXT, CSV, JSON |
| Image | BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP |
Pricing
Unstract Cloud (bring your own keys)
Plans include LLMWhisperer but require users to supply their own LLM, vector database, and embedding model API keys. A free onboarding credit of $10 on Azure OpenAI GPT-4o is included, along with free access to PostgreSQL, Azure OpenAI Embedding, and LLMWhisperer.
Starter
60,000 pages/year. Overage at $0.10/page. Includes LLMWhisperer, PostgreSQL, Azure OpenAI Embedding.
Growth {primary}
300,000 pages/year. Overage at $0.09/page. All Starter features plus higher volume.
Enterprise
Custom page volume. SSO, SAML/OIDC RBAC, HIPAA, SOC 2, ISO 27001, SLA, priority support.
LLMWhisperer API (standalone)
| Tier | Price |
|---|---|
| Native Text | $1/1,000 pages |
| Low Cost | $5/1,000 pages |
| High Quality | $7/1,000 pages (monthly) / $10/1,000 pages (annual) |
| High Quality with Form Elements | $15/1,000 pages |
Free tier: 100 pages/day, no credit card required.
Use cases
Insurance and financial services
Unstract targets insurance underwriting and claims processing by ingesting KYC documents, contracts, and claims forms and converting them from unstructured to structured formats. The vendor claims document review times can drop from days to minutes, though this figure is self-reported. The platform's handling of high-variation document sets, such as the same form across 50 different states, addresses a documented failure mode of template-based OCR systems.
Financial document extraction
Independent testing on 10-Q SEC filings demonstrated that vector database integration reduced per-document extraction cost from $1.43 to $0.17 on 30-page documents. Hussainn noted that "large language models are disrupting IDP platforms and making manual annotations, the greatest pain of IDP, completely redundant." This positions Unstract for finance teams processing high volumes of regulatory filings where annotation overhead makes legacy IDP economically unviable.
ETL for unstructured data
For data engineering teams, Unstract functions as an ETL pipeline for documents: ingest from cloud storage, extract structured fields, push to data warehouses or databases. This positions it alongside platforms like Unstructured in the LLM-native document ETL space, with the addition of open-source self-hosting as a differentiator.
Agentic workflows
The MCP server integration and Human-in-the-Loop feature (available on cloud) position Unstract for agentic document workflows where autonomous agents need reliable structured data from documents. Teams evaluating no-code agentic automation alongside open-source options will find Unstract's AGPL licensing a meaningful differentiator for self-hosted deployments.
Competitive positioning
Unstract occupies the open-source, LLM-native segment of the IDP market, positioned closer to Unstructured and Chunkr in architecture than traditional OCR-first platforms like ABBYY or Tungsten Automation. Its AGPL 3.0 license means self-hosted deployments are free but require open-sourcing derivative works; commercial use cases that cannot comply with AGPL require the enterprise plan.
The LLMChallenge dual-LLM validation approach differentiates Unstract from platforms that rely on single-model extraction with post-hoc confidence scoring. The trade-off is higher LLM API cost per document, partially offset by SinglePass and SummarizedExtraction efficiency features, both of which are enterprise-only. Teams evaluating the open-source edition should account for this gap before assuming production-grade accuracy is available without a paid plan.
Bring-your-own-key pricing means Unstract's cloud cost does not include LLM inference, which can be significant at scale depending on model choice. Vendors like Tiny IDP take a similar LLM-powered extraction approach but without the open-source self-hosting option, making deployment model a key differentiator for teams with data residency requirements.
Technical specifications
| Feature | Detail |
|---|---|
| License | AGPL 3.0 (open source); commercial enterprise plans available |
| Deployment | Cloud (managed), self-hosted (Docker), on-premise (enterprise) |
| Backend stack | Django, React, Celery workers, FastAPI |
| LLM support | OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Mistral AI, Ollama, Anyscale |
| Vector DB support | Qdrant, Weaviate, Pinecone, PostgreSQL, Milvus |
| Hallucination mitigation | LLMChallenge (dual-LLM maker-checker; enterprise only) |
| Token efficiency | SinglePass (up to 8x), SummarizedExtraction (up to 6x); enterprise only |
| Document pre-processing | LLMWhisperer (layout-preserving, handwriting, forms, scanned PDFs) |
| Output formats | JSON, spreadsheet, database, API |
| Languages (LLMWhisperer) | 300+ (High Quality tier); 120+ (Low Cost); all Unicode (Native Text) |
| Compliance certifications | SOC 2, HIPAA, ISO 27001, GDPR (enterprise) |
| SSO | SAML/OIDC; enterprise plans only |
| Human-in-the-Loop | Cloud plans only |
| Free tier | 100 pages/day (LLMWhisperer); 14-day trial (Unstract Cloud) |
| Minimum system requirements | 8 GB RAM, Linux or macOS, Docker, Docker Compose, Git |
Resources
- Unstract website
- GitHub repository
- Documentation
- Quick start guide
- Cloud instance
Company information
Unstract is developed by Zipstack. The managed cloud instance is hosted at us-central.unstract.com. Self-hosted deployment is supported on Linux and macOS via Docker. Enterprise plans include SOC 2, HIPAA, ISO 27001, and GDPR compliance certifications, with priority support and SLA guarantees.
:::recent 3 :::