unstructured

Open-source ETL platform provider transforming unstructured data into LLM-ready formats through libraries and enterprise APIs, supporting 25+ file types with 60+ connectors for RAG workflows.

unstructured

Overview

Unstructured provides automated ETL solutions for document processing that extract and transform data from PDFs, images, HTML, Word documents, emails, scanned documents, and handwritten notes. Founded in 2022 by Brian Raymond, Matt Robinson, and Crag Wolfe after working together at Primer AI, the company raised $65M from Bain Capital Ventures, Menlo Ventures, Madrona Venture Group, M12, and MongoDB Ventures.

The platform offers three transformation tiers: Basic for text-only documents, Advanced for PDFs and complex files, and Platinum with VLM API integration for scanned and handwritten content. Throughout 2025, Unstructured demonstrated rapid development velocity with frequent releases including version 1.2.5 of unstructured-ingest in August 2025 and version 1.1.4 of unstructured-inference in December 2025. By January 2026, the company released version 0.1.7 of its MCP server implementation, signaling a strategic shift toward API-first development and developer community engagement.

The company gained market recognition through inclusion in open source PDF parsing comparison tools alongside Azure Document Intelligence and LlamaParse, positioning it as a viable enterprise alternative in the intelligent document processing space.

Key Features

25+ File Type Support: Processes PDFs, images, HTML, Word, emails, scanned documents, handwritten notes
Three Transformation Tiers: Basic (text-only), Advanced (complex PDFs/images), Platinum (VLM API for scanned/handwritten)
60+ Connectors: Source and destination connectors including S3, Azure, Google Drive, Salesforce, SharePoint, Weaviate, Pinecone, MongoDB
Open-Source Libraries: Python-based tools including unstructured-ingest and unstructured-inference under Apache 2.0 license
MCP Server Implementation: 19 tools for programmatic workflow management and API integration
Workflow Builder: No-code drag-and-drop ETL orchestration
Firecrawl Integration: Web crawling capabilities for LLM-optimized text generation
Auto-Scaling: Horizontal scaling with 300x concurrency per organization

Use Cases

RAG Data Preparation

Organizations preparing data for large language models use Unstructured to transform documents from multiple sources into LLM-ready formats. The platform automatically detects new files from 60+ connectors, applies appropriate transformation tiers based on document complexity, and delivers structured outputs to vector databases for RAG workflows.

Developer Integration Workflows

Developers leverage Unstructured's open-source libraries and MCP server implementation to programmatically manage document processing workflows. The platform's API-first approach enables integration with existing enterprise systems through standardized interfaces, supporting both SSE and stdio server protocols.

Enterprise Document ETL

Enterprises deploy Unstructured for continuous extraction from diverse document repositories. The Workflow Builder orchestrates multi-step transformations including partitioning, cleaning, chunking, and embedding generation without code, with RBAC controls and in-VPC deployment for sensitive data.

Technical Specifications

Feature	Specification
Platform Types	Open-source library, Enterprise UI, Enterprise API, MCP server
File Types	25+ including PDFs, images, HTML, Word, emails, scanned/handwritten
Transformation Tiers	Basic (text-only), Advanced (complex), Platinum (VLM API)
Connectors	60+ including S3, Azure, Google Drive, Salesforce, vector databases
Processing	Partitioning, cleaning, extraction, chunking, embeddings
API Tools	19 tools for workflow management via MCP implementation
Python Support	3.7.0+ (inference), 3.10-3.12 (ingest), 3.12+ (MCP)
License	Apache 2.0 (open-source components)
Deployment	Cloud, in-VPC
Compliance	SOC 2 Type 2, HIPAA, GDPR
Scaling	Horizontal auto-scaling, 300x concurrency

unstructured

Overview

Key Features

Use Cases

RAG Data Preparation

Developer Integration Workflows

Enterprise Document ETL

Technical Specifications

Resources

Company Information