unstructured
Open-source ETL platform provider transforming unstructured data into LLM-ready formats through libraries and enterprise APIs, supporting 25+ file types with 60+ connectors for RAG workflows.

Overview
Unstructured provides automated ETL solutions for document processing that extract and transform data from PDFs, images, HTML, Word documents, emails, scanned documents, and handwritten notes. Founded in 2022 by Brian Raymond, Matt Robinson, and Crag Wolfe after working together at Primer AI, the company raised $65M from Bain Capital Ventures, Menlo Ventures, Madrona Venture Group, M12, and MongoDB Ventures.
The platform offers three transformation tiers: Basic for text-only documents, Advanced for PDFs and complex files, and Platinum with VLM API integration for scanned and handwritten content. Throughout 2025, Unstructured demonstrated rapid development velocity with frequent releases including version 1.2.5 of unstructured-ingest in August 2025 and version 1.1.4 of unstructured-inference in December 2025. By January 2026, the company released version 0.1.7 of its MCP server implementation, signaling a strategic shift toward API-first development and developer community engagement.
The company gained market recognition through inclusion in open source PDF parsing comparison tools alongside Azure Document Intelligence and LlamaParse, positioning it as a viable enterprise alternative in the intelligent document processing space.
Key Features
- 25+ File Type Support: Processes PDFs, images, HTML, Word, emails, scanned documents, handwritten notes
- Three Transformation Tiers: Basic (text-only), Advanced (complex PDFs/images), Platinum (VLM API for scanned/handwritten)
- 60+ Connectors: Source and destination connectors including S3, Azure, Google Drive, Salesforce, SharePoint, Weaviate, Pinecone, MongoDB
- Open-Source Libraries: Python-based tools including unstructured-ingest and unstructured-inference under Apache 2.0 license
- MCP Server Implementation: 19 tools for programmatic workflow management and API integration
- Workflow Builder: No-code drag-and-drop ETL orchestration
- Firecrawl Integration: Web crawling capabilities for LLM-optimized text generation
- Auto-Scaling: Horizontal scaling with 300x concurrency per organization
Use Cases
RAG Data Preparation
Organizations preparing data for large language models use Unstructured to transform documents from multiple sources into LLM-ready formats. The platform automatically detects new files from 60+ connectors, applies appropriate transformation tiers based on document complexity, and delivers structured outputs to vector databases for RAG workflows.
Developer Integration Workflows
Developers leverage Unstructured's open-source libraries and MCP server implementation to programmatically manage document processing workflows. The platform's API-first approach enables integration with existing enterprise systems through standardized interfaces, supporting both SSE and stdio server protocols.
Enterprise Document ETL
Enterprises deploy Unstructured for continuous extraction from diverse document repositories. The Workflow Builder orchestrates multi-step transformations including partitioning, cleaning, chunking, and embedding generation without code, with RBAC controls and in-VPC deployment for sensitive data.
Technical Specifications
| Feature | Specification |
|---|---|
| Platform Types | Open-source library, Enterprise UI, Enterprise API, MCP server |
| File Types | 25+ including PDFs, images, HTML, Word, emails, scanned/handwritten |
| Transformation Tiers | Basic (text-only), Advanced (complex), Platinum (VLM API) |
| Connectors | 60+ including S3, Azure, Google Drive, Salesforce, vector databases |
| Processing | Partitioning, cleaning, extraction, chunking, embeddings |
| API Tools | 19 tools for workflow management via MCP implementation |
| Python Support | 3.7.0+ (inference), 3.10-3.12 (ingest), 3.12+ (MCP) |
| License | Apache 2.0 (open-source components) |
| Deployment | Cloud, in-VPC |
| Compliance | SOC 2 Type 2, HIPAA, GDPR |
| Scaling | Horizontal auto-scaling, 300x concurrency |
Resources
Company Information
Headquarters: Rocklin, California, United States
Founded: 2022
Founders: Brian Raymond (CEO), Matt Robinson, Crag Wolfe
Employees: ~40
Funding: $65M ($40M Series B March 2024, from Bain Capital Ventures, Menlo Ventures, Madrona Venture Group, M12 - Microsoft's Venture Fund, Mango Capital, MongoDB Ventures, Shield Capital)
Previous Experience: Founders worked together at Primer AI and CIA