Skip to content

unstructured

Open-source ETL platform provider transforming unstructured data into LLM-ready formats through libraries and enterprise APIs, supporting 25+ file types with 60+ connectors for RAG workflows.

unstructured

Overview

Unstructured provides automated ETL solutions for document processing that extract and transform data from PDFs, images, HTML, Word documents, emails, scanned documents, and handwritten notes. Founded in 2022 by Brian Raymond, Matt Robinson, and Crag Wolfe after working together at Primer AI, the company raised $65M from Bain Capital Ventures, Menlo Ventures, Madrona Venture Group, M12, and MongoDB Ventures.

The platform offers three transformation tiers: Basic for text-only documents, Advanced for PDFs and complex files, and Platinum with VLM API integration for scanned and handwritten content. Throughout 2025, Unstructured demonstrated rapid development velocity with frequent releases including version 1.2.5 of unstructured-ingest in August 2025 and version 1.1.4 of unstructured-inference in December 2025. By January 2026, the company released version 0.1.7 of its MCP server implementation, signaling a strategic shift toward API-first development and developer community engagement.

The company gained market recognition through inclusion in open source PDF parsing comparison tools alongside Azure Document Intelligence and LlamaParse, positioning it as a viable enterprise alternative in the intelligent document processing space.

Key Features

  • 25+ File Type Support: Processes PDFs, images, HTML, Word, emails, scanned documents, handwritten notes
  • Three Transformation Tiers: Basic (text-only), Advanced (complex PDFs/images), Platinum (VLM API for scanned/handwritten)
  • 60+ Connectors: Source and destination connectors including S3, Azure, Google Drive, Salesforce, SharePoint, Weaviate, Pinecone, MongoDB
  • Open-Source Libraries: Python-based tools including unstructured-ingest and unstructured-inference under Apache 2.0 license
  • MCP Server Implementation: 19 tools for programmatic workflow management and API integration
  • Workflow Builder: No-code drag-and-drop ETL orchestration
  • Firecrawl Integration: Web crawling capabilities for LLM-optimized text generation
  • Auto-Scaling: Horizontal scaling with 300x concurrency per organization

Use Cases

RAG Data Preparation

Organizations preparing data for large language models use Unstructured to transform documents from multiple sources into LLM-ready formats. The platform automatically detects new files from 60+ connectors, applies appropriate transformation tiers based on document complexity, and delivers structured outputs to vector databases for RAG workflows.

Developer Integration Workflows

Developers leverage Unstructured's open-source libraries and MCP server implementation to programmatically manage document processing workflows. The platform's API-first approach enables integration with existing enterprise systems through standardized interfaces, supporting both SSE and stdio server protocols.

Enterprise Document ETL

Enterprises deploy Unstructured for continuous extraction from diverse document repositories. The Workflow Builder orchestrates multi-step transformations including partitioning, cleaning, chunking, and embedding generation without code, with RBAC controls and in-VPC deployment for sensitive data.

Technical Specifications

Feature Specification
Platform Types Open-source library, Enterprise UI, Enterprise API, MCP server
File Types 25+ including PDFs, images, HTML, Word, emails, scanned/handwritten
Transformation Tiers Basic (text-only), Advanced (complex), Platinum (VLM API)
Connectors 60+ including S3, Azure, Google Drive, Salesforce, vector databases
Processing Partitioning, cleaning, extraction, chunking, embeddings
API Tools 19 tools for workflow management via MCP implementation
Python Support 3.7.0+ (inference), 3.10-3.12 (ingest), 3.12+ (MCP)
License Apache 2.0 (open-source components)
Deployment Cloud, in-VPC
Compliance SOC 2 Type 2, HIPAA, GDPR
Scaling Horizontal auto-scaling, 300x concurrency

Resources

Company Information

Headquarters: Rocklin, California, United States

Founded: 2022

Founders: Brian Raymond (CEO), Matt Robinson, Crag Wolfe

Employees: ~40

Funding: $65M ($40M Series B March 2024, from Bain Capital Ventures, Menlo Ventures, Madrona Venture Group, M12 - Microsoft's Venture Fund, Mango Capital, MongoDB Ventures, Shield Capital)

Previous Experience: Founders worked together at Primer AI and CIA