Scale AI: Data Annotation and AI Training
On This Page
Scale AI is a data annotation and AI training platform provider that underwent a major transformation in 2025 when Meta acquired a 49% stake for $14.8 billion - triggering a founder departure, a 14% workforce reduction, and the loss of Google and xAI as clients. The company now operates at the intersection of commercial AI infrastructure and contested federal contracts.

Overview
Scale AI operates a data annotation platform supporting images, video, text, audio, LiDAR, and point cloud data types. Founded in 2016, the company initially focused on data labeling services for autonomous vehicles before expanding into document processing with Scale Document AI. In May 2024, the company raised $1 billion from Nvidia, Amazon, and Meta, valuing it at nearly $14 billion.
The strategic landscape shifted in June 2025 when Meta acquired a 49% stake for $14.8 billion. Founder and CEO Alexandr Wang departed to Meta to serve as Chief AI Officer at the newly formed Meta Superintelligence Labs alongside former GitHub CEO Nat Friedman. The acquisition immediately triggered client departures - OpenAI cut ties, Google canceled a planned $200 million spend, and xAI began exploring alternatives - while approximately 200 employees, roughly 14% of headcount, were laid off.
The federal revenue picture is now contested. Scale built its government business on a run of DoD contracts stretching back to 2020, including a $99 million Army contract for AI tooling and a $24 million NGA contract in 2024 for Maven-linked data labeling. In January 2026, Scale lost the NGA's largest-ever data-training contract - worth up to $708 million over seven years - to smaller competitor Enabled Intelligence, and filed suit against the U.S. Department of Defense on January 30 in the U.S. Court of Federal Claims after a GAO protest was dismissed two days earlier. A separate federal engagement continues in parallel: Scale announced a collaboration with Anduril and Microsoft to deploy AI agents within the U.S. military under the DoD's "Thunderforge" initiative.
A 49% stake held by a single commercial hyperscaler creates an obvious tension with Scale's positioning as a neutral, government-trusted AI infrastructure provider. The company's public response to the NGA lawsuit - a spokesperson statement pointedly addressed to "Secretary Hegseth and the Department of War" rather than the contracting officer - reads as reputation management aimed at reassuring DoD stakeholders that Meta's ownership has not compromised Scale's federal alignment.
How Scale AI Processes Documents
Scale Document AI handles document extraction through a combination of proprietary OCR, adaptive machine learning, and human-in-the-loop validation:
- Scale Document AI: Template-free document extraction using adaptive machine learning models that generalize across variable document structures without per-document configuration
- Scale AI OCR Engine: Proprietary text recognition combining computer vision and natural language processing for structured and semi-structured documents
- Data Engine: RLHF, synthetic data generation, and model evaluation pipelines for training and fine-tuning large language models - the core capability that originally attracted enterprise AI clients
- Multi-Format Support: Processes images, video, text, audio, LiDAR, and point clouds alongside traditional document formats
- Human-in-the-Loop: Global network of domain expert annotators providing validation and edge-case resolution, particularly for regulated industries requiring audit trails
- API and SDK Integration: Programmatic access via REST API, SDK, and CLI tools with cloud storage connectors for AWS S3, Google Cloud Storage, and Azure Blob Storage
The RAGChain documentation lists Scale's OCR engine as a supported model for document extraction pipelines because it orchestrates extraction using deep learning before passing structured output downstream. Teams building similar open-source extraction pipelines sometimes evaluate LangExtract, Google's Python library for structured information extraction from unstructured text using LLMs with source grounding.
Use Cases
Autonomous Vehicle Training
Scale AI's original focus area. The platform provides labeled LiDAR point cloud annotation, camera fusion datasets, and computer vision training data for self-driving development programs. This vertical established the human-in-the-loop annotation model that Scale later applied to document processing.
Financial Services Document Processing
Banks and financial institutions use Scale Document AI to process loan applications, compliance documents, and KYC packages. Template-free extraction handles the variability of incoming document formats, while human validation supports regulatory audit requirements. The platform's RLHF pipeline allows financial clients to fine-tune extraction models on proprietary document corpora. Open-source alternatives targeting similar financial document workflows include Unstract, a no-code LLM platform with hallucination mitigation designed for production-grade extraction. Financial analytics providers such as Acuity Knowledge Partners take a complementary approach, combining AI-powered document processing with research automation for institutional clients.
Defense and Military Applications
Scale provides AI training solutions for military simulation programs, autonomous drone systems, and geospatial intelligence analysis. The company held a $24 million NGA contract in 2024 for Maven-linked data labeling and a $99 million Army contract for AI tooling. The loss of the NGA's follow-on $708 million contract to Enabled Intelligence - now under litigation - signals that incumbency no longer insulates Scale from competitive federal procurement. The Thunderforge collaboration with Anduril and Microsoft represents a separate, active DoD engagement running in parallel. Government-focused vendors such as VIDIZMO, which offers evidence management and redaction solutions for defense and law enforcement, occupy adjacent territory in the federal document AI market.
Technical Specifications
| Feature | Specification |
|---|---|
| Core Products | Scale Document AI, Data Engine, Scale Rapid, Scale Studio, Scale GenAI |
| Recognition Technology | In-house OCR, computer vision, NLP, adaptive ML models |
| Data Types | Images, video, text, audio, LiDAR, point clouds, documents |
| Extraction Approach | Template-free, adaptive AI |
| Integration | API, SDK, CLI tools |
| Cloud Storage | AWS S3, Google Cloud Storage, Azure Blob Storage |
| Target Industries | Autonomous vehicles, defense, financial services, healthcare |
| Deployment | Cloud-based platform |
| Annual Revenue | $870M (2024), $1.5B ARR (2025) |
| Valuation | $29B (2025) |
| Employees | ~1,430 (post-14% reduction from ~1,650 in 2024) |
Resources
- Website
- Scale Document AI vs OCR
- Data Engine Platform
- Wikipedia: Scale AI
- Scale AI sues DoD over $708M NGA contract loss - Tekedia, 2026-01-30
- Meta's AI acquisitions in 2025 - Economic Times, 2025
- Alexandr Wang joins Meta Superintelligence Labs - Fortune, 2026-01-30
- Scale AI valuation and client dynamics - SaaStr
- Nvidia's top startup investments including Scale AI - TechCrunch, 2026-01-02
For a competitive positioning analysis, see Scale AI: Competitive Analysis.
Company Information
Headquarters: San Francisco, California, United States
Founded: 2016
Employees: ~1,430 (following 14% reduction post-Meta investment, 2025)
Revenue: $870M (2024), $1.5B ARR (2025)
Valuation: $29B (2025)
Ownership: Meta Platforms acquired 49% stake for $14.8B in June 2025
Leadership: Alexandr Wang (founder) departed as CEO to become Chief AI Officer at Meta Superintelligence Labs following the acquisition; current CEO not publicly confirmed as of February 2026
Key Federal Contracts: $99M Army AI tooling contract; $24M NGA Maven data labeling contract (2024); $708M NGA follow-on contract lost to Enabled Intelligence (under litigation as of January 2026); Thunderforge initiative with Anduril and Microsoft (active)