Scale AI: Data Annotation and AI Training

On This Page

Overview
What users say
How Scale AI processes documents
Use cases
Autonomous vehicle training
Financial services document processing
Defense and government applications
Robotics and physical AI
Model evaluation and benchmarking
Technical specifications
Company information
Resources

Scale AI is a data annotation and AI training platform provider that posted its best revenue in company history in 2025, reaching a $1 billion annual run rate, despite a year defined by a $14.8 billion Meta acquisition, a founder departure, a 14% workforce reduction, and the loss of Google and xAI as clients. The company now operates across commercial AI infrastructure, federal defense contracts, and a growing enterprise applications business under new CEO Jason Droege.

Scale AI

$1B2025 Annual Run Rate

$29BPost-Meta Valuation

150K+Robotics Training Hours (2025)

$300M+DoD Contracts Secured

Overview

Scale AI operates a data annotation platform supporting images, video, text, audio, LiDAR, and point cloud data types. Founded in 2016, the company initially focused on data labeling services for autonomous vehicles before expanding into document processing with Scale Document AI. In May 2024, the company raised $1 billion from Nvidia, Amazon, and Meta, valuing it at nearly $14 billion.

The strategic landscape shifted in June 2025 when Meta acquired a 49% non-voting stake for $14.8 billion, more than doubling Scale's valuation to approximately $29 billion. Founder Alexandr Wang departed to serve as Chief AI Officer at Meta Superintelligence Labs. Jason Droege, previously Scale's Chief Strategy Officer and a former Uber executive, became CEO. As Sources News reported in February 2026: "When Alexandr Wang left Scale AI for Meta last year in a $14.5 billion deal, the conventional wisdom was that Scale was done. Instead, the company posted its best revenue in history last year, reaching a billion-dollar annual run rate."

The acquisition immediately triggered client departures. OpenAI cut ties, Google canceled a planned $200 million annual contract citing data security concerns, and xAI began exploring alternatives. Competitors including Labelbox, Mercor, Turing, and Handshake reported significant customer inflows from organizations diversifying away from Scale. Turing's revenue run-rate tripled to $300 million by positioning itself as a neutral vendor. The financialcontent.com analysis from March 2026 notes the deal grants Meta "privileged access" to Scale's global workforce through its Outlier and Remotasks subsidiaries, plus exclusive rights to Scale's Safety, Evaluation, and Alignment Lab (SEAL) frameworks.

Despite the client losses, Scale doubled annual revenue in 2025 and secured over $1 billion in new bookings, with more than half closed in Q4 2025. The data business became profitable in H2 2025. The applications business more than doubled revenue in the second half of the year and is projected to roughly double again in 2026.

What users say

Practitioners working with Scale's annotation services report a widening gap between the company's enterprise positioning and its operational execution. Amplify Partners documented in 2026 that customers at major AI research labs and startups are "raising concerns about low quality and high turnaround times" with Scale's annotation services, and that "the traditional approach to data labeling is not meeting their needs." Annotators have separately reported wage disputes tied to opaque submission policies, with some workers in the Philippines paid less than one cent per task.

Teams building production document extraction pipelines increasingly evaluate Scale's platform against newer alternatives. The robotics and physical AI training work draws more consistent praise, reflecting Scale's investment in specialized infrastructure. Enterprise applications customers, including those in financial services and government, tend to report stronger outcomes where Scale's human-in-the-loop validation and classified deployment capabilities are the primary requirement, rather than commodity labeling volume.

The data security incident that exposed thousands of sensitive project documents via public Google Docs has made procurement teams cautious. For organizations sharing proprietary documents with Scale's platform, this incident remains a live concern that financial performance figures do not address.

How Scale AI processes documents

Scale Document AI handles document extraction through a combination of proprietary optical character recognition (OCR), adaptive machine learning, and human-in-the-loop validation. The platform processes structured and semi-structured documents without requiring per-document template configuration, which matters for enterprise workflows where incoming document formats vary significantly.

The core extraction pipeline starts with Scale's in-house OCR engine, which combines computer vision and natural language processing to recognize text across variable layouts. Adaptive machine learning models then generalize across document structures, reducing the configuration overhead that template-based systems require. For edge cases and regulated industries requiring audit trails, Scale routes documents to domain expert annotators through its global workforce network.

The Scale GenAI Platform extends this infrastructure to foundation model customization. Enterprises use it to fine-tune models from OpenAI, Anthropic, and Meta on proprietary document corpora while maintaining data security through on-premise hosting options. The Data Engine handles reinforcement learning from human feedback (RLHF), synthetic data generation, and model evaluation pipelines for organizations training or fine-tuning large language models on document-heavy datasets.

The RAGChain documentation lists Scale's OCR engine as a supported model for document extraction pipelines because it orchestrates extraction using deep learning before passing structured output downstream. Teams building similar open-source extraction pipelines sometimes evaluate LangExtract, Google's Python library for structured information extraction from unstructured text using large language models with source grounding.

Use cases

Autonomous vehicle training

Scale AI's original focus area. The platform provides labeled LiDAR point cloud annotation, camera fusion datasets, and computer vision training data for self-driving development programs. This vertical established the human-in-the-loop annotation model that Scale later applied to document processing. As autonomous vehicle companies matured their real-world data collection in the early 2020s, demand for Scale's core labeling service in this segment contracted, accelerating the company's pivot toward higher-margin verticals.

Financial services document processing

Banks and financial institutions use Scale Document AI to process loan applications, compliance documents, and KYC packages. Template-free extraction handles the variability of incoming document formats, while human validation supports regulatory audit requirements. The RLHF pipeline allows financial clients to fine-tune extraction models on proprietary document corpora. Scale's Q4 2025 enterprise bookings included Allianz and BP as new customers, signaling traction in regulated industries. Open-source alternatives targeting similar financial document workflows include Unstract, a no-code large language model platform with hallucination mitigation designed for production-grade extraction. Financial analytics providers such as Acuity Knowledge Partners take a complementary approach, combining AI-powered document processing with research automation for institutional clients.

Defense and government applications

Scale provides AI training solutions for military simulation programs, autonomous drone systems, and geospatial intelligence analysis. The company holds over $300 million in Department of Defense contracts, including the Thunderforge prime contract (March 2025) for integrating AI agents into Pentagon mission planning, a $100 million five-year CDAO agreement, and two additional contracts totaling nearly $200 million awarded in 2025. Scale Donovan, the company's government decision-support platform, deploys AI agents for intelligence analysis and mission operations on Top Secret and Sensitive Compartmented Information (TS/SCI) networks. Scale was the first AI company to deploy a large language model on a classified U.S. Army network, doing so in May 2023.

The loss of the NGA's $708 million follow-on data-training contract to smaller competitor Enabled Intelligence, now under litigation filed January 30, 2026 in the U.S. Court of Federal Claims, signals that incumbency no longer insulates Scale from competitive federal procurement. Government-focused vendors such as VIDIZMO, which offers evidence management and redaction for defense and law enforcement, occupy adjacent territory in the federal document AI market.

Robotics and physical AI

Scale delivered more than 150,000 hours of robotics and physical AI training data in 2025, onboarding 10 new robotics customers including Physical Intelligence and Generalist AI. The Physical AI data collection platform, built on 100,000+ production hours at Scale's San Francisco lab, supports autonomous vehicle companies and humanoid robotics developers. In March 2026, Scale partnered with Universal Robots to launch "UR AI Trainer" for physical AI and robotics training. This segment drove the data business to profitability in H2 2025.

Model evaluation and benchmarking

Scale's Safety, Evaluation, and Alignment Lab (SEAL) published 450+ evaluations across 50+ models in 2025 and introduced 15 new benchmarks including Humanity's Last Exam, SWE-Bench Pro, and MCP Atlas. In March 2026, Scale launched Scale Labs, a research division expanding SEAL's focus to agentic and multimodal AI systems, with new benchmarks including SWE-Atlas and Voice Showdown. Scale also became a third-party evaluator for the U.S. AI Safety Institute in February 2025. This evaluation business positions Scale as an independent assessor of AI reliability, a role that commands premium pricing and reduces dependence on commodity annotation revenue.

Technical specifications

Feature	Specification
Core products	Scale Document AI, Data Engine, Scale Rapid, Scale Studio, Scale GenAI, Scale Donovan, Scale Evaluation
Recognition technology	In-house OCR, computer vision, NLP, adaptive ML models
Data types	Images, video, text, audio, LiDAR, point clouds, documents
Extraction approach	Template-free, adaptive AI
Integration	API, SDK, CLI tools
Cloud storage	AWS S3, Google Cloud Storage, Azure Blob Storage
Deployment	Cloud-based; on-premise available for GenAI Platform; TS/SCI for Donovan
Target industries	Autonomous vehicles, defense, financial services, healthcare, government
Annual revenue	$870M (2024), $1B+ ARR (2025)
Valuation	$29B (2025)
Employees	~1,430 (post-14% reduction from ~1,650 in 2024)
New offices (2025)	New York, London, Washington D.C., Doha, St. Louis

Resources

Website
Scale Document AI vs OCR
Scale's next era: building for 2026
Wikipedia: Scale AI
Scale AI new CEO interview
Meta's data moat: the $14.3B bet on Scale AI
Annotation for AI doesn't scale

For a competitive positioning analysis, see Scale AI competitive analysis.