LangExtract — Google's Open-Source IDP Python Library

On This Page

Overview
Technical architecture
Model support and deployment
Domain applications
Competitive positioning
Implementation examples
Basic extraction workflow
Interactive visualization
Technical specifications
Frequently asked questions
Resources
Company information

Google's free Python library for extracting structured information from unstructured text using large language models (LLMs), with every extracted entity mapped back to its exact character position in the source document.

FreeLicense (Apache 2.0)

147,843+ charsMax demonstrated doc size

20Parallel workers supported

90+Languages via Gemini models

Overview

LangExtract was released by Google on July 30, 2025 as an open-source Python library for LLM-powered information extraction. The library is available via PyPI as v1.2.0 (released March 22, 2026) under the Apache 2.0 license, making it free for commercial and research use. Google notes explicitly that LangExtract is not an officially supported Google product, requiring users to acknowledge this when deploying in production or citing it in publications.

The library addresses a specific gap in document processing pipelines: once text has been extracted from a document, how do you reliably convert unstructured narratives into structured data while proving every output is grounded in the source? As Akshay Goel of Google framed the problem at launch: "What if you could programmatically extract the exact information you need, while ensuring the outputs are structured and reliably tied back to its source?"

Unlike full-stack intelligent document processing (IDP) platforms such as ABBYY or Hyperscience, LangExtract operates on text that has already been extracted from documents. It focuses on the semantic intelligence layer: transforming unstructured narratives into structured JSON with schema enforcement, source grounding, and hallucination filtering. Teams that need upstream document parsing and layout recovery can pair LangExtract with Docling, Google's companion tool for that layer.

Google positioned LangExtract as a free alternative to $50K enterprise extraction solutions in a data extraction market valued between $1.5 billion and $5 billion in 2024-2025, with forecasts reaching tens of billions by the mid-2030s.

Not an officially supported Google product. Google's documentation states users must acknowledge LangExtract's experimental status when deploying in production or citing it in publications. Organizations requiring vendor SLAs should factor this into procurement decisions.

Technical architecture

LangExtract's design rests on three interlocking capabilities: source grounding, multi-pass processing for long documents, and hallucination filtering. Together they address the core reliability problem with LLM-based extraction: outputs that cannot be verified against their source.

Source grounding maps every extracted entity to its exact character offset in the original text. This enables visual highlighting of the original fragment and direct verification of whether the returned data is actually supported by the document. The official repository describes it as: "Each extraction can be mapped to its exact location in the source text, enabling visual highlighting of the original fragment and review of whether the returned data is truly supported by the document." For regulated industries where audit trails are mandatory, this is the capability that separates LangExtract from generic LLM API calls.

Multi-pass processing handles documents that exceed typical LLM context windows. LangExtract uses text fragmentation, parallel processing with up to 20 configurable workers, and multiple extraction passes to improve recall on documents exceeding 100 pages. The library successfully processed a 147,843-character text from Project Gutenberg and a 15,000-character video meeting transcript, extracting 67 distinct entities with exact character positioning from the latter.

Hallucination filtering automatically detects when an LLM extracts content from few-shot examples rather than the input text. The library checks for null character intervals in extraction results and filters outputs not grounded in the source document, reducing the risk of fabricated data entering downstream systems.

Model support and deployment

LangExtract supports four deployment paths, giving teams flexibility to match cost, privacy, and performance requirements.

Gemini 2.5 Flash provides the recommended speed-cost-quality balance for most extraction tasks. Gemini 2.5 Pro handles complex reasoning requirements where deeper inference is needed. Both are accessible via the Gemini API or Vertex AI. For large-scale batch workloads, LangExtract supports the Vertex AI Batch API for cost optimization, which can reduce inference costs significantly on high-volume pipelines.

OpenAI models are available through optional dependencies, and local model support via Ollama enables fully offline processing for sensitive documents requiring air-gapped deployment. A TypeScript port with OpenAI model support, created by developer Kyle Brown following the library's open-source release, extends LangExtract to JavaScript environments. Teams searching for langextract npm, langextract js, or langextract typescript should note this community port is not an official Google release.

Custom providers are supported through a plugin interface, allowing organizations to integrate proprietary or fine-tuned models into the same extraction pipeline.

Domain applications

LangExtract targets four primary verticals where unstructured text volume is high and structured output is required for downstream systems.

Healthcare is the most developed use case. Google demonstrated clinical capabilities through RadExtract, an interactive demo hosted on Hugging Face Spaces that converts free-text radiology findings into standardized structured formats required for research and clinical care. The library's source grounding supports HIPAA audit trail requirements by maintaining character-level traceability for every extracted entity. Early adoption signals include integration with Microsoft Presidio for PII and PHI detection, suggesting enterprise security tooling is beginning to standardize around LLM-based extraction pipelines. Teams evaluating LangExtract for healthcare document exchange workflows may also consider Concord Technologies, which offers a purpose-built straight-through processing platform with EHR integration.

Legal and financial documents benefit from the relationship extraction capability, which connects related entities (counterparty names, contract dates, obligation clauses) while maintaining source traceability for each data point. The schema enforcement layer ensures outputs conform to predefined structures required for downstream contract management or compliance systems.

Engineering and technical documentation represent a third target, where LangExtract extracts specifications, part numbers, and relationships from dense technical text that resists template-based approaches.

Competitive positioning

LangExtract occupies a specific layer in the IDP stack rather than competing across the full pipeline. Its open-source nature and Google backing provide cost advantages over proprietary extraction tools like Docugami or Sensible.so, while its source grounding addresses compliance requirements that generic LLM API calls cannot satisfy.

The docling vs langextract comparison is the most common evaluation question this page receives. The distinction is architectural: Docling handles document parsing, layout recovery, and format conversion (PDF, DOCX, images). LangExtract handles semantic entity extraction from clean text. They are complementary, not competing. A production pipeline might use Docling to parse a PDF into clean text, then LangExtract to extract structured entities from that text.

Against full-stack IDP platforms, LangExtract trades breadth for depth. It does not handle OCR, document classification, workflow routing, or human-in-the-loop review. Vendors like Adlib take a comparable accuracy-validation approach for regulated enterprises, but through a complete platform rather than a developer library. The trade-off is explicit: LangExtract gives developers precise control over the extraction layer at zero licensing cost, while full-stack platforms provide end-to-end automation with vendor support.

Google acknowledges the library's limitations directly: "Deterministic rules and domain-specific fine-tuned models may offer better guarantees in some scenarios." Organizations requiring deterministic, auditable extraction at scale may still prefer domain-specific fine-tuned models or rule-based systems alongside or instead of LangExtract.

LangExtract strengths

Free under Apache 2.0. Source grounding to character level. Hallucination filtering built in. Supports Gemini, OpenAI, Ollama, and custom models. Vertex AI Batch API for cost optimization. Active community with TypeScript port.

LangExtract limitations

Not an officially supported Google product. No OCR, classification, or workflow layer. Extraction quality depends on instruction clarity and example quality. Deterministic guarantees weaker than fine-tuned or rule-based systems. No vendor SLA.

Implementation examples

Basic extraction workflow

import langextract as lx

## Define extraction task with examples
prompt = "Extract characters, emotions, and relationships in order of appearance."
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks?",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            )
        ]
    )
]

## Process document
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

Interactive visualization

LangExtract generates self-contained HTML files for reviewing extracted entities in their original context. The visualization works in Google Colab or as standalone files, providing immediate feedback on extraction quality without additional tooling.

## Save results and generate interactive HTML
lx.io.save_annotated_documents([result], output_name="results.jsonl")
html_content = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

A data scientist reviewing the platform noted: "LangExtract is the most production-ready unstructured to structured extraction library I've used. It treats extraction as an engineering problem, not just an AI problem."

Technical specifications

Feature	Specification
License	Apache 2.0 (open source)
Current version	v1.2.0 (released March 22, 2026)
Primary language	Python
Recommended model	Gemini 2.5 Flash (speed/cost balance)
Advanced model	Gemini 2.5 Pro (complex reasoning)
Local model support	Ollama integration for offline processing
Cloud batch support	Vertex AI Batch API for cost optimization
Output format	Structured JSON with source mapping
Visualization	Interactive HTML with entity highlighting
Processing architecture	Chunking, parallel processing, multiple passes
Maximum demonstrated doc size	147,843+ characters
Parallel workers	Up to 20 configurable workers
Languages supported	90+ via Gemini models
TypeScript port	Community-maintained (not official Google release)

Frequently asked questions

Is LangExtract free?

Yes. LangExtract is released under the Apache 2.0 license, which permits free commercial and research use. The library itself has no licensing cost. You will incur API costs if you use Gemini or OpenAI models for inference. Local model deployment via Ollama eliminates API costs entirely.

Is LangExtract fully open source?

Yes, the source code is published on GitHub under Apache 2.0. However, Google explicitly states LangExtract is not an officially supported Google product. There is no vendor SLA, and users must acknowledge the experimental status when deploying in production or citing it in publications.

What is the LangExtract release date?

Google announced and released LangExtract on July 30, 2025. The current version is v1.2.0, released on PyPI on March 22, 2026.

Does LangExtract do OCR?

No. LangExtract operates on text that has already been extracted from documents. It does not perform optical character recognition (OCR) or document parsing. For PDF and image processing, pair LangExtract with a document parsing tool such as Docling, which handles layout recovery and format conversion upstream.

Is there a LangExtract npm or JavaScript package?

There is no official Google npm package. A community TypeScript port with OpenAI model support was created by developer Kyle Brown following the open-source release. This port is not maintained by Google.

How does LangExtract compare to Docling?

They solve different problems. Docling parses documents (PDF, DOCX, images) into clean text and handles layout recovery. LangExtract extracts structured entities from clean text. A production pipeline typically uses both: Docling upstream for parsing, LangExtract downstream for semantic extraction.