Apache PDFBox — Open-Source Java PDF Library
On This Page
Apache PDFBox is an open-source Java library for PDF document processing, serving as the foundation for enterprise document workflows and government-scale applications.

Overview
Apache PDFBox is a project of the Apache Software Foundation that enables developers to create, manipulate, and extract content from PDF documents. First released in 2008, the library has matured into a foundational toolkit embedded in enterprise applications, commercial products, and open-source projects worldwide.
Its defining characteristic is lenient PDF parsing: the ability to process malformed and inconsistent documents that cause stricter libraries to fail. That reputation reached beyond the Java ecosystem in January 2026, when LibPDF, a new TypeScript library launched by Documenso, explicitly adopted PDFBox's parsing methodology as a benchmark for robust document processing. At government scale, the NIH's FAIR-SMART system demonstrated what that reliability means in practice: integrated alongside Apache POI and OpenCSV, PDFBox processed over 5 million biomedical research supplementary materials with a 99.46% conversion success rate, transforming diverse file formats into standardized BioC-compliant XML and JSON.
The project currently maintains two active branches: v3.0.7 (current stable) and v2.0.35 (legacy support). Version 3.0.7 shipped after a turbulent release cycle: the initial release candidate, staged February 2, 2026 by lead committer Andreas Lehmkühler, was reverted two days later after community tester Daniel Persson discovered a critical regression causing blank pages across multi-page PDFs. Once resolved, 3.0.7 shipped with 35 bug fixes, 10 improvements, and one new feature. It also patches CVE-2026-23907, a path traversal vulnerability affecting all prior releases back to 2.0.24. Organizations running any version before 3.0.7 should treat the upgrade as a security requirement, not an optional improvement.
Security advisory: CVE-2026-23907 (disclosed March 10, 2026) affects PDFBox versions 2.0.24 through 2.0.35 and 3.0.0 through 3.0.6. The vulnerability allows a malicious PDF to escape the intended extraction directory via PDComplexFileSpecification.getFilename(). It is patched only in 3.0.7. Organizations that copied the ExtractEmbeddedFiles example code into production should audit their implementations regardless of upgrade status.
What users say
Practitioners who rely on PDFBox in production consistently cite its lenient parsing as the reason they chose it over alternatives: the library handles documents that other Java PDF tools reject outright. Teams processing high-volume, heterogeneous document sets, particularly in government and research contexts, report that tolerance for malformed input is the feature that matters most at scale.
The 3.0.7 release cycle surfaced a recurring frustration among active contributors. Daniel Persson's benchmark data from February 4, 2026 showed average wall time of approximately 516ms in 3.0.6 spiking to 1,851ms in the 3.0.7 release candidate on load-and-save operations. His assessment of the underlying change was direct: "I still don't like the new implementation of COSWriterObjectStream. The original thread-safe implementation is simpler to read and more correct, but I understand if you want the change for performance reasons. Removing the synchronization seems like the wrong way to do this."
Teams evaluating PDFBox for new integrations note that the migration guide for v3 remains marked as a work in progress in the 3.0.7 release notes. Maven Central adoption data confirms the effect: 3.0.5 shows 76 declared usages while 3.0.6 shows only 26, suggesting many projects are a full minor version behind current. A long tail of 1.x deployments persists, with PDFBox 1.8.15 recording 12 active usages despite carrying a known vulnerability.
How Apache PDFBox processes documents
PDFBox provides lenient PDF parsing that handles malformed and inconsistent PDF documents that cause other libraries to fail. The Java PDF library extracts text, images, and metadata from existing PDF files while offering PDF rasterization capabilities that convert vector-based PDF content to pixel-based images with configurable DPI from 72 to 600. The library enables document manipulation through merge, split, and modify operations, and supports form handling to fill in, extract data from, and flatten PDF forms.
Digital signature capabilities allow adding and verifying signatures, with PDF/A support for creating and validating compliant documents for archiving. Version 3.0.7 adds PDF/A-4 conformance level support (PDFBOX-6090, PDFBOX-6088), extending the library's compliance coverage. Security features remove links, scripts, and macros while making text non-selectable. Streaming capability manages memory for large documents without loading entire files into heap.
The v3 branch introduced compressed object streams and a restructured write pipeline. The December 2025 memory-optimization commit (PDFBOX-5169), intended to reduce footprint by reusing an internal byte array, and a January 2026 crash fix (PDFBOX-6142) interacted to produce the blank-pages regression caught at the RC stage. The fix required a developer running a regression suite of 150+ files per change, plus three pre-release batches covering 270 problematic PDFs rendered across Poppler, PDF.js, Chromium, and SIPS, and large-scale tests across 50,000 pages. That testing is driven entirely by volunteer contributors.
Version 3.0.7 also resolves a stack overflow error for documents with 1,000 or more bookmarks or outlines (PDFBOX-6036, PDFBOX-6102, PDFBOX-6153), fixes German umlauts not rendering correctly (PDFBOX-6105), reduces access rights to temp files (PDFBOX-6100), and adds DFLT script support in the GSUB system for OpenType fonts (PDFBOX-6103).
Teams evaluating open-source alternatives for structured data extraction from unstructured text may also want to review LangExtract, Google's open-source Python library that applies LLMs with precise source grounding rather than rule-based parsing.
Use cases
Government and research applications
PDFBox serves as the PDF processing component in large-scale government systems. The NIH's FAIR-SMART platform converted over 5 million biomedical research supplementary materials into machine-readable BioC-compliant XML and JSON formats, achieving a 99.46% conversion success rate. PDFBox handled format diversity alongside Apache POI and OpenCSV, working on volumes that depend directly on its tolerance for malformed input.
Organizations in the public sector requiring on-premises document processing with strict data sovereignty requirements may also evaluate Captova Technologies, a vendor claiming 100+ pages/second processing speeds with on-premises deployment specifically targeting government and defense markets.
Document security and compliance
Organizations in legal, healthcare, and financial sectors use PDFBox rasterization capabilities to convert sensitive documents into secure image formats, removing interactive elements such as scripts, links, and macros while ensuring consistent display across platforms. Output types (RGB, GRAY, and BINARY) allow organizations to match image fidelity to compliance requirements.
CVE-2026-23907 carries a specific practical risk for enterprise intelligent document processing (IDP) implementations: organizations that lifted the ExtractEmbeddedFiles example code verbatim into production document processing pipelines are exposed even if they understood it as sample code. The affected range covers the entire active release window for both supported branches prior to 3.0.7. Teams with redaction requirements beyond rasterization may find VIDIZMO Redactor relevant; it provides AI-powered redaction across documents, audio, video, and images for regulated industries.
Developer tooling and library foundations
PDFBox functions as a parsing reference for other ecosystems. When Documenso's LibPDF TypeScript library launched in January 2026, it adopted PDFBox's lenient parsing methodology explicitly, showing that the library's approach to malformed-document handling has become a de facto standard against which new implementations benchmark themselves.
Developers building production document pipelines on top of PDFBox who need a no-code layer for LLM-powered extraction workflows may find Unstract a relevant complement; it is an open-source platform designed for production-grade IDP with hallucination mitigation.
Technical specifications
| Feature | Specification |
|---|---|
| Programming language | Java |
| License | Apache License 2.0 |
| PDF specification support | Up to PDF 1.7 |
| Platform compatibility | Cross-platform (Java-based) |
| Current stable version (v3) | 3.0.7 |
| Current legacy version (v2) | 2.0.35 |
| Rasterization DPI range | 72-600 DPI |
| Image output types | RGB, GRAY, BINARY |
| Memory management | Streaming capability for large documents |
| Active branches | v3.x (active development), v2.x (legacy support) |
| PDF/A conformance | PDF/A-4 added in 3.0.7 |
| Security patch | CVE-2026-23907 patched in 3.0.7 |
Version history and migration
v2.0.x legacy branch
Still maintained at 2.0.35, but Maven Central data shows adoption declining. Organizations on this branch are exposed to CVE-2026-23907 and should plan migration to v3.
v3.0.x active branch
Active development target since v3.0.0. Compressed object streams and a restructured write pipeline differentiate it from v2. The migration guide remains a work in progress as of 3.0.7.
v3.0.7 current stable
Released early 2026 after the RC regression was resolved. Includes 35 bug fixes, PDF/A-4 support, and the CVE-2026-23907 security patch. Verify against the PDFBox download page before adopting in production.
1.x long tail
PDFBox 1.8.15 records 12 active Maven usages despite carrying a known vulnerability. Older 1.x versions each carry at least one vulnerability. No migration path exists other than upgrading to v3.
Resources
- Apache PDFBox website
- PDFBox 3.0.7 Javadoc
- PDF rasterization in Java: technical guide
- NIH FAIR-SMART: government-scale PDFBox deployment
- CVE-2026-23907 path traversal vulnerability details
- Maven Central: PDFBox adoption data
- PDFBox 3.0.7 RC regression discussion, February 2026
- Open-source OCR tools: comparative context
- Self-hosted document processing: deployment guide
Company information
- Parent organization: Apache Software Foundation
- Website: pdfbox.apache.org
- Mailing list: users@pdfbox.apache.org
- Bug reporting: Apache JIRA: PDFBOX project
- Source code: GitHub: apache/pdfbox
- CVE history: CVEdetails: Apache PDFBox