Yapit — Open-Source Text-to-Speech for Documents

On This Page

Overview
Document Processing Capabilities
TTS Engine Options
Deployment
Features
Technical Specifications
Company Information

Yapit is an open-source text-to-speech platform built specifically for reading documents aloud. It handles PDFs, academic papers, and web pages with intelligent processing of math, citations, and figures that generic TTS tools fail on.

Overview

Yapit is a self-hostable TTS application that accepts a URL or PDF and reads the content aloud. Unlike general TTS tools, it is designed for document fidelity: math equations are spoken as alt text rather than raw LaTeX, citation markers and figure labels are naturalized, and page headers and footers are stripped before synthesis.

The project is licensed under AGPL-3.0 and available on GitHub. It supports 170+ voices across 15 languages and can run the TTS engine entirely in the browser via Kokoro-82M on WebGPU. No server is needed for basic usage.

Yapit differs from commercial IDP vendors in scope. Where platforms like ABBYY or Nanonets focus on structured data extraction and workflow automation, Yapit focuses on accessibility and document consumption through audio output. It is relevant to organizations researching open-source document intelligence tooling or building accessible document workflows.

Document Processing Capabilities

Yapit's document pipeline covers:

PDF ingestion: Layout analysis via DocLayout-YOLO detects figures, tables, and headers for accurate extraction
Web page extraction: Clean content extraction via the defuddle library strips navigation, ads, and boilerplate
Academic paper handling: Math rendered visually but spoken as alt text; citation markers and figure labels converted to natural speech patterns
Markdown export: Append /md to any document URL to retrieve clean markdown; /md-annotated adds TTS annotations
Customizable extraction: Extraction behavior driven by a configurable prompt, supporting any OpenAI-compatible vision API (OpenRouter, vLLM, Ollama, Google Gemini)

TTS Engine Options

Engine	Mode	Notes
Kokoro-82M	Browser (WebGPU/CPU)	Runs locally, no server required
Kokoro-FastAPI	Self-hosted server	Docker worker, GPU/CPU
Inworld TTS	Hosted	Cloud API
OpenAI-compatible	Any	vLLM-Omni, AllTalk, Chatterbox TTS

Voice auto-discovery is supported when the server exposes GET /v1/audio/voices.

Deployment

Yapit is designed for self-hosting via Docker Compose:

make self-host

Default mode is single-user with no login required. Multi-user mode with authentication (Stack Auth + ClickHouse) is available via AUTH_ENABLED=true. GPU workers for Kokoro TTS and YOLO figure detection are optional add-ons, with NVIDIA MPS support for multi-worker GPU sharing.

The tech stack uses a Python backend (managed via uv), a Node.js/Vite frontend, and Redis for job queuing.

Features

170+ voices across 15 languages
Document outliner and Vim-style keyboard shortcuts
Media key support and adjustable playback speed
Dark mode and share-by-link
MP3 audio export on roadmap

Technical Specifications

Feature	Specification
License	AGPL-3.0
Backend	Python (uv)
Frontend	Node.js / Vite
TTS Model	Kokoro-82M (WebGPU / CPU / GPU worker)
PDF Layout	DocLayout-YOLO
Web Extraction	defuddle
Auth	Stack Auth (optional)
Analytics	ClickHouse (optional, multi-user mode)
Deployment	Docker Compose