On This Page

Floatingpoint AI builds training datasets for document intelligence models, covering parsing, layout analysis, form understanding, and actuarial modeling across 100+ languages and 21 domains.

Overview

Floatingpoint AI occupies a distinct position in the document AI market: it does not process documents for end users. Instead, it produces the training data that other teams use to build and fine-tune document intelligence models. The company's thesis is that models will only become reliable co-workers in knowledge work if they are trained on data that captures the full complexity of real-world documents, not clean synthetic samples or narrow benchmarks.

The product catalog spans four dataset families: parsing (end-to-end document understanding including layout, reading order, OCR across 50+ languages, table-to-HTML, form extraction, chart-to-Mermaid, and formula recognition), layout analysis (19 element types across 21 domains, complexity-stratified and distribution-matched to real traffic), form understanding (a graph-schema dataset covering text, indicators, inputs, and graphics across 14 data classes), and actuarial modeling (source documents transformed into Excel models with step-by-step expert reasoning, built by practicing actuaries from American insurers). Each dataset ships as a complete data product: core annotated data, synthetic expansion built on top of that core, an interactive delivery platform with sourcing and annotation logic, and a commitment to iterative improvement after delivery.

This positions Floatingpoint AI closer to vendors like Scale AI in the data infrastructure layer than to IDP platforms like ABBYY or Rossum that process documents in production. Teams building or fine-tuning document AI models, whether at IDP vendors, hyperscalers, or enterprise AI groups, are the target customer, not the document processing teams themselves.

How Floatingpoint AI Processes Documents

Floatingpoint AI's pipeline runs in the opposite direction from a typical IDP vendor. Rather than ingesting customer documents and returning structured data, the company ingests source documents and returns annotated training datasets. Domain specialists source, annotate, and review every record. A layered quality assurance and quality control process runs throughout, with distributions built to match real-world document traffic rather than idealized samples.

The parsing dataset sets the broadest scope: it covers the full extraction stack from raw document input to structured output, including OCR across 50+ languages, reading order detection, table-to-HTML conversion, form field extraction, PCM formula parsing, and chart-to-Mermaid conversion. This breadth reflects the company's view that document understanding is not a single capability but a pipeline of interdependent tasks, each of which requires its own training signal.

The layout analysis dataset applies a different design philosophy. Rather than maximizing variety, it stratifies complexity from simple single-column layouts to deeply nested multi-column structures, then matches the distribution of those complexity levels to what models actually encounter in production traffic. The 19 element types span 21 domains, covering the range from financial statements to insurance forms to scientific papers.

Form understanding takes a graph-schema approach that is uncommon in publicly available datasets. Each form element, whether a text field, checkbox, dropdown, or graphic, is boxed and linked to related elements through a deterministic graph structure. Typing across 14 data classes supports three distinct downstream tasks: form filling, form parsing, and field extraction. This schema design allows model developers to train on the relational structure of forms rather than treating each field as an independent extraction target.

The actuarial modeling dataset is the most specialized offering. Practicing actuaries from American insurers transform messy source documents into professional Excel models, tracing their reasoning step by step throughout the process. The dataset targets the gap between raw insurance documents and the structured financial models that actuaries produce from them, a workflow that requires domain expertise that general-purpose annotation pipelines cannot replicate.

Synthetic expansion is available on top of every core dataset. Floatingpoint AI describes this as "rigorously developed" on top of core data, with rich metadata for building custom splits and training recipes. The interactive delivery platform lets buyers explore sourcing decisions, annotation logic, distribution choices, and machine learning findings from the annotation process, providing transparency that is rare in commercial training data products.

Use Cases

Document AI Model Development

Teams building foundation models or fine-tuning existing models for document understanding use Floatingpoint AI's datasets to close accuracy gaps on specific document types. The parsing dataset is the broadest entry point, covering the full extraction pipeline. Teams working on specific sub-tasks, such as table extraction or form parsing, can license the relevant dataset independently. The iterative delivery model means that accuracy, schema coverage, and language support improve continuously after the initial license, which matters for teams running ongoing training cycles rather than one-time fine-tuning runs.

Insurance and Actuarial AI

The actuarial modeling dataset addresses a specific bottleneck: the gap between raw insurance source documents and the structured Excel models that actuaries derive from them. By tracing expert reasoning step by step, the dataset provides training signal not just for the output format but for the intermediate reasoning process. This is relevant for teams building AI systems that need to replicate actuarial judgment, not just extract fields from forms.

Multilingual Document Processing

With coverage across 100+ languages in the parsing dataset and 50+ languages in the OCR component, Floatingpoint AI targets teams building document AI for non-English markets where training data is scarce. The 21-domain coverage means the language data is not limited to a single document type, reducing the risk of models that perform well on English financial documents but fail on non-English equivalents.

Technical Specifications

Feature Specification
Dataset Types Parsing, Layout Analysis, Form Understanding, Actuarial Modeling
Language Coverage 100+ languages (parsing); 50+ languages (OCR component)
Domain Coverage 21 domains
Layout Element Types 19 element types
Form Data Classes 14 typed data classes
Annotation Method Human domain specialists with layered QA/QC
Synthetic Expansion Available on top of core datasets
Delivery Platform Interactive platform with sourcing, distributions, annotation logic, ML learnings
Delivery Model Iterative: continuous accuracy, schema, and coverage updates post-delivery
Custom Builds Bespoke partnerships with collaborative schema design
Deployment Cloud-based delivery platform
Pricing Contact for licensing; off-the-shelf and custom options available

Resources

  • Website

Company Information

New York, NY, USA. Founding year not publicly disclosed. No funding rounds, employee count, or leadership information has been published as of July 2025. The company operates as a data products business serving AI teams building document intelligence models, with both off-the-shelf dataset licensing and bespoke partnership engagements available. No third-party analyst coverage or independent benchmark results have been identified.