Benchmarks for the hardest parts of document processing

Measured on long, real-world documents. We publish results, methodology, and source data so you can verify them yourself.

Benchmarks

Public benchmarks for document parsing, extraction, and splitting, with published methodology and source data.

Parsing

RealDoc-Bench

Measures whether parsers deliver accurate layouts, preserve reading order, and enable agents to correctly answer objective questions against real-world documents.

Q&A accuracy

Extend Parse 2.0

95.7%

Layout F1

Adjusted F1

0.847

Q&A set

Prompts across 581 docs

1,359

BlogMethodology and results ResearcharXiv paper Source codeGitHub repo Hugging FaceLayout dataset Hugging FaceQ&A dataset

Read RealDoc-Bench

Extraction

LongArray-Extract

Tests whether extraction systems preserve cardinality and return complete, schema-faithful arrays when the output grows from a dozen of rows to thousands.

Mean accuracy

Extend across 45 PDFs

99.2%

Run completion

45 of 45 PDFs completed

100%

Largest array

Rows in one PDF

2.2k

BlogMethodology and results Hugging FaceLong-array extraction dataset

Read LongArray-Extract

Splitting