_

Benchmarks for the hardest parts of document processing

Measured on long, real-world documents. We publish results, methodology, and source data so you can verify them yourself.

_

Benchmarks

Public benchmarks for document parsing, extraction, and splitting, with published methodology and source data.

Parsing

RealDoc-Bench

Measures whether parsers deliver accurate layouts, preserve reading order, and enable agents to correctly answer objective questions against real-world documents.

Q&A accuracy

Extend Parse 2.0

95.7%

Layout F1

Adjusted F1

0.847

Q&A set

Prompts across 581 docs

1,359

Read RealDoc-Bench

Extracting

LongArray-Extract

Tests whether extraction systems preserve cardinality and return complete, schema-faithful arrays when the output grows from a dozen of rows to thousands.

Mean accuracy

Extend across 45 PDFs

99.2%

Run completion

45 of 45 PDFs completed

100%

Largest array

Rows in one PDF

2.2k

Read LongArray-Extract

Splitting

PoliTax Split

Evaluates document splitting on long, compound tax filings where frontier models miss subtle boundaries across hundreds of pages.

Best harness F1

Claude Opus 4.6

72.48%

Recall lift

Points across models

17-44

Evaluation set

Largest PoliTax docs

30

Read PoliTax Split
cta-background

( fig.11 )

Turn your documents into high quality data