Back to the main blog

LongArray-Extract: A Benchmark for Complete Structured Extraction at Scale

Jing ReyhanJoseph Bajor

Jing Reyhan, Joseph Bajor

5 min read

Jun 2, 2026

Engineering

LongArray-Extract featured image

At Extend, we are committed to delivering a reliable, scalable, and performant document infrastructure that leading AI teams can trust to ship agents in production. This is not a theoretical commitment. We build our products and benchmarks around real-world use cases and systems of record documents because production document processing is judged by how the system performs on the hardest documents, not the cleanest examples. Long array extraction is one such example of a difficult problem in production.

Extraction is central to document processing because it captures the critical inputs in unstructured documents and transforms them into structured data. Long array extraction is the ability to capture every repeated item in a document, preserving the cardinality of the underlying record set, the attributes associated with each record, and the relationships between those attributes across pages, sections, tables, and document layouts.

For example, a clinical adverse-event listing might contain 1,283 individual events spread over 234 pages. A Wells Fargo combined statement covers two accounts and 1,100 transactions over 60 pages. A motion for summary judgment contains 705 numbered factual paragraphs over 96 pages.

LongArray-Extract measures whether a system can accurately reconstruct large sets of repeated records from benchmark documents modeled on production systems of record and the long-array use cases we see in customer pipelines. Similar to RealDoc-Bench, it is designed around production-style documents and customer-relevant workloads. Both benchmarks are open-sourced to provide transparency and reproducibility.

TLDR

LongArray-Extract tests whether extraction systems can return complete, schema-faithful arrays when the output grows from dozens of rows to more than 2,000.

Extend delivered 100% completion and 99.2% mean per-document accuracy across all 45 benchmark PDFs. The closest evaluated system reached 97.4%, and the long tail exposed tradeoffs in completion rate, latency, and partial-output behavior.

TLDR benchmark summary
LongArray-Extract

Extend delivered 100% completion and 99.2% mean per-document accuracy.

Mean per-document accuracy
99.2%

Failures count as 0%

Completion
45/45

Extend scored every document

Largest array
2.2k

Transactions in one PDF

All 45 documents

Mean per-document accuracy

Each PDF contributes one score. Failed or timed-out runs count as 0%, so completion is reflected in the accuracy.

Extend
99.2%
01Extend
02Reducto Deep Extract
03Opus 4.7
04Reducto Standard
05Pulse Effort
06Pulse Auto
07Gemini Pro
08LlamaParse Standard
09GPT-5.5
10Gemini Flash

How we evaluated

Most extraction tests stop at short documents or fixed field sets. This benchmark focuses on the production failure mode we see when the answer is itself large: the model or extraction system has to keep emitting rows, preserve row-level context, avoid duplicates, and return every expected item.

The dataset

LongArray-Extract covers use cases we see in production across customer pipelines. For example, dense financial statements with detailed transaction rows, clinical reports with extensive event rows, and enumerated paragraphs for court documents. Each document has generated ground truth, so missing rows and incorrect fields can be scored directly instead of inferred from spot checks.

Hugging Face dataset

Dataset examples
Source PDF
wf_08_n2200_combined-0VqBcuDi.pdf

What we measured

The primary metric is per-document extraction accuracy. Each system returns structured JSON for the full document. We compare the extracted output to the expected output field by field, then aggregate those matches into a document score.

The benchmark also tracks completion and latency. A response that fails, times out, or returns an unscoreable output is not the same as a low-quality completed extraction. For reporting, we show both quality and completion behavior because production systems need both.

Methodology

We evaluated Extend, frontier foundation models, and document-AI platforms on the same benchmark documents. The raw foundation-model baselines used a single-pass harness with chunking disabled, matching the "send this document to a model and ask for JSON" experience. Extend and competitor platforms were evaluated through their extraction systems and API modes.

All outputs are normalized into a common scoring format. Bank statement descriptions use fuzzy matching to avoid over-penalizing punctuation or whitespace differences. Legal pleading court names also use fuzzy matching, and vendor-specific response shape differences are normalized before scoring to ensure fairness. Failed runs are tracked separately; when we report mean per-document accuracy, failed documents contribute zero.

Agentic modes that failed on most documents are excluded from the scored leaderboard rather than treated as comparable complete systems.

What we found

Mean per-document accuracy weights every PDF equally, so a small document and a large document each contribute one score. Failed or timed-out runs count as 0%, which makes the metric reflect whether a system can complete the whole extraction, not just how accurate it is on the rows it returns.

Extend led the aggregate benchmark with 99.2% mean per-document accuracy and 100% completion across all 45 documents.

Reducto Deep Extract was the next closest system at 97.4%, followed by Claude Opus 4.7 at 83.2% and Reducto Standard at 80.9%.

All 45 documents

Mean per-document accuracy

Each PDF contributes one score. Failed or timed-out runs count as 0%.

01Extend
02Reducto Deep Extract
03Opus 4.7
04Reducto Standard
05Pulse Effort
06Pulse Auto
07Gemini Pro
08LlamaParse Standard
09GPT-5.5
10Gemini Flash

1. The gap opens in the long tail

Below roughly 200 rows, many systems look close to production-ready. Past that point, the benchmark separates systems that keep enumerating from systems that collapse into partial arrays, sparse samples, or terminal failures.

Long-tail divergence

System spread widens with document size

Best-system to worst-system score spread widens as document size increases
Each point summarizes the spread between the best and worst system at that document size. The full benchmark includes small, medium, and long-tail documents because production pipelines are judged by the hardest records.

2. Completion changes how accuracy should be read

Completion rate illustrates why reliability has to sit next to accuracy. Some systems scored well on documents they completed. But across the full benchmark, incomplete or failed runs changed the ranking once every document was counted.

For production pipelines, "accurate when it returns" is a different property from "reliably returns the whole output." The benchmark reports both so teams can see whether a system is failing loudly, failing silently, or completing with degraded quality.

Reliability x accuracy

Accuracy only matters when the system completes

40%60%80%100%60%80%100%Documents successfully completedAccuracy on completed documentsExtendDone 100.0% / Completed acc 99.2%
Operational overview

Extend completed all 45 PDFs with 99.2% aggregate accuracy.

LlamaParse Standard was accurate on many completed outputs, but it completed 26 of 45 documents, so its aggregate mean per-document accuracy falls to 47.2%.

X-axis is the fraction of all 45 benchmark PDFs completed. Y-axis is mean accuracy on documents where the system returned a scoreable output. Aggregate mean per-document accuracy still counts failed documents as 0%.

3. Production tradeoffs include speed

Latency matters only after a system can return the right output. LongArray-Extract therefore compares speed against aggregate accuracy instead of treating runtime as a standalone win.

Extend sits on the aggregate Pareto frontier: no evaluated system is both faster and more accurate. Reducto Standard is slightly faster on mean latency but materially lower in accuracy, while Reducto Deep Extract is slower and still trails Extend on aggregate accuracy.

Latency x accuracy

Production tradeoffs include speed

40%60%80%100%100s200s500s1000sMean wall-clock latency per documentMean per-document accuracyExtend221s / Acc 99.2%
Operational overview

Extend sits on the aggregate Pareto frontier: no evaluated system is both faster and more accurate.

Reducto Standard is slightly faster on mean latency but 18.3 points lower in accuracy. Reducto Deep Extract is 3.8x slower than Extend while trailing by 1.8 points.

Latency is mean per-document wall-clock time, shown on a log scale. Accuracy is aggregate mean per-document accuracy across all 45 PDFs, with failed or timed-out runs counted as 0%.

4. Frontier models can perform extraction but fail to finish

The foundation models were often accurate on the rows they returned early in a document. The failure mode was sustained structured output. As arrays grew, models returned partial arrays, sampled from the document, skipped tail rows, or closed valid JSON before the extraction was complete.

That matters because valid JSON is not success. If a document contains 1,139 legal facts and the system returns a plausible subset, the output can pass schema validation while silently losing data.

Completion simulation

90 records across 12 pages

Speed, completion, and error rate are derived from aggregate benchmark latency, completion, and accuracy. Fast lanes can still finish with fewer correct rows.

Extendref 99.2% / 221sok 0merged 0dropped 0
Reducto Deep Extractref 97.4% / 846sok 0merged 0dropped 0
Claude Opus 4.7ref 83.2% / 532sok 0merged 0dropped 0
Reducto Standardref 80.9% / 201sok 0merged 0dropped 0
LlamaParse Standardref 47.2% / 88scompletes 57.8%ok 0merged 0dropped 0
GPT-5.5ref 31.7% / 128sok 0merged 0dropped 0
captured merged / duplicated dropped / not returned page boundary

5. Production pipelines need more than long context

LongArray-Extract shows that this problem is not solved by context length alone. The document can fit in the model window and still fail because the required answer is too large to emit reliably in one turn.

For production document workflows, that difference shows up as missing transactions, adverse events, facts, or citations that downstream agents treat as complete records. The stronger approach is orchestration around the model: choose document representations, route by expected output load, split work deliberately, preserve global context, repair boundary issues, and verify the merged array before downstream systems receive it.

Try it for yourself

cta-background

( fig.11 )

Turn your documents into high quality data