Kushal Byatnal
Co-founder, CEO
One of our most used features today didn’t exist when we launched.
At the time, we thought document automation was all about extraction. But over time, every customer hit the same wall:
They weren’t just uploading neat, single-purpose PDFs. They were dealing with 100-page Frankenstein PDFs — contracts, addendums, and forms, all bundled into one file.
For many teams, this causes a logistical headache to deal with before any document processing can start.
That’s where Extend’s Splitter processor comes in.
How Splitters Power Document Automation
Very few real-world document flows are “extract this field from this one doc.” Most involve identifying and isolating the pieces that actually matter — before extraction begins.
But the value of splitting goes beyond just improving accuracy.
In many domains, splitting isn’t optional — it’s the only way to make things work. You need to:
Apply the correct extraction logic and schema for each type of doc
Validate the presence of document types to trigger actions (like signing, approval, or payment)
De-duplicate and break out repeated subdocuments
Splitters give teams the ability to:
Segment a file into logical subdocuments
Classify what each segment is (e.g. contract, invoice, notice)
Assign a unique identifier to each split for traceability (when dealing with multiple of one type of document)
Route each segment to downstream logic (e.g. further extraction)
Without this kind of preprocessing, any downstream extraction processor is forced to work on noisy, misaligned input — often guessing at context across unrelated pages. That leads to lower accuracy and broken pipelines. Splitting first is what makes structured, reliable document automation possible.
Splitter In Action
Let’s say you’re a real-estate platform that has to process mortgage applications. You receive 160-page customer loan packages. Before doing any extracting, you first need to identify:
10-17 are the loan application you care about
18-40 are noisy addendums that can be discarded
but 40-44 is a critical income statement you need

Once you’ve split out each subdocument, you can access the data via API for use downstream. If using Extend’s Workflows product, you can optionally chain splitters with their respective configured extractors.

Quick Feature Highlights
Classification types: Define all the types of subdocuments your pipeline should expect.
Identifier support: Assign custom logic to detect and split multiple subdocuments of the same category by unique identifiers (e.g. invoice number).
Split methods: Choose between high-precision (best for accuracy) or low-latency (best for speed) strategies.
Excel support: Automatically split multi-tab Excel files
Every Splitter can be configured via UI or API — and like all processors in Extend, they’re versioned, testable, and traceable.
Spitter Identifier:

Splitter Evaluation:

Challenges of Splitting:
At first glance, splitting seems simple. But we’ve seen teams wasting months building brittle logic to keep up with the complexity over time:
Files might include 5 docs of the same type — how do you tell them apart?
OCR noise, scanned pages, or rotated sheets can throw off rule-based systems
Complex domains often requires contextual reasoning via advanced LLMs, not just keywords
Building a robust splitter requires deeply integrated vision + language models, handling of real-world edge cases and exceptions, and extensive evaluation tooling.
Looking Ahead
We’re constantly improving critical steps like this within our customers pipelines. Some ways in which the Splitter will improve in the coming months:
Increasing accuracy by learning from historical examples of mistaken splits
Allowing for more granular splitting, specifically by page sections rather than only pages
If you're dealing with messy files and want a faster, more reliable path to structured data, talk to us!
We’ve seen this problem across hundreds of real-world file types, and we’d love to show you how Extend can help.
WHY EXTEND?