At Extend, we build APIs for document processing: parsing, extraction, classification, and splitting. AI teams from leading companies use our tools to pull structured data from their documents, split and classify different document types to workflows, and parse files at scale.
We have “evaluation sets” that let you measure how well your processor is doing. One of the most time-consuming part is iterating on the feedback loop:

Composer automates this process. Today, hundreds of customers use it to regularly push classification accuracy to ~99% on eval sets that previously plateaued at ~85%.
But getting here required us to throw out our first architecture entirely, develop opinions on problems that most agent frameworks ignore, and discover unexpected benefits customers were deriving from the tool.
This blog post is a technical deep-dive into what we learned building a production-ready, enterprise-grade agent. The domain is document processing, but the patterns are general. If you're building agents and thinking about workflows vs. agents, context window management, or how to handle tool calls that take 30 minutes, we hope our mistakes save you some time.
The Problem We Were Trying to Solve
Building a high-accuracy document processor is fundamentally an encoding problem. Your teams have domain expertise and know how to read these documents because they’ve seen hundreds of them. The hard part is encoding that knowledge in a system, and translating it into field descriptions, classification rules, and extraction logic that an LLM can execute.
Traditionally, users do this in Extend by observing failures and thinking about how to update LLM instructions. This is harder than it sounds. It’s not always clear how to prompt an LLM so that it strictly improves performance and doesn’t degrade in certain scenarios, confuse the model, and doesn’t overfit. So translating failures into prompt improvements is error-prone and has high cognitive load. This process can take days.
Labeling examples is much easier. Instead of updating an abstract ruleset, users just add a failure case to an eval set. An agent can then discover and update the rules automatically. The human provides domain knowledge by labeling. Composer figures out the prompt engineering.
First Attempt: The Workflow Approach
Our first proof-of-concept was a workflow. It was heuristic-based and deterministic, with AI involved only at the final step of generating new descriptions.
The logic went like this: an extraction schema has individual fields. Naively, you can assume each field is independent of the others and can be optimized separately. If the invoice_total field has 73% accuracy, we look at the errors for that field, generate a better description, and move on to the next field. One inference call per field. Simple decomposition.

For simple schemas with a handful of flat fields, this worked. If you have five flat fields that genuinely don't interact with each other, optimizing them independently is fine.
Where It Broke Down
The independence assumption falls apart quickly as schemas get more complex.
Array fields create multi-level descriptions. A line_items array has descriptions at the array level and for each property within it. These levels interact during extraction, but our workflow optimized them separately. The heuristics for ordering and resolving conflicts between levels became increasingly convoluted.
Related fields interfere with each other. Fields like length, width, and height in an item_dimensions object should have similar descriptions since they extract measurements from the same part of a document. But optimizing them independently produces conflicting descriptions that could fight each other.
Classification killed the workflow entirely. Classification is where our approach collapsed. Classification classes are mutually exclusive—to optimize one class, you need to know what else is in the schema so you don't create overlap. A lot of classification rules are about disambiguation: "This is a house bill of lading, NOT a master bill of lading, because of X."
When we tried to build heuristics for all the possible relationships between classification classes, we hit combinatorial explosion. You can't enumerate every pairwise interaction in a 50-class classifier and write deterministic rules for each one. The code paths multiplied faster than we could handle them.
The Pivot to Agents
We pivoted after we hit a breakthrough realization: this is a high-dimensional optimization problem, and navigating high-dimensional spaces is exactly what agents are good at.
Instead of trying to decompose the problem and write code for every case, we could reframe it. Give the agent the full picture—the schema, the errors, the relationships between fields—and let it figure out what to optimize and how, with the right tools for the job.

This also meant that as models get smarter, Composer gets smarter for free. With the old approach, every model upgrade meant rewriting heuristics by hand.
Switching to an agent didn't magically fix everything. It changed the problem from "how do I write heuristics" to "how do I present information to the agent so it can actually do its job." Here's what we learned figuring that out.
Our takeaways
1/ Context Engineering Is The Real Work
We built the classification optimizer first because workflows failed fastest there—the mutual exclusivity problem made heuristics infeasible—and because classification has simpler context requirements. The output is a single label, and evaluation is binary: either the predicted class matches the ground truth or it doesn't.
We threw together a minimal agent with basic tools and almost no prompt engineering. It worked immediately. We ran it against our hardest internal evaluation sets—classifiers with subtle distinctions between visually similar document types—and watched accuracy climb to ~99%. But classification simplicity hid a landmine we stepped on later: context window limits. We encountered it in 2 ways:
First, the error analysis itself. A classifier's confusion matrix is an N×N structure. For a classifier with 200 classes, that's 40,000 cells. Even with aggressive compression, presenting the full confusion matrix to the agent would consume the entire context window before we'd included any actual document content or error analysis.
We solved this with aggressive filtering: only show classes that were actually confused with each other. If Class A was never misclassified as Class B, there's no reason to present that cell to the agent. This reduced the context from O(N²) to O(errors), which was manageable.
Second, which documents the agent sees. We found that we got better results by showing the agent only the misclassified examples and telling it "everything else was correct." This focused its attention on the actual problems rather than having it reconsider things that were already working.
The lesson: context is precious. Every token you spend on background information is a token you can't spend on the actual problem.
2/ Small Tool Design Choices Add Up
Batch over parallel
Early on, we tried to leverage parallel tool calling: the model would emit multiple tool calls, and we'd execute them concurrently. This was unreliable. The parallel calls would sometimes be inconsistent with each other, or the model would make calls that only made sense if executed in a specific order.
We switched to batch-style tools. Instead of the model making five separate update_field calls, it makes one update_fields call with a list of updates. The tool handles the batching internally.
# Instead of this:
update_field("name", "desc1")
update_field("address", "desc2")
update_field("total", "desc3")
# We do this:
update_fields([
{"field": "name", "description": "desc1"},
{"field": "address", "description": "desc2"},
{"field": "total", "description": "desc3"},
])
Collapse List & Get
A common agent pattern is: list available items, then get details on specific ones. For example: list_fields() returns field names, then get_field_details(field_id) returns the full schema for that field.
We found it more effective to collapse these. Return paginated results with full objects, not just IDs. The model makes fewer round trips, and you avoid the failure mode where the model lists items and then forgets to fetch the details it actually needs.
# Instead of this:
def list_fields() -> list[str]:
"""Returns field IDs only."""
def get_field_details(field_id: str) -> FieldSchema:
"""Returns full schema for one field."""
# Do this:
def get_fields(page: int, page_size: int) -> list[FieldSchema]:
"""Returns paginated fields with full schema included."""
Schema References with JSON Pointer
We needed a way for the agent to refer to specific locations in nested schemas. If it wants to update the description for line_items.properties.unit_price, it needs a consistent syntax that the agent can generate and our tooling can parse.
We used JSON Pointer (RFC 6901). It's a standard the model has likely seen in training data, so it generates valid pointers without much prompting. We can parse them reliably and apply updates to the right schema locations. Using standards from RFCs means less custom syntax to teach the model and fewer edge cases in our parsing code.
3/ Long-Running, Expensive Operations Require State
Most agent frameworks assume tool calls are fast and cheap. Ours are not.
When Composer runs an evaluation set against a processor, it's potentially processing hundreds of documents through extractors and classifiers. This can take a long time: 30 minutes and sometimes an hour for large sets. It also costs money in inference and processing fees.
So we built a simple persistence layer that separates "triggering" an operation from "waiting for"
The key behaviors:
- Idempotent triggers: When the agent requests an evaluation, we first check if we've already kicked one off for this configuration. If yes, we return the existing operation ID and wait for it.
- Persistent state: We write the operation ID to storage immediately after triggering. If the agent crashes and restarts, we can look up in-progress operations and resume waiting.
- Polling over blocking: Tools return quickly with an operation ID, then poll for completion in a loop. This lets us update the agent's state with progress information and handle timeouts gracefully.
It's not elegant. It's a quasi-workflow engine bolted onto an agent framework. But if your domain involves expensive, long-running operations, you'll need something like it.
The lesson: Most agent frameworks assume tool calls return in seconds. If your domain involves expensive, long running operations, Idempotent triggers and persistent state tracking aren't optional.
4/ Your Agent Might Solve a Different Problem Than You Intended
The most interesting thing we learned wasn't planned. We built Composer to make "number go up"—to maximize accuracy. What we discovered was that, even when Composer was ill prepared to make the number go up, it was still able to explain to the user why the number can't go up.
When Composer runs and accuracy doesn't improve, it reports what it found. And these reports turned out to be remarkably useful. For example:
- "Documents A and B have identical content but different labels"
- "The date field is sometimes labeled with the invoice date and sometimes the due date"
- "This field you're asking me to extract doesn't exist in these documents"
Users run Composer hoping for better accuracy. Composer says "I can't do better because your labels have these problems." Users fix their labels, re-run, and the accuracy jumps. When Composer "fails" to improve accuracy, it's not actually failing. It's doing a different kind of useful work. We only recognized this because we externalized the agent's reasoning. If Composer had been a black box that just output configuration updates, we'd never have seen it.
The lesson: Externalize your agent's thinking. Log the reasoning traces. Surface intermediate conclusions to users. You might find the "side effects" are as valuable as the primary output.
What’d we do differently
Building Composer took longer than it should have. Some of that is inherent to R&D: you don't know what will work until you try it. But some of it was self-inflicted.
Focus on the problem before the system. We built distributed infrastructure before we'd validated the approach. When we pivoted to agents, all of that became baggage we had to refactor or throw away. We should have started with scrappy Jupyter notebooks and single-threaded scripts.
Invest in fast feedback loops early. A single experiment ("does this prompting approach work better?") took hours to run and days to analyze. If we'd built tooling for parallel comparison and automated metrics dashboards from the start, we'd have converged weeks faster. Experimentation velocity dominates everything else in R&D.
Decouple experimentation from production engineering. Good code in service of the wrong approach is worse than throwaway code that finds the right approach faster.
Wrapping up
If you're building your own agents, here's what we hope resonates:
- Let agents do what they're good at
- Starve your context
- Design for expensive operations
- Externalize reasoning
- Productionize last
Composer exists because we were trying to solve a tedious problem for our users. What we learned along the way applies far beyond document processing. Agents are a new paradigm, and we're all still figuring out the patterns.
We hope sharing ours helps you find yours faster.
Composer is live in Extend for classification and extraction optimization. Check out the docs or reach out if you want to see it in action.
