Building Trust in AI: Coupling Extraction and Judging Agents for Insurance Submission Processing

Insurance submissions are remarkably diverse. Each one is a unique package of documents - ACORD forms, loss runs, supplemental applications, SOVs, financial statements, policy documents, broker notes and more. No two submissions look alike, and that’s by design.
A loss run in one submission follows a completely different format from another. A SOV might be a clean Excel spreadsheet in one package but a scanned PDF of a table in another. Both are valid submissions, but they present different data extraction challenges.
The information that is most critical also depends heavily on the line of business. Management liability underwriting requires complete financial information, employee counts, and directors and officers coverage limits. Workers' comp for contractors insurance requires exposure classifications, payroll broken down by class code, and experience modification factors. Each line of business demands extracting different information from different document types - and getting them right.
Large language models have changed what’s possible here. They can interpret messy, varied documents and extract structured underwriting data with far more flexibility than prior approaches. But they also introduce new challenges - non-determinism, hallucinations, and sensitivity to input quality - and traditional QA (manual review and rules) isn’t built to catch plausible-looking errors.
In this blog, we’ll explain how we build trust in LLM-driven submission extraction using Kalepa’s System of AI Agents: strategic sampling, line-of-business-specific criteria, and coupling independent “Judging” Agents with the Extraction Agents to establish reliable measures of confidence.
Why LLMs change the game (and create new problems)
At Kalepa, we couple a variety of large language models to extract critical underwriting data from insurance submissions. LLMs have enabled extraction quality that was impossible just a few years ago. They can understand context, handle diverse formats, recognize what matters in unstructured data, and extract structured data with remarkable flexibility.
But they introduce a new class of challenges that need new solutions. The error modes are fundamentally different from traditional software approaches. Traditional systems fail loudly - you get an error, a crash, or a null value. LLM-based systems fail silently: they give you an answer that looks plausible but is wrong.
In insurance submission extraction, errors can compound from multiple sources:
- OCR errors misread characters in poor-quality scans, faxes, or documents with complex layouts like financial tables (e.g., 5 → S, missing decimals, misaligned columns).
- LLM hallucinations generate plausible values that aren’t supported by the source document.
- Context confusion pulls the right concept from the wrong place (e.g., prior year vs current year, subsidiary vs parent, wrong policy)
- Format misinterpretation misaligns columns, misattributes headers, or merges cells incorrectly.
This creates a paradox: LLMs are the only technology capable of handling insurance data’s complexity at scale, but their probabilistic nature makes traditional quality assurance approaches inadequate.
Why Traditional QA approaches fail
The natural response is to add more verification steps, but traditional approaches fail for systematic reasons.
Manual review doesn't scale statistically.
Even sampling 1% of submissions means thousands of documents per month at meaningful volume. Worse, the statistical confidence from small samples is surprisingly weak. If you find 2 errors in 100 samples, your true error rate could be anywhere from 0.5% to 5% with 95% confidence. You're spending significant resources to learn very little. Manual review is also inconsistent, slow, and can't provide the detailed error analysis needed to improve the system.
Rule-based validation catches only trivial errors.
You can write rules such as “revenue must be positive” or “dates must be valid,” but they miss what matters: the revenue figure that's valid but pulled from the wrong fiscal year, the employee count that conflicts with company size indicated elsewhere, or the loss amount read from the wrong table row. Rules check syntax; they can't check meaning.
These approaches fail because they're trying to solve a statistical quality problem with either brute force (manual review) or deterministic tools (rules). What's needed is a way to build statistical confidence about quality while remaining computationally scalable
Our Approach: Strategic Sampling with Judging Agents
Instead of trying to verify every extraction, we sample strategically and use an independent quality assessment pipeline. The core insight is that evaluating extraction quality is a different - and in some ways simpler - task than doing the extraction itself.
- Strategic sampling
Kalepa supports clients who write different lines of business, and what matters most from a quality perspective varies accordingly. Each customer has a set of critical fields that they rely on for underwriting decisions, and those fields differ by line of business and submission type.
Our sampling strategy reflects that. We select submissions across the lines of business our customers write, the document types they rely on, and the range of complexity and input quality we see in production. The sample isn’t random - it’s stratified and dynamically recalibrated to ensure we’re covering the important dimensions of variability.
- Line-of-business-specific quality criteria
Each line of business has different critical fields - the data points where errors have the most impact on underwriting decisions. Rather than checking everything equally, we weigh our evaluation towards the data points that matter most. This prioritization ensures our QA resources focus on the extractions that drive underwriting decisions.
- Judging Agents: LLM-as-judge evaluation
A separate set of LLM-based models review the extracted data against the source documents. Crucially, the judging agents see more context than the extractor did.
Extraction is a breadth task. A single submission can require thousands of data points, including many exposures that directly drive underwriting appetite. To handle that scope, the extraction agents are optimized to systematically pull a large number of fields across many document types.
Judging is a depth task. These agents are not trying to re-extract every data point - they are trying to validate a smaller set of critical fields with high confidence. Because the judging agents are focused on a limited number of data points, they have access to a larger context and are able to efficiently cross-reference values across documents, check for internal consistency, and verify that each value is supported by the source.
- Distribution analysis gives us statistical confidence
Across the sample, we build a distribution of quality scores. Unlike a simple accuracy percentage, confidence intervals tell us how reliable that measurement is given our sample size. This lets us compute proper confidence intervals at run-time rather than just point estimates.
We also track these metrics over time to detect otherwise silent quality regressions. If for example, contractor payroll extraction suddenly dips, this could reflect a regression in the pipeline, drift in a critical model, or the need for fine tuning of the parameters for the coupled AI system. If a new LLM becomes available, it can immediately become a part of the AI system, and we can simultaneously route extraction tasks and evaluate its accuracy in real time – and whether that improvement is real or noise.
Conclusion
The promise of AI in insurance underwriting isn't just automation - it's reliable automation. Underwriters need to trust the data they're seeing, and that trust can't be assumed. It must be measured, tracked, and continuously earned.
By coupling extraction agents with judging agents, Kalepa's AI systems make that trust quantifiable. We measure confidence in extraction in real time. We detect quality regressions and address their root causes. Most importantly, our coupled agent architecture enables the system to improve systematically - not reactively - with each judging agent evaluation strengthening the extraction pipeline's reliability.
This is what separates Professional Grade AI from demos. The models are powerful, but imperfect - and making them underwriting-reliable requires sophisticated QA technology built with the same rigor as the AI systems themselves. Kalepa has built exactly that, so our underwriting AI can deliver the accuracy commercial insurance demands.


















.png)