Xtillion - Benchmarking AI for Real-World Mortgage Tasks

One of the things we spend a lot of time thinking about is how to measure AI performance effectively, in the context of real businesses. Over the past 2 years, we've found that most AI benchmarks (the tests that large AI model developers use to measure progress) rarely reflect the real world. They tend to focus on academic tasks that fail to capture the practical work people do in businesses of all sizes. To illustrate this point, see these examples in a recent benchmark that the ScaleAI team put together, Humanity's Last Exam:

These are PhD-level questions, and we'll all be impressed once these models start answering them. However, they don't represent the real world because 99%+ of people don't need to know these things to excel in their jobs (i.e., what do Roman inscriptions and hummingbird anatomy have to do with running a healthcare payer or a bank?). Businesses require a distinct set of skills, and we believe that there hasn’t been a good representation of those in benchmarks that OpenAI, Anthropic, Google, and xAI are using.

To change that, we developed a new suite of benchmarks grounded in real operational workflows. These tasks will represent those that humans can perform easily but require us to build workarounds or spend more time engineering robust solutions. By utilizing these benchmarks, executives can be confident in the capabilities of models designed to address the specific challenges they face in their fields. This approach is far more effective than relying on generic benchmarks and merely hoping that the models will perform as well as previous ones have.

In this article, we introduce MortgageBench, a new benchmark designed to evaluate the performance of AI models in end-to-end tasks within the mortgage origination process at a bank. MortgageBench serves as a structured testing environment that mirrors real-world loan processing workflows, from verifying income to reconciling documentation. We chose the mortgage loan process because it blends structure with complexity. It contains rules, calculations, and cross-checks that resemble real operational workflows. A model that can accurately calculate Debt-to-Income ratios, reconcile income across documents, and identify missing items demonstrates both technical skill and business relevance. Mortgages are also inherently high stakes; calculation errors can impact both applicants and institutions, so humans tend to verify documentation and application steps multiple times. This benchmark serves as a test of whether today's models are ready for reliable, real-world deployment.

Benchmark Structure:

MortgageBench evaluates model performance across nine tasks that mirror common steps in the mortgage loan process:

Calculate average income from past W-2 and 1099 tax documents
Estimate annualized pay from current pay stubs
Detect missing documents within the loan application
Reconcile applicant information to ensure consistency across documents
Calculate the Debt-to-Income (DTI) ratio
Select and extract credit scores from a credit report
Calculate total household monthly income
Calculate total household liabilities
Verify employer name consistency across income-related documents

Each task in MortgageBench includes 50 applications, where each application represents a simulated mortgage loan file containing the types of documents a loan officer would typically review, such as pay stubs, W-2s, 1099s, credit reports, and loan applications. These applications test the accuracy of a model in analyzing, verifying, and computing information across various sources. They vary in complexity: on one end, we have clean, consistent records with minimal variability, where applicants usually have a single employer, aligned pay frequencies, and complete documentation. On the other hand, we introduce realistic variability to reflect better real-world mortgage scenarios, including multiple employers, co-borrowers, and inconsistencies across documents.

Example of a straightforward application: an applicant has two W-2s and two recent pay stubs from the same employer, no additional income sources, a complete credit report, and no co-borrower.
Example of a complex application: an applicant has two W-2s and two pay stubs from different employers due to a job change during the previous year. Their co-borrower is a salaried employee with additional interest income.

This approach helps quantify how model performance degrades as the data becomes more realistic.

Working directly with PDF documents introduces challenges, such as Optical Character Recognition (OCR) errors and inconsistencies in data extraction. To isolate and evaluate the models' reasoning capabilities, specifically their ability to understand the context of the mortgage domain without domain-specific fine-tuning, we utilized structured, JSON-formatted data that represented information as key–value pairs. For example:

{

"first_name": "Andrea",

"last_name": "Cole",

"full_name": "Andrea Cole

"dob": "1987-01-23"

}

Each mortgage application included pay stubs, W-2s, credit reports, 1099 tax documents, and loan applications, allowing us to measure how models handled different mortgage application scenarios. Applications represented either individual borrowers or families, where a primary applicant had a co-borrower.

Benchmark Results:

Key Findings:

Strong numerical accuracy: Models demonstrated strong performance on straightforward calculations and cross-checking tasks like Annualized Pay, Credit Score, and Household Liabilities. Even smaller models, such as GPT-5-mini and Grok-3-mini, maintained high accuracy.
Reasoning tasks reveal differences: As reasoning task complexity increases, particularly in Household Monthlies, performance declined for all models. A reasoning task requires the model to combine, interpret, and calculate information across multiple documents or data fields rather than simply extracting single facts. For example, the Household Monthlies task requires models to identify all income sources for both the applicant and co-borrower, calculate their combined annual income, and then compute their household monthly income. All models struggled with cases where the applicant had a spouse or a co-borrower with additional income sources. While the models correctly recognized that these spouses' documents corresponded to income sources, many failed to integrate them properly into the overall household calculation, revealing weaknesses in multi-step reasoning under variable data conditions.
Data variability exposes performance gaps: Models continued to extract structured data reliably, but their consistency declined when document information overlapped or conflicted. For instance, in the Employer Consistency task, a model could easily reconcile information for a straightforward applicant, such as a salaried employee with two W-2s and two pay stubs from the same employer. However, when faced with an applicant who recently switched jobs and had three pay stubs (two from a previous employer and one from a new one), the model struggled to confirm employer consistency. The absence of a W-2 for the latest job further complicated reasoning, leading to uncertainty even when a recent pay stub was present.

Experienced loan processors would resolve ambiguity within the data by using context and their domain knowledge; capabilities models are still lacking without additional guidance.

Examples of what models accurately predicted vs not:

To illustrate how performance translates to real workflows, here are two examples: one where the model performed well and one where reasoning broke down.

Household Liabilities Task:

The model accurately computed household liabilities by reading structured credit report data, correctly identifying each applicant's total debt balance and corresponding monthly payment estimates. Because the data in the credit report was clean and the necessary field names were consistent for both tiers, the results were accurate and repeatable. (The prompt for this task is included in the Appendix.)

Household Monthlies Task:

This task required aggregating income across multiple income documents, such as W-2s, 1099s, and pay stubs, for both the applicant and any co-borrower, a reasoning step beyond simple retrieval. In scenarios where variability and multiple employers were present, the model struggled, even when the prompt clearly listed all documents considered as income sources. The models sometimes double counted or missed income entries. (The prompt for this task is included in the Appendix.)

These examples illustrate both the strengths and the limits of today's models. There are a variety of levers we can pull to improve performance, such as context engineering, breaking down the problems into more specific agents, or using humans-in-the-loop. Here is one example of how we can use task-specific prompt engineering in this case to get the performance we need:

To improve clarity, we revised the Calculate Household Monthlies task to provide more explicit guidance. We specified that a "household" consists of the primary applicant and any co-borrower, with their combined incomes taken into account. (The prompt for this task is included in the Appendix). After adding this context, model accuracy significantly improved across all models, except for GPT-Nano.

Future Work:

MortgageBench will continue to evolve in two directions:

Broader domains: Expanding the benchmark into sectors like healthcare to evaluate domain-specific reasoning.
More realistic data formats: Moving beyond structured JSON data toward documents like PDFs and images to simulate how humans process loan applications end-to-end.

Ultimately, our goal is to develop a suite of benchmarks that measure what AI understands and how reliably it performs in the systems that power real business decisions.

Reflections and Next Steps:

After running these experiments, we ended up both impressed and more focused. The models performed better than expected, especially on examples with higher data variability. We initially anticipated more performance degradation under complex, real-world conditions, but the results show a level of reasoning stability.

At the same time, we have a benchmark that shows where human and engineering intervention still matters. Every issue we observed is engineerable: we can isolate, reformat, or sequence the data in ways that help models succeed. For instance, feeding income documents one by one instead of all at once could resolve many of the failure cases we saw. That's where our role comes in: understanding what models can do independently and knowing exactly when to step in to make them reliable for real-world use.

Appendix

Calculate Household Liabilities Prompt:

Role:

You are a Loan Processor at a major financial institution with 20 years of expertise processing loan applications.

Your job is to complete these tasks and return a JSON object as formatted in the `Response` section.

ONLY JSON OUTPUTS WILL BE ACCEPTED; ANYTHING ELSE WILL BE MARKED AS A BAD RESPONSE.

Do not wrap JSON in backticks or add any introductory text such as “Here’s your JSON:”.

Task:

Using the information provided in an applicant’s file, calculate the total debt and the estimated monthly payments for each individual and for the household.

Calculate Household Liabilities – Instructions:

A household consists of an applicant and their spouse (if present).
Use only the values found in the `credit_report` document for each person.
Extract `total_balance` from `summary.total_balance` for each individual.
Extract `estimated_monthly_payments` from `summary.estimated_monthly_payments` for each individual.
Compute `household_total_balance` as the sum of total balances from both individuals.
Compute `household_estimated_monthly_payments` as the sum of estimated monthly payments from both individuals.

Response:

Output should be in valid JSON format only, using the following structure:

{

"individuals": [

{

"name": "<individual's name>",

"total_debt": "<individual's total debt>",

"estimated_monthly_payments": "<individual's estimated monthly payments>"

}

],

"household_total_debt": "<household's total debt>",

"household_estimated_monthly_payments": "<household's estimated monthly payments>"

}

Deliverables:

Each individual’s total debt
Each individual’s monthly payment estimates
The household’s combined total debt
The household’s combined estimated monthly payments

Applicant File:

Link to Applicant File

Calculate Household Monthlies Prompt #1:

Role:

You are a Loan Processor at a major financial institution with 20 years of expertise processing loan applications.

Your job is to complete these tasks and return a JSON object as formatted in the `Response` section.

ONLY JSON OUTPUTS WILL BE ACCEPTED; ANYTHING ELSE WILL BE MARKED AS A BAD RESPONSE.

Do not wrap JSON in backticks or add any introductory text such as “Here’s your JSON:”

Task:

Calculate the household's average monthly gross income for an applicant (and their spouse, if present) using income information from the provided documents.

Calculate Household Monthlies – Instructions:

For each individual, extract their name and group all associated income entries by type.
- Types of income documents: W2, 1099-NEC, 1099-DIV, 1099-INT.
In the W2, the pay can be found in box 1, labeled `wages, tips, & others`.
For 1099-NEC, use box 1, `non-employee compensation`.
For 1099-DIV, use box 1, `total ordinary dividends` and box 2 `total capital gain distributions`.
For 1099-INT, use box 1, `interest income`.
For each document type, total the amount by year, then divide by the number of years present.
Compute the combined household monthly income by summing the values and dividing by 12.

Deliverables:

The total monthly income for the household.

Response:

Output should be in valid JSON format only, using the following structure:

{"individuals": [{“name": "<individual's name>“ ],

"household_monthly_income": "<monthly average gross pay>"}

Applicant File:

Link to Applicant File

Calculate Household Monthlies Prompt #2:

Role:

You are a Loan Processor at a major financial institution with 20 years of expertise processing loan applications.

Your job is to complete these tasks and return a JSON object as formatted in the `Response` section.

ONLY JSON OUTPUTS WILL BE ACCEPTED; ANYTHING ELSE WILL BE MARKED AS A BAD RESPONSE.

Do not wrap JSON in backticks or add any introductory text such as “Here’s your JSON:”

Task:

Calculate the household's average monthly gross income for an applicant (and their spouse, if present) using income information from the provided documents.

Calculate Household Monthlies – Instructions:

A household consists of an applicant and their spouse (if present).
For each individual in the household, extract their name and group all associated income entries by type. Types of income documents: W2, 1099-NEC, 1099-DIV, 1099-INT.
In the W2, the pay can be found in box 1, labeled `wages, tips, & others`.
For 1099-NEC, use box 1, `non-employee compensation`.
For 1099-DIV, use box 1, `total ordinary dividends` and box 2 `total capital gain distributions`.
For 1099-INT, use box 1, `interest income`.
For each document type, total the amount by year, then divide by the number of years present.
Compute the combined household monthly income by summing the values and dividing by 12.