Xtillion - Testing Multitask AI Performance in Real Mortgage Workflows

Introduction

In Benchmarking AI for Real-World Mortgage Tasks, we introduced the MortgageBench benchmark, a business-focused evaluation designed to test how large language models handle practical mortgage origination tasks, such as calculating Debt-to-Income Ratios, reconciling income, and flagging missing documents. The results showed that while models excel with clean, structured data, their accuracy decreases when variability and reasoning complexity are introduced.

Real loan processing, however, doesn't consist of doing a single task for an applicant. Loan processors complete multiple subtasks, which involve extraction, computation, and validation, all for a single mortgage application. In the second phase of MortgageBench, the objective was to determine whether a single model can consistently perform a series of tasks for the same applicant with accuracy. The goal is to determine if a single agent can reliably handle multitask workloads and reduce the time and cost of building complex, multi-agent systems.

Why This Matters

Every additional agent or workflow component adds implementation and infrastructure cost. Suppose one model can manage several tasks in order, without significant performance loss. This approach directly results in fewer integration layers and quicker deployment of AI-powered automation. This clearer view helps us understand the second half of the ROI equation, showing how performance efficiency translates into faster deployment and cost savings.

Results:

To our benchmark, we added an evaluation that measured how model performance changed as the number of tasks per applicant increased. Each "compound task" required models to correctly complete multiple subtasks for one applicant (instead of 1 task at a time as we described in the previous article) and performance was measured based on the combined outcomes of those subtasks.

(Figure 1: Whole-Task Accuracy)

This evaluation measured correctness in absolute terms. The metric reflects the percentage of applicants for whom the models completed the entire workload without error. Accuracy was highest for smaller workloads, with all models performing strongly on one or two tasks. As the number of tasks increased, accuracy dropped sharply across models, reflecting the growing difficulty of maintaining consistent reasoning over more extended sequences. Interestingly, accuracy occasionally fluctuated between task counts. In some cases, adding a task may have improved the outcome of some workloads, suggesting that the task itself, not just workload size, affects the model's reasoning consistency. Running this a few times would reduce the variability in these curves.

(Figure 2: Subtask Sensitivity)

Heatmap analysis revealed that just the incremental workload size didn't cause performance fluctuations. Instead, the difficulty of the subtask constrained them. To visualize this, if you look at the rows from left to right, you can see that the average performance per task doesn't worsen as the number of functions per applicant increases. Reasoning-heavy subtasks, such as Annualized Pay, Household Monthlies, Reconcile Documents, and Missing Documents, consistently showed lower accuracy, regardless of the number of tasks. Straightforward arithmetic or retrieval subtasks, such as Household Liabilities and Credit Score, stayed nearly perfect. In other words, model weaknesses stem from task reasoning demands, not multitasking. It's also important to note that the "Total" row at the bottom represents the whole-task accuracy, which, as previously explained, measures correctness in absolute terms. Because a single subtask error causes the entire compound task to be marked incorrect, the total accuracy is expectedly to appear lower. While individual subtasks may perform well in isolation, the total score reflects the standard of correctness.

This was a surprising takeaway: model performance generally remains stable as we add more tasks per applicant. While the nature of our evaluated tasks may influence this, it challenges the assumption that compound workloads reduce accuracy. These findings suggest the potential for single-agent systems to handle end-to-end reasoning, providing a path to simplify AI workflows and minimize implementation complexity.

Introduction

Why This Matters

Results:

Related articles

View All

Let's Connect

Let's Connect