August 15, 2025•9-MINUTE READ•by Ruperto MartinezEduardo Piñeiro

Picture this: A field technician is called to inspect a vehicle after an accident. The lighting is poor, the documents are partially handwritten, and one or two have been filled out with a pen that was running out of ink. Now imagine thousands of these documents and images being processed into your system daily, with critical business decisions depending on the accurate extraction of that data.

Welcome to the world of document intelligence in the auto insurance industry.

As AI capabilities have evolved dramatically over the past few years, organizations face a multitude of complex decisions. Among them: Whether to make use of the raw power of vision-capable large language models (LLMs) to interpret documents directly, or apply a hybrid approach combining traditional computer vision, modern technologies, and LLMs.

This isn't just an academic question or exercise. The choice impacts accuracy, cost, implementation timelines, and ultimately, business outcomes. In this article, we will share insights from our real-world implementation in the auto insurance industry, providing a blueprint and use case to help you make the right choice for your document intelligence needs.

Before diving deeper, we want to take a moment to define some of the key technologies that will be discussed in this article:

OCR (Optical Character Recognition): An application of computer vision technology that converts text within images into digital text, which can be analyzed and processed by machines. Some well-known providers of this technology are Tesseract, Azure Document Intelligence, AWS Textract, and Google Document AI.
LLMs (Large Language Models): AI systems trained on vast text datasets that can understand and generate human-like text
VLMs (Vision-capable Language Models): LLMs with the ability to process and interpret images as well as text

Now that we have a clearer understanding of the key terms, let’s go ahead and jump in.

The Strategic Decision: Raw LLM Vision vs. Specialized Pipelines and Solutions

When automating our auto insurance client document analysis pipeline, an important question was whether to use raw LLM vision capabilities or design a pipeline combining specialized tools with the intelligence power of LLMs. Here is how these approaches differ:

Approach 1: Vision-capable LLM

This approach feeds document images directly to vision-capable LLMs like Claude or GPT. In it, all processing occurs through the LLM. This includes detecting text, understanding context, and extracting information. Essentially, in the same way you would expect an LLM to understand the contents of your email, you are now asking it to understand the visual content of an image.

Advantages:

Single-step process
Simpler implementation
Visual context of textual information
Less prone to common OCR errors

Challenges:

Potential hallucinations
Context window limitations
Ignores complex or unusual text
Unpredictable costs based on image size and resolution
Processing time varies with image complexity

Approach 2: OCR + LLM

This approach splits document processing into distinct steps: use computer vision and OCR to extract text, then feed that digitized text into an LLM for interpretation and classification.

Advantages:

Higher reliability in text extraction
More consistent, cost-effective, and predictable costs
Higher certainty that all text is being extracted
More consistent performance across varying document quality

Challenges:

More complex implementation
Multi-step pipeline with added potential for error
Risk of OCR errors (these models sometimes err like thinking an O is a 0)
Longer development timeline and testing

Case Study: Auto Insurance Industry Implementation

The Challenge

Our client faced a challenge: manually processing hundreds of images daily. These images included vehicle images, car registrations, driver’s licenses, VINs, and so forth. Not only was there a variety of images, but many presented automation challenges as well:

A mix of handwritten and printed text with varying document layouts
Poor lighting conditions
Differing image angles and perspectives
Varying image quality (from high-end phones to low-end cameras)

With these conditions and with critical business decisions on the line, we had to be careful in choosing and implementing a solution.

Raw LLM Vision Approach

Initially, we attempted using vision-capable LLMs directly on the images. After all, this solution can often be tested with a one-shot prompt to these models. The results were mixed:

We found that the LLMs performed well on images with clear printed text and relatively small quantities of information. A perfect example of this was driver’s licenses. They were usually better cared for than other documents, leading to better image quality, and comparatively to other documents they had low amounts of information to be extracted.

However, where LLMs failed were more complex images like those with lots of text, inconsistent formats, and handwriting. The prime example of this was car registrations. These images were taken with varying lighting conditions, varying angles, and inconsistency because it was outdoors. Additionally, because these documents were not standardized, they could be submitted digitally, on printed paper, or handwritten paper. These factors led to the LLM struggling to automate the extraction sometimes ignoring entire data fields, confusing text in the document, and or making dangerous assumptions. This level of nondeterminism was unacceptable for a business-critical process, even with guardrails and prompt engineering.

Improved Approach: OCR + LLM

Recognizing the risks and need for improvement, we iterated and tested various approaches until we landed on the following pipeline:

Image Augmentation: Preprocessing images to optimize for OCR
Specialized OCR: Extracting all text from the document
Text Structuring: Organizing extracted text based on document structure
LLM Processing: Using an LLM to interpret, classify, and validate the extracted text

The result was a significant improvement; the costs for text extraction were lowered dramatically, we had higher rates of success with accounting for handwritten notes and fields, and the LLM was able to focus on classification rather than detection leading to fewer issues with context windows and prompt engineering.

To illustrate further, we will showcase an old car registration found online and two screenshots of this image getting directly fed into Claude Sonnet 3.7 with a basic prompt:

Figure 1. An image of an old car registration from the state of Illinois.

Figure 2. A screenshot of the JSON returned by Claude Sonnet 3.7 when asked to analyze the image and search for car registration data fields.

Figure 3. A screenshot of the JSON returned by Claude Sonnet 3.7 with the advanced pipeline

The results illustrate the limitations of direct vision processing. In the raw LLM approach, the model incorrectly extracted the registration number from what appears to be a stamp, misread the license plate as "4X32," and identified the vehicle model as simply "T", all significant errors that would impact downstream business processes.

The specialized pipeline demonstrates marked improvement in the second example. Fields that couldn't be reliably extracted were conservatively marked as null, and the vehicle model was more accurately identified as "Roadster." This is because the OCR stage produces a structured, more deterministic text output that the LLM can then analyze. This separation of tasks allows for more control when prompt engineering and greater reliability with the vision aspect of the analysis.

This architectural difference proved crucial for our business-critical application, where extraction errors could propagate through subsequent processing stages and impact decision-making especially since we wanted to be able to flag missing or absent fields.

Cost Considerations

Raw LLM Approach: When considering input tokens to a model, costs vary significantly based on image resolution, size, and complexity. An image from an iPhone 16 Pro Max could cost substantially more to process than one from a lower-resolution device due to how LLMs calculate processing costs through tokens. Image costs, as stated, varied but processing requirements dictated that the input tokens be in the thousands for these images. In our case, this means that one image with a powerful model like Claude 3.7 Sonnet would cost:

Input tokens cost: 1,500 tokens × $3/MTok = 1,500 × ($3/1,000,000) = $0.0045

Output tokens cost: 600 tokens × $15/MTok = 600 × ($15/1,000,000) = $0.009

Total Cost = $0.0045 + $0.009 = $0.0135

OCR + LLM Approach: OCR processing, on the other hand, has a fixed cost per image, and the LLM costs were much more predictable when processing just the extracted text, given that we could calculate the average number of words per image and document category. This means an average cost per image would look something like this:

OCR Text Extraction cost: 1 image x $1/1K images = 1 x ($1/1,000) = $0.001

Input tokens cost: 600 tokens × $3/MTok = 600 × ($3/1,000,000) = $0.0018

Output tokens cost: 600 tokens × $15/MTok = 600 × ($15/1,000,000) = $0.009

Total Cost = $0.0018 + $0.009 + 0.001 = $0.0109

As we can see there was a 21.3% decrease in costs for each processed image in just this simple lower end cost example. For high-volume operations like our auto insurance client's, this cost predictability, and efficiency was crucial for budgeting and scaling.

Key Takeaways: Choosing the Right Approach

When to Use OCR + LLM

High Variability: Documents with handwriting, variable quality, or complex formats
Cost Optimization: When processing costs need to be predictable and optimized
Text-Heavy Documents: Large documents or images with significant text content
Complete Text Extraction: When missing any text would be problematic
High-Volume Processing: When processing thousands of documents daily

When to Use Raw LLM Vision

Time Constraints: When rapid implementation is critical
Simple Analysis: For straightforward documents with limited text
Visual Context Importance: When understanding the visual layout is crucial
Low Volume: When processing a smaller number of documents
Standardized Formats: For documents with consistent, well-defined formats

Conclusion

The choice between raw LLM vision capabilities and specialized OCR+LLM approaches ultimately depends on your specific use case, volume requirements, development timelines, cost considerations, and accuracy needs. Document intelligence is no longer just about extracting text it's about understanding content in context and making that information actionable. By strategically choosing the right approach for your needs, you can build systems that reliably transform documents into business intelligence.

Authors

Ruperto Martinez

Eduardo Piñeiro