August 15, 20259-MINUTE READby Ruperto MartinezEduardo Piñeiro

Picture this: A field technician is called to inspect a vehicle after an accident. The lighting is poor, the documents are partially handwritten, and one or two have been filled out with a pen that was running out of ink. Now imagine thousands of these documents and images being processed into your system daily, with critical business decisions depending on the accurate extraction of that data. 


Welcome to the world of document intelligence in the auto insurance industry. 


As AI capabilities have evolved dramatically over the past few years, organizations face a multitude of complex decisions. Among them: Whether to make use of the raw power of vision-capable large language models (LLMs) to interpret documents directly, or apply a hybrid approach combining traditional computer vision, modern technologies, and LLMs. 


This isn't just an academic question or exercise. The choice impacts accuracy, cost, implementation timelines, and ultimately, business outcomes. In this article, we will share insights from our real-world implementation in the auto insurance industry, providing a blueprint and use case to help you make the right choice for your document intelligence needs. 


Before diving deeper, we want to take a moment to define some of the key technologies that will be discussed in this article: 

  • OCR (Optical Character Recognition): An application of computer vision technology that converts text within images into digital text, which can be analyzed and processed by machines. Some well-known providers of this technology are Tesseract, Azure Document Intelligence, AWS Textract, and Google Document AI. 
  • LLMs (Large Language Models): AI systems trained on vast text datasets that can understand and generate human-like text 
  • VLMs (Vision-capable Language Models): LLMs with the ability to process and interpret images as well as text 


Now that we have a clearer understanding of the key terms, let’s go ahead and jump in. 

The Strategic Decision: Raw LLM Vision vs. Specialized Pipelines and Solutions

When automating our auto insurance client document analysis pipeline, an important question was whether to use raw LLM vision capabilities or design a pipeline combining specialized tools with the intelligence power of LLMs. Here is how these approaches differ: 

Approach 1: Vision-capable LLM 

This approach feeds document images directly to vision-capable LLMs like Claude or GPT. In it, all processing occurs through the LLM. This includes detecting text, understanding context, and extracting information. Essentially, in the same way you would expect an LLM to understand the contents of your email, you are now asking it to understand the visual content of an image. 

Advantages: 

  • Single-step process 
  • Simpler implementation 
  • Visual context of textual information 
  • Less prone to common OCR errors 

Challenges: 

  • Potential hallucinations 
  • Context window limitations 
  • Ignores complex or unusual text 
  • Unpredictable costs based on image size and resolution 
  • Processing time varies with image complexity 

Approach 2: OCR + LLM 

This approach splits document processing into distinct steps: use computer vision and OCR to extract text, then feed that digitized text into an LLM for interpretation and classification. 

Advantages: 

  • Higher reliability in text extraction 
  • More consistent, cost-effective, and predictable costs 
  • Higher certainty that all text is being extracted 
  • More consistent performance across varying document quality 

Challenges: 

  • More complex implementation 
  • Multi-step pipeline with added potential for error 
  • Risk of OCR errors (these models sometimes err like thinking an O is a 0) 
  • Longer development timeline and testing 

Case Study: Auto Insurance Industry Implementation

The Challenge 

Our client faced a challenge: manually processing hundreds of images daily. These images included vehicle images, car registrations, driver’s licenses, VINs, and so forth. Not only was there a variety of images, but many presented automation challenges as well: 

  • A mix of handwritten and printed text with varying document layouts 
  • Poor lighting conditions 
  • Differing image angles and perspectives 
  • Varying image quality (from high-end phones to low-end cameras) 

With these conditions and with critical business decisions on the line, we had to be careful in choosing and implementing a solution. 

Raw LLM Vision Approach 

Initially, we attempted using vision-capable LLMs directly on the images. After all, this solution can often be tested with a one-shot prompt to these models. The results were mixed: 
 
We found that the LLMs performed well on images with clear printed text and relatively small quantities of information. A perfect example of this was driver’s licenses. They were usually better cared for than other documents, leading to better image quality, and comparatively to other documents they had low amounts of information to be extracted. 

However, where LLMs failed were more complex images like those with lots of text, inconsistent formats, and handwriting. The prime example of this was car registrations. These images were taken with varying lighting conditions, varying angles, and inconsistency because it was outdoors. Additionally, because these documents were not standardized, they could be submitted digitally, on printed paper, or handwritten paper. These factors led to the LLM struggling to automate the extraction sometimes ignoring entire data fields, confusing text in the document, and or making dangerous assumptions. This level of nondeterminism was unacceptable for a business-critical process, even with guardrails and prompt engineering. 

Improved Approach: OCR + LLM 

Recognizing the risks and need for improvement, we iterated and tested various approaches until we landed on the following pipeline: 

  1. Image Augmentation: Preprocessing images to optimize for OCR 
  2. Specialized OCR: Extracting all text from the document 
  3. Text Structuring: Organizing extracted text based on document structure 
  4. LLM Processing: Using an LLM to interpret, classify, and validate the extracted text 

The result was a significant improvement; the costs for text extraction were lowered dramatically, we had higher rates of success with accounting for handwritten notes and fields, and the LLM was able to focus on classification rather than detection leading to fewer issues with context windows and prompt engineering. 

To illustrate further, we will showcase an old car registration found online and two screenshots of this image getting directly fed into Claude Sonnet 3.7 with a basic prompt: 

Car Registration

Figure 1. An image of an old car registration from the state of Illinois.

Ford OCR JSON

Figure 2. A screenshot of the JSON returned by Claude Sonnet 3.7 when asked to analyze the image and search for car registration data fields. 

Tech Up OCR JSON

Figure 3. A screenshot of the JSON returned by Claude Sonnet 3.7 with the advanced pipeline 

The results illustrate the limitations of direct vision processing. In the raw LLM approach, the model incorrectly extracted the registration number from what appears to be a stamp, misread the license plate as "4X32," and identified the vehicle model as simply "T", all significant errors that would impact downstream business processes.

The specialized pipeline demonstrates marked improvement in the second example. Fields that couldn't be reliably extracted were conservatively marked as null, and the vehicle model was more accurately identified as "Roadster." This is because the OCR stage produces a structured, more deterministic text output that the LLM can then analyze. This separation of tasks allows for more control when prompt engineering and greater reliability with the vision aspect of the analysis.

This architectural difference proved crucial for our business-critical application, where extraction errors could propagate through subsequent processing stages and impact decision-making especially since we wanted to be able to flag missing or absent fields. 

Cost Considerations 

  • Raw LLM Approach: When considering input tokens to a model, costs vary significantly based on image resolution, size, and complexity. An image from an iPhone 16 Pro Max could cost substantially more to process than one from a lower-resolution device due to how LLMs calculate processing costs through tokens. Image costs, as stated, varied but processing requirements dictated that the input tokens be in the thousands for these images. In our case, this means that one image with a powerful model like Claude 3.7 Sonnet would cost: 

Input tokens cost: 1,500 tokens × $3/MTok = 1,500 × ($3/1,000,000) = $0.0045 

Output tokens cost: 600 tokens × $15/MTok = 600 × ($15/1,000,000) = $0.009 

Total Cost = $0.0045 + $0.009 = $0.0135 

  • OCR + LLM Approach: OCR processing, on the other hand, has a fixed cost per image, and the LLM costs were much more predictable when processing just the extracted text, given that we could calculate the average number of words per image and document category. This means an average cost per image would look something like this: 

OCR Text Extraction cost: 1 image x $1/1K images = 1 x ($1/1,000) = $0.001 

Input tokens cost: 600 tokens × $3/MTok = 600 × ($3/1,000,000) = $0.0018 

Output tokens cost: 600 tokens × $15/MTok = 600 × ($15/1,000,000) = $0.009 

Total Cost = $0.0018 + $0.009 + 0.001 = $0.0109 

As we can see there was a 21.3% decrease in costs for each processed image in just this simple lower end cost example. For high-volume operations like our auto insurance client's, this cost predictability, and efficiency was crucial for budgeting and scaling.  

Key Takeaways: Choosing the Right Approach

When to Use OCR + LLM 

  • High Variability: Documents with handwriting, variable quality, or complex formats 
  • Cost Optimization: When processing costs need to be predictable and optimized 
  • Text-Heavy Documents: Large documents or images with significant text content 
  • Complete Text Extraction: When missing any text would be problematic 
  • High-Volume Processing: When processing thousands of documents daily 

When to Use Raw LLM Vision 

  • Time Constraints: When rapid implementation is critical 
  • Simple Analysis: For straightforward documents with limited text 
  • Visual Context Importance: When understanding the visual layout is crucial 
  • Low Volume: When processing a smaller number of documents 
  • Standardized Formats: For documents with consistent, well-defined formats 

Conclusion

The choice between raw LLM vision capabilities and specialized OCR+LLM approaches ultimately depends on your specific use case, volume requirements, development timelines, cost considerations, and accuracy needs. Document intelligence is no longer just about extracting text it's about understanding content in context and making that information actionable. By strategically choosing the right approach for your needs, you can build systems that reliably transform documents into business intelligence. 

Authors

Ruperto Martinez

Eduardo Piñeiro

Call to Action Background

Let's Connect

Contact Us
Xtillion - Mastering Document Intelligence: Raw LLM Vision vs. OCR