How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison

By

Introduction

Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

What You Need

  • Python 3.8+ installed on your system
  • pytesseract and Tesseract OCR engine (follow installation for your OS)
  • Ollama (install from ollama.ai) with LLaMA 3 model pulled (ollama pull llama3)
  • A sample B2B PDF invoice or order document (use a real but anonymized one)
  • Basic Python libraries: pdf2image, Pillow, re, json
  • Text editor or IDE

Step-by-Step Guide

Step 1: Set Up the Environment and Sample Document

First, create a project folder and install dependencies:

pip install pytesseract pdf2image Pillow ollama

Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.

Step 2: Build the Rule-Based Extractor with pytesseract

Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])

Now define rules using regex and keyword matching. For example:

  • Extract Order ID by looking for patterns like Order #:\s*(\w+)
  • Find Supplier Name after the word Supplier or Vendor
  • Parse line items using tabular assumption (fixed positions or delimiter)
  • Grab the total via Total:\s*[\$]?(\d+\.\d{2})

Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.

Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3

Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:

import ollama

prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)

response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])

This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

Step 4: Compare Outputs and Handle Failures

Run both scripts on the same document. Compare extracted JSON:

  • Rule-based may miss fields if layout shifts or OCR introduces noise
  • LLM-based may hallucinate or misinterpret ambiguous text

For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.

Step 5: Optimize for Your Use Case

For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.

You can also combine them: try rules first, then use LLM for confidence threshold below 90%.

Tips for Success

  • Preprocess images before OCR: crop, deskew, convert to grayscale, increase contrast.
  • Use structured output with LLMs: ask for JSON and validate with Pydantic.
  • Test on multiple documents with varying layouts to see where each approach shines.
  • Monitor costs: local LLM via Ollama has no API costs but uses compute; rules need no GPU.
  • Version control both extraction scripts and sample documents to reproduce comparisons.
  • Consider a hybrid system as the best of both worlds: rules for speed, LLM for edge cases.

By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.

Tags:

Related Articles

Recommended

Discover More

Design Dialects: Embracing Flexibility Over Rigid ConsistencyMastering the T-Cell Engager Strategy: A Comprehensive Guide to UCB's $2.2 Billion Acquisition of Candid TherapeuticsA Step-by-Step Guide to Meta's Enhanced End-to-End Encrypted Backup InfrastructureNew Reinforcement Learning Algorithm Breaks from Temporal Difference Paradigm, Promises Scalable Long-Horizon TasksAdidas Adizero Adios Pro Evo 3: The 97-Gram Shoe That Shattered the Two-Hour Marathon Barrier