Automate Invoice Data Entry from PDF to Spreadsheet
To automate invoice data extraction to a spreadsheet (Excel or Google Sheets), you can use either a **Python-based programmatic approach** (best for digital, text-based PDFs) or a **No-Code OCR workflow** (best for scanned PDFs and varying invoice layouts). Below are the step-by-step implementations for both methods. ---Method 1: Python Automation (For Digital PDFs)
If your invoices are digital PDFs (where you can highlight and copy text), you can use Python withpdfplumber to extract text and pandas to export the data to an Excel spreadsheet.
Step 1: Install Required Libraries
Run the following command in your terminal to install the necessary libraries:
pip install pdfplumber pandas openpyxl
Step 2: Write the Extraction Script
Create a Python file (e.g., extractor.py) and use the following code. This script uses Regular Expressions (regex) to locate the Invoice Number, Date, and Total Amount, then saves them to an Excel sheet.
import pdfplumber
import re
import pandas as pd
import os
def extract_invoice_data(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
# Extract text from the first page
first_page = pdf.pages[0]
text = first_page.extract_text()
# Regex patterns (modify these based on your invoice layouts)
invoice_num_pat = r"(?:Invoice\s?#|Invoice\s?Number|Inv\s?No\.?)\s*:\s*(\S+)"
date_pat = r"(?:Date|Invoice\s?Date)\s*:\s*([\d/.-]+)"
total_pat = r"(?:Total|Amount\s?Due|Total\s?Due)\s*:\s*\$?([\d,.]+)"
invoice_num = re.search(invoice_num_pat, text, re.IGNORECASE)
date = re.search(date_pat, text, re.IGNORECASE)
total = re.search(total_pat, text, re.IGNORECASE)
return {
"File Name": os.path.basename(pdf_path),
"Invoice Number": invoice_num.group(1) if invoice_num else "Not Found",
"Date": date.group(1) if date else "Not Found",
"Total": total.group(1) if total else "Not Found"
}
# Process a folder of invoices
invoice_folder = "./invoices"
data_list = []
for file in os.listdir(invoice_folder):
if file.endswith(".pdf"):
path = os.path.join(invoice_folder, file)
try:
data = extract_invoice_data(path)
data_list.append(data)
except Exception as e:
print(f"Error processing {file}: {e}")
# Export to Excel
df = pd.DataFrame(data_list)
df.to_excel("extracted_invoices.xlsx", index=False)
print("Extraction complete. Data saved to extracted_invoices.xlsx")
---
Method 2: No-Code Automation (For Scanned PDFs & Scale)
If your invoices are scanned images or come in highly varied layouts, regex-based Python scripts will fail. A cloud-based OCR and workflow automation tool is required.The Workflow Stack:
- Trigger: Google Drive / OneDrive / Email (where invoices are received).
- Parser: Google Document AI, Rossum, or AWS Textract (for OCR and key-value extraction).
- Action: Google Sheets or Microsoft Excel 365.
Step-by-Step Setup using Make.com (formerly Integromat) or Zapier:
- Set up the Trigger: Connect your email or cloud storage account. Set the trigger to run whenever a new file is uploaded to an "Invoices" folder.
- Add the OCR Step: Route the PDF file to an AI parser like Google Document AI (using the "Invoice Parser" processor) or Docparser. These tools automatically identify fields like
supplier_name,invoice_date, andtotal_amountregardless of layout. - Add the Spreadsheet Step: Connect Google Sheets. Select the action "Add Multiple Rows" or "Add Row". Map the extracted fields from the OCR step to the corresponding columns in your spreadsheet.
Limitations and Reality Checks
- Layout Sensitivity: Python/Regex scripts are highly sensitive. If a vendor changes their invoice layout by even one character space, the script may fail to extract the data.
- OCR Accuracy: No optical character recognition (OCR) tool is 100% accurate. Hand-written numbers, low-resolution scans, or folded paper invoices will occasionally result in extraction errors (e.g., reading an "8" as a "3").
- Human-in-the-Loop: For critical financial workflows, always implement a manual verification step or validation script to flag missing fields or mathematical mismatches before importing data into your accounting software.
Need this done fast? order it on Kwork.
Need help with this?
I take on freelance fixes and builds in this area.