GUARDLABS
GuardLabs ยท Technical note

Python Script to Convert PDF to Excel Spreadsheet

To extract tabular data from a PDF and save it as an Excel spreadsheet (.xlsx) using Python, you can use either pdfplumber or tabula-py. Below are two distinct, production-ready scripts. Choose Method 1 for general PDFs with mixed layout text, or Method 2 for PDFs containing strictly formatted grid tables.

Prerequisites and Installation

Open your terminal or command prompt and install the required dependencies:
pip install pdfplumber pandas openpyxl tabula-py
Note: Method 2 (tabula-py) requires Java Runtime Environment (JRE) installed on your system.

Method 1: Using pdfplumber (Recommended for Mixed Layouts)

The pdfplumber library excels at identifying tables visual structure and extracting text accurately without requiring external system dependencies like Java.
import pdfplumber
import pandas as pd

def extract_pdf_to_excel(pdf_path, excel_path):
    all_tables = []
    
    # Open the PDF file
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            # Extract all tables from the current page
            tables = page.extract_tables()
            for table_index, table in enumerate(tables):
                # Convert list of lists to a Pandas DataFrame
                df = pd.DataFrame(table)
                
                # Optional: Treat the first row as the header
                if not df.empty:
                    df.columns = df.iloc[0]
                    df = df[1:]
                
                all_tables.append((f"Page_{page_num}_Table_{table_index+1}", df))
                
    if not all_tables:
        print("No tables detected in the PDF.")
        return

    # Write each extracted table to a separate sheet in the Excel workbook
    with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
        for sheet_name, df in all_tables:
            # Clean sheet name to fit Excel's 31-character limit
            safe_sheet_name = sheet_name[:30]
            df.to_excel(writer, sheet_name=safe_sheet_name, index=False)
            
    print(f"Extraction complete. Saved to {excel_path}")

# Example Usage
extract_pdf_to_excel("sample_document.pdf", "output_data.xlsx")

How it works:

  • pdfplumber.open(): Loads the PDF into memory page-by-page.
  • page.extract_tables(): Automatically detects table boundaries based on vertical and horizontal lines or words alignment.
  • pd.ExcelWriter(): Uses the openpyxl engine to write multiple DataFrames into distinct sheets of a single .xlsx file.

Method 2: Using tabula-py (Best for Clean, Grid-Based Tables)

If your PDF contains clean, structured tables with clear borders, tabula-py is faster and highly accurate. It acts as a Python wrapper around the Java-based Tabula tool.
import tabula
import pandas as pd

def tabula_to_excel(pdf_path, excel_path):
    try:
        # Read all tables from the PDF
        # pages='all' processes the entire document; multiple_tables=True keeps them separated
        tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
        
        if not tables:
            print("No tables found.")
            return

        with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
            for i, df in enumerate(tables, start=1):
                df.to_excel(writer, sheet_name=f"Table_{i}", index=False)
                
        print(f"Successfully exported {len(tables)} tables to {excel_path}")
    except Exception as e:
        print(f"An error occurred: {e}")
        print("Ensure Java is installed and added to your system PATH.")

# Example Usage
tabula_to_excel("financial_report.pdf", "financial_report.xlsx")

Limitations and Edge Cases

No programmatic PDF extraction method is 100% accurate. Keep the following technical limitations in mind:
  • Scanned PDFs (Image-only): Neither pdfplumber nor tabula-py can read scanned image PDFs natively. For scanned documents, you must run Optical Character Recognition (OCR) using tools like pytesseract or pdf2image before parsing.
  • Merged Cells: Merged header or data cells in PDFs often split into multiple empty cells or cause alignment shifts in the resulting Excel sheet. You may need to post-process the Pandas DataFrames using df.ffill() (forward fill) to handle empty cells generated by merged columns.
  • Complex Layouts: Multi-column text layouts containing inline tables can cause parsing engines to fail to recognize boundaries. Adjusting table detection parameters (like table_settings in pdfplumber) is required for custom borders.

Need this done fast? order it on Kwork.

Published 2026-06-23 3 min read All articles EN / RU / ES
Need help with this?

I take on freelance fixes and builds in this area.