GuardLabs · Technical note

Docling: Turn Your Documents Into AI-Ready Data — Locally, With Tables Intact

Most RAG and AI-agent projects fail at the boring first step: getting clean text out of real-world documents. PDFs with multi-column layouts, scanned contracts, Excel exports with merged cells, slide decks — naive extractors flatten all of it into garbage, and your model answers from garbage. Docling, an open-source tool from IBM Research (now under LF AI & Data, MIT-licensed), is built to fix exactly this.

What it does

Docling converts PDF, DOCX, PPTX, XLSX, HTML, images, EPUB and email into clean Markdown and lossless JSON — the formats LLMs and vector stores actually want. Crucially, it runs locally: confidential client documents never leave your server. That alone makes it viable for finance, legal and healthcare work where cloud OCR is a non-starter.

Why it beats a plain text extractor

Real table understanding — its TableFormer model reconstructs table structure (rows, columns, headers) instead of dumping cell soup. This is the single biggest quality win for financial and reporting documents.
Layout & reading order — multi-column pages are read in the right sequence, not left-to-right across columns.
OCR for scans — multiple engines (EasyOCR, Tesseract, RapidOCR), many languages, with formula and code extraction.
Charts to tables — diagrams are converted into structured data where possible.

Getting started in five minutes

As a Python library:

pip install docling

from docling.document_converter import DocumentConverter
conv = DocumentConverter()
result = conv.convert("report.pdf")
print(result.document.export_to_markdown())

Or as a service for a RAG stack (e.g. Open WebUI): run docling-serve on port 5001 and point your document engine at it. One gotcha — set UVICORN_WORKERS=1, or you hit "Task Not Found" errors. There is also an MCP server, so an AI agent can parse documents on its own.

Tuning for quality (and for non-English scans)

Digital files (PDFs with a text layer, Word, slides) work with no configuration — Cyrillic and other scripts extract cleanly from the embedded text. Scans need the OCR language set explicitly:

{
  "do_ocr": true,
  "table_mode": "accurate",
  "ocr_engine": "tesseract",
  "ocr_lang": ["rus", "eng"]
}

Note the engine-specific codes: EasyOCR uses two-letter (ru, en), Tesseract three-letter (rus, eng). Use table_mode: accurate when tables matter, fast when throughput does.

When to use it — and when not

Tool	Pick it when
Docling	You need structure, tables, locality and zero per-page cost
Apache Tika	You just need fast, rough text from many formats
Cloud OCR API	You want a turnkey API and don't mind sending data out

Limitations to plan for: on CPU it is noticeably slower than a plain parser — high volume wants a GPU. Very complex Excel files with merged cells still parse imperfectly. And quality mode costs memory. For light, GPU-equipped setups, the compact granite-docling-258M vision-language model handles parsing in a single pass.

Where it pays off

The practical wins are concrete: a knowledge-base RAG that keeps tables readable; financial documents turned into structured JSON; an archive of scanned PDFs made full-text searchable; AI agents that ingest documents through an MCP call; clean datasets for fine-tuning. If your AI project is only as good as the data you feed it, the ingestion layer is where to invest first.

We build document-to-data and RAG pipelines for exactly these cases — structured extraction with tables preserved, run locally so your data stays private. If you have a pile of PDFs, contracts or reports you want your AI to actually understand, that is the work we do.

Published 2026-06-22 3 min read All articles EN / RU / ES

Need your documents turned into AI-ready data?

We build local, privacy-safe document-to-data and RAG pipelines - structured extraction with tables preserved. Tell us what you have and what your AI needs to answer.