Docling: Turn Your Documents Into AI-Ready Data — Locally, With Tables Intact
Most RAG and AI-agent projects fail at the boring first step: getting clean text out of real-world documents. PDFs with multi-column layouts, scanned contracts, Excel exports with merged cells, slide decks — naive extractors flatten all of it into garbage, and your model answers from garbage. Docling, an open-source tool from IBM Research (now under LF AI & Data, MIT-licensed), is built to fix exactly this.
What it does
Docling converts PDF, DOCX, PPTX, XLSX, HTML, images, EPUB and email into clean Markdown and lossless JSON — the formats LLMs and vector stores actually want. Crucially, it runs locally: confidential client documents never leave your server. That alone makes it viable for finance, legal and healthcare work where cloud OCR is a non-starter.
Why it beats a plain text extractor
- Real table understanding — its TableFormer model reconstructs table structure (rows, columns, headers) instead of dumping cell soup. This is the single biggest quality win for financial and reporting documents.
- Layout & reading order — multi-column pages are read in the right sequence, not left-to-right across columns.
- OCR for scans — multiple engines (EasyOCR, Tesseract, RapidOCR), many languages, with formula and code extraction.
- Charts to tables — diagrams are converted into structured data where possible.
Getting started in five minutes
As a Python library:
pip install docling
from docling.document_converter import DocumentConverter
conv = DocumentConverter()
result = conv.convert("report.pdf")
print(result.document.export_to_markdown())
Or as a service for a RAG stack (e.g. Open WebUI): run docling-serve on port 5001 and point your document engine at it. One gotcha — set UVICORN_WORKERS=1, or you hit "Task Not Found" errors. There is also an MCP server, so an AI agent can parse documents on its own.
Tuning for quality (and for non-English scans)
Digital files (PDFs with a text layer, Word, slides) work with no configuration — Cyrillic and other scripts extract cleanly from the embedded text. Scans need the OCR language set explicitly:
{
"do_ocr": true,
"table_mode": "accurate",
"ocr_engine": "tesseract",
"ocr_lang": ["rus", "eng"]
}
Note the engine-specific codes: EasyOCR uses two-letter (ru, en), Tesseract three-letter (rus, eng). Use table_mode: accurate when tables matter, fast when throughput does.
When to use it — and when not
| Tool | Pick it when |
|---|---|
| Docling | You need structure, tables, locality and zero per-page cost |
| Apache Tika | You just need fast, rough text from many formats |
| Cloud OCR API | You want a turnkey API and don't mind sending data out |
Limitations to plan for: on CPU it is noticeably slower than a plain parser — high volume wants a GPU. Very complex Excel files with merged cells still parse imperfectly. And quality mode costs memory. For light, GPU-equipped setups, the compact granite-docling-258M vision-language model handles parsing in a single pass.
Where it pays off
The practical wins are concrete: a knowledge-base RAG that keeps tables readable; financial documents turned into structured JSON; an archive of scanned PDFs made full-text searchable; AI agents that ingest documents through an MCP call; clean datasets for fine-tuning. If your AI project is only as good as the data you feed it, the ingestion layer is where to invest first.
We build document-to-data and RAG pipelines for exactly these cases — structured extraction with tables preserved, run locally so your data stays private. If you have a pile of PDFs, contracts or reports you want your AI to actually understand, that is the work we do.
We build local, privacy-safe document-to-data and RAG pipelines - structured extraction with tables preserved. Tell us what you have and what your AI needs to answer.