Docling Pipeline - Ilya Fedotov Blog

Specification

Role

Author

Year

2025–2026

Stack

Python
Docling
hierarchical chunking
LLM verbatim extraction
JSON-LD

Overview

A reference pipeline that turns any PDF (insurance policies are the running example) into rich, structured artifacts ready for exploration, compliance review, or downstream analytics. Built on Docling’s PDF pipeline plus a hierarchical chunker and an optional two-pass LLM extraction stage.

Features

High-fidelity Docling conversion with Markdown, JSON, picture captions, and table structure.
Hierarchical chunking with section metadata, neighbor context, per-chunk column estimates, and serialized tables.
Optional multi-pass LLM extraction with caching, verbatim provenance, clause/tariff schemas, and token usage telemetry.
Document graph output that stitches sections, chunks, detections, clauses, tariffs, and potential conflicts.

Why two passes

The first pass runs over chunks and extracts candidate clauses with strict provenance (verbatim spans only). The second pass reconciles candidates across chunks — same concept appearing in different sections is merged with the surface forms preserved.

Initializing demo…