Docling Pipeline - Portfolio

Docling Pipeline

End-to-end PDF extraction pipeline turning policies, tariffs, and papers into structured artifacts.

Active
Specification
Role
Author
Year
2025–2026
Stack
  • Python
  • Docling
  • hierarchical chunking
  • LLM verbatim extraction
  • JSON-LD

Overview

A reference pipeline that turns any PDF (insurance policies are the running example) into rich, structured artifacts ready for exploration, compliance review, or downstream analytics. Built on Docling’s PDF pipeline plus a hierarchical chunker and an optional two-pass LLM extraction stage.

Features

  • High-fidelity Docling conversion with Markdown, JSON, picture captions, and table structure.
  • Hierarchical chunking with section metadata, neighbor context, per-chunk column estimates, and serialized tables.
  • Optional multi-pass LLM extraction with caching, verbatim provenance, clause/tariff schemas, and token usage telemetry.
  • Document graph output that stitches sections, chunks, detections, clauses, tariffs, and potential conflicts.

Why two passes

The first pass runs over chunks and extracts candidate clauses with strict provenance (verbatim spans only). The second pass reconciles candidates across chunks — same concept appearing in different sections is merged with the surface forms preserved.

Initializing demo…