Skip to main content

PDF Parsing Libraries

pypdfium2
  • BSD-3 license
  • Text extraction via page.get_textpage()
  • Incremental parsing support
  • Form XObjects support
  • Production-ready performance
PyMuPDF (fitz)
  • GPL-3 license (commercial license available)
  • Full text, image, vector graphics access
  • Coordinate-mapped text extraction: page.get_text("dict")
  • Vector graphics manipulation
pdfplumber
  • MIT license
  • Table extraction: page.extract_table()
  • Configurable vertical/horizontal strategies
  • SVG page preview generation

Layout Analysis Models

DocLayout-YOLO
  • 18ms/page processing speed
  • DocSynth-300K pre-training
  • finetunable
PaddleOCR Layout
  • PP-YOLOv2 / YOLOX architectures
  • OCR pipeline integration
Deformable DETR
  • Multi-column layout support
  • 2+ A100 GPU requirement
  • Hugging Face/Detectron training
Grounding DINO + SAM
  • Zero-shot text prompt detection
  • 10-15 mAP improvement with fine-tuning

Table Extraction Models

PaddleOCR Table
  • HTML output format
  • COCO-style annotation support
  • finetuable
Table Transformer (TATR)
  • Deformable DETR backbone
  • PubTables-1M format support
  • Row/column/cell detection
UniTable
  • Pixel-to-token framework that jointly predicts table structure, cell content & bounding boxes
  • SOTA results on four benchmark datasets 
  • MIT-licensed
Donut
  • Direct image-to-markdown conversion
  • 10k steps fine-tuning process
  • Hugging Face integration

General Purpose Libraries

PaddleOCR / PP-Structure
  • Layout, table, OCR, key-value pipeline
  • YAML configuration system
  • Standard training interface
Unstructured
  • Element-based document partitioning
  • Custom partitioner support
MarkItDown
  • Multi-format (PDF, Office, HTML, images, audio) → Markdown conversion
  • One-line CLI (markitdown my.pdf > out.md) + Python API
  • MIT-licensed, plug-in architecture
docTR
  • Apache-2.0 license
  • DBNet, CRNN, ViTSTR models
  • Hugging Face checkpoint compatibility