Skip to main contentPDF Parsing Libraries
pypdfium2
- BSD-3 license
- Text extraction via
page.get_textpage()
- Incremental parsing support
- Form XObjects support
- Production-ready performance
PyMuPDF (fitz)
- GPL-3 license (commercial license available)
- Full text, image, vector graphics access
- Coordinate-mapped text extraction:
page.get_text("dict")
- Vector graphics manipulation
pdfplumber
- MIT license
- Table extraction:
page.extract_table()
- Configurable vertical/horizontal strategies
- SVG page preview generation
Layout Analysis Models
DocLayout-YOLO
- 18ms/page processing speed
- DocSynth-300K pre-training
- finetunable
PaddleOCR Layout
- PP-YOLOv2 / YOLOX architectures
- OCR pipeline integration
Deformable DETR
- Multi-column layout support
- 2+ A100 GPU requirement
- Hugging Face/Detectron training
Grounding DINO + SAM
- Zero-shot text prompt detection
- 10-15 mAP improvement with fine-tuning
PaddleOCR Table
- HTML output format
- COCO-style annotation support
- finetuable
Table Transformer (TATR)
- Deformable DETR backbone
- PubTables-1M format support
- Row/column/cell detection
UniTable
- Pixel-to-token framework that jointly predicts table structure, cell content & bounding boxes
- SOTA results on four benchmark datasets 
- MIT-licensed
Donut
- Direct image-to-markdown conversion
- 10k steps fine-tuning process
- Hugging Face integration
General Purpose Libraries
PaddleOCR / PP-Structure
- Layout, table, OCR, key-value pipeline
- YAML configuration system
- Standard training interface
Unstructured
- Element-based document partitioning
- Custom partitioner support
MarkItDown
- Multi-format (PDF, Office, HTML, images, audio) → Markdown conversion
- One-line CLI (
markitdown my.pdf > out.md) + Python API
- MIT-licensed, plug-in architecture
docTR
- Apache-2.0 license
- DBNet, CRNN, ViTSTR models
- Hugging Face checkpoint compatibility