PDF Parsing Libraries
pypdfium2- BSD-3 license
- Text extraction via
page.get_textpage() - Incremental parsing support
- Form XObjects support
- Production-ready performance
- GPL-3 license (commercial license available)
- Full text, image, vector graphics access
- Coordinate-mapped text extraction:
page.get_text("dict") - Vector graphics manipulation
- MIT license
- Table extraction:
page.extract_table() - Configurable vertical/horizontal strategies
- SVG page preview generation
Layout Analysis Models
DocLayout-YOLO- 18ms/page processing speed
- DocSynth-300K pre-training
- finetunable
- PP-YOLOv2 / YOLOX architectures
- OCR pipeline integration
- Multi-column layout support
- 2+ A100 GPU requirement
- Hugging Face/Detectron training
- Zero-shot text prompt detection
- 10-15 mAP improvement with fine-tuning
Table Extraction Models
PaddleOCR Table- HTML output format
- COCO-style annotation support
- finetuable
- Deformable DETR backbone
- PubTables-1M format support
- Row/column/cell detection
- Pixel-to-token framework that jointly predicts table structure, cell content & bounding boxes
- SOTA results on four benchmark datasets 
- MIT-licensed
- Direct image-to-markdown conversion
- 10k steps fine-tuning process
- Hugging Face integration
General Purpose Libraries
PaddleOCR / PP-Structure- Layout, table, OCR, key-value pipeline
- YAML configuration system
- Standard training interface
- Element-based document partitioning
- Custom partitioner support
- Multi-format (PDF, Office, HTML, images, audio) → Markdown conversion
- One-line CLI (
markitdown my.pdf > out.md) + Python API - MIT-licensed, plug-in architecture
- Apache-2.0 license
- DBNet, CRNN, ViTSTR models
- Hugging Face checkpoint compatibility