Open Source Tools - Parseport

PDF Parsing Libraries
Layout Analysis Models
Table Extraction Models
General Purpose Libraries

PDF Parsing Libraries

pypdfium2

BSD-3 license
Text extraction via page.get_textpage()
Incremental parsing support
Form XObjects support
Production-ready performance

PyMuPDF (fitz)

GPL-3 license (commercial license available)
Full text, image, vector graphics access
Coordinate-mapped text extraction: page.get_text("dict")
Vector graphics manipulation

pdfplumber

MIT license
Table extraction: page.extract_table()
Configurable vertical/horizontal strategies
SVG page preview generation

Layout Analysis Models

DocLayout-YOLO

18ms/page processing speed
DocSynth-300K pre-training
finetunable

PaddleOCR Layout

PP-YOLOv2 / YOLOX architectures
OCR pipeline integration

Deformable DETR

Multi-column layout support
2+ A100 GPU requirement
Hugging Face/Detectron training

Grounding DINO + SAM

Zero-shot text prompt detection
10-15 mAP improvement with fine-tuning

Table Extraction Models

PaddleOCR Table

HTML output format
COCO-style annotation support
finetuable

Table Transformer (TATR)

Deformable DETR backbone
PubTables-1M format support
Row/column/cell detection

UniTable

Pixel-to-token framework that jointly predicts table structure, cell content & bounding boxes
SOTA results on four benchmark datasets
MIT-licensed

Donut

Direct image-to-markdown conversion
10k steps fine-tuning process
Hugging Face integration

General Purpose Libraries

PaddleOCR / PP-Structure

Layout, table, OCR, key-value pipeline
YAML configuration system
Standard training interface

Unstructured

Element-based document partitioning
Custom partitioner support

MarkItDown

Multi-format (PDF, Office, HTML, images, audio) → Markdown conversion
One-line CLI (markitdown my.pdf > out.md) + Python API
MIT-licensed, plug-in architecture

docTR

Apache-2.0 license
DBNet, CRNN, ViTSTR models
Hugging Face checkpoint compatibility

Document Ingestion

Receipt to JSON