Skip to main content

What is this library?

Parseport is a composable and customizable library for documentation ingestion - converting documents in formats that LLMs / VLMs / traditional software can easily work with. Documents come in diverse layouts and content. Turning these documents into consumable formats by generative models or traditional databases often requires custom finetuned models and specialized processing steps. The goal of the library is to provide a framework and the tools for building your own document ingestion pipeline. The library encouragests selectively picking out parts of the it that are useful in a particular use cases. It can be integrated with other popular libraries for document proecssing. It also makes it easy to bring in custom models.

Why a Library?

There are many great document processing API products. There are however situations where API based solutions are not suitable. Some users process highly sensitive documents and prefer to process them on an on-prem enviroment. There are long tail document types that are not well handled by off-the-shelf APIs, and instead required custom finetuned models. Some users may happen to have GPU machines at hand and can leverage their existing asset to save cost. parseport fills these gaps.

How is it different from other libraries?

There are few other great libraries that Parseport was inspired by, including the below: PaddleOCR PaddleOCR provides a zoo of models for layout analysis, OCR, table structre recogition, chart2table, etc as well as abstractions for putting them together in pipelines. The pretrained models are excellent and often the SOTA on standard benchmarks. It also provides recipes for finetuning MarkItDown MarkItDown can be used to convert arbitrary file types (PDF, Word, Excel, Images, HTML) to markdown. But in order to identify structure in the file, it uses a few backend options, including heuristics based on text block coordinates and sizes, a vision model hosted on azure, or simple OCR models. Unstructured Similar to MarItDown, Unstructured outputs text structure from raw files (except, instead of Markdown, in its own data structures). Similarily, it handles with parsing via heuristics, OCR, or VLMs hosted on API endpoints.
The key limitation of these libraries is that it is not easy to bring in custom models to the pipelines. While PaddleOCR supports fineuned models, the APIs only support checkpoints from the same architectures. For MarkItDown and Untructured, the APIs are much higher level, and the main nobs are the flags you provide. Parseport was designed to expose the individual components in the pipeline, making it easy to use custom models, including the ones provided by PaddleOCR, other open source models, and API based pipelines.