Skip to main content

What is Document Ingestion?

The information contained in a raw PDF is messy. Even for a text-only PDF, the file does not encode semantic boundaries like paragraphs and sections, or even sentences. Instead, they are encoded as small text blocks with coordinates. Tables and charts are stored as lines and curves (i.e. vector graphics), and the content is not organized in tabular format. PDF files also often come from scanners, in which case they are effectively images and the data is not structured at all. The goal of document ingestion is to convert such data into formats that are easily consumable by downstream systems like LLMs / VLMs, RAG pipeline, or traditional databases. Document ingestion is sometimes casually referred to as OCR, which technically means recognizing texts in images. OCR is a part of document ingestion, but document ingestion also means turning the document into structured data like JSON or HTML.

Are VLMs not Enough?

Documents come in very diverse formats, varied across industries and use cases. While VLLMs are improving quickly, they are still not good enough to turn any kind of document from image to structured data in one shot. For tables and charts, accuracy is paramount, and missing one digit or a cell entry shifted by one column can critically change the content. These are very common problems in VLMs that claim to have solved OCR (try it yourself!) A much better solution is to break the problem into multiple stages that can be handled by specialized modules, some leveraging VLMs. It leads to higher accuracy, better handling of edge cases, and often cheaper and faster pipelines.

Document Ingestion Pipeline

Layout Analysis

The first stage of document processing is layout analysis. The goal of layout analysis is to identify bounding boxes within a page of a document for different semantic sections. For example, a page with table, a chart and paragraphs and a footer, need to be recognized as different elements. Layout analysis outputs a series of bounding box coordinates containing these elements. Note that the requirements for layout analysis can be different depending on the downstream purposes. For example, a formula within a paragraph can be either detected to be part of a paragraph, or as a separate element.

Read Order Detection

A step that is subtle but needed is read order detection. For a two-column page for example, a simple vertical and horizontal sweep will not cut it. The detected elements need to be re-ordered if knowing their relations in the context of the page is important. Depending on the use case, this step can be skipped.

Content Extraction

In many use cases, once the different elements are detected, they need to be converted from images to text. For example, the image of a table is not useful unless it can be converted to tabular format. The goal of this stage is to turn the elements identified in the previous stages into text. For the image of a text paragraph in a scanned document, an OCR model can be used. There are good specialized models for handwriting recognition. For flowcharts, a specialized VLM may be needed to convert them to a domain-specific-language (DSL). Natural images may be kept as images, or can be captioned into text which is helpful for buidling search systems later.

Final Processing

As the final step, the outputs from the previous stages may be re-formatted to suit downstream use cases. For example, you may want to aggregate information from multiple tables into a single JSON output. Perhaps you’re looking to keep only specific paragraphs that mention specific topics, etc.