PDF Extraction

Indexify provides several extractors that extract text, images, and tables from PDF documents. Some extractors also convert PDFs to markdown documents. You can build complex pipelines that can extract and write tabular information from PDF documents in structured stores or extract embedding from texts in the documents.

What Can You Achieve with Indexify?

With Indexify, you can accomplish the following with your PDFs:

🔍 Data Extraction: Extract specific information from PDFs such as text, images and tables.
📚 Document Indexing: Build searchable indexes on vector stores and structured stores by combining PDF extractors with chunking, embedding, and structured data extractors.
🤖 Document Q&A: Get Answers to specific Questions from Documents directly.

The Extraction Pipeline: A Three-Stage Process

PDF Extraction Pipelines are usually composed of three stages. You can use one or more of these stages depending on your use case -

Content Extraction Stage: Extract raw content from your PDFs using extractors like pdf/pdfextractor or pdf/markdown. These extractors will retrieve text, images, and tables from your documents. If all you want is to extract text, images, and tables from a PDF, this is the only stage you need in an extraction graph.
Text Chunking: Chunk any extracted text using the text/chunking extractor. You can write custom chunkers by following this guide.
Chunk to Embedding Extraction Stage: Convert the chunks into vector embeddings using extractors like embedding/minilm-l6 or embedding/arctic.

Image Extraction

If you would like to extract images from PDF, the best extractor to use is tensorlake/pdfextractor It automatically extracts images from documents and writes them into blob stores.

You can get extracted images from pdfextractor by simply specifying ["image"] in the graph like this:

extraction_graph_spec = """
name: 'image_extractor'
extraction_policies:
  - extractor: 'tensorlake/pdfextractor'
    name: 'pdf_to_image'
    input_params:
      output_types: ["image"]
"""

Complete code:

Define Graph by running image_pipeline.py
Test Image extraction by running upload_and_retreive.py

Table Extraction

Tables are automatically extracted by tensorlake/pdfextractor as JSON metadata. You can query the metadata associated with documents by calling the Retrieval APIs.

You can get extracted tables from pdfextractor by simply specifying ["table"] in the graph like this:

extraction_graph_spec = """
name: 'table_extractor'
extraction_policies:
  - extractor: 'tensorlake/pdfextractor'
    name: 'pdf_to_table'
    input_params:
      output_types: ["table"]
"""

Complete code:

Define Graph by running table_pipeline.py
Test Table extraction by running upload_and_retreive.py

Explore PDF Extractors

Here's a quick overview of all the extractors:

Extractors	Output Type	Best For	Output Example
tensorlake/layoutlm-document-qa-extractor	metadata	Invoices Question Answering	[Feature(feature_type='metadata', name='metadata', value={'query': 'What is the invoice total?', 'answer': '$93.00', 'page': 0, 'score': 0.9743825197219849}, comment=None)]
tensorlake/pdfextractor	text, image, table	Scientific Papers with Tabular Info	[Content(content_type='text/plain', data=b'I love playing football.', features=[Feature(feature_type='metadata', name='text', value={'page': 1}, comment=None)], labels={})]
tensorlake/ocrmypdf	text	Photocopied/Scanned PDFs on CPU	[Content(content_type='text/plain', data=b'I love playing football.', features=[Feature(feature_type='metadata', value={'page': 1}, comment=None)], labels={})]
tensorlake/easyocr	text	Photocopied/Scanned PDFs on GPU	[Content(content_type='text/plain', data=b'I love playing football.', features=[Feature(feature_type='metadata', name='text', value={'page': 1}, comment=None)], labels={})]
tensorlake/marker	text, table	Detailed structured & formatted PDF	[Content(content_type='text/plain', data=b'I love playing football.', features=[Feature(feature_type='metadata', name='text', value={'language': 'English', 'filetype': 'pdf', 'toc': [], 'pages': 1, 'ocr_stats': {'ocr_pages': 0, 'ocr_failed': 0, 'ocr_success': 0}, 'block_stats': {'header_footer': 2, 'code': 0, 'table': 0, 'equations': {'successful_ocr': 0, 'unsuccessful_ocr': 0, 'equations': 0}}, 'postprocess_stats': {'edit': {}}}, comment=None)], labels={})]

Continuous PDF Extraction for Applications

Here is an example of how you can create a pipeline that extracts text, tables and images from a PDF document.

Start the Indexify Server:

curl https://getindexify.ai | sh
./indexify server -d

Start a long-running PDF Extractor:

pip install indexify-extractor-sdk indexify
indexify-extractor download tensorlake/pdfextractor
indexify-extractor join-server

Create an Extraction Graph:

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

extraction_graph_spec = """
name: 'pdfknowledgebase'
extraction_policies:
   - extractor: 'tensorlake/pdfextractor'
     name: 'pdf_to_text'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Upload PDFs from your application:

content_id = client.upload_file("pdfknowledgebase", "/path/to/pdf.file")

Inspect the extracted content:

client.wait_for_extraction(content_id)
extracted_content = client.get_extracted_content(content_id=content_id, graph_name="pdfknowledgebase", policy_name="pdf_to_text")
print(extracted_content)

You can extend the graph to do any kind of downstream tasks(embedding, summarization, etc) once you have data out of PDF.

Explore More Examples

We've curated a collection of PDF extraction examples. Check out these notebooks:

Extractor Performance Analysis

PDF is a complex data type, we recommend you try out all extractors on a representative sample of documents that you are extracting from, and decide which extractors to use in your pipeline. We present some code to benchmark the various extractors.

Category: Scientific Papers and Books

crowd.pdf - Reference: crowd.md (10 pages)
multicolcnn.pdf - Reference: multicolcnn.md (10 pages)
switch_trans.pdf - Reference: switch_trans.md (40 pages)
thinkdsp.pdf - Reference: thinkdsp.md (153 pages)
thinkos.pdf - Reference: thinkos.md (99 pages)
thinkpython.pdf - Reference: thinkpython.md (240 pages)

Accuracy Comparison

PDF Document	Marker Score	Unstructured IO Score	EasyOCR Score	OCRmyPDF Score	OpenAI GPT-4o Score
crowd.pdf	0.5391	0.5224	0.5486	0.5792	0.5556
multicolcnn.pdf	0.5409	0.5213	0.5333	0.5627	0.3981
switch_trans.pdf	0.5191	0.4730	0.5198	0.5198	0.4358
thinkdsp.pdf	0.6810	0.6625	0.6755	0.6740	0.6818
thinkos.pdf	0.7368	0.6855	0.6781	0.7050	0.6813
thinkpython.pdf	0.6910	0.6822	0.6875	0.6161	0.6666

Time Taken Comparison

PDF Document	Marker Time (s)	Unstructured IO Time (s)	EasyOCR Time (s)	OCRmyPDF Time (s)	OpenAI GPT-4o Time (s)
crowd.pdf	21.65	2.44	14.18	5.44	722.21
multicolcnn.pdf	17.91	1.64	31.00	22.40	182.97
switch_trans.pdf	45.90	5.35	0.14	4.10	484.94
thinkdsp.pdf	139.80	29.10	17.37	20.59	1256.39
thinkos.pdf	84.04	4.88	0.13	5.70	914.21
thinkpython.pdf	217.60	21.00	4.03	13.96	4991.86

Visual Comparisons

Accuracy Comparison Graph

Time Taken Comparison Graph

Detailed Analysis and Insights

Accuracy:

Marker extractor consistently provides high accuracy scores across all PDF documents.
EasyOCR and OCRMyPDF shows competetive accuracy across all the documents.
Unstructured IO is fractionally better than EasyOCR in one of the books, and from OCRMyPDF on another.
OpenAI GPT-4o performs well for code based texts however it performs average for regular texts.

Time Efficiency:

Unstructured IO extractor is the fastest, taking the least time for all PDF documents.
EasyOCR shows extreme variability in processing times, being exceptionally fast for some documents and very slow for others.
Marker extractor, despite providing high accuracy, is significantly slower compared to the other extractors.
OCRmyPDF shows moderate time efficiency, balancing between speed and accuracy.
OpenAI GPT-4o takes by far the longest time and is best avoided for pdf extraction unless necessary.

Extractor Recommendations:

Marker Extractor: Use when accuracy is the primary concern and processing time is less critical. Ideal for scenarios requiring detailed and precise text extraction.
EasyOCR: When accuracy is import, but you need faster extraction time. EasyOCR also provides training recipes to fine tune the model on private documents. This will improve it's accuracy further.
Unstructured: While unstructured doesn't show any significant advantages in accuracy, it is known to handle a lot of differnt variations of PDFs.

Additional Comparisons

Combined Accuracy and Time Comparison Graph

Time Taken Per Extractor Graph

Scoring Methodology

The scoring was done by comparing the extracted text from each PDF with its reference text using the following steps:

Chunking: Both the hypothesis (extracted text) and the reference text are divided into chunks of 500 characters, ensuring chunks have a minimum of 25 characters.
Overlap Score Calculation: For each chunk of the hypothesis text, a fuzzy matching score with the best-matching reference chunk within a defined range is computed using the rapidfuzz library. This range is determined based on a length modifier and search distance.
Final Score: The average of these chunk scores provides a final alignment score between 0 and 1, indicating how closely the extracted text matches the reference.

Links

Code for Creating Graphs
Benchmarking Notebook
Benchmarking Local Script
Extractor Performance Benchmarks - Thanks to the Marker team for the code.

Test Environment

Tests were conducted on an L4 GPU with 53 GB system RAM and 22.5 GB GPU RAM.