PDF Extraction
Comprehensive Guide to PDF Extraction with Indexify
PDF (Portable Document Format) is a widely used file format for sharing documents. However, extracting useful information from PDFs can be immensely challenging. Indexify provides various extractors to help you extract text, images, and tables from PDF documents efficiently.
Sample PDF Page |
---|
A Sample page from a PDF document |
Source: https://arxiv.org/pdf/2310.06825.pdf |
Let us see how Indexify performs on a page like this.
Sample Output
Image Extraction | Table Extraction | Content Extraction |
---|---|---|
{"0": ["Model", "Modality", "MMLU"], "1": ["LLaMA 2 7B", "Pretrained", "44.4"], "2": ["Code-Llama 7B", "Finetuned", "36.99"]} | Size and Efficiency. We computed "equivalent model sizes" of the Llama 2 family, aiming to understand Mistral 7B models... |
This guide will walk you through using Indexify for PDF extraction, from basic concepts to advanced use cases. Although this guide doesn’t make any assumptions about your familiarity with Indexify or its components, we highly recommend you go through the Getting Started and Key Concepts guide.
Getting Started for PDF Extraction
Prerequisites
Before we begin, ensure you have the following:
- Python 3.11 or older installed (Indexify currently requires this version)
- Basic familiarity with Python programming
- An OpenAI API key (for using GPT models in the question-answering part)
- Command-line interface experience
Setup
You’ll need three separate terminal windows open for this tutorial:
- Terminal 1: For downloading and running the Indexify Server
- Terminal 2: For running Indexify extractors
- Terminal 3: For running Python scripts to load and query data from the Indexify server
We’ll use the following notation to indicate which terminal to use:
<command goes here>
Understanding PDF Extraction
PDF extraction is the process of pulling out useful information from PDF files. This can include:
- Text: The main content of the document.
- Images: Any pictures or graphics in the PDF.
- Tables: Structured data presented in a tabular format.
- Metadata: Information about the document itself, such as author, creation date, etc.
Extracting this information can be challenging due to the complex structure of PDF files. Indexify simplifies this process by providing specialized extractors for each type of content.
Indexify PDF Extractors
Indexify offers several extractors specifically designed for PDF documents. Here’s an overview of the main ones:
Extractor Class Name | Features | Best Use Case |
---|---|---|
tensorlake/pdfextractor | Extracts text, images, and tables | Best for scientific papers with tabular information |
tensorlake/easyocr | Focuses on text extraction | Optimized for photocopied or scanned PDFs (GPU-based) |
tensorlake/marker | Converts PDFs to Markdown | Best for detailed, structured, and formatted PDFs |
Let’s visualize how these extractors fit into an extraction pipeline:
This diagram shows how different Indexify extractors can process a PDF document, each producing specific types of output. To learn more about various types of extractors or the guide to creating your own extractor, read the official docs here.
End-to-End Example: Building a PDF Knowledge Base
Let’s walk through a complete example of how to use Indexify to build a PDF knowledge base. This example will demonstrate how to extract text, images, and tables from a set of PDF documents and store the extracted information for later use. The PDF we will be using for this example is the technical report for Mistral, found here.
Directory Structure
indexify-pdf-extractor/
│
├── venv/ # Virtual environment
│
├── image_extraction.py # Script for image extraction
├── table_extraction.py # Script for table extraction
├── content_extraction.py # Script for content extraction
│
└── indexify # Indexify server executable
Setup
First, set up your environment:
curl https://getindexify.ai | sh
./indexify server -d
python3 -m venv venv
source venv/bin/activate
pip install indexify-extractor-sdk indexify requests
indexify-extractor download tensorlake/pdfextractor
indexify-extractor join-server
Image Extraction
Create a file named image_extraction.py
with the following content:
from indexify import IndexifyClient, ExtractionGraph
import os
import requests
# Set up the extraction graph
client = IndexifyClient()
extraction_graph_spec = """
name: 'image_extractor'
extraction_policies:
- extractor: 'tensorlake/pdfextractor'
name: 'pdf_to_image'
input_params:
output_types: ["image"]
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
def download_pdf(url, save_path):
response = requests.get(url)
with open(save_path, 'wb') as f:
f.write(response.content)
print(f"PDF downloaded and saved to {save_path}")
def get_images(pdf_path):
client = IndexifyClient()
# Upload the PDF file
content_id = client.upload_file("image_extractor", pdf_path)
# Wait for the extraction to complete
client.wait_for_extraction(content_id)
# Retrieve the images content
images = client.get_extracted_content(
ingested_content_id=content_id,
graph_name="image_extractor",
policy_name="pdf_to_image"
)
return images
if __name__ == "__main__":
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"
pdf_path = "reference_document.pdf"
# Download the PDF
download_pdf(pdf_url, pdf_path)
# Get images from the PDF
images = get_images(pdf_path)
for image in images:
content_id = image.features[0].name
with open(f"{content_id}.png", 'wb') as f:
print("writing image ", content_id)
f.write(image.data)
Run the script:
python3 image_extraction.py
Following is the output of the extraction process.
Sample Page to extract image from | Sample Image extracted from page |
---|---|
Table Extraction
Create a file named table_extraction.py
with the following content:
from indexify import IndexifyClient, ExtractionGraph
import os
import requests
# Set up the extraction graph
client = IndexifyClient()
extraction_graph_spec = """
name: 'table_extractor'
extraction_policies:
- extractor: 'tensorlake/pdfextractor'
name: 'pdf_to_table'
input_params:
output_types: ["table"]
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
def download_pdf(url, save_path):
response = requests.get(url)
with open(save_path, 'wb') as f:
f.write(response.content)
print(f"PDF downloaded and saved to {save_path}")
def get_tables(pdf_path):
client = IndexifyClient()
# Upload the PDF file
content_id = client.upload_file("table_extractor", pdf_path)
# Wait for the extraction to complete
client.wait_for_extraction(content_id)
# Retrieve the tables content
tables = client.get_extracted_content(
ingested_content_id=content_id,
graph_name="table_extractor",
policy_name="pdf_to_table"
)
return tables[0].features[0].value if tables else None
if __name__ == "__main__":
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"
pdf_path = "reference_document.pdf"
# Download the PDF
download_pdf(pdf_url, pdf_path)
# Get tables from the PDF
tables = get_tables(pdf_path)
print("Tables from the PDF:")
print(tables)
Run the script:
python3 table_extraction.py
Following is the output of the extraction process.
Sample Page to extract table from | Sample Extracted Table |
---|---|
{"0": ["Model", "Modality", "MMLU"], "1": ["LLaMA 2 7B", "Pretrained", "44.4"], "2": ["Code-Llama 7B", "Finetuned", "36.99"]} |
Text Extraction
For simple text extraction, follow these steps:
- Create a file named
content_extraction.py
with the following content:
from indexify import IndexifyClient, ExtractionGraph
import requests
client = IndexifyClient()
extraction_graph_spec = """
name: 'pdfknowledgebase'
extraction_policies:
- extractor: 'tensorlake/pdfextractor'
name: 'pdf_to_text'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
def process_pdf(url):
# Download the PDF
response = requests.get(url)
pdf_content = response.content
filename = url.split("/")[-1]
# Upload and process the PDF
content_id = client.upload_bytes("pdfknowledgebase", pdf_content, filename)
# Wait for the extraction to complete
client.wait_for_extraction(content_id)
# Retrieve and print extracted content
extracted_content = client.get_extracted_content(
ingested_content_id=content_id,
graph_name="pdfknowledgebase",
policy_name="pdf_to_text"
)
print(f"Extracted content from {filename}:")
print(extracted_content)
if __name__ == "__main__":
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"
process_pdf(pdf_url)
- Run the continuous extraction script:
python3 content_extraction.py
Following is the output of the extraction process.
Sample Page to extract content from | Sample Extracted Content |
---|---|
Size and Efficiency. We computed "equivalent model sizes" of the Llama 2 family, aiming to understand Mistral 7B models... |
This guide demonstrates how to use Indexify for PDF extraction, closely aligning with the provided examples. Key points:
- Separate scripts for image extraction, table extraction, and continuous extraction.
- PDFs are downloaded from URLs instead of using local file paths.
- Extraction graphs are set up for each specific task.
- The extraction scrips can be extended to process multiple PDFs in sequence.
You can further extend these scripts to include additional processing steps or to handle multiple PDFs as needed for your specific use case.
Visualize the Extraction Process
Let’s visualize our PDF knowledge base extraction process:
This diagram illustrates how PDF documents are processed through Indexify, extracted using the tensorlake/pdfextractor
, and stored in a knowledge base for later querying and retrieval.
Advanced Use Cases
Once you have extracted content from your PDFs, you can use it for various advanced applications:
- Document Search Engine: Index the extracted text to create a searchable database of your PDFs.
- Image Analysis: Process extracted images for tasks like object detection or classification.
- Data Analysis: Use extracted tables for data analysis and visualization.
- Question Answering: Combine extracted content with language models for automated question answering.
Here’s a simple example of how you might use the extracted text for a basic search functionality:
from indexify import IndexifyClient
from rapidfuzz import fuzz
client = IndexifyClient()
def search_pdfs(query, graph_name="pdf_knowledge_base", policy_name="pdf_extractor"):
results = []
for content_id in client.list_content_ids(graph_name):
extracted_content = client.get_extracted_content(
ingested_content_id=content_id,
graph_name=graph_name,
policy_name=policy_name
)
for content in extracted_content:
if content.content_type == "text/plain":
text = content.data.decode('utf-8')
similarity = fuzz.partial_ratio(query.lower(), text.lower())
if similarity > 70: # Adjust this threshold as needed
results.append((content_id, similarity))
return sorted(results, key=lambda x: x[1], reverse=True)
# Example usage
search_results = search_pdfs("machine learning")
for content_id, similarity in search_results[:5]: # Top 5 results
print(f"Content ID: {content_id}, Similarity: {similarity}")
This script implements a simple fuzzy search across all extracted text from your PDFs, returning the most relevant documents based on the search query.
Extractor Performance Analysis
When choosing an extractor for your PDF processing needs, it’s important to consider both accuracy and speed. Here’s a comparison of the performance of different Indexify PDF extractors:
Here’s a quadrant chart diagram of the performance:
This chart provides a visual representation of how different extractors perform in terms of speed and accuracy. The ideal extractor would be in the top-right quadrant (Fast and Accurate).
Category: Scientific Papers and Books
- crowd.pdf - Reference: crowd.md (10 pages)
- multicolcnn.pdf - Reference: multicolcnn.md (10 pages)
- switch_trans.pdf - Reference: switch_trans.md (40 pages)
- thinkdsp.pdf - Reference: thinkdsp.md (153 pages)
- thinkos.pdf - Reference: thinkos.md (99 pages)
- thinkpython.pdf - Reference: thinkpython.md (240 pages)
Scoring Methodology
The scoring was done by comparing the extracted text from each PDF with its reference text using the following steps:
- Chunking: Both the hypothesis (extracted text) and the reference text are divided into chunks of 500 characters, ensuring chunks have a minimum of 25 characters.
- Overlap Score Calculation: For each chunk of the hypothesis text, a fuzzy matching score with the best-matching reference chunk within a defined range is computed using the
rapidfuzz
library. This range is determined based on a length modifier and search distance. - Final Score: The average of these chunk scores provides a final alignment score between 0 and 1, indicating how closely the extracted text matches the reference.
Extractor Performance
Extractor | Accuracy | Speed | Best For |
---|---|---|---|
tensorlake/pdfextractor | High | Fast | General-purpose extraction |
tensorlake/easyocr | High | Slow | Scanned documents (GPU-based) |
tensorlake/marker | Very High | Slow | Detailed, structured PDFs |
layoutlm-document-qa | High | Medium | Question-answering tasks |
Reproduction Code for Comparison
Resource | Link |
---|---|
Code for Creating Graphs | Google Colab |
Benchmarking Notebook | Google Colab |
Benchmarking Local Script | GitHub Repository |
Extractor Performance Benchmarks | Marker GitHub Repository |
Note: Thanks to the Marker team for the code used in the Extractor Performance Benchmarks.
Explore More Examples
We’ve curated a collection of PDF extraction examples. Check out these notebooks:
- Tax Document Extraction (Uses OpenAI)
- Summarization (Uses Mistral)
- Entity Extraction (Uses Mistral)
- Chunk Extraction ()
- Multi-Modal RAG (Uses OpenAI)
- Structured Extraction guided by Schema (Uses OpenAI)
Conclusion
Indexify provides a powerful and flexible solution for PDF extraction, enabling you to build sophisticated document processing pipelines. By understanding the capabilities of different extractors and following best practices, you can efficiently extract valuable information from your PDF documents and use it in a wide range of applications.
Recommendation of Extractors
Aspect | Extractor | Class Name | Analysis |
---|---|---|---|
Accuracy | Marker | tensorlake/marker | Consistently provides high accuracy scores across all PDF documents. |
EasyOCR | tensorlake/easyocr | Shows competitive accuracy across all the documents. | |
Unstructured IO | tensorlake/pdfextractor | Is fractionally better than EasyOCR in one of the books, and from OCRMyPDF on another. | |
OpenAI GPT-4o | tensorlake/openai | Performs well for code-based texts however it performs average for regular texts. | |
Time Efficiency | Unstructured IO | tensorlake/pdfextractor | Is the fastest, taking the least time for all PDF documents. |
EasyOCR | tensorlake/easyocr | Shows extreme variability in processing times, being exceptionally fast for some documents and very slow for others. | |
Marker | tensorlake/marker | Despite providing high accuracy, is significantly slower compared to the other extractors. | |
OpenAI GPT-4o | tensorlake/openai | Takes by far the longest time and is best avoided for pdf extraction unless necessary. | |
Recommendations | Marker | tensorlake/marker | Use when accuracy is the primary concern and processing time is less critical. Ideal for scenarios requiring detailed and precise text extraction. |
EasyOCR | tensorlake/easyocr | When accuracy is important, but you need faster extraction time. EasyOCR also provides training recipes to fine-tune the model on private documents. This will improve its accuracy further. | |
Unstructured IO | tensorlake/pdfextractor | While unstructured doesn’t show any significant advantages in accuracy, it is known to handle a lot of different variations of PDFs. |
While our extractors do most of the heavy lifting for you, there are some practical developer considerations that one needs to take while building with Indexify.
Best Practices and Tips
- Choose the Right Extractor: Select the extractor that best fits your specific use case and document types.
- Preprocess Your PDFs: Ensure your PDFs are of good quality. For scanned documents, consider using OCR-specific extractors.
- Handle Errors Gracefully: Some PDFs may fail to process. Implement error handling to manage these cases.
- Monitor Performance: Keep an eye on processing times and accuracy, especially when dealing with large volumes of documents.
- Post-process Extracted Content: Clean and format extracted text, validate table structures, and optimize images as needed.
- Implement Caching: For frequently accessed documents, consider caching extracted content to improve response times.
Remember, the key to successful PDF extraction is choosing the right tools for your specific needs and continuously refining your process. Experiment with different extractors and configurations to find the optimal setup for your use case.
Was this page helpful?