PDF Image Extraction with Indexify

This project demonstrates how to extract images from PDF documents using Indexify. It includes two main components: setting up an extraction graph for image extraction and a script to process PDFs and retrieve the extracted images.

Introduction

This project showcases the use of Indexify to create a pipeline for extracting images from PDF documents. It consists of two main parts: - An extraction graph that defines the process of converting PDFs to images. - A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.

Prerequisites

Before we begin, ensure you have the following:

Create a virtual env with Python 3.9 or later

python3.9 -m venv ve
source ve/bin/activate

Indexify installed and running
Required Python packages: indexify, requests

Setup

Install Indexify:

curl https://getindexify.ai | sh
./indexify server -d

Install the required extractor:

pip install indexify-extractor-sdk
indexify-extractor download tensorlake/pdfextractor
indexify-extractor join-server

File Descriptions

setup.py: This script sets up the extraction graph for converting PDFs to images.
upload_and_retreive.py: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.

Usage

First, run the setup.py script to set up the extraction graph:
```
python setup.py
```
Then, run the upload_and_retrieve.py script to process a PDF and extract images:
```
python upload_and_retrieve.py
```

This script will: - Download a sample PDF from arXiv - Upload the PDF to Indexify - Extract images from the PDF - Print the extracted image content

Customization

You can customize the image extraction process by modifying the extraction_graph_spec in image_pipeline.py. For example, you could add additional extraction steps or change the output format.

In upload_and_retrieve.py, you can modify the pdf_url variable to process different PDF documents.

Conclusion

This project demonstrates the power and flexibility of Indexify for extracting images from PDF documents. While this example focuses on image extraction, Indexify can be used for various other document processing tasks, making it a versatile tool for handling large volumes of documents efficiently.

For more information on Indexify and its capabilities, visit the official documentation.