Skip to content

PDF Image Extraction with Indexify

This project demonstrates how to extract images from PDF documents using Indexify. It includes two main components: setting up an extraction graph for image extraction and a script to process PDFs and retrieve the extracted images.

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Setup
  4. File Descriptions
  5. Usage
  6. Customization
  7. Conclusion


This project showcases the use of Indexify to create a pipeline for extracting images from PDF documents. It consists of two main parts: - An extraction graph that defines the process of converting PDFs to images. - A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.


Before we begin, ensure you have the following:

  • Create a virtual env with Python 3.9 or later
    python3.9 -m venv ve
    source ve/bin/activate
  • Indexify installed and running
  • Required Python packages: indexify, requests


  1. Install Indexify:

    curl | sh
    ./indexify server -d

  2. Install the required extractor:

    pip install indexify-extractor-sdk
    indexify-extractor download tensorlake/pdfextractor
    indexify-extractor join-server

File Descriptions

  1. This script sets up the extraction graph for converting PDFs to images.

  2. This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.


  1. First, run the script to set up the extraction graph:


  2. Then, run the script to process a PDF and extract images:


This script will: - Download a sample PDF from arXiv - Upload the PDF to Indexify - Extract images from the PDF - Print the extracted image content


You can customize the image extraction process by modifying the extraction_graph_spec in For example, you could add additional extraction steps or change the output format.

In, you can modify the pdf_url variable to process different PDF documents.


This project demonstrates the power and flexibility of Indexify for extracting images from PDF documents. While this example focuses on image extraction, Indexify can be used for various other document processing tasks, making it a versatile tool for handling large volumes of documents efficiently.

For more information on Indexify and its capabilities, visit the official documentation.