Intermediate
Parsing complex PDF Documents
What if one could go through a 26-page complex tax document and understand it without a lawyer, an accountant, or the geek from college? What if 5 easy-to-follow steps were all you need to do reliable QnA on a complex and layered document like a tax invoice? Indexify enables you to do just that.
In this example, we will make an LLM (Large Language Model) answer how much someone would be paying in taxes in California, based on their income. We will ingest and extract information from a PDF containing CA tax laws, the LLM will refer to the extracted data for response synthesis.
Supported OS:
Mac OS(M series).
Linux: Ubuntu 20.04 and above, Red Hat 9 and above.
Windows: Use WSL, but we don’t actively test.
Prerequisites
Before we begin, ensure you have the following:
- Python 3.9 or older installed (Indexify currently requires this version)
- An OpenAI API key (for using GPT models)
- Command-line interface experience
Setup
You’ll need three separate terminal windows open for this tutorial:
- Terminal 1: For downloading and running the Indexify Server
- Terminal 2: For running Indexify extractors (handling structured extraction, chunking, and embedding)
- Terminal 3: For running Python scripts to load and query data from the Indexify server
We’ll use the following notation to indicate which terminal to use:
<command goes here>
Understanding Indexify Components
Here are components which you will touch while working through the example:
- Indexify Server: The central coordinator and data ingestion API.
- Extractors: Specialized workers designed to perform specific data processing tasks (e.g., embedding data, generating summaries, or extracting features from unstructured data).
- Extraction Graph: A declarative YAML file that chains together extractors into a complex pipeline.
The directory structure of our project, will look like this
indexify-tax-calculator/
│
├── venv/ # Virtual environment (created by python3 -m venv venv)
│
├── setup_extraction_graph.py # Script to set up the extraction graph
├── ingest_document.py # Script to download and ingest the PDF
├── query_tax_info.py # Script for question-answering functionality
│
├── taxes.pdf # Downloaded California tax law PDF
│
└── indexify # Indexify server executable (downloaded by curl command)
Stage 1: Setting Up the Indexify Server
Download the indexify server and run it
curl https://getindexify.ai | sh
./indexify server -d
Once running, the server provides two key endpoints:
http://localhost:8900
- The main API endpoint for data ingestion and retrievalhttp://localhost:8900/ui
- A user interface for monitoring and debugging your Indexify pipelines
Stage 2: Downloading and Setting Up Extractors
Extractors are specialized components in Indexify that process and transform data. For our tax law application, we’ll need three specific extractors:
- Marker Extractor: Converts PDF documents to Markdown format
- Chunk Extractor: Splits text into manageable chunks
- MiniLM-L6 Extractor: Generates embeddings for text chunks
The source code for this tutorial can be found here in our example folder
Before we begin, let’s setup a virtual environment for Python projects and download the extractors
Indexify requires python 3.11 or older at this present moment to work.
python3 -m venv venv
source venv/bin/activate
pip3 install indexify-extractor-sdk indexify wikipedia openai
indexify-extractor download tensorlake/marker
indexify-extractor download tensorlake/minilm-l6
indexify-extractor download tensorlake/chunk-extractor
Start the Extractors
Run the Indexify Extractor server in a separate terminal
indexify-extractor join-server
You’ll be able to verify that the three extractors are available by going to the dashboard and looking at the extractors section.
Stage 3: Installing Required Libraries
Don’t forget to install the necessary dependencies before running the rest of this tutorial.
Stage 4: Setting Up the Extraction Graph
Set up an extraction graph to process the PDF documents. The extraction graph defines the sequence of operations that will be performed on our input data (the tax law PDF). Let’s set it up:
This extraction graph, named ‘pdfqa’, defines a three-step process:
- The
tensorlake/marker
extractor converts the PDF into Markdown format. - The
tensorlake/chunk-extractor
splits the Markdown text into chunks of 1000 characters with a 100-character overlap. - The
tensorlake/minilm-l6
extractor generates embeddings for each chunk, enabling semantic search capabilities.
Run this script to create the extraction graph:
source venv/bin/activate
python3 ./setup_extraction_graph.py
The following diagram expresses the pipeline in detail.
Stage 5: Document Ingestion
Add the PDF document to the “pdfqa” extraction graph:
This code does the following:
- Downloads the California tax law PDF from a public URL
- Saves the PDF locally as “taxes.pdf”
- Uploads the PDF to Indexify, associating it with our ‘pdfqa’ extraction graph
Run the following code to ingest the tax document
python3 ./ingest_document.py
Once uploaded, Indexify will automatically process the PDF through our defined extraction graph.
Stage 6: Implementing Question-Answering Functionality
We can use the same prompting and context retrieval function defined above to get context for the LLM based on the question.
You’ll want to have exported OPENAI_API_KEY
and set to your API key before running these scripts.
This will allow you to do the following:
- Defines a
get_context
function that retrieves relevant passages from our processed PDF based on the question. - Creates a
create_prompt
function that formats the question and context for the LLM. - Uses OpenAI’s GPT-3.5-turbo model to generate an answer based on the provided context.
The query_tax_info.py
script handles the querying of processed tax information. Its design incorporates these patterns:
-
Context Retrieval: The script includes a function to retrieve relevant context based on the input question. This separates the concern of finding relevant information from the actual question-answering process.
-
Prompt Creation: There’s a dedicated function for creating the prompt that will be sent to the language model. This separation allows for easy modification of how the prompt is structured.
-
Language Model Integration: The script uses OpenAI’s API to generate answers. This is abstracted into a separate call, making it possible to switch to a different language model (like Mistral) if needed.
Run the following code to query the tax document
python3 ./query_tax_info.py
The example question asks about California tax brackets and calculates taxes for a $24,000 income. The LLM uses the context provided by our Indexify pipeline to formulate an accurate response as shown below.
Based on the provided information, the tax rates and brackets for California are as follows:
- 11,450: 10% of the amount over $0
- 43,650: 11,450
- 112,650: 43,650
- 182,400: 112,650
- 357,700: 182,400
- 100,604 plus 35% of the amount over $357,700
For an income of 0 - 11,450 at 10%, the tax owed on 24,000 at 15%, and add those together.
Given that 0 - $43,650 bracket, you would need to calculate the following:
- Tax on first 11,450 x 10% = $1,145
- Tax on next 24,000 - 12,550 x 15% = $1,882.50
Therefore, your total tax liability would be 1,882.50 = $3,027.50.
Conclusion
This intermediate guide demonstrates how to use Indexify to create a sophisticated question-answering system for California tax laws. By ingesting a PDF, extracting and processing its content, and using an LLM for answer generation, we’ve created a flexible and powerful tool that could be easily expanded to cover more complex scenarios or multiple documents.
The key strengths of this pipeline would be:
- Automatic processing of complex documents (PDFs)
- Efficient chunking and embedding of text for quick retrieval
- Use of up-to-date, specific information for answer generation
- Scalability to handle multiple documents or more complex queries
Was this page helpful?