Building a Wikipedia Information Retrieval System

Getting Started Cover Image

In this guide, we’ll walk you through creating an online ingestion pipeline for Wikipedia pages. This pipeline will demonstrate how to:

  1. Extract structured information (Named Entity Recognition) from web pages using LLMs
  2. Break down (chunk) the text, create embeddings, and store them in a vector database (LanceDB in this example)

By the end of this tutorial, you’ll be able to:

  1. Use Mistral or OpenAI GPT-4 via API calls and Llama 3.1 or Phi-3 locally to answer questions based on indexed information (Retrieval-Augmented Generation or RAG)
  2. Retrieve the Named Entities extracted from the text
  3. Use a User Interface to visually debug your pipelines and inspect how pages are broken down into chunks

Supported OS:
Mac OS(M series).
Linux: Ubuntu 20.04 and above, Red Hat 9 and above.
Windows: Use WSL, but we don’t actively test. Python Versions: 3.9-3.11

The code of this guide can be found here.

Prerequisites

Before we begin, make sure you have:

  • Basic knowledge of Python programming
  • Familiarity with command-line interfaces
  • An OpenAI or Mistal API key (only if using them)

Setup

You’ll need three terminal windows open for this tutorial: one for downloading and running the Indexify Server, another for running Indexify extractors to handle structured extraction, chunking, and embedding, and a third for running Python scripts to load and query data from the Indexify server.

Indexify-Terminals

We’ll use the following notation to indicate which terminal to use:

Terminal X Description of Command
command goes here

Understanding Indexify Components

Here are components which you will touch while working through the example:

  1. Indexify Server: The central coordinator and data ingestion API.
  2. Extractors: Specialized workers designed to perform specific data processing tasks (e.g., embedding data, generating summaries, or extracting features from unstructured data).
  3. Extraction Graph: A declarative YAML file that chains together extractors into a complex pipeline.

The directory structure of our project, will look like this

Directory Structure
indexify-tutorial
│
├── venv                   # Virtual environment (created by python3 -m venv venv)
│
├── graph.yaml             # Extraction graph definition
├── setup.py               # Script to create the extraction graph
├── ingest.py              # Script to ingest Wikipedia data
├── query.py               # Script to query the indexed data
│
└── indexify               # Indexify server executable (downloaded by curl command)

Step 1: Setting Up the Indexify Server

Let’s start by downloading and running the Indexify server:

Terminal 1 - Download Indexify Server
curl https://getindexify.ai | sh
./indexify server -d

This command creates two important endpoints:

  1. Ingestion API: http://localhost:8900
  2. User Interface: http://localhost:8900/ui

The Ingestion API is used for uploading content and retrieving data from indexes and SQL tables, while the User Interface provides a dashboard for visualizing extraction graphs, content, and indexes.

Step 2: Creating a Virtual Environment

It’s good practice to use a virtual environment for Python projects. Let’s create one and install the necessary packages:

Terminal 2 - Install Dependencies
python3 -m venv venv
source venv/bin/activate
pip3 install indexify-extractor-sdk indexify wikipedia openai llama-cpp-python mistralai langchain_community

Step 3: Setting Up Indexify Extractors

The next step is to set up the extractors, which are essential for structured extraction from unstructured data across different modalities. For instance, in this example we use different extractors for parsing HTML and converting into text, chunking the text, and embedding it.

Extractors consume Content, which consists of raw bytes of unstructured data, and then produce a list of processed Content along with extracted features.

Extractor Working

If you want to read and understand how to build a custom extractor for your own use case, go through the following section. However, if you want to use a built-in available extractor jump to the next section.

Using Custom Extractors

Custom Extractors are written by implementing a Python class that extends the Extractor abstractor class from the indexify-extractor-sdk package.

Here is an example Extractor, which does Named Entity Recognition on text data.

class NerExtract(Extractor):
    name = 'yourorg/nerextractor'
    
    def extract(self, content: Content) -> List[Content]:
        """
        Extracts named entities from content.
        """
        output = []
        chunks = chunk_content(content)
        for chunk in chunks:
            entities = run_ner_model(chunk)
            metadata_chunk = Content(content_type='application/json', data=entities),
            output.append([metadata_chunk])
        return output

An extractor receives data in a Content object and transforms it into one or more Content objects, optionally adding Embedding objects during extraction. For example, you could split a PDF into multiple content pieces, each with its text and corresponding embedding or named entities detected in the text.

Indexify provides tools to test extractors locally, package and deploy them to production. For detailed instructions, please see this page.

Using Available Extractors

As mentioned before, for the purpose of this tutorial, we already have Extractors written, deployed and tested.

Now, let’s download the extractors:

Start all the extractors:

Terminal 2 - Starting Extractor Workers
indexify-extractor join-server

Step 4: Defining Our Data Pipeline

We’ll define our data pipeline using a YAML file to process text documents by splitting them into chunks, extracting entities, and embedding the chunks in parallel. The following diagram outlines the Indexify end-to-end pipeline.

Extraction Policy Graph

Let us create (or open) a file named graph.yaml with the following content:

This YAML file defines three extraction policies:

  1. entity-extractor: Extracts named entities from the text
  2. chunker: Splits the text into smaller chunks
  3. wikiembedding: Creates embeddings for the chunks

These three extractors are utilized to process and analyze the input content from Wikipedia:

  1. LLM Extractors: We use one of these LLMs to extract named entities from text. All these can be customized with specific system prompts to tailor its output.

    • Llama CPP Extractor(tensorlake/llama_cpp): Uses Phi-3-mini-4k-instruct model.
    • OpenAI Extractor(tensorlake/openai): Uses OpenAI’s GPT4, GPT-4o, etc.
    • Mistral Extractor(tensorlake/mistral): Uses Mistral’s LLMs.
  2. Chunk Extractor(tensorlake/chunk-extractor): Breaks down text into chunks, offering flexibility in chunk size and overlap.

  3. MiniLM-L6 Extractor(tensorlake/minilm-l6): Generates embeddings of the text chunks for semantic search and retrieval.

These extractors work in concert to transform raw, unstructured input into processed, indexed, and easily retrievable information, forming the backbone of the Indexify pipeline for tasks such as entity recognition, text segmentation, and semantic embedding. You can learn more about different types of available extractors and their usage here.

Step 5: Creating the Extraction Graph

Now, let’s create a Python script to set up our extraction graph using the YAML file we just created.

Create a file named setup.py with the following content:

setup.py
from indexify import IndexifyClient, ExtractionGraph

client = IndexifyClient()

def create_extraction_graph():
    extraction_graph = ExtractionGraph.from_yaml_file("graph.yaml")
    client.create_extraction_graph(extraction_graph)

if __name__ == "__main__":
    create_extraction_graph()

Run this script to create the extraction graph:

Terminal 3 - Create Extraction Graph
source venv/bin/activate
python3 ./setup.py

Step 6: Loading Data

Now that we have our extraction graph set up, let’s create (or open) a script to load the Wikipedia data into our pipeline.

Create a file named ingest.py with the following content:

ingest.py
from indexify import IndexifyClient, ExtractionGraph
from langchain_community.document_loaders import WikipediaLoader

client = IndexifyClient()

def load_data(player):
    docs = WikipediaLoader(query=player, load_max_docs=1).load()

    for doc in docs:
        client.add_documents("wiki_extraction_pipeline", doc.page_content)

if __name__ == "__main__":
    load_data("Kevin Durant")
    load_data("Stephen Curry")

Run this script to ingest data into Indexify:

Terminal 3 - Ingest Data
source venv/bin/activate
python3 ./ingest.py

Step 7: Querying Indexify

Now that we have data in our system, let’s create a script to query Indexify and retrieve information.

You can query Indexify to -

  1. List ingested content by extraction graph. You can also list content per extraction policy.
  2. Get extracted data from any of the extraction policies of an Extraction Graph.
  3. Perform semantic search on vector indexes populated by embedding extractors.
  4. Run SQL Queries on structured data(not in this tutorial).

Let us create a file named query.py with the following content:

Run this script to query the indexed data:

You should see a response summarizing Kevin Durant’s career accomplishments based on the indexed Wikipedia data.

Output
Kevin Durant has achieved numerous accomplishments during his career, including:

1. NBA Championships: Won two NBA titles with the Golden State Warriors in 2017 and 2018.
2. NBA Most Valuable Player (MVP): Won the regular season MVP award in 2014.
3. NBA Finals MVP: Named Finals MVP in both 2017 and 2018.
4. All-Star selections: Selected to the NBA All-Star team multiple times (exact number may vary depending on the most recent data).
5. All-NBA Team selections: Named to multiple All-NBA Teams, including First Team selections.
6. Scoring titles: Led the league in scoring multiple times.
7. Olympic gold medals: Won gold medals with Team USA in basketball.
8. Rookie of the Year: Won the NBA Rookie of the Year award in 2008.

These accomplishments highlight Durant's status as one of the premier players in the NBA, showcasing his scoring ability, leadership, and overall impact on the game.

Conclusion

Congratulations, You’ve successfully set up an Indexify pipeline for ingesting, processing, and querying Wikipedia data. This guide has walked you through:

  1. Setting up the Indexify server and extractors
  2. Defining an extraction graph for processing Wikipedia pages
  3. Ingesting data into the system
  4. Querying the processed data using semantic search and Phi 3 Mini / GPT 4o Mini / Mistral Large

Indexify’s fault-tolerant design ensures reliability and scalability, making it suitable for mission-critical applications. You can now explore more advanced topics and integrations to further enhance your information retrieval and processing capabilities.

Next Steps

Intermediate Tutorial

Build a tax document processing and Q&A system. The tutorial covers text extraction from PDFs and building a RAG pipeline.

Key Concepts of Indexify

Learn about the core concepts of Indexify, including extractors, transformation, and structured data extraction.