Introduction

Indexify High Level

Indexify is an open-source data framework. It is designed for creating ingestion and extraction pipelines for unstructured data. The framework uses a declarative approach to perform structured extraction with any AI model or to transform ingested data.

Indexify pipelines operate in real-time. They process data immediately after ingestion. This makes them ideal for interactive applications and low-latency scenarios.

Use Cases: Pipelines for Entity Extraction/Embedding from documents, Audio Transcription, Summarization, Object Detection from Images.

RAG: Embedding and structured data extracted by Indexify from enterprise data can be consumed into RAG applications.

Multi-Stage Data Ingestion and Extraction Workflows

Extraction Graphs are at the center of Indexify. They are pipelines which processes data and writes the output to databases for retrieval. Extraction policies are linked using the content_source attribute. Some unique characteristics of Indexify workflows are -

AI Native: Use models from OpenAI, HuggingFace and Ollama in the pipeline.

Extensible: by plugging in Python modules in the pipeline.

Real Time and Scalable: Handles real-time processing of data at scale powered by a low latency and distributed scheduler.

APIs: Pipelines are exposed as HTTP APIs, making them accessible from applications written in TypeScript, Java, Python, Go, etc.

graph.yaml

name: 'pdf-ingestion-pipeline'
extraction_policies:
- extractor: 'tensorlake/marker'
  name: 'pdf_to_markdown'
- extractor: 'tensorlake/ner'
  name: 'entity_extractor'
  content_source: 'pdf_to_markdown'
- extractor: 'tensorlake/minilm-l6'
  name: 'embedding'
  content_source: 'pdf_to_markdown'

Continuous Data Ingestion

Retrieve Extracted Data

Retrieve extracted named entities

Search vector indexes populated by embeddings

To understand usage of indexify for various use cases we recommend the following starting points.

Basic Tutorial

A Wikipedia ingestion and indexing pipeline. The tutorial teaches the basics of Indexify. It introduces all the system components and how to build a simple pipeline.

Intermediate Tutorial

A tax document processing and Q&A system. The tutorial covers text extraction from PDF and building a RAG pipeline.

Multi-Modal RAG

A detailed example of text, table and image extraction from PDF. It also covers building image and text indexes and doing cross-modal retrieval, re-ranking, and reciproal rank fusion.

Why Use Indexify?

Indexify is for continuous ingestion and extraction from unstructured data in a streaming fashion. If you are building LLM Application that need structured data from any unstructured data sources, Indexify is the tool for you.

While there lies a divide in tools for prototyping and for deploying to production, Indexify runs locally on laptops for rapid prototyping, but you can also deploy a scaled up version in production for high availability and reliability.

You should use Indexify if you care about -

Processing Multi-Modal Data such as PDFs, Videos, Images, and Audio
High Availability and Fault Tolerance
Local Experience for rapid prototyping
Automatically updating indexes whenever upstream data sources are updated
Incremental Extraction and Selective Deletion
Compatibility with Various LLM Frameworks (Langchain, DSPy, etc.)
Integration with any database, blob store or vector store.
Owning your data infrastructure and being able to deploy on any cloud.

Get Started

Overview

Use Cases

CLI & UI

Pre-Built Extractors

LLM Frameworks

Deployment and Operation

Introduction

Multi-Stage Data Ingestion and Extraction Workflows

Continuous Data Ingestion

Retrieve Extracted Data

Basic Tutorial

Intermediate Tutorial

Multi-Modal RAG

Why Use Indexify?

Get Started

Overview

Use Cases

CLI & UI

Pre-Built Extractors

LLM Frameworks

Deployment and Operation

​Multi-Stage Data Ingestion and Extraction Workflows

​Continuous Data Ingestion

​Retrieve Extracted Data

Basic Tutorial

Intermediate Tutorial

Multi-Modal RAG

​Why Use Indexify?

Multi-Stage Data Ingestion and Extraction Workflows

Continuous Data Ingestion

Retrieve Extracted Data

Why Use Indexify?