Indexify High Level

Indexify is an open-source data framework. It is designed for creating ingestion and extraction pipelines for unstructured data. The framework uses a declarative approach to perform structured extraction with any AI model or to transform ingested data.

Indexify pipelines operate in real-time. They process data immediately after ingestion. This makes them ideal for interactive applications and low-latency scenarios.

Use Cases: Pipelines for Entity Extraction/Embedding from documents, Audio Transcription, Summarization, Object Detection from Images.

RAG: Embedding and structured data extracted by Indexify from enterprise data can be consumed into RAG applications.

Multi-Stage Data Ingestion and Extraction Workflows

Extraction Graphs are at the center of Indexify. They are pipelines which processes data and writes the output to databases for retrieval. Extraction policies are linked using the content_source attribute. Some unique characteristics of Indexify workflows are -

AI Native: Use models from OpenAI, HuggingFace and Ollama in the pipeline.

Extensible: by plugging in Python modules in the pipeline.

Real Time and Scalable: Handles real-time processing of data at scale powered by a low latency and distributed scheduler.

APIs: Pipelines are exposed as HTTP APIs, making them accessible from applications written in TypeScript, Java, Python, Go, etc.

graph.yaml
name: 'pdf-ingestion-pipeline'
extraction_policies:
- extractor: 'tensorlake/marker'
  name: 'pdf_to_markdown'
- extractor: 'tensorlake/ner'
  name: 'entity_extractor'
  content_source: 'pdf_to_markdown'
- extractor: 'tensorlake/minilm-l6'
  name: 'embedding'
  content_source: 'pdf_to_markdown'

Continuous Data Ingestion

Retrieve Extracted Data

Retrieve extracted named entities

Search vector indexes populated by embeddings

To understand usage of indexify for various use cases we recommend the following starting points.

Basic Tutorial

A Wikipedia ingestion and indexing pipeline. The tutorial teaches the basics of Indexify. It introduces all the system components and how to build a simple pipeline.

Intermediate Tutorial

A tax document processing and Q&A system. The tutorial covers text extraction from PDF and building a RAG pipeline.

Multi-Modal RAG

A detailed example of text, table and image extraction from PDF. It also covers building image and text indexes and doing cross-modal retrieval, re-ranking, and reciproal rank fusion.

Why Use Indexify?

Indexify is for continuous ingestion and extraction from unstructured data in a streaming fashion. If you are building LLM Application that need structured data from any unstructured data sources, Indexify is the tool for you.

While there lies a divide in tools for prototyping and for deploying to production, Indexify runs locally on laptops for rapid prototyping, but you can also deploy a scaled up version in production for high availability and reliability.

You should use Indexify if you care about -

  • Processing Multi-Modal Data such as PDFs, Videos, Images, and Audio
  • High Availability and Fault Tolerance
  • Local Experience for rapid prototyping
  • Automatically updating indexes whenever upstream data sources are updated
  • Incremental Extraction and Selective Deletion
  • Compatibility with Various LLM Frameworks (Langchain, DSPy, etc.)
  • Integration with any database, blob store or vector store.
  • Owning your data infrastructure and being able to deploy on any cloud.

Was this page helpful?