Skip to content

Key Concepts

Overview

Indexify is a powerful and versatile data framework designed to revolutionize the way we handle unstructured data for AI applications. It offers a seamless solution for building ingestion and extraction pipelines that can process various types of unstructured data, including documents, videos, images, and audio files. A typical workflow involves:

  1. Uploading unstructured data (documents, videos, images, audio)
  2. Applying extractors to process the content
  3. Updating vector indexes and structured stores
  4. Retrieving information via semantic search on vector indexes and SQL queries on structured data tables

Block Diagram

Core Components

To fully grasp the power and flexibility of Indexify, it's essential to understand its core concepts:

  1. Extractors: These are the workhorses of Indexify. Extractors are functions that take data from upstream sources and produce three types of output:
  2. Transformed data: For example, converting a PDF to plain text.
  3. Embeddings: Vector representations of the data, useful for semantic search.
  4. Structured data: Extracted metadata or features in a structured format.

  5. Extraction Graphs: These are multi-step workflows created by chaining multiple extractors together. They allow you to define complex data processing pipelines that can handle various transformations and extractions in a single, cohesive flow.

  6. Namespaces: Indexify uses namespaces as logical abstractions for storing related content. This feature allows for effective data partitioning based on security requirements or organizational boundaries, making it easier to manage large-scale data operations.

  7. Content: In Indexify, 'Content' represents raw unstructured data. This could be documents, videos, images, or any other form of unstructured data. Content objects contain the raw bytes of the data along with metadata like MIME types.

  8. Vector Indexes: These are automatically created from extractors that return embeddings. Vector indexes enable powerful semantic search capabilities, allowing you to find similar content based on meaning rather than just keywords.

  9. Structured Data Tables: Metadata extracted from content is exposed via SQL queries. This allows for easy querying and analysis of the structured information derived from your unstructured data.

1. Extractor

An Extractor is essentially a Python class that can:

a) Transform unstructured data into intermediate forms b) Extract features like embeddings or metadata (JSON) for LLM applications

Extractors consume Content which contains raw bytes of unstructured data, and they produce a list of Content and features from them.

Image 4: Extractor_working

Types of Extraction:

  1. Transformation
  2. Converts a source into a list of Content
  3. Example: PDF to text, video to audio
  4. Extractor(Content) -> List[Content]

Content Transformation

  1. Structured Data Extraction
  2. Enriches content with structured data
  3. Example: Adding bounding boxes to detected objects in an image
  4. Extractor(Content) -> List[Feature(Type=Metadata)]

Feature Extraction

  1. Embedding Extraction
  2. Generates embeddings for ingested content
  3. Indexify automatically creates indexes from these embeddings
  4. Extractor(Content) -> List[Feature(Type=Embedding)]

Embedding Extraction

  1. Combined Extraction
  2. Performs transformation, embedding, and metadata extraction simultaneously
  3. Extractor(Content) -> List[Feature... Content ...]

Combined Extraction

Indexify allows you to have the freedom to either build custom extractors yourself, or make use of a wide array of pre-existing extractors.

Modality Extractor Name Use Case Supported Input Types
Text OpenAI General-purpose text processing text/plain, application/pdf
Text Chunking Text splitting into smaller chunks text/plain
Image Gemini General-purpose image processing image/jpeg, image/png
Image EasyOCR Text extraction from images using OCR image/jpeg, image/png
PDF PDFExtractor Text, image, and table extraction from PDFs application/pdf
PDF Marker PDF to markdown conversion application/pdf
Audio Whisper Audio transcription audio, audio/mpeg
Audio ASR Diarization Speech recognition and speaker diarization audio, audio/mpeg
Presentation PPT Information extraction from presentations application/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation

For an exhaustive list of all extractors and a guide on building custom extractors visit the official extractor docs.

2. Namespaces

  • Logical abstractions for storing related content
  • Allow data partitioning based on security and organizational boundaries

3. Content

  • Representation of unstructured data (documents, video, images)

4. Extraction Graphs

  • Apply a sequence of extractors on ingested content in a streaming manner
  • Individual steps in an Extraction Graph are known as Extractors
  • Track lineage of transformed content and extracted features
  • Enable deletion of all transformed content and features when sources are deleted

Extraction Policy

5. Vector Index and Retrieval APIs

  • Automatically created from extractors that return embeddings
  • Support various vector databases (Qdrant, Elastic Search, Open Search, PostgreSQL, LanceDB)
  • Enable semantic/KNN search

6. Structured Data Tables

  • Expose metadata extracted from content using SQL Queries
  • Each Extraction Graph has a virtual SQL table
  • Allow querying of metadata added to content

Example Usage:

select * from object_detector where object_name='ball'
This query retrieves all images with a detected ball, assuming an object_detector policy using YOLO object detection.

Next Steps

To continue your journey with Indexify, consider exploring the following topics in order:

Topics Subtopics
Getting Started - Basic - Setting up the Indexify Server
- Creating a Virtual Environment
- Downloading and Setting Up Extractors
- Defining Data Pipeline with YAML
- Loading Wikipedia Data
- Querying Indexed Data
- Building a Simple RAG Application
Intermediate Use Case: Unstructured Data Extraction from a Tax PDF - Understanding the challenge of tax document processing
- Setting up an Indexify pipeline for PDF extraction
- Implementing extractors for key tax information
- Querying and retrieving processed tax data
Key Concepts of Indexify - Extractors
• Transformation
• Structured Data Extraction
• Embedding Extraction
• Combined Transformation, Embedding, and Metadata Extraction
- Namespaces
- Content
- Extraction Graphs
- Vector Index and Retrieval APIs
- Structured Data Tables
Architecture of Indexify - Indexify Server
• Coordinator
• Ingestion Server
- Extractors
- Deployment Layouts
• Local Mode
• Production Mode
Building a Custom Extractor for Your Use Case - Understanding the Extractor SDK
- Designing your extractor's functionality
- Implementing the extractor class
- Testing and debugging your custom extractor
- Integrating the custom extractor into your Indexify pipeline
Examples and Use Cases - Document processing and analysis
- Image and video content extraction
- Audio transcription and analysis
- Multi-modal data processing
- Large-scale data ingestion and retrieval systems

Each section builds upon the previous ones, providing a logical progression from practical application to deeper technical understanding and finally to customization and real-world examples.

For more information on how to use Indexify, refer to the official documentation.

Happy coding!