Skip to content

Embedding Extractors

Embedding extractors are essential tools in modern natural language processing and computer vision tasks. They transform raw data (text, images, or other types of input) into dense vector representations, enabling machines to understand and process information in a more meaningful way, and facilitating tasks such as semantic search, clustering, and machine learning. Indexify allows you to choose between different extractors based on your use case and source of data. If you want to learn more about extractors, their design and usage, read the Indexify documentation.

Extractor Name Use Case Supported Input Types
OpenAI Clip Image and text embedding image/jpeg, image/png, image/gif, text/plain
ColBERT Text embedding text/plain
E5 Text embedding for similarity search text/plain
Hash Identity embedding for duplicate detection text/plain
Jina Text embedding text/plain
Arctic Text embedding text/plain
MiniLML6 Text embedding text/plain
MPnet Multilingual text embedding text/plain
OpenAI Embedding Text embedding text/plain
SciBERT Uncased Scientific text embedding text/plain
BGE Base Text embedding text/plain

OpenAI Clip

Description

This extractor utilizes OpenAI's CLIP model to generate embeddings for both images and text.

Input Parameters

  • None specified.

Input Data Types

["image/jpeg", "image/png", "image/gif", "text/plain"]

Class Name

ClipEmbeddingExtractor

Download Command

indexify-extractor download tensorlake/clip-extractor

ColBERT

Description

This ColBERTv2-based extractor is a Python class that encapsulates the functionality to convert text inputs into vector embeddings using the ColBERTv2 model. It leverages ColBERTv2's transformer-based architecture to generate context-aware embeddings suitable for various natural language processing tasks.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

ColBERTv2Base

Download Command

indexify-extractor download tensorlake/colbert

E5

Description

A good small and fast general model for similarity search or downstream enrichments. Based on E5_Small_V2 which only works for English texts. Long texts will be truncated to at most 512 tokens.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

E5SmallEmbeddings

Download Command

indexify-extractor download tensorlake/e5_small_embedding

Hash

Description

This extractor extracts an "identity-"embedding for a piece of text, or file. It uses the sha256 to calculate the unique embedding for a given text, or file. This can be used to quickly search for duplicates within a large set of data.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

IdentityHashEmbedding

Download Command

indexify-extractor download yenicelik/identity-hash-extractor

Jina

Description

This extractor extracts an embedding for a piece of text. It uses the huggingface Jina model which is an English, monolingual embedding model supporting 8192 sequence length. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

JinaEmbeddingsBase

Download Command

indexify-extractor download tensorlake/jina-embeddings-base-en

Arctic - Sentence Transformer

Description

This extractor extracts an embedding for a piece of text. It uses the huggingface Snowflake's Arctic-embed-m. The snowflake-arctic-embedding models achieve state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

ArcticExtractor

Download Command

indexify-extractor download tensorlake/arctic

MiniLML6 - Sentence Transformer

Description

This extractor extracts an embedding for a piece of text. It uses the huggingface MiniLM-6 model, which is a tiny but very robust embedding model for text.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

MiniLML6Extractor

Download Command

indexify-extractor download tensorlake/minilm-l6

MPnet - Sentence Transformer

Description

This is a sentence embedding extractor based on the MPNET Multilingual Base V2. This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. It's best use case is paraphrasing, but it can also be used for other tasks.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

MPNetV2

Download Command

indexify-extractor download tensorlake/mpnet

OpenAI Embedding

Description

This extractor extracts an embedding for a piece of text. It uses the OpenAI text-embedding-ada-002 model.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

OpenAIEmbeddingExtractor

Download Command

indexify-extractor download tensorlake/openai-embedding-ada-002-extractor

SciBERT Uncased

Description

This is the pretrained model presented in SciBERT: A Pretrained Language Model for Scientific Text, which is a BERT model trained on scientific text. Works best with scientific text embedding extraction.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

SciBERTExtractor

Download Command

indexify-extractor download tensorlake/scibert

BGE Base

Description

BGE Base English Model for Sentence Embeddings.

Input Parameters

  • None specified.

Input Data Types

["text/plain"]

Class Name

BGEBase

Download Command

indexify-extractor download tensorlake/bge-base-en