Skip to content

LLM Extractors

LLM based text extractors are the backbone of structured data parsing. Indexify allows you to choose between multiple extractors as per the use case and the type of data. If you want to read more about how extractors are created and used read the documentation.

Extractor Name Use Case Supported Input Types
Gemini General-purpose text and image processing text/plain, application/pdf, image/jpeg, image/png
Mistral AI General-purpose text processing text/plain
Ollama Local LLM processing text/plain
OpenAI General-purpose text and image processing text/plain, application/pdf, image/jpeg, image/png
NER Named Entity Recognition text/plain
Summarization Text summarization using BART text/plain
LLM-Summary Text summarization using lightweight LLM text/plain
Chunking Text splitting into smaller chunks text/plain
Schema Structured information extraction based on JSON schema text/plain

Gemini Extractor

The Gemini extractors supports various Gemini models like 1.5 Pro and 1.5 Flash. It supports input documents like text, pdf and images and returns output in text. The system and user prompts can be configured using the input parameter configuration. The input to the extractor is appended into the user prompt.

Input Params

model_name(default:'gemini-1.5-flash-latest'): Name of the gemini model to use.

key(default: None): API Key

system_prompt(default: "You are a helpful assistant"): Default system prompt.

user_prompt(default: None): User prompt

Input Data Types

["text/plain", "application/pdf", "image/jpeg", "image/png"]

Class Name

GeminiExtractor

Download Command

indexify-extractor download tensorlake/gemini

Mistral AI Extractor

This Mistral extractor supports various Mistral AI models like mistral-large-latest. The system and user prompts can be configured using the input parameter configuration. The input to the extractor is appended into the user prompt.

Input Params

model_name(default:'mistral-large-latest'): Name of the Mistral model to use.

key:(default: None): API Key

system_prompt(default: "You are a helpful assistant"): Default system prompt.

user_prompt(default: None): User prompt

Input Data Types

["text/plain"]

Class Name

MistralExtractor

Download Command

indexify-extractor download tensorlake/mistral

Ollama Extractor

The extractor provides access to local LLMs using the Ollama API. It is Llama3 by default. The system and user prompts can be configured using the input parameter configuration. The input to the extractor is appended to the user prompt.

Input Params

model_name(default='llama3'): Model Name

system_prompt(default='You are a helpful assistant.'): System prompt

user_prompt(default=None): User prompt

Input Data Types

["text/plain"]

Class Name

OllamaExtractor

Download Command

indexify-extractor download tensorlake/ollama

OpenAI Extractor

Description

This extractor supports multiple types of input documents like text, PDF, and images, returning output in text format using OpenAI. It supports all OpenAI LLM models like 3.5 Turbo and 4, working on the content of previous extractors as messages. The user can manually overwrite prompts and messages.

Input Parameters

  • model_name (default: 'gpt-3.5-turbo'): Name of the OpenAI model to use.
  • key (default: None): API Key
  • system_prompt (default: 'You are a helpful assistant.'): Default system prompt.
  • user_prompt (default: None): User prompt

Input Data Types

["text/plain", "application/pdf", "image/jpeg", "image/png"]

Class Name

OAIExtractor

Download Command

indexify-extractor download tensorlake/openai

NER Extractor

Description

This is a bert-base-NER (Named-Entity-Recognition) model. It is a fine-tuned BERT model ready for Named Entity Recognition, achieving state-of-the-art performance. It recognizes four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous (MISC). This model is a bert-base-based model fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.

Input Parameters

  • model_name (default: "dslim/bert-base-NER"): Name of the NER model to use.

Input Data Types

["text/plain"]

Class Name

NERExtractor

Download Command

indexify-extractor download tensorlake/ner

Summarization Extractor

Description

This BART-based Summary Extractor can convert text from Audio, PDF, and other files into summaries. It uses facebook/bart-large-cnn, which excels at generating factual summaries from English news articles. For chunking, it uses FastRecursiveTextSplitter along with support for LangChain-based text splitters.

Input Parameters

  • max_length (default: 3000): Maximum length of the summary.
  • min_length (default: 30): Minimum length of the summary.
  • chunk_method (default: "indexify"): Method used for text chunking.

Input Data Types

["text/plain"]

Class Name

SummaryExtractor

Download Command

indexify-extractor download tensorlake/summarization

LLM-Summary Extractor

Description

This LLM-based Summary Extractor converts text from Audio, PDF, and other files into summaries. It uses h2oai/h2o-danube2-1.8b-chat, a <3B parameter Large Language Model suitable for low-end machines. For chunking, it uses FastRecursiveTextSplitter along with support for LangChain-based text splitters.

Input Parameters

  • max_length (default: 130): Maximum length of the summary.
  • chunk_method (default: "indexify"): Method used for text chunking.

Input Data Types

["text/plain"]

Class Name

SummaryExtractor

Download Command

indexify-extractor download tensorlake/llm-summary

Chunking Extractor

Description

ChunkExtractor splits text into smaller chunks. It can handle input from any source producing text with a mime type of text/plain. The extractor can be configured to use various chunking strategies available through Langchain.

Input Parameters

  • overlap (default: 0): Overlap between chunks.
  • chunk_size (default: 100): Size of each chunk.
  • text_splitter (default: "recursive"): Type of text splitter to use (options: "char", "recursive", "markdown", "html").
  • headers_to_split_on (default: []): List of headers to use for splitting.

Input Data Types

["text/plain"]

Class Name

ChunkExtractor

Download Command

indexify-extractor download tensorlake/chunk-extractor

Schema Extractor

Description

The Schema Extractor enables structured extraction using LLMs. It accepts a user-provided JSON Schema and extracts information from text based on the schema. By default, it uses OpenAI, with the ability to use other LLMs as well. This extractor is inspired by Instructor from @jxnlco.

Input Parameters

  • service (default: 'openai'): LLM Service to use.
  • model_name (default: 'gpt-3.5-turbo'): LLM model to use.
  • key (default: None): API key of the service.
  • schema_config (default: None): The JSON Schema to use for structured extraction.
  • example_text (default: None): Few examples of how the scema should look like
  • data: Unstructured text data
  • additional_messages (default: None): Any additional instrunction for the model

Input Data Types

["text/plain"]

Class Name

SchemaExtractor

Download Command

indexify-extractor download tensorlake/schema