Extractors are managed by the indexify-extractor cli.

Indexify Extractor CLI

Download the indexify-extractor cli by

pip install indexify-extractor-sdk

List Available Extractors

indexify-extractor list

Extractor List Photo

Download Extractors

The extractors has to be downloaded before they can be used locally or in production. For ex, you can download the PDF extractor like this -

indexify-extractor download tensorlake/pdfextractor

Test Extractors Locally

You can test extractors locally without running them with the server in a production setting. Let’s say we want to test the PDF extractor

indexify-extractor run-local pdfextractor.pdf_extractor:PDFExtractor --file /path/to/pdf
Options
  • --file to pass in a file to the extractor
  • --text to pass in text to the extractor

Join the Extractor to the Server

You can join the extractor to the server to start extracting data ingested by the server

indexify-extractor join-server
Options
  • --coordinator-addr - Address of the coordinator. Default: localhost:8950
  • --ingestion-addr - Address of the ingestion server. Default: localhost:8900
  • --listen-port - The port on which the extractor listens of on-demand extraction
  • --advertise-addr - The address that is advertized to the ingestion server. This should be reachable by the server for embedding lookups to work if this is an embedding extractor.
  • --workers - Number of workers that the extractor spawns

These configurations are printed in log when the extractor starts up