If you’d like to try with your own cluster, check out the instructions.

Components

  • Server - This is where all your requests go. There’s an ingress which exposes / by default.
  • Executor - Extractors can take multiple forms, this example is generic and works for all the extractors which are distributed by the project.

Dependencies

Blob Store

We recommend using an S3 like service for the blob store. Our ephemeral example uses minio for this. See the environment variable patch for how this gets configured.

The API server and coordinator will need an AWS_ENDPOINT env var pointing to where your S3 solution is hosted. Extractors need a slightly different env var - AWS_ENDPOINT_URL.

GCP

  • You’ll want to create a HMAC key to use as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  • Set AWS_ENDPOINT_URL to https://storage.googleapis.com/

Other Clouds

Not all clouds expose a S3 interface. For those that don’t check out the s3proxy project. However, we’d love help implementing your native blob storage of choice! Please open an issue so that we can have a discussion on how that would look for the project.

Vector Store

We support multiple backends for vectors including LanceDb, Qdrant and PgVector. The ephemeral example uses postgres and PgVector for this. The database itself is pretty simple. Pay extra attention to the patch which configures the API server and collector to use that backend.

Structured Store

Take a look at the vector store component in kustomize. It implements the structured store as well.