PDF is the most complex and multi-layered source of unstructured data extraction. Extraction is not always straightforward and easy. Indexify gives you the freedom to choose between different extractors based on your use case and source of data in the PDF. If you want to learn more about extractors, their design and usage, read the Indexify documentation.

Extractor NameUse CaseSupported Input Types
EasyOCRText extraction from PDFs and images using OCRapplication/pdf, image/jpeg, image/png
PDFExtractorText, image, and table extraction from PDFsapplication/pdf
OCRMyPDFText extraction from image-based PDFsapplication/pdf
UnstructuredIOStructured extraction of PDF contentapplication/pdf
LayoutLM Document QAQuestion answering on PDF documentsapplication/pdf, image/jpeg, image/png
MarkerPDF, EPUB, and MOBI to markdown conversionapplication/pdf
PaddleOCRText extraction from PDFs using PaddleOCRapplication/pdf
PPTInformation extraction from presentationsapplication/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation

EasyOCR

Description

This extractor uses EasyOCR to generate searchable PDF content from a regular PDF and then extract the text into plain text content.

Input Parameters

  • None specified

Input Data Types

["application/pdf", "image/jpeg", "image/png"]

Class Name

OCRExtractor

Download Command

indexify-extractor download tensorlake/easyocr

PDFExtractor

Description

Extract text, images, and tables as strings, bytes, and JSON respectively using this extractor.

Input Parameters

  • None specified
  • output_types (default: [“text”]): List of output types to extract (options: “text”, “image”, “table”)

Input Data Types

["application/pdf"]

Class Name

PDFExtractor

Download Command

indexify-extractor download tensorlake/pdfextractor

UnstructuredIO

Description

This extractor uses unstructured.io to extract pieces of PDF document into separate plain text content data.

Input Parameters

  • strategy (default: “auto”): Extraction strategy (options: “auto”, “hi_res”, “ocr_only”, “fast”)
  • hi_res_model_name (default: “yolox”): High-resolution model name
  • infer_table_structure (default: True): Whether to infer table structure
  • output_types (default: [“text”]): List of output types to extract (options: “text”, “table”)

Input Data Types

["application/pdf"]

Class Name

UnstructuredIOExtractor

Download Command

indexify-extractor download tensorlake/unstructuredio

Marker Extractor

Description

Markdown extractor converts PDF, EPUB, and MOBI to markdown. It’s 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Input Parameters

  • max_pages (optional): Maximum number of pages to process
  • start_page (optional): Starting page for processing
  • langs (optional): Languages to consider
  • batch_multiplier (default: 2): Batch multiplier for processing
  • output_types (default: [“text”]): List of output types to extract (options: “text”, “image”)

Input Data Types

["application/pdf"]

Class Name

MarkdownExtractor

Download Command

indexify-extractor download tensorlake/marker

PaddleOCR Extractor

Description

PDF Extractor for Texts using PaddleOCR.

Input Parameters

  • None specified
  • output_types (default: [“text”]): List of output types to extract (options: “text”)

Input Data Types

["application/pdf"]

Class Name

PaddleOCRExtractor

Download Command

indexify-extractor download tensorlake/paddleocr_extractor

PPT Extractor

Description

An extractor that lets you extract information from presentations.

Input Parameters

  • None specified
  • output_types (default: [“text”, “table”]): List of output types to extract (options: “text”, “table”, “image”)

Input Data Types

["application/vnd.ms-powerpoint", "application/vnd.openxmlformats-officedocument.presentationml.presentation"]

Class Name

PPTExtractor

Download Command

indexify-extractor download tensorlake/ppt

Was this page helpful?