PDF is the most complex and multi-layered source of unstructured data extraction. Extraction is not always straightforward and easy. Indexify gives you the freedom to choose between different extractors based on your use case and source of data in the PDF. If you want to learn more about extractors, their design and usage, read the Indexify documentation.
Extractor Name | Use Case | Supported Input Types |
---|---|---|
EasyOCR | Text extraction from PDFs and images using OCR | application/pdf , image/jpeg , image/png |
PDFExtractor | Text, image, and table extraction from PDFs | application/pdf |
OCRMyPDF | Text extraction from image-based PDFs | application/pdf |
UnstructuredIO | Structured extraction of PDF content | application/pdf |
LayoutLM Document QA | Question answering on PDF documents | application/pdf , image/jpeg , image/png |
Marker | PDF, EPUB, and MOBI to markdown conversion | application/pdf |
PaddleOCR | Text extraction from PDFs using PaddleOCR | application/pdf |
PPT | Information extraction from presentations | application/vnd.ms-powerpoint , application/vnd.openxmlformats-officedocument.presentationml.presentation |
EasyOCR
Description
This extractor uses EasyOCR to generate searchable PDF content from a regular PDF and then extract the text into plain text content.
Input Parameters
- None specified
Input Data Types
["application/pdf", "image/jpeg", "image/png"]
Class Name
OCRExtractor
Download Command
indexify-extractor download tensorlake/easyocr
PDFExtractor
Description
Extract text, images, and tables as strings, bytes, and JSON respectively using this extractor.
Input Parameters
- None specified
- output_types (default: [“text”]): List of output types to extract (options: “text”, “image”, “table”)
Input Data Types
["application/pdf"]
Class Name
PDFExtractor
Download Command
indexify-extractor download tensorlake/pdfextractor
UnstructuredIO
Description
This extractor uses unstructured.io to extract pieces of PDF document into separate plain text content data.
Input Parameters
- strategy (default: “auto”): Extraction strategy (options: “auto”, “hi_res”, “ocr_only”, “fast”)
- hi_res_model_name (default: “yolox”): High-resolution model name
- infer_table_structure (default: True): Whether to infer table structure
- output_types (default: [“text”]): List of output types to extract (options: “text”, “table”)
Input Data Types
["application/pdf"]
Class Name
UnstructuredIOExtractor
Download Command
indexify-extractor download tensorlake/unstructuredio
Marker Extractor
Description
Markdown extractor converts PDF, EPUB, and MOBI to markdown. It’s 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
Input Parameters
- max_pages (optional): Maximum number of pages to process
- start_page (optional): Starting page for processing
- langs (optional): Languages to consider
- batch_multiplier (default: 2): Batch multiplier for processing
- output_types (default: [“text”]): List of output types to extract (options: “text”, “image”)
Input Data Types
["application/pdf"]
Class Name
MarkdownExtractor
Download Command
indexify-extractor download tensorlake/marker
PaddleOCR Extractor
Description
PDF Extractor for Texts using PaddleOCR.
Input Parameters
- None specified
- output_types (default: [“text”]): List of output types to extract (options: “text”)
Input Data Types
["application/pdf"]
Class Name
PaddleOCRExtractor
Download Command
indexify-extractor download tensorlake/paddleocr_extractor
PPT Extractor
Description
An extractor that lets you extract information from presentations.
Input Parameters
- None specified
- output_types (default: [“text”, “table”]): List of output types to extract (options: “text”, “table”, “image”)
Input Data Types
["application/vnd.ms-powerpoint", "application/vnd.openxmlformats-officedocument.presentationml.presentation"]
Class Name
PPTExtractor
Download Command
indexify-extractor download tensorlake/ppt
Was this page helpful?