PDF Extractors
PDFExtractor
Extract text, images and tables as strings, bytes and json respectively using this extractor.
OCRMyPdf
Extract text content from image based pdf files using this ocrmypdf based extractor.
UnstructuredIO
This extractor uses unstructured.io to extract pieces of pdf document into separate plain text content data.
LayoutLM Document QA
This is a fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents. It has been fine-tuned using both the SQuAD2.0 and DocVQA datasets.
Marker Extractor
Markdown extractor converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.