Unleash the Power of PDFs: Introducing olmOCR, the Ultimate Toolkit for PDF Text Extraction
Tired of wrestling with PDFs and struggling to extract accurate text? OLMOCR, the open-source toolkit developed by the AllenNLP team at AI2, is here to revolutionize how you interact with PDF documents. Designed to leverage the latest advances in Vision Language Models (VLMs), including ChatGPT 4o integration, olmOCR empowers you to unlock hidden knowledge and make the most of your PDF data.
Why olmOCR? Key Benefits You Can't Ignore
Why should you choose olmOCR? Here's how it can transform your PDF processing workflow:
- ChatGPT-Powered Natural Text Parsing: Achieve unparalleled accuracy in text extraction with an innovative prompting strategy using ChatGPT 4o. Say goodbye to garbled text and hello to clean, usable data.
- Streamlined Evaluation: Compare different pipeline versions side-by-side with the built-in eval toolkit (
runeval.py
), ensuring optimal performance and accuracy. - Intelligent Filtering: Eliminate noise and focus on relevant content with language filtering and SEO spam removal via the integrated
filter.py
. - Scalable Processing: Process millions of PDFs efficiently using Sglang. Whether working locally or on a multi-node cluster, olmOCR delivers speed and scalability.
- Cutting-Edge Model Support: Fine-tuning code for state-of-the-art models like Qwen2-VL and Molmo-O ensures compatibility with the latest advancements in VLM technology.
- Easy Result Visualization: View extracted text alongside the original PDFs for seamless verification and analysis, using
dolmaviewer.py
. - Automated Workflow: Automate the process of working with PDFs thanks to the flexible pipeline tool.
Getting Started with olmOCR: Installation and Local Usage
Ready to dive in? Here’s a quick guide to getting olmOCR up and running:
1. Prerequisites:
- Powerful NVIDIA GPU: A recent NVIDIA GPU with at least 20 GB of GPU RAM (e.g., RTX 4090, L40S, A100, H100) is a necessity due to the nature of the pipeline.
- Storage: 30GB of free disk space.
- Software: poppler-utils and additional fonts for rendering PDF images.
Dependencies (Ubuntu/Debian):
2. Installation Steps:
3. GPU Inference Setup (Optional):
For GPU-accelerated inference, install Sglang with flashinfer:
4. Local Usage Example:
Convert a single PDF:
Convert multiple PDFs:
Results are stored in JSON format in the ./localworkspace
directory.
Scaling Up: Multi-Node and Cluster Usage for Massive PDF Processing
For large-scale PDF processing, olmOCR supports multi-node and cluster deployments. You can leverage AWS S3 for input and output, enabling parallel processing across multiple machines.
Example: Running on a Cluster
To process millions of PDFs efficiently:
- Prepare your cloud environment: Set up an AWS S3 bucket to store all of your PDFs
- First Node (Worker):
- Subsequent Nodes (Workers): The work queue is coordinated using shared resources so subsequent nodes simply need to run
- Access the results: View the updated output and see your processed and refined PDFs that are now easier to work with than ever.
Ai2 Beaker Integration
If you are at Ai2, add the --beaker
flag to run on the Beaker cluster.
Optimize Your Workflow
olmOCR offers a robust set of options for controlling the extraction pipeline.
--apply_filter
Apply smart filtering to remove english PDFs that are not forms, and unlikely to be SEO spam.--model
To list model paths for PDF conversion. The script will attempt to use the fastest available model.--target_longest_image_dim
To set the longest side dimension used for rendering PDF pages to maximize quality.
Type python -m olmocr.pipeline --help
to see the complete documentation.
Dive Deeper: Unlock the Potential of Your PDF Data with olmOCR
OLMOCR represents a significant leap forward in PDF text extraction. By combining state-of-the-art VLM technology with a flexible and scalable architecture, olmOCR empowers researchers, developers, and businesses to unlock the hidden potential of their PDF data. Start exploring olmOCR today and experience the future of PDF processing.