Unleash the Power of PDFs: Introducing olmOCR, the Ultimate Toolkit for PDF Text Extraction

Tired of wrestling with PDFs and struggling to extract accurate text? OLMOCR, the open-source toolkit developed by the AllenNLP team at AI2, is here to revolutionize how you interact with PDF documents. Designed to leverage the latest advances in Vision Language Models (VLMs), including ChatGPT 4o integration, olmOCR empowers you to unlock hidden knowledge and make the most of your PDF data.

Why olmOCR? Key Benefits You Can't Ignore

Why should you choose olmOCR? Here's how it can transform your PDF processing workflow:

ChatGPT-Powered Natural Text Parsing: Achieve unparalleled accuracy in text extraction with an innovative prompting strategy using ChatGPT 4o. Say goodbye to garbled text and hello to clean, usable data.
Streamlined Evaluation: Compare different pipeline versions side-by-side with the built-in eval toolkit (runeval.py), ensuring optimal performance and accuracy.
Intelligent Filtering: Eliminate noise and focus on relevant content with language filtering and SEO spam removal via the integrated filter.py.
Scalable Processing: Process millions of PDFs efficiently using Sglang. Whether working locally or on a multi-node cluster, olmOCR delivers speed and scalability.
Cutting-Edge Model Support: Fine-tuning code for state-of-the-art models like Qwen2-VL and Molmo-O ensures compatibility with the latest advancements in VLM technology.
Easy Result Visualization: View extracted text alongside the original PDFs for seamless verification and analysis, using dolmaviewer.py.
Automated Workflow: Automate the process of working with PDFs thanks to the flexible pipeline tool.

Getting Started with olmOCR: Installation and Local Usage

Ready to dive in? Here’s a quick guide to getting olmOCR up and running:

1. Prerequisites:

Powerful NVIDIA GPU: A recent NVIDIA GPU with at least 20 GB of GPU RAM (e.g., RTX 4090, L40S, A100, H100) is a necessity due to the nature of the pipeline.
Storage: 30GB of free disk space.
Software: poppler-utils and additional fonts for rendering PDF images.

Dependencies (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

2. Installation Steps:

conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

3. GPU Inference Setup (Optional):

For GPU-accelerated inference, install Sglang with flashinfer:

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

4. Local Usage Example:

Convert a single PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

Convert multiple PDFs:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

Results are stored in JSON format in the ./localworkspace directory.

Scaling Up: Multi-Node and Cluster Usage for Massive PDF Processing

For large-scale PDF processing, olmOCR supports multi-node and cluster deployments. You can leverage AWS S3 for input and output, enabling parallel processing across multiple machines.

Example: Running on a Cluster

To process millions of PDFs efficiently:

Prepare your cloud environment: Set up an AWS S3 bucket to store all of your PDFs
First Node (Worker):

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

Subsequent Nodes (Workers): The work queue is coordinated using shared resources so subsequent nodes simply need to run

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace

Access the results: View the updated output and see your processed and refined PDFs that are now easier to work with than ever.

Ai2 Beaker Integration

If you are at Ai2, add the --beaker flag to run on the Beaker cluster.

Optimize Your Workflow

olmOCR offers a robust set of options for controlling the extraction pipeline.

--apply_filter Apply smart filtering to remove english PDFs that are not forms, and unlikely to be SEO spam.
--model To list model paths for PDF conversion. The script will attempt to use the fastest available model.
--target_longest_image_dim To set the longest side dimension used for rendering PDF pages to maximize quality.

Type python -m olmocr.pipeline --help to see the complete documentation.

Dive Deeper: Unlock the Potential of Your PDF Data with olmOCR

OLMOCR represents a significant leap forward in PDF text extraction. By combining state-of-the-art VLM technology with a flexible and scalable architecture, olmOCR empowers researchers, developers, and businesses to unlock the hidden potential of their PDF data. Start exploring olmOCR today and experience the future of PDF processing.