Unlock Hidden Data: Convert PDFs to Text with Cutting-Edge AI Using olmOCR

Tired of manually extracting data from PDFs? olmOCR, a powerful toolkit developed by AI2, leverages the latest language models to accurately convert scanned documents into usable text. Whether you're processing a single PDF or millions, olmOCR streamlines the process and unlocks valuable insights. Imagine quickly analyzing contracts, research papers, or financial statements with unparalleled efficiency. This guide will walk you through the capabilities of olmOCR, its installation, and practical usage examples, helping you harness the power of AI for PDF data extraction.

What is olmOCR and Why Should You Use It?

olmOCR (Optical Layout-Aware Model for OCR) is a game-changing toolkit that empowers you to:

Extract Natural Text: Leveraging advanced prompting strategies with models like ChatGPT 4o for superior natural text parsing.
Evaluate and Compare: Side-by-side evaluation toolkit helps to compare different pipeline versions.
Filter Out the Noise: Basic filtering for language and SEO spam removal ensures clean, relevant data.
Fine-Tune Models: Code to finetune Qwen2-VL and Molmo-O models for specific needs.
Scale Your Operations: Process millions of PDFs efficiently using Sglang for parallel processing.
Visualize Results: Dolma viewer allows you to compare the extracted text with the original PDF layout.

The key benefit? Access to trillions of tokens locked away within PDFs, enabling data-driven decisions and insights.

Installation and Setup: Get Started with olmOCR

Getting olmOCR up and running requires a compatible environment. Here's a step-by-step guide to help you set up the toolkit:

1. System Requirements:

NVIDIA GPU with at least 20 GB of RAM (tested on RTX 4090, L40S, A100, H100).
30GB of free disk space.
Poppler-utils and additional fonts for rendering PDF images.

2. Install Dependencies (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

3. Create a Conda Environment and Install olmOCR:

conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

4. Install Sglang (for GPU Inference):

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Note: An NVIDIA GPU is required for local inference, powered by Sglang, making PDF to text conversion smooth and efficient.

Local Usage: Converting PDFs Made Easy

Once installed, olmOCR offers simple commands for converting PDFs:

Convert a Single PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

Convert Multiple PDFs:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

Results will be stored as JSON in the ./localworkspace directory, providing easy access to your extracted data.

Multi-Node/Cluster Usage: Scaling to Millions of PDFs

olmOCR is designed for large-scale PDF processing. You can leverage AWS S3 for reading PDFs and coordinating work across multiple nodes.

First Worker Node: Set up a work queue in your AWS bucket and start converting PDFs:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

Subsequent Nodes: Start grabbing items from the same workspace queue:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace

For users at AI2, the --beaker flag simplifies launching GPU workers in the cluster for efficient PDF data extraction.

Viewing Results and Optimizing Performance

Extracted text is stored as Dolma-style JSONL files. Use the Dolma viewer to compare results side-by-side with the original PDFs.

View Results:

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

Open the generated HTML file in your browser to visually inspect the extracted text. Analyzing documents with poor image quality is now easier than ever.

Key Options for Fine-Tuning the Pipeline

The pipeline.py script provides numerous options to tailor the PDF to text conversion process. Key parameters include:

--pdfs: Path to PDF files for processing.
--workspace: Local folder or S3 path for storing work.
--apply_filter: Apply basic filtering to English PDFs.
--model: Specify the path to your language model.
--beaker: Submit the job to Beaker for cluster processing.

For complete documentation, run python -m olmocr.pipeline --help.

Contribute to the Future of Document Understanding

olmOCR, developed by the AllenNLP team at AI2, is licensed under Apache 2.0. We encourage you to explore the toolkit, contribute to its development, and unlock the potential hidden within your PDF documents. Start using olmOCR today and experience the power of AI-driven PDF data extraction!