Unlock the Power of PDFs: Introducing olmOCR for AI-Powered Document Processing

Tired of wrestling with inaccessible PDF documents? olmOCR, a cutting-edge toolkit from AllenAI, is here to revolutionize how you interact with PDFs, leveraging the power of language models. This open-source solution provides state-of-the-art natural text parsing and unlocks hidden insights within your documents.

What is olmOCR and Why Should You Use It?

olmOCR is a comprehensive toolkit designed for training language models to extract and process text from PDF documents, even those with complex layouts or poor image quality. This allows for a far greater text extraction from documents that previously may have been ignored.

Here's what olmOCR brings to the table:

ChatGPT 4o Integration: Cutting-edge prompting strategy for exceptionally accurate natural text parsing.
Side-by-Side Evaluation: Easily compare different pipeline versions to optimize your processing workflow.
Advanced Filtering: Cleans your data by removing SEO spam and non-English documents.
Scalable Processing: Convert millions of PDFs using Sglang with efficient GPU utilization.

Key Features and Benefits of Using olmOCR

Improved OCR Accuracy: Leverage the power of Qwen2-VL and Molmo-O fine-tuning for unparalleled accuracy.
Efficient Data Extraction: Extract text and data from PDFs at scale, saving time and resources.
Seamless Integration: Easily incorporate olmOCR into your existing data pipelines.
Open-Source and Customizable: Adapt the toolkit to your specific needs and contribute to the community.

Getting Started with olmOCR: Installation and Local Usage

Ready to unlock the potential of your PDFs? Here's how to get started with olmOCR:

1. Prerequisites:

Recent NVIDIA GPU (at least 20 GB of GPU RAM) ideal for PDF processing.
30 GB of free disk space.
Poppler-utils and additional fonts for rendering PDF images.

2. Installation (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

3. Conda Environment Setup:

conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

4. (Optional) Sglang Installation for GPU Inference:

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install " sglang[all]==0.4.2 " --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

5. Local Usage Example:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

You can also convert multiple PDFs by using the following command:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

Scaling Up: Multi-Node Cluster Usage with olmOCR

For large-scale PDF processing, olmOCR supports multi-node clusters using AWS S3 for storage and coordination. This makes it ideal for organizations dealing with massive document archives.

Example Command (First Worker Node):

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

Example Command (Subsequent Worker Nodes):

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace

Diving Deeper: Exploring the olmOCR Pipeline

The olmocr.pipeline module provides extensive options for customizing your PDF processing workflow.

Key Parameters:

--pdfs: Specifies the path to the PDF files to process. Can be a glob pattern or a file containing a list of paths.
--workspace: The location where work will be stored. Can be a local folder or an S3 path.
--apply_filter: Apply filtering for English PDFs to reduce noise and SEO spam.
--model: Specifies model paths to use for processing.
--workers: Sets the number of worker threads to use.

For complete documentation, run:

python -m olmocr.pipeline --help

License and Citation

olmOCR is licensed under Apache 2.0, promoting open collaboration and innovation.

Citation:

@misc { olmocr,
 title = { {olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models}},
 author = { Jake Poznanski and Jon Borchardt and Jason Dunkelberger and Regan Huff and Daniel Lin and Aman Rangapur and Christopher Wilhelm and Kyle Lo and Luca Soldaini},
 year = { 2025},
 eprint = { 2502.18443},
 archivePrefix = { arXiv},
 primaryClass = { cs.CL},
 url = { https://arxiv.org/abs/2502.18443},
}

Unlock the insights hidden within your PDFs today with olmOCR!