Unleash the Power of PDFs: Introducing olmOCR for AI-Powered Document Understanding

Tired of wrestling with messy PDFs and inaccurate OCR? olmOCR, a cutting-edge toolkit from AllenAI, is here to revolutionize how language models interact with documents in the wild. Built to leverage the latest advancements in AI, it tackles the complexities of PDF processing, opening up a world of possibilities for data extraction and analysis. Ready to transform your PDF workflow?

What Can olmOCR Do for You? Key Features & Benefits

Next-Level Text Parsing with ChatGPT 4o: Harness the power of advanced prompting strategies for unparalleled natural text extraction from PDFs. Achieve significantly higher accuracy compared to traditional OCR methods.
Streamlined Evaluation: Easily compare different pipeline versions side-by-side with the built-in evaluation toolkit (runeval.py), ensuring optimal performance. No more guesswork – see the improvements firsthand!
Intelligent Filtering: Automatically filter out irrelevant content, including language mismatches and SEO spam, thanks to the built-in filtering system (filter.py). Focus only on the data that matters.
Fine-Tuned Models for Peak Performance: Leverage fine-tuning code for state-of-the-art models like Qwen2-VL and Molmo-O (train.py), maximizing accuracy and efficiency for your specific needs. olmOCR empowers you to customize the models to your document types.
Massive-Scale PDF Processing with Sglang: Process millions of PDFs efficiently using a fine-tuned model and Sglang (pipeline.py). Scale your operations without compromising accuracy or speed.
Intuitive Result Viewing: Easily access and review extracted text in Dolma-style JSONL format, and visualize the output alongside the original PDFs using the dolmaviewer.py tool.

Get Started: Installation & Local Usage

Ready to dive in? Here's what you'll need:

Prerequisites:

Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
30GB of free disk space
poppler-utils and additional fonts for rendering PDF images

Installation Steps (Ubuntu/Debian):

Update your package list:
```
sudo apt-get update
```

Install dependencies:

sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

Set up a conda environment:

conda create -n olmocr python=3.11
conda activate olmocr

Clone the olmOCR repository:

git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

Install sglang with flashinfer for GPU inference (optional but recommended):

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Local Usage Example:

Convert a single PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

Convert multiple PDFs:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

(Results are stored as JSON in ./localworkspace.)

View the extracted content:

cat localworkspace/results/output_*.jsonl

Preview results alongside the original PDFs:
```
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
```
Then, open ./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html in your browser.

Scaling Up: Multi-Node & Cluster Usage for Large Datasets

Need to process millions of PDFs? olmOCR supports distributed processing using AWS S3 for both input and output. This allows for efficient parallel processing across multiple nodes.

Basic Multi-Node Example:

Worker Node 1:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

This command sets up a work queue in your S3 bucket and starts converting PDFs.

Subsequent Worker Nodes:
```
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
```
These nodes will automatically grab work items from the queue.

Ai2 Beaker Integration:

If you're at AI2, you can efficiently process PDFs using Beaker:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4

This prepares the workspace locally and launches N GPU workers in the cluster.

Unlock the Potential of Your PDFs with olmOCR

olmOCR provides a comprehensive toolkit for processing PDFs at any scale. From enhanced OCR accuracy to scalable cluster deployment to analyzing billions of tokens within your files, unlock previously inaccessible data. Start using olmOCR today and transform your document understanding capabilities.