Unleash the Power of PDFs: Introducing olmOCR for AI-Powered Document Understanding
Tired of wrestling with messy PDFs and inaccurate OCR? olmOCR, a cutting-edge toolkit from AllenAI, is here to revolutionize how language models interact with documents in the wild. Built to leverage the latest advancements in AI, it tackles the complexities of PDF processing, opening up a world of possibilities for data extraction and analysis. Ready to transform your PDF workflow?
What Can olmOCR Do for You? Key Features & Benefits
- Next-Level Text Parsing with ChatGPT 4o: Harness the power of advanced prompting strategies for unparalleled natural text extraction from PDFs. Achieve significantly higher accuracy compared to traditional OCR methods.
- Streamlined Evaluation: Easily compare different pipeline versions side-by-side with the built-in evaluation toolkit (
runeval.py
), ensuring optimal performance. No more guesswork – see the improvements firsthand! - Intelligent Filtering: Automatically filter out irrelevant content, including language mismatches and SEO spam, thanks to the built-in filtering system (
filter.py
). Focus only on the data that matters. - Fine-Tuned Models for Peak Performance: Leverage fine-tuning code for state-of-the-art models like Qwen2-VL and Molmo-O (
train.py
), maximizing accuracy and efficiency for your specific needs. olmOCR empowers you to customize the models to your document types. - Massive-Scale PDF Processing with Sglang: Process millions of PDFs efficiently using a fine-tuned model and Sglang (
pipeline.py
). Scale your operations without compromising accuracy or speed. - Intuitive Result Viewing: Easily access and review extracted text in Dolma-style JSONL format, and visualize the output alongside the original PDFs using the
dolmaviewer.py
tool.
Get Started: Installation & Local Usage
Ready to dive in? Here's what you'll need:
Prerequisites:
- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
- 30GB of free disk space
poppler-utils
and additional fonts for rendering PDF images
Installation Steps (Ubuntu/Debian):
-
Update your package list:
-
Install dependencies:
-
Set up a conda environment:
-
Clone the olmOCR repository:
-
Install sglang with flashinfer for GPU inference (optional but recommended):
Local Usage Example:
-
Convert a single PDF:
-
Convert multiple PDFs:
(Results are stored as JSON in
./localworkspace
.) -
View the extracted content:
-
Preview results alongside the original PDFs:
Then, open
./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html
in your browser.
Scaling Up: Multi-Node & Cluster Usage for Large Datasets
Need to process millions of PDFs? olmOCR supports distributed processing using AWS S3 for both input and output. This allows for efficient parallel processing across multiple nodes.
Basic Multi-Node Example:
-
Worker Node 1:
This command sets up a work queue in your S3 bucket and starts converting PDFs.
-
Subsequent Worker Nodes:
These nodes will automatically grab work items from the queue.
Ai2 Beaker Integration:
If you're at AI2, you can efficiently process PDFs using Beaker:
This prepares the workspace locally and launches N GPU workers in the cluster.
Unlock the Potential of Your PDFs with olmOCR
olmOCR provides a comprehensive toolkit for processing PDFs at any scale. From enhanced OCR accuracy to scalable cluster deployment to analyzing billions of tokens within your files, unlock previously inaccessible data. Start using olmOCR today and transform your document understanding capabilities.