Extract Data from PDFs Like a Pro: A Guide to MinerU for Machine-Readable Conversions

Want to transform PDFs into easily usable data? MinerU is an open-source tool designed to convert PDFs into machine-readable formats like Markdown and JSON, simplifying data extraction. If you're looking for accurate PDF data extraction and an efficient way to handle scientific documents, MinerU might be your solution.

Why Choose MinerU for PDF Conversion?

MinerU stands out with features tailored for scientific literature and complex layouts. It strips away irrelevant elements and preserves document structure, making it ideal for researchers and data scientists. Plus, MinerU supports over 80 languages via OCR, and offers CPU & GPU acceleration.

Key Benefits of Using MinerU:

Removes headers, footers, and page numbers for clean data.
Preserves document structure, including headings and lists.
Extracts images, tables, and formulas.
Supports multiple output formats (Markdown, JSON).
Offers layout and span visualization for quality control.
Can run on CPU or GPU (CUDA/NPU/MPS).
Cross-platform compatibility (Windows, Linux, macOS).

New Features and Updates in MinerU

MinerU is constantly evolving with new features and performance enhancements. Version 1.3.8 includes an upgrade to the default OCR model (PP-OCRv4_server_rec_doc), boosting recognition for Chinese, Japanese, and special characters:

Stars

Recent Improvements to Enhance OCR Accuracy:

Improved OCR Model: PP-OCRv4_server_rec_doc enhances recognition for diverse characters.
Speed Optimization: Faster performance in CPU mode and optimized OCR detection.
Compatibility Fixes: Resolved dependency issues in Python 3.13 on Windows.
Memory Efficiency: Reduced memory usage during batch processing.
Table Parsing: Enhanced table parsing for rotated tables and accuracy in financial reports.
Word Concatenation Fix: Resolved occasional word concatenation in English text.

Getting Started with MinerU: Installation and Usage

Ready to try MinerU? Here's how to get started, from online demos to accelerated GPU inference.

Choose Your MinerU Experience:

Online Demo: No installation needed, try MinerU instantly.
Quick CPU Demo: Easy setup for Windows, Linux, and Mac.
GPU Acceleration (CUDA/CANN/MPS): For faster processing.

Quick CPU Demo Instructions:

Install magic-pdf:

conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"

Download model weight files: Follow the detailed instructions in the documentation.
Modify the configuration file: Adjust magic-pdf.json (created in your user directory) to enable/disable features like table recognition.

Enable GPU Acceleration for Faster Processing

If you have a CUDA-enabled GPU, leverage its power for faster PDF to text conversion. Follow the guides for Ubuntu or Windows:

You can also use Docker for quick deployment, ensuring your GPU has at least 6GB VRAM.:

Forks

Docker Deployment:

Check CUDA support:

docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Build and run the Docker image:

wget https://github.com/opendatalab/MinerU/raw/master/docker/global/Dockerfile -O Dockerfile
docker build -t mineru:latest .
docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
magic-pdf --help

MinerU Usage: Command Line and Python API

MinerU offers flexible usage options to fit your workflow, whether you prefer command-line operations or Python scripting.

Command-Line Interface:

Use the magic-pdf command followed by appropriate arguments to convert PDFs.

Python API:

Integrate MinerU into your Python projects for automated PDF data extraction tasks.

Addressing Known Issues and Seeking Support

While MinerU strives for perfection, some limitations exist. Consult the FAQ and Known Issues section for troubleshooting.

Common Issues :

Reading order issues in complex layouts.
Lack of support for vertical text.
Potential table recognition errors.
OCR inaccuracies in lesser-known languages.

Contributing and Acknowledging MinerU

MinerU thrives on community contributions. If you encounter issues or have suggestions, submit them on GitHub Issues. MinerU acknowledges and appreciates the contributions from the open-source community.

License and Acknowledgments :

MinerU uses PyMuPDF for advanced functionality.
Acknowledgments to projects like PDF-Extract-Kit, DocLayout-YOLO, and PaddleOCR.

By using MinerU, you're joining a community dedicated to improving PDF to machine-readable data conversion. Start extracting valuable insights from your documents today!