Unlock the Power of Your PDFs: Introducing MinerU for Flawless Data Extraction

Struggling to extract data from PDFs? MinerU, an open-source solution, converts PDFs into machine-readable formats like Markdown and JSON, making data extraction a breeze. Say goodbye to manual data entry and hello to streamlined workflows. Unlock the potential hidden within your documents and contribute to the future of large language models.

Why MinerU? Key Features That Set Us Apart

MinerU excels where others fall short. Here's what makes it the go-to tool for PDF data extraction:

Semantic Coherence: Effortlessly removes headers, footers, and footnotes for consistent, meaningful text.
Intelligent Layout Processing: Accurately outputs text in the correct reading order, even in complex multi-column layouts.
Structural Preservation: Maintains headings, paragraphs, and list structures for easy readability.
Comprehensive Data Extraction: Extracts images, descriptions, tables (including titles and footnotes), and formulas in LaTeX format.
Automatic OCR: Automatically detects scanned or corrupted PDFs and activates OCR for accurate text recognition in 84 languages.
Versatile Output Formats: Supports Markdown, JSON, and rich intermediate formats for diverse applications.
Visualization Tools: Offers layout and span visualizations to ensure output quality.
Platform Flexibility: Compatible with Windows, Linux, and Mac, supporting both CPU and GPU/NPU/MPS acceleration.

Experience MinerU: Three Ways to Get Started

Ready to unleash the power of MinerU? Choose the setup that best suits your needs:

Online Demo (No Installation): Try MinerU on our website.
Quick CPU Demo (Windows, Linux, Mac): Ideal for initial testing and exploration.
Accelerated Inference (CUDA/CANN/MPS): Optimize performance with GPU or NPU acceleration for demanding tasks.

Supercharge Your Workflow with MinerU: Real-World Examples

MinerU empowers you to extract valuable information from PDFs, enabling you to:

Automate Data Entry: Seamlessly transfer data from invoices, reports, and other PDFs into your systems.
Enhance Research: Extract scientific formulas in LaTeX format to use in research papers and presentations.
Build Knowledge Bases: Convert legal documents, technical manuals, and academic articles into structured knowledge bases.

MinerU: Installation and Optimization

Quick CPU Demo: Step-by-Step Guide

Install magic-pdf:

conda create -n mineru ' python>=3.10 ' -y
conda activate mineru
pip install -U " magic-pdf[full] "

Download Model Weight Files: Follow these detailed instructions to download the necessary model files. A magic-pdf.json file will be automatically generated in your user directory. (Windows: C:\Users\username, Linux: /home/username, macOS: /Users/username).

Modify the Configuration File: Edit magic-pdf.json to customize settings, such as enabling or disabling table recognition. Adjust the table-config and formula-config to fit your needs.

{
  "layout-config": {
    "model": " doclayout_yolo "
  },
  "formula-config": {
    "mfd_model": " yolo_v8_mfd ",
    "mfr_model": " unimernet_small ",
    "enable": true // Enable or disable formula recognition (default: true).
  },
  "table-config": {
    "model": " rapid_table ",
    "sub_model": " slanet_plus ",
    "enable": true, // Enable or disable table recognition (default: true).
    "max_time": 400
  }
}

GPU Acceleration: Unleash Maximum Performance

If your system meets the GPU requirements, leverage CUDA or MPS acceleration for significantly faster parsing:

Docker Deployment: Streamlined Setup

Deploy MinerU using Docker for a quick and easy setup (requires a GPU with at least 6GB of VRAM):

docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
wget https://github.com/opendatalab/MinerU/raw/master/docker/global/Dockerfile -O Dockerfile
docker build -t mineru:latest .
docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c " echo 'source /opt/mineru_venv/bin/activate' &gt;&gt; ~/.bashrc && exec bash "
magic-pdf --help

Take control of your PDFs today!

What’s New? MinerU Updates and Improvements

MinerU is continuously evolving with frequent updates and optimizations. The most recent updates include:

Version 1.3.7 (2025/04/22): Bug fixes and performance enhancements.
Version 1.3.4 (2025/04/16): Improved OCR detection and fixed page sorting issues.
Version 1.3.2 (2025/04/12): Enhanced dependency management, memory usage, and parsing accuracy for rotated tables. Solved issues with word concatenation in English text.
Version 1.3.0 (2025/04/03): Comprehensive optimizations for installation, compatibility, performance, parsing effect, and usability. Significant speed improvements include formula parsing increases exceeding 1400%.

Addressing Common Challenges with MinerU

While MinerU strives for perfection, some limitations exist:

Complex layouts may occasionally result in reading order errors.
Vertical text is not currently supported.
Certain uncommon list formats may not be recognized.
Code block recognition is under development.
Performance may vary with comic books, art albums, textbooks, and exercises.

Join the MinerU Community

We welcome you to contribute to the MinerU project. Your feedback helps us improve and expand its capabilities.

Disclaimer

This project utilizes PyMuPDF, which is licensed under AGPL. Be aware of the licensing implications for specific use cases. We are exploring alternative PDF processing libraries for greater user flexibility in the future.