Unlock the Power of Your PDFs: Introducing MinerU for Flawless Data Extraction
Struggling to extract data from PDFs? MinerU, an open-source solution, converts PDFs into machine-readable formats like Markdown and JSON, making data extraction a breeze. Say goodbye to manual data entry and hello to streamlined workflows. Unlock the potential hidden within your documents and contribute to the future of large language models.
Why MinerU? Key Features That Set Us Apart
MinerU excels where others fall short. Here's what makes it the go-to tool for PDF data extraction:
- Semantic Coherence: Effortlessly removes headers, footers, and footnotes for consistent, meaningful text.
- Intelligent Layout Processing: Accurately outputs text in the correct reading order, even in complex multi-column layouts.
- Structural Preservation: Maintains headings, paragraphs, and list structures for easy readability.
- Comprehensive Data Extraction: Extracts images, descriptions, tables (including titles and footnotes), and formulas in LaTeX format.
- Automatic OCR: Automatically detects scanned or corrupted PDFs and activates OCR for accurate text recognition in 84 languages.
- Versatile Output Formats: Supports Markdown, JSON, and rich intermediate formats for diverse applications.
- Visualization Tools: Offers layout and span visualizations to ensure output quality.
- Platform Flexibility: Compatible with Windows, Linux, and Mac, supporting both CPU and GPU/NPU/MPS acceleration.
Experience MinerU: Three Ways to Get Started
Ready to unleash the power of MinerU? Choose the setup that best suits your needs:
- Online Demo (No Installation): Try MinerU on our website.
- Quick CPU Demo (Windows, Linux, Mac): Ideal for initial testing and exploration.
- Accelerated Inference (CUDA/CANN/MPS): Optimize performance with GPU or NPU acceleration for demanding tasks.
Supercharge Your Workflow with MinerU: Real-World Examples
MinerU empowers you to extract valuable information from PDFs, enabling you to:
- Automate Data Entry: Seamlessly transfer data from invoices, reports, and other PDFs into your systems.
- Enhance Research: Extract scientific formulas in LaTeX format to use in research papers and presentations.
- Build Knowledge Bases: Convert legal documents, technical manuals, and academic articles into structured knowledge bases.
MinerU: Installation and Optimization
Quick CPU Demo: Step-by-Step Guide
-
Install magic-pdf:
-
Download Model Weight Files: Follow these detailed instructions to download the necessary model files. A
magic-pdf.json
file will be automatically generated in your user directory. (Windows:C:\Users\username
, Linux:/home/username
, macOS:/Users/username
). -
Modify the Configuration File: Edit
magic-pdf.json
to customize settings, such as enabling or disabling table recognition. Adjust thetable-config
andformula-config
to fit your needs.
GPU Acceleration: Unleash Maximum Performance
If your system meets the GPU requirements, leverage CUDA or MPS acceleration for significantly faster parsing:
Docker Deployment: Streamlined Setup
Deploy MinerU using Docker for a quick and easy setup (requires a GPU with at least 6GB of VRAM):
Take control of your PDFs today!
What’s New? MinerU Updates and Improvements
MinerU is continuously evolving with frequent updates and optimizations. The most recent updates include:
- Version 1.3.7 (2025/04/22): Bug fixes and performance enhancements.
- Version 1.3.4 (2025/04/16): Improved OCR detection and fixed page sorting issues.
- Version 1.3.2 (2025/04/12): Enhanced dependency management, memory usage, and parsing accuracy for rotated tables. Solved issues with word concatenation in English text.
- Version 1.3.0 (2025/04/03): Comprehensive optimizations for installation, compatibility, performance, parsing effect, and usability. Significant speed improvements include formula parsing increases exceeding 1400%.
Addressing Common Challenges with MinerU
While MinerU strives for perfection, some limitations exist:
- Complex layouts may occasionally result in reading order errors.
- Vertical text is not currently supported.
- Certain uncommon list formats may not be recognized.
- Code block recognition is under development.
- Performance may vary with comic books, art albums, textbooks, and exercises.
Join the MinerU Community
We welcome you to contribute to the MinerU project. Your feedback helps us improve and expand its capabilities.
Disclaimer
This project utilizes PyMuPDF, which is licensed under AGPL. Be aware of the licensing implications for specific use cases. We are exploring alternative PDF processing libraries for greater user flexibility in the future.