Ditch Semantic Similarity: Unlock Reasoning-Based RAG with PageIndex
Are you tired of inaccurate results when using vector databases for long, complex documents? Traditional Retrieval-Augmented Generation (RAG) relies on semantic similarity, which often misses the true relevance hidden within professional documents. It's time for a smarter solution: PageIndex.
What is Reasoning-Based RAG and Why Should You Care?
Reasoning-based RAG allows Large Language Models (LLMs) to actually think and reason through documents. This approach is crucial for documents requiring domain expertise and multi-step reasoning, delivering far more accurate and relevant results than simple similarity searches.
PageIndex is a document indexing system designed to transform lengthy documents into semantic tree structures, making them ready for powerful reasoning-based RAG.
Key Benefits of PageIndex:
- Enhanced Accuracy: Reasoning-based RAG provides more relevant and accurate results than similarity search, crucial for professional documents.
- Deeper Understanding: Enables LLMs to truly understand the context and relationships within documents.
- Improved Efficiency: Streamlines the retrieval process, saving time and resources.
PageIndex: Your Open-Source Solution for Smarter Document Retrieval
PageIndex, built by Vectify AI, is an open-source document indexing system that creates search tree structures from long documents. You can self-host it or try their cloud service, which offers features like OCR for complex PDFs.
Key Features That Set PageIndex Apart:
- Hierarchical Tree Structure: Navigates documents logically, like an intelligent table of contents optimized for LLMs.
- Precise Page Referencing: Pinpoints retrieval with node summaries and precise start/end page indexes.
- Chunk-Free Segmentation: Maintains natural document structure without arbitrary chunking.
- Massive Scalability: Handles hundreds or even thousands of pages with ease.
PageIndex is ideal for:
- Financial reports
- Regulatory filings
- Academic textbooks
- Legal or technical manuals
- Any document exceeding LLM context limits
Understanding the PageIndex Format
Ever wondered how PageIndex organizes information? Here's a glimpse. Imagine a financial report:
This hierarchical structure allows LLMs to navigate complex documents efficiently.
Get Started with PageIndex Today: A Step-by-Step Guide
Ready to unlock the power of reasoning-based RAG? Here's how to get started with PageIndex:
- Install Dependencies:
pip3 install -r requirements.txt
- Set Your OpenAI API Key: Create a
.env
file and add your key:CHATGPT_API_KEY=your_openai_key_here
- Run PageIndex on Your PDF:
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
Customize your processing with these optional arguments:
--model
: OpenAI model to use (default: gpt-4o-2024-11-20)--toc-check-pages
: Pages to check for table of contents (default: 20)--max-pages-per-node
: Max pages per node (default: 10)--max-tokens-per-node
: Max tokens per node (default: 20000)--if-add-node-id
: Add node ID (yes/no, default: yes)--if-add-node-summary
: Add node summary (yes/no, default: no)--if-add-doc-description
: Add doc description (yes/no, default: yes)
Cloud API (Beta): Effortless PageIndex Integration
Don't want to self-host? Try the hosted API for PageIndex. Benefit from their custom OCR model for improved accuracy with complex documents. Plus, explore results visually with the web Dashboard – no coding needed! Leave your email to get 1,000 pages free.
Case Study: Mafin 2.5 - Proof of PageIndex's Power
Mafin 2.5, a reasoning-based RAG model built on PageIndex, achieved a remarkable 98.7% accuracy on the FinanceBench benchmark. This significantly outperformed traditional vector-based RAG systems, demonstrating the power of hierarchical indexing for precise content extraction.
Reasoning-Based RAG Framework Example
PageIndex revolutionizes the RAG process. Here's an example of how you can leverage it.
- Query Preprocessing: Analyze user query to identify needed knowledge.
- Document Selection: Retreive relevant documents and fetch tree structure from database.
- Node Selection: Search through tree structures to find relevant nodes.
- LLM Generation: Fetch node contents, format, and send to LLM for contextually informed responses.
Roadmap: The Future of PageIndex
Stay tuned for these upcoming enhancements as PageIndex continues to improve.
Planned Features for PageIndex:
- Detailed examples of document selection, node selection, and RAG pipelines (due 2025/04/14)
- Integration of reasoning-based retrieval and semantic-based retrieval (due 2025/04/21)
- Efficient tree search methods introduction
- Technical report on the design of PageIndex
Contribute to the Reasoning-Based RAG Revolution!
PageIndex is in early beta. Your feedback is invaluable! Report issues, ask questions, or contribute directly to the project.