Ditch Semantic Similarity: Unlock Reasoning-Based RAG with PageIndex

Are you tired of inaccurate results when using vector databases for long, complex documents? Traditional Retrieval-Augmented Generation (RAG) relies on semantic similarity, which often misses the true relevance hidden within professional documents. It's time for a smarter solution: PageIndex.

What is Reasoning-Based RAG and Why Should You Care?

Reasoning-based RAG allows Large Language Models (LLMs) to actually think and reason through documents. This approach is crucial for documents requiring domain expertise and multi-step reasoning, delivering far more accurate and relevant results than simple similarity searches.

PageIndex is a document indexing system designed to transform lengthy documents into semantic tree structures, making them ready for powerful reasoning-based RAG.

Key Benefits of PageIndex:

Enhanced Accuracy: Reasoning-based RAG provides more relevant and accurate results than similarity search, crucial for professional documents.
Deeper Understanding: Enables LLMs to truly understand the context and relationships within documents.
Improved Efficiency: Streamlines the retrieval process, saving time and resources.

PageIndex: Your Open-Source Solution for Smarter Document Retrieval

PageIndex, built by Vectify AI, is an open-source document indexing system that creates search tree structures from long documents. You can self-host it or try their cloud service, which offers features like OCR for complex PDFs.

Key Features That Set PageIndex Apart:

Hierarchical Tree Structure: Navigates documents logically, like an intelligent table of contents optimized for LLMs.
Precise Page Referencing: Pinpoints retrieval with node summaries and precise start/end page indexes.
Chunk-Free Segmentation: Maintains natural document structure without arbitrary chunking.
Massive Scalability: Handles hundreds or even thousands of pages with ease.

PageIndex is ideal for:

Financial reports
Regulatory filings
Academic textbooks
Legal or technical manuals
Any document exceeding LLM context limits

Understanding the PageIndex Format

Ever wondered how PageIndex organizes information? Here's a glimpse. Imagine a financial report:

{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve...",
"nodes": [
  {
   "title": "Monitoring Financial Vulnerabilities",
   "node_id": "0007",
   "start_index": 22,
   "end_index": 28,
   "summary": "The Federal Reserve's monitoring..."
  },
  {
   "title": "Domestic and International Cooperation and Coordination",
   "node_id": "0008",
   "start_index": 28,
   "end_index": 31,
   "summary": "In 2023, the Federal Reserve collaborated..."
  }
 ]
}

This hierarchical structure allows LLMs to navigate complex documents efficiently.

Get Started with PageIndex Today: A Step-by-Step Guide

Ready to unlock the power of reasoning-based RAG? Here's how to get started with PageIndex:

Install Dependencies: pip3 install -r requirements.txt
Set Your OpenAI API Key: Create a .env file and add your key: CHATGPT_API_KEY=your_openai_key_here
Run PageIndex on Your PDF: python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

Customize your processing with these optional arguments:

--model: OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages: Pages to check for table of contents (default: 20)
--max-pages-per-node: Max pages per node (default: 10)
--max-tokens-per-node: Max tokens per node (default: 20000)
--if-add-node-id: Add node ID (yes/no, default: yes)
--if-add-node-summary: Add node summary (yes/no, default: no)
--if-add-doc-description: Add doc description (yes/no, default: yes)

Cloud API (Beta): Effortless PageIndex Integration

Don't want to self-host? Try the hosted API for PageIndex. Benefit from their custom OCR model for improved accuracy with complex documents. Plus, explore results visually with the web Dashboard – no coding needed! Leave your email to get 1,000 pages free.

Case Study: Mafin 2.5 - Proof of PageIndex's Power

Mafin 2.5, a reasoning-based RAG model built on PageIndex, achieved a remarkable 98.7% accuracy on the FinanceBench benchmark. This significantly outperformed traditional vector-based RAG systems, demonstrating the power of hierarchical indexing for precise content extraction.

Reasoning-Based RAG Framework Example

PageIndex revolutionizes the RAG process. Here's an example of how you can leverage it.

Query Preprocessing: Analyze user query to identify needed knowledge.
Document Selection: Retreive relevant documents and fetch tree structure from database.
Node Selection: Search through tree structures to find relevant nodes.
LLM Generation: Fetch node contents, format, and send to LLM for contextually informed responses.

Roadmap: The Future of PageIndex

Stay tuned for these upcoming enhancements as PageIndex continues to improve.

Planned Features for PageIndex:

Detailed examples of document selection, node selection, and RAG pipelines (due 2025/04/14)
Integration of reasoning-based retrieval and semantic-based retrieval (due 2025/04/21)
Efficient tree search methods introduction
Technical report on the design of PageIndex

Contribute to the Reasoning-Based RAG Revolution!

PageIndex is in early beta. Your feedback is invaluable! Report issues, ask questions, or contribute directly to the project.