Unlock Deep Insights: Effortless PDF Analysis with OpenAI File Search API

Stop struggling with complex RAG setups! Discover how to use the OpenAI File Search API for streamlined PDF data retrieval and analysis. This guide provides a practical, code-driven approach to boost your LLM workflows with ease, focusing on actionable steps and clear explanations. Dive in and learn how to leverage this powerful tool to extract the most value from your PDF documents.

Ditch the Complexity: Simplified RAG with File Search

Traditional RAG pipelines for PDFs can be overwhelming. Parsing documents, chunking strategies, storage providers, running embeddings, and vector databases – it's a lot to handle.

The File Search API simplifies this process. As a hosted tool within OpenAI's Responses API, it indexes and searches your PDF knowledge base, allowing you to retrieve relevant content and generate answers easily. Forget the infrastructure headaches and focus on extracting valuable insights.

This API helps you:

Avoid intricate setups: No more manual chunking, embedding calculations, or managing vector databases.
Focus on results: Concentrate on retrieving accurate information and generating meaningful responses.
Integrate seamlessly: Easily incorporate file search into your existing LLM workflows.

Quick Start: Setting Up Your Environment

Before diving into the code, let's ensure smooth sailing with these installations:

pip install PyPDF2 pandas tqdm openai

Now configure your OpenAI API key:

from openai import OpenAI
import os

client = OpenAI(api_key = os.getenv('OPENAI_API_KEY'))
dir_pdfs = 'openai_blog_pdfs' # PDFs stored locally
pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]

Step-by-Step: Creating Your PDF Vector Store

Leverage OpenAI's API to create a managed vector store and upload your PDF files. OpenAI handles chunking, embedding, and storage so that you can query the content.

from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import concurrent
import PyPDF2

def upload_single_pdf(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file = open(file_path, 'rb'), purpose = "assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id = vector_store_id,
            file_id = file_response.id
        )
        return { "file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return { "file": file_name, "status": "failed", "error": str(e)}

def upload_pdf_files_to_vector_store(vector_store_id: str):
    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]
    stats = { "total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}

    print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")

    with concurrent.futures.ThreadPoolExecutor(max_workers = 10) as executor:
        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}
        for future in tqdm(concurrent.futures.as_completed(futures), total = len(pdf_files)):
            result = future.result()
            if result["status"] == "success":
                stats["successful_uploads"] += 1
            else:
                stats["failed_uploads"] += 1
                stats["errors"].append(result)

    return stats

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name = store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}

store_name = "openai_blog_store"
vector_store_details = create_vector_store(store_name)
upload_pdf_files_to_vector_store(vector_store_details["id"])

Querying your PDF Vector Store

You can query the vector store directly without integrating it into a Response API call.

query = "What's Deep Research?"
search_results = client.vector_stores.search(
    vector_store_id = vector_store_details['id'],
    query = query
)

for result in search_results.data:
    print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))

Integrating LLM and File Search: Responses API

The true power lies in combining file search with LLMs. Use the Responses API with the file_search tool to get answers grounded in your PDF knowledge base.

query = "What's Deep Research?"
response = client.responses.create(
    input = query,
    model = "gpt-4o-mini",
    tools = [{
        "type": "file_search",
        "vector_store_ids": [vector_store_details['id']],
    }]
)

# Extract annotations from the response
annotations = response.output[1].content[0].annotations

# Get top-k retrieved filenames
retrieved_files = set([result.filename for result in annotations])

print(f'Files used: {retrieved_files}')
print('Response:')
print(response.output[1].content[0].text)

Evaluating Performance: A Crucial Step

Measuring the relevance and quality of retrieved files is important. Generate an evaluation dataset and calculate metrics. This is an imperfect approach and we'll always recommend a human verified approach for your own use-cases.

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            page_text = page.extract_text()
                if page_text:
                    text += page_text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return text

def generate_questions(pdf_path):
    text = extract_text_from_pdf(pdf_path)

    prompt = (
        "Can you generate a question that can only be answered from this document?: \n"
        f"{text}\n\n"
    )

    response = client.responses.create(
        input = prompt,
        model = "gpt-4o",
    )

    question = response.output[0].content[0].text

    return question

# Generate questions for each PDF and store in a dictionary
questions_dict = {}
for pdf_path in pdf_files:
    questions = generate_questions(pdf_path)
    questions_dict[os.path.basename(pdf_path)] = questions

Conclusion: Unlock Powerful PDF Insights

The OpenAI File Search API provides a simplified and robust solution for RAG on PDFs. The vector search API allows you to find relevant items from your knowledge base without integrating it in an LLM query. By following this guide, you can easily create vector stores of PDFs, query them with LLMs, and extract valuable information for various applications. Embrace this new approach to accelerate your document analysis workflows and discover the hidden insights within your PDF files. Maximize the power of your data today!