Extract, Transform, Load: Use GPT-4o to Boost Your Data Workflow from PDFs

Unstructured data trapped in PDFs? Stop manually wrangling invoices and unlock powerful insights with an innovative ELT workflow using GPT-4o. This article shows you how to seamlessly extract, transform, and load (ELT) data with GPT-4o, even from complex, multilingual documents. Say goodbye to traditional OCR limitations and hello to efficient data analysis.

Why GPT-4o is a Game Changer for Data Extraction and Transformation

Traditional Optical Character Recognition (OCR) often struggles with layout complexities and multilingual content within documents like PDFs. GPT-4o offers a smarter alternative, leveraging its multimodal capabilities to understand and interpret data in various formats. Here's how GPT-4o revolutionizes ELT:

Superior Data Extraction: GPT-4o adapts to diverse document layouts, reducing errors and handling multiple languages effortlessly. It understands context, extracting meaningful relationships, and processes images and tables seamlessly.
Smarter Data Transformation: GPT-4o dynamically adapts to different data structures, mapping them flexibly to fit specific database schemas. It uses reasoning to create insightful transformations, enriching your datasets with derived metrics and metadata.

Streamline Your ELT Process: A Three-Part Cookbook

This guide walks you through building an ELT workflow to convert data from PDFs into a usable database. The workflow is broken down into three parts:

Extracting Data from Multilingual PDFs: Use GPT-4o's vision capabilities to pull unstructured data from PDFs.
Transforming Data With a Schema: Convert the extracted data into an easy-to-use schema to load into a database.
Load Transformed Data into a Database: Analyze the data in your database.

The following sections will walk you through each part.

Part 1: Effortless PDF Data Extraction with GPT-4o's Vision

Let's dive into extracting data from those pesky PDFs. Because GPT-4o needs images to extract data, we will extract the data and encode it as base64.

from openai import OpenAI
import fitz # PyMuPDF
import io
import os
from PIL import Image
import base64
import json

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

@staticmethod
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def pdf_to_base64_images(pdf_path):
    #Handles PDFs with multiple pages
    pdf_document = fitz.open(pdf_path)
    base64_images = []
    temp_image_paths = []

    total_pages = len(pdf_document)

    for page_num in range(total_pages):
        page = pdf_document.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.open(io.BytesIO(pix.tobytes()))
        temp_image_path = f"temp_page_{page_num}.png"
        img.save(temp_image_path, format="PNG")
        temp_image_paths.append(temp_image_path)
        base64_image = encode_image(temp_image_path)
        base64_images.append(base64_image)

    for temp_image_path in temp_image_paths:
        os.remove(temp_image_path)

    return base64_images

Then we'll take the images and write a prompt for GPT-4o to extract the data from the images:

def extract_invoice_data(base64_image):
    system_prompt = f"""
    You are an OCR-like data extraction tool that extracts hotel invoice data from PDFs.

    1. Please extract the data in this hotel invoice, grouping data according to theme/sub groups, and then output into JSON.

    2. Please keep the keys and values of the JSON in the original language.

    3. The type of data you might encounter in the invoice includes but is not limited to: hotel information, guest information, invoice information,
    room charges, taxes, and total charges etc.

    4. If the page contains no charge data, please output an empty JSON object and don't make up any data.

    5. If there are blank data fields in the invoice, please include them as "null" values in the JSON object.

    6. If there are tables in the invoice, capture all of the rows and columns in the JSON object.
    Even if a column is blank, include it as a key in the JSON object with a null value.

    7. If a row is blank denote missing fields with "null" values.

    8. Don't interpolate or make up data.

    9. Please maintain the table structure of the charges, i.e. capture all of the rows and columns in the JSON object.

    """

    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={ "type": "json_object" },
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "extract the data in this hotel invoice and output into JSON "},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}", "detail": "high"}}
                ]
            }
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content

Finally, loop through all the pages and create 1 json file:

def extract_from_multiple_pages(base64_images, original_filename, output_directory):
    entire_invoice = []

    for base64_image in base64_images:
        invoice_json = extract_invoice_data(base64_image)
        invoice_data = json.loads(invoice_json)
        entire_invoice.append(invoice_data)

    # Ensure the output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Construct the output file path
    output_filename = os.path.join(output_directory, original_filename.replace('.pdf', '_extracted.json'))

    # Save the entire_invoice list as a JSON file
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(entire_invoice, f, ensure_ascii=False, indent=4)
    return output_filename

def main_extract(read_path, write_path):
    for filename in os.listdir(read_path):
        file_path = os.path.join(read_path, filename)
        if os.path.isfile(file_path):
            base64_images = pdf_to_base64_images(file_path)
            extract_from_multiple_pages(base64_images, filename, write_path)

read_path="./data/hotel_invoices/receipts_2019_de_hotel"
write_path="./data/hotel_invoices/extracted_invoice_json"

main_extract(read_path, write_path)

Imagine you're processing German hotel invoices. The extracted JSON will contain key-value pairs in German, grouped logically, with "null" values for any missing fields. This unstructured data can be stored in a data lake, ready for the next step.

Part 2: Transforming Unstructured Data into a Consistent Schema

Now, transform the extracted data into a schema for a database. This stage ensures data consistency and facilitates efficient querying. Here's an example of such a schema:

[
  {
    "hotel_information": {
      "name": "string",
      "address": {
        "street": "string",
        "city": "string",
        "country": "string",
        "postal_code": "string"
      },
      "contact": {
        "phone": "string",
        "fax": "string",
        "email": "string",
        "website": "string"
      }
    },
    "guest_information": {
      "company": "string",
      "address": "string",
      "guest_name": "string"
    },
    "invoice_information": {
      "invoice_number": "string",
      "reservation_number": "string",
      "date": "YYYY-MM-DD",
      "room_number": "string",
      "check_in_date": "YYYY-MM-DD",
      "check_out_date": "YYYY-MM-DD"
    },
    "charges": [
      {
        "date": "YYYY-MM-DD",
        "description": "string",
        "charge": "number",
        "credit": "number"
      }
    ],
    "totals_summary": {
      "currency": "string",
      "total_net": "number",
      "total_tax": "number",
      "total_gross": "number",
      "total_charge": "number",
      "total_credit": "number",
      "balance_due": "number"
    },
    "taxes": [
      {
        "tax_type": "string",
        "tax_rate": "string",
        "net_amount": "number",
        "tax_amount": "number",
        "gross_amount": "number"
      }
    ]
  }
]

Use the schema to transform the data using GPT-4o:

def transform_invoice_data(json_raw, json_schema):
    system_prompt = f"""
    You are a data transformation tool that takes in JSON data and a reference JSON schema, and outputs JSON data according to the schema.
    Not all of the data in the input JSON will fit the schema, so you may need to omit some data or add null values to the output JSON.
    Translate all data into English if not already in English.
    Ensure values are formatted as specified in the schema (e.g. dates as YYYY-MM-DD).
    Here is the schema:
    {json_schema}

    """

    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={ "type": "json_object" },
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Transform the following raw JSON data according to the provided schema. Ensure all data is in English and formatted as specified by values in the schema. Here is the raw JSON: {json_raw}"}
                ]
            }
        ],
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

The code below loops through the folder with the extracted schema and creates a new folder with the transform JSON schema:

def main_transform(extracted_invoice_json_path, json_schema_path, save_path):
    # Load the JSON schema
    with open(json_schema_path, 'r', encoding='utf-8') as f:
        json_schema = json.load(f)

    # Ensure the save directory exists
    os.makedirs(save_path, exist_ok=True)

    # Process each JSON file in the extracted invoices directory
    for filename in os.listdir(extracted_invoice_json_path):
        if filename.endswith(".json"):
            file_path = os.path.join(extracted_invoice_json_path, filename)

            # Load the extracted JSON
            with open(file_path, 'r', encoding='utf-8') as f:
                json_raw = json.load(f)

            # Transform the JSON data
            transformed_json = transform_invoice_data(json_raw, json_schema)

            # Save the transformed JSON to the save directory
            transformed_filename = f"transformed_{filename}"
            transformed_file_path = os.path.join(save_path, transformed_filename)
            with open(transformed_file_path, 'w', encoding='utf-8') as f:
                json.dump(transformed_json, f, ensure_ascii=False, indent=2)

extracted_invoice_json_path ="./data/hotel_invoices/extracted_invoice_json"
json_schema_path ="./data/hotel_invoices/invoice_schema.json"
save_path ="./data/hotel_invoices/transformed_invoice_json"

main_transform(extracted_invoice_json_path, json_schema_path, save_path)

Part 3: Loading Your Cleaned Data into a Database for Analysis

With your data neatly schematized, it's time to load it into a relational database like SQLite. This involves structuring the data into tables (Hotels, Invoices, Charges, Taxes) for efficient querying and analysis.

import os
import json
import sqlite3

def ingest_transformed_jsons(json_folder_path, db_path):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Create necessary tables
    cursor.execute('''
CREATE TABLE IF NOT EXISTS Hotels (
hotel_id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
street TEXT,
city TEXT,
country TEXT,
postal_code TEXT,
phone TEXT,
fax TEXT,
email TEXT,
website TEXT
)
    ''')

    cursor.execute('''
CREATE TABLE IF NOT EXISTS Invoices (
invoice_id INTEGER PRIMARY KEY AUTOINCREMENT,
hotel_id INTEGER,
invoice_number TEXT,
reservation_number TEXT,
date TEXT,
room_number TEXT,
check_in_date TEXT,
check_out_date TEXT,
currency TEXT,
total_net REAL,
total_tax REAL,
total_gross REAL,
total_charge REAL,
total_credit REAL,
balance_due REAL,
guest_company TEXT,
guest_address TEXT,
guest_name TEXT,
FOREIGN KEY(hotel_id) REFERENCES Hotels(hotel_id)
)
    ''')

    cursor.execute('''
CREATE TABLE IF NOT EXISTS Charges (
charge_id INTEGER PRIMARY KEY AUTOINCREMENT,
invoice_id INTEGER,
date TEXT,
description TEXT,
charge REAL,
credit REAL,
FOREIGN KEY(invoice_id) REFERENCES Invoices(invoi