Squeeze Every Drop: Distill GPT-4o Knowledge into GPT-4o-mini for Wine Expertise

OpenAI's model distillation offers a game-changing approach: transfer the smarts of powerful (but pricey) models like GPT-4o into smaller, faster ones. Imagine GPT-4o-mini, but with the wine-guessing abilities of its big brother. This guide explores how to achieve this, cutting costs and latency without sacrificing accuracy. We'll focus on a wine classification problem, using enums for structured outputs to boost performance even further.

The Magic of Model Distillation: Why It Matters

Model distillation is like cloning the knowledge of a large model into a smaller one.

Cheaper Inferences: Run smaller models on less expensive hardware.
Lower Latency: Get answers faster with smaller models.
Specialized Expertise: Fine-tune for specific tasks, like wine tasting.

Dependencies

!pip install openai tiktoken numpy pandas tqdm -- quiet
import openai
import json
import tiktoken
from tqdm import tqdm
from openai import OpenAI
import numpy as np
import concurrent.futures
import pandas as pd

client = OpenAI()

French Wine Focus: Unearthing a Dataset

For this experiment, we'll leverage a fascinating dataset that contains numerous wine reviews available on Kaggle.

Dataset Source: https://www.kaggle.com/datasets/zynicide/wine-reviews
Narrowed Scope: We'll focus specifically on French wines to streamline the process, focusing on a subset of 500 to keep testing efficient.
The Goal: To predict the grape variety based on review details like description, region, and winery.

Crafting the Perfect Prompt: Setting the Stage for Success

A well-structured prompt is key for accurate predictions and successful distallation of knowledge into a smaller model.

Let's filter the grape varieties that have less than 5 occurences in reviews.

Let's proceed with a subset of 500 random rows from this dataset.

df = pd.read_csv( 'data/winemag/winemag-data-130k-v2.csv')
df_france = df[df[ 'country'] == 'France']

# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those
# they're outliers that we don't want to optimize for that would make our enum list be too long
# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.

varieties_less_than_five_list = df_france[ 'variety'].value_counts()[df_france[ 'variety'].value_counts() < 5].index.tolist()
df_france = df_france[ ~ df_france[ 'variety'].isin(varieties_less_than_five_list)]

df_france_subset = df_france.sample( n = 500)
df_france_subset.head()

# Let's retrieve all grape varieties to include them in the prompt and in our structured outputs enum list.
varieties = np.array(df_france[ 'variety'].unique()).astype( 'str')
varieties

Consider this when creating your prompts for your models:

Detailed Context: Provide winery, region, description, reviewer, and points.
Variety List: Explicitly list possible grape varieties for the model, improving focus.
Concise Instructions: Request a single-word answer (the grape variety), ensuring consistency.

Prompt Example

def generate_prompt (row, varieties):
    # Format the varieties list as a comma-separated string
    variety_list = ', '.join(varieties)

    prompt = f """
Based on this wine review, guess the grape variety:
This wine is produced by { row[ 'winery']} in the { row[ 'province']} region of { row[ 'country']}.
It was grown in { row[ 'region_1']}. It is described as: " { row[ 'description']} ".
The wine has been reviewed by { row[ 'taster_name']} and received { row[ 'points']} points.
The price is { row[ 'price']}.

Here is a list of possible grape varieties to choose from: { variety_list}.

What is the likely grape variety? Answer only with the grape variety name or blend from the list.
"""
    return prompt

# Example usage with a specific row
prompt = generate_prompt(df_france.iloc[ 0], varieties)
prompt

Cost-Effective Completion Calls: Token Estimates

Before launching a large number of API calls, estimate token usage with tiktoken and model costs.

# Load encoding for the GPT-4 model
enc = tiktoken.encoding_for_model( "gpt-4o")

# Initialize a variable to store the total number of tokens
total_tokens = 0

for index, row in df_france_subset.iterrows():
    prompt = generate_prompt(row, varieties)

    # Tokenize the input text and count tokens
    tokens = enc.encode(prompt)
    token_count = len (tokens)

    # Add the token count to the total
    total_tokens += token_count

print ( f "Total number of tokens in the dataset: { total_tokens} ")
print ( f "Total number of prompts: {len (df_france_subset)} ")
# outputing cost in $ as of 2024/10/16

gpt4o_token_price = 2.50 / 1_000_000 # $2.50 per 1M tokens
gpt4o_mini_token_price = 0.150 / 1_000_000 # $0.15 per 1M tokens

total_gpt4o_cost = gpt4o_token_price * total_tokens
total_gpt4o_mini_cost = gpt4o_mini_token_price * total_tokens

print (total_gpt4o_cost)
print (total_gpt4o_mini_cost)

Structured Wine Wisdom: Enums for Accurate Output

Using structured outputs ensures the model answers predictably and accurately.

Enumerated List: Define a JSON schema with an enum listing all grape varieties.
Deterministic Answers: Force the model to select from the list, avoiding irrelevant responses.
Performance Boost: Eliminate the need for further processing - an immediate accuracy win!

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "grape-variety",
        "schema": {
            "type": "object",
            "properties": {
                "variety": {
                    "type": "string",
                    "enum": varieties.tolist()
                }
            },
            "additionalProperties": False,
            "required": [ "variety"],
        },
        "strict": True
    }
}

Store Completions with Metadata

To distill a model, it's essential to store all completions, this allows us to give it as a reference to the smaller model to fine-tune it. Storing with metadata also simplifies filtering for distillation and future evaluations.

Store Parameter: Set store=True in client.chat.completions.create.
Metadata Tag: Add a descriptive metadata tag like "distillation": "wine-distillation".

# Initialize the progress index
metadata_value = "wine-distillation" # that's a funny metadata tag :-)

# Function to call the API and process the result for a single model (blocking call in this case)
def call_model (model, prompt):
    response = client.chat.completions.create(
        model = model,
        store = True,
        metadata = {
            "distillation": metadata_value,
        },
        messages = [
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        response_format = response_format
    )
    return json.loads(response.choices[ 0].message.content.strip())[ 'variety']

Parallel Processing for Speed: Concurrent Futures to the Rescue

Parallelizing API calls dramatically reduces processing time using concurrent futures.

ThreadPoolExecutor: Use this to run completions concurrently.
Progress Bar: Implement tqdm for a visual display of processing.
Error Handling: Include try...except blocks to gracefully handle any API errors.

# Parallel processing
# As we'll run this on a large number of rows, let's make sure we run those completions in parallel and use concurrent futures for this. We'll iterate on our dataframe and output progress every 20 rows. We'll store the completion from the model we run the completion for in the same dataframe using the column name {model}-variety.
def process_example (index, row, model, df, progress_bar):
    global progress_index

    try:
        # Generate the prompt using the row
        prompt = generate_prompt(row, varieties)

        df.at[index, model + "-variety"] = call_model(model, prompt)

        # Update the progress bar
        progress_bar.update( 1)

        progress_index += 1
    except Exception as e:
        print ( f "Error processing model { model}: {str (e)} ")

def process_dataframe (df, model):
    global progress_index
    progress_index = 1 # Reset progress index

    # Create a tqdm progress bar
    with tqdm( total = len (df), desc = "Processing rows") as progress_bar:
        # Process each example concurrently using ThreadPoolExecutor
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {executor.submit(process_example, index, row, model, df, progress_bar): index for index, row in df.iterrows()}

        for future in concurrent.futures.as_completed(futures):
            try:
                future.result() # Wait for each example to be processed
            except Exception as e:
                print ( f "Error processing example: {str (e)} ")

    return df

How to Ensure Accuracy in Results

Now that you have your code working with the model lets first check it before you distill models.

answer = call_model( 'gpt-4o', generate_prompt(df_france_subset.iloc[ 0], varieties))
answer

Now that all the checks are good we can now process our dataset.

Run both gpt-4o and gpt-4o-mini against the French wine dataset:

df_france_subset = process_dataframe(df_france_subset, "gpt-4o")
df_france_subset = process_dataframe(df_france_subset, "gpt-4o-mini")

Compare those 2 Models

Comparing completion results between the 'teacher' model (GPT-4o) and the 'student' model (GPT-4o-mini). This is an important step to assess the effectiveness of distillation.

models = [ 'gpt-4o', 'gpt-4o-mini']

def get_accuracy (model, df):
    return np.mean(df[ 'variety'] == df[model + '-variety'])

for model in models:
    print ( f " { model} accuracy: { get_accuracy(model, df_france_subset) * 100:.2f} %")

Distilling Knowledge: The OpenAI Platform Steps

Upload the processed dataset with stored gpt-4o completions to the OpenAI platform and begin distillation.

Stored Completions: Navigate to the OpenAI Stored completions page.
Filter: Select gpt-4o as the model and "distillation: wine-distillation" as the metadata.
Initiate Distillation: Click "Distill" in the top right corner.
Choose Base Model: Select gpt-4o-mini and configure fine-tuning parameters.
Track Progress: Retrieve the fine-tuning job ID and monitor progress.