Speed Up LLM Preference Tuning: Optimize DPO and Reward Modeling with Flash Preference

Want to accelerate your Large Language Model (LLM) preference tuning? Discover how Flash Preference, a powerful new tool, can drastically speed up your Direct Preference Optimization (DPO), Reward Modeling (RM), and Group Relative Policy Optimization (GRPO) processes with minimal code changes. This article will show you how to leverage prefix sharing to reduce computation and memory usage, leading to faster and more efficient training.

What is Flash Preference and How Does it Work?

Flash Preference is a library designed to improve the efficiency of LLM preference tuning. It works by identifying and sharing common prefixes in input sequences during model forward and backward passes. This intelligent sharing dramatically reduces computational overhead and memory footprint without sacrificing accuracy. With just a single line of code, you can integrate Flash Preference into your existing workflows.

Getting Started with Flash Preference

Integrating Flash Preference into your workflow is straightforward. Here’s how to get started:

Installation: Install the latest version directly from GitHub or use pip:
```
pip install git+https://github.com/li-plus/flash-preference.git@main
```
Implementation: Wrap your model's forward and backward passes with the shared_prefix context. The library automatically detects and shares common prefixes.

Here's an example using the Qwen2.5-7B-Instruct model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from flash_pref import shared_prefix

model_id = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="right")
model = AutoModelForCausalLM.from_pretrained(
    model_id, attn_implementation="flash_attention_2", use_cache=False, torch_dtype=torch.bfloat16, device_map="cuda"
)

prompt = "What is the next 10 numbers of this sequence: " + ", ".join(str(x) for x in range(500))
chosen_response = ", ".join(str(x) for x in range(500, 500 + 10))
rejected_response = ", ".join(str(x) for x in range(500, 500 + 10, 2))
conversations = [
    [{"role": "user", "content": prompt}, {"role": "assistant", "content": chosen_response}],
    [{"role": "user", "content": prompt}, {"role": "assistant", "content": rejected_response}],
]
inputs = tokenizer.apply_chat_template(
    conversations, tokenize=True, padding=True, return_tensors="pt", return_dict=True
).to("cuda")

# ===== MAGIC HERE =====
with shared_prefix(model, input_ids=inputs.input_ids, attention_mask=inputs.attention_mask):
    output = model(**inputs)
    output.logits.backward(torch.randn_like(output.logits))

Key Benefits of Using Flash Preference

Flash Preference delivers remarkable speedup and memory savings compared to baseline models. Here's what you can expect:

Accelerated Training: By sharing prefixes, Flash Preference reduces redundant computations, leading to faster training cycles. This is especially crucial for Direct Preference Optimization (DPO).
Reduced Memory Footprint: Prefix sharing minimizes the memory required during training, allowing you to work with larger models and datasets on limited hardware.
Seamless Integration: Implementing Flash Preference requires minimal code changes, making it easy to integrate into your existing projects. You'll see nearly instant benefits with accelerated DPO.

Benchmark Performance

The following benchmark highlights the performance improvements achieved with Flash Preference:

Model: Qwen/Qwen2.5-7B-Instruct with gradient checkpointing, Liger-Kernel, and FlashAttention-2 enabled.
Data: Mocked pairwise preference data with prompt and response lengths varying from 64 to 16k.
Computation: One forward pass followed by one backward pass.
Hardware: 1x NVIDIA A800-SXM4-80GB GPU.

Supported Architectures

Flash Preference currently supports a variety of popular LLM architectures, including:

LLaMA
Gemma
Gemma2
Qwen2
Qwen2VL
Qwen2.5VL

Contributing and Development

The project is actively developed and welcomes contributions.

Unit Tests: Ensure your code maintains the highest standards by running unit tests. At least two GPUs are required.
Code Formatting: Maintain a clean and consistent codebase by formatting your code.

License

Flash Preference is released under the MIT License, promoting open-source collaboration and usage.

Speed Up LLM Preference Tuning: Optimize DPO and Reward Modeling with Flash Preference

What is Flash Preference and How Does it Work?

Getting Started with Flash Preference

Integrating Flash Preference into your workflow is straightforward. Here’s how to get started:

Installation: Install the latest version directly from GitHub or use pip:

pip install git+https://github.com/li-plus/flash-preference.git@main

Implementation: Wrap your model's forward and backward passes with the shared_prefix context. The library automatically detects and shares common prefixes.

Here's an example using the Qwen2.5-7B-Instruct model:

Key Benefits of Using Flash Preference

Flash Preference delivers remarkable speedup and memory savings compared to baseline models. Here's what you can expect:

Accelerated Training: By sharing prefixes, Flash Preference reduces redundant computations, leading to faster training cycles. This is especially crucial for Direct Preference Optimization (DPO).

Reduced Memory Footprint: Prefix sharing minimizes the memory required during training, allowing you to work with larger models and datasets on limited hardware.

Seamless Integration: Implementing Flash Preference requires minimal code changes, making it easy to integrate into your existing projects. You'll see nearly instant benefits with accelerated DPO.

Benchmark Performance

The following benchmark highlights the performance improvements achieved with Flash Preference:

Model: Qwen/Qwen2.5-7B-Instruct with gradient checkpointing, Liger-Kernel, and FlashAttention-2 enabled.

Data: Mocked pairwise preference data with prompt and response lengths varying from 64 to 16k.

Computation: One forward pass followed by one backward pass.

Hardware: 1x NVIDIA A800-SXM4-80GB GPU.