Train Your Own DeepSeek-R1: Simplified Code for Maximum Impact

Frustrated with complex, bloated code when trying to train your own DeepSeek-R1 model? This repository offers a streamlined solution, focusing on simplicity and efficiency. Achieve your DeepSeek-R1 training goals faster and with less overhead.

Why This DeepSeek-R1 Repository Stands Out

Instead of navigating thousands of lines of code and countless configuration files, this repository distills the essential components into just three files: main.py, trainer.py, and utils.py. This simplified approach makes it easier to understand, customize, and troubleshoot your DeepSeek-R1 training process. Say goodbye to the complexity often associated with GRPOTrainer and embrace clarity.

Here's what you gain:

Minimal Code: Only 831 lines across three files for easy comprehension.
No Hugging Face GRPOTrainer: Avoid the complexities and customization frustrations.
vLLM Integration: Achieve fast answer generation without sacrificing code simplicity.
Multi-GPU Support: Dedicated GPU for vLLM inference alongside training GPUs.

Unleash the Speed: Training Time and GPU Usage

The code provides impressive training times using NVIDIA A100 GPUs. Here’s a glimpse of the performance you can expect:

Qwen2-VL-2B-Instruct (100k QA samples):
- 2 A100 GPUs: 14 hours
- 8 A100 GPUs: 4.5 hours
Qwen2.5-VL-3B-Instruct: 6 hours
GPU Memory Usage: 40-60GB (with specific parameter settings)

Troubleshooting vLLM: A Quick Fix

The repository addresses a critical bug related to GPU index mapping in vLLM versions prior to some later updates. A simple code modification in vllm/worker/worker.py is provided to resolve the issue:

def init_device(self) -> None:
    if self.device_config.device.type == "cuda":
        # torch.cuda.set_device(self.device) # It is the problem, please comment out

Commenting out torch.cuda.set_device(self.device) resolves the GPU mapping problem. For optimal tensor parallelism, consider using vllm==0.8.3. Maintaining separate conda environments for different vLLM versions (e.g., 0.7.3 and 0.8.3) is recommended.

Simple Installation: Get Started Quickly

The installation process is straightforward, utilizing conda and pip. Follow these steps to set up your environment:

Create a new conda environment: conda create -n deepsick python=3.12 -y
Activate the environment: conda activate deepsick
Install vLLM: pip install vllm==0.7.3
Install necessary packages: pip install trl wandb debugpy datasets deepspeed accelerate
Install flash attention: pip install flash-attn --no-build-isolation

Training with Multiple GPUs using DeepSpeed

This repository leverages DeepSpeed-ZeRO3 for efficient multi-GPU training by configuring the ds_accel.yaml to properly distribute trainging. To launch the training script, use the following command:

bash train.sh

This script automatically computes the process number for DeepSpeed, offering a clever fix for compatibility issues between vLLM and accelerate.

From Vision Language Models (VLMs) to Large Language Models (LLMs)

Although primarily focused on training Vision Language Models (VLMs) like DeepSeek-R1, the code's simplicity allows for easy modification to train Large Language Models (LLMs). Leverage the clean structure to adapt the code to your specific LLM training needs.