Train Your Own DeepSeek-R1: Simplified Code for Maximum Impact
Frustrated with complex, bloated code when trying to train your own DeepSeek-R1 model? This repository offers a streamlined solution, focusing on simplicity and efficiency. Achieve your DeepSeek-R1 training goals faster and with less overhead.
Why This DeepSeek-R1 Repository Stands Out
Instead of navigating thousands of lines of code and countless configuration files, this repository distills the essential components into just three files: main.py
, trainer.py
, and utils.py
. This simplified approach makes it easier to understand, customize, and troubleshoot your DeepSeek-R1 training process. Say goodbye to the complexity often associated with GRPOTrainer and embrace clarity.
Here's what you gain:
- Minimal Code: Only 831 lines across three files for easy comprehension.
- No Hugging Face GRPOTrainer: Avoid the complexities and customization frustrations.
- vLLM Integration: Achieve fast answer generation without sacrificing code simplicity.
- Multi-GPU Support: Dedicated GPU for vLLM inference alongside training GPUs.
Unleash the Speed: Training Time and GPU Usage
The code provides impressive training times using NVIDIA A100 GPUs. Here’s a glimpse of the performance you can expect:
- Qwen2-VL-2B-Instruct (100k QA samples):
- 2 A100 GPUs: 14 hours
- 8 A100 GPUs: 4.5 hours
- Qwen2.5-VL-3B-Instruct: 6 hours
- GPU Memory Usage: 40-60GB (with specific parameter settings)
Troubleshooting vLLM: A Quick Fix
The repository addresses a critical bug related to GPU index mapping in vLLM versions prior to some later updates. A simple code modification in vllm/worker/worker.py
is provided to resolve the issue:
Commenting out torch.cuda.set_device(self.device)
resolves the GPU mapping problem. For optimal tensor parallelism, consider using vllm==0.8.3
. Maintaining separate conda environments for different vLLM versions (e.g., 0.7.3 and 0.8.3) is recommended.
Simple Installation: Get Started Quickly
The installation process is straightforward, utilizing conda
and pip
. Follow these steps to set up your environment:
- Create a new conda environment:
conda create -n deepsick python=3.12 -y
- Activate the environment:
conda activate deepsick
- Install vLLM:
pip install vllm==0.7.3
- Install necessary packages:
pip install trl wandb debugpy datasets deepspeed accelerate
- Install flash attention:
pip install flash-attn --no-build-isolation
Training with Multiple GPUs using DeepSpeed
This repository leverages DeepSpeed-ZeRO3 for efficient multi-GPU training by configuring the ds_accel.yaml
to properly distribute trainging. To launch the training script, use the following command:
bash train.sh
This script automatically computes the process number for DeepSpeed, offering a clever fix for compatibility issues between vLLM and accelerate.
From Vision Language Models (VLMs) to Large Language Models (LLMs)
Although primarily focused on training Vision Language Models (VLMs) like DeepSeek-R1, the code's simplicity allows for easy modification to train Large Language Models (LLMs). Leverage the clean structure to adapt the code to your specific LLM training needs.