Tired of Bloated Codebases? Train Your Vision Language Model with Just 3 Files!

Are you drowning in thousands of lines of code just to train a vision language model? DeepSick-R1 offers a streamlined approach. This repository delivers a surprisingly simple and efficient way to train your model, focusing on clarity and customization. Get ready to ditch the complexity and dive straight into training.

Why DeepSick-R1 Stands Out From The Crowd

Training models like DeepSeek-R1 can be complex. You might think a few lines of code can't perform complex training, but DeepSick-R1 delivers! Here's the catch:

Simplicity First: Forget navigating massive codebases. DeepSick-R1 boils down the training process to just three key files: main.py, trainer.py, and utils.py.
No More Hugging Face Frustration: Skip the headaches of complex GRPOTrainer customization. This repo eliminates the need for Hugging Face's GRPOTrainer class.
vLLM Integration for Speed: Generate answer candidates at lightning speed with built-in vLLM support.
Multi-GPU Training Made Easy: Seamlessly leverage multiple GPUs, dedicating one to vLLM inference while others focus on training.

Train Vision Language Models Faster: Real-World Performance

Here's what you can expect in terms of training performance:

Qwen2-VL-2B-Instruct: 14 hours on 2 NVIDIA A100 80GB GPUs (100k QA samples).
Qwen2-VL-2B-Instruct: 4.5 hours on 8 NVIDIA A100 80GB GPUs (100k QA samples).
Qwen2.5-VL-3B-Instruct: 6 hours on multiple GPUs.

Achieve impressive results with minimal code and maximum efficiency.

Getting Started: Installation and Usage

Ready to start training your own vision language model with DeepSick-R1? Here's a quick guide:

Create a Conda Environment:

conda create -n deepsick python=3.12 -y
conda activate deepsick

Install vLLM:
```
pip install vllm==0.7.3
```

Install Dependencies:

pip install trl wandb debugpy datasets deepspeed accelerate

Install Flash Attention:

pip install flash-attn --no-build-isolation

With a simple setup, you're ready to train. Here's how to launch training with multi-GPU support:

bash train.sh

This script utilizes DeepSpeed ZeRO3 and automatically configures the number of processes based on available GPUs.

Navigating vLLM Issues: A Quick Fix

Encountering issues with vLLM? DeepSick-R1 provides a practical solution for vLLM version 0.7.3. To avoid GPU index mapping problems, comment out the specified line in vllm/worker/worker.py:

# torch.cuda.set_device(self.device) # It is the problem, please comment out

For tensor parallelism with tensor_parallel_size=8, use vLLM version 0.8.3. Consider using separate Conda environments for versions 0.7.3 and 0.8.3 to avoid conflicts.

Dive Into the Code: Key Files Explained

Understanding the structure of DeepSick-R1 is easy:

main.py (292 lines): The heart of the training process, managing data loading, model initialization, and training loops.
trainer.py (108 lines): Contains the training logic.
utils.py (431 lines): Provides utility functions for data processing, logging, and more.

With just 831 lines of code, you can quickly grasp the entire training pipeline and adapt it to your specific needs. This makes it possible to create powerful text to image generators with ease and flexibility.