Tired of Bloated Codebases? Train Your Vision Language Model with Just 3 Files!
Are you drowning in thousands of lines of code just to train a vision language model? DeepSick-R1 offers a streamlined approach. This repository delivers a surprisingly simple and efficient way to train your model, focusing on clarity and customization. Get ready to ditch the complexity and dive straight into training.
Why DeepSick-R1 Stands Out From The Crowd
Training models like DeepSeek-R1 can be complex. You might think a few lines of code can't perform complex training, but DeepSick-R1 delivers! Here's the catch:
- Simplicity First: Forget navigating massive codebases. DeepSick-R1 boils down the training process to just three key files:
main.py
,trainer.py
, andutils.py
. - No More Hugging Face Frustration: Skip the headaches of complex GRPOTrainer customization. This repo eliminates the need for Hugging Face's GRPOTrainer class.
- vLLM Integration for Speed: Generate answer candidates at lightning speed with built-in vLLM support.
- Multi-GPU Training Made Easy: Seamlessly leverage multiple GPUs, dedicating one to vLLM inference while others focus on training.
Train Vision Language Models Faster: Real-World Performance
Here's what you can expect in terms of training performance:
- Qwen2-VL-2B-Instruct: 14 hours on 2 NVIDIA A100 80GB GPUs (100k QA samples).
- Qwen2-VL-2B-Instruct: 4.5 hours on 8 NVIDIA A100 80GB GPUs (100k QA samples).
- Qwen2.5-VL-3B-Instruct: 6 hours on multiple GPUs.
Achieve impressive results with minimal code and maximum efficiency.
Getting Started: Installation and Usage
Ready to start training your own vision language model with DeepSick-R1? Here's a quick guide:
- Create a Conda Environment:
- Install vLLM:
- Install Dependencies:
- Install Flash Attention:
With a simple setup, you're ready to train. Here's how to launch training with multi-GPU support:
This script utilizes DeepSpeed ZeRO3 and automatically configures the number of processes based on available GPUs.
Navigating vLLM Issues: A Quick Fix
Encountering issues with vLLM? DeepSick-R1 provides a practical solution for vLLM version 0.7.3. To avoid GPU index mapping problems, comment out the specified line in vllm/worker/worker.py
:
For tensor parallelism with tensor_parallel_size=8
, use vLLM version 0.8.3. Consider using separate Conda environments for versions 0.7.3 and 0.8.3 to avoid conflicts.
Dive Into the Code: Key Files Explained
Understanding the structure of DeepSick-R1 is easy:
main.py
(292 lines): The heart of the training process, managing data loading, model initialization, and training loops.trainer.py
(108 lines): Contains the training logic.utils.py
(431 lines): Provides utility functions for data processing, logging, and more.
With just 831 lines of code, you can quickly grasp the entire training pipeline and adapt it to your specific needs. This makes it possible to create powerful text to image generators with ease and flexibility.