Real-Time Video Commentary: Train Your Own Video LLM with Streaming Speech Transcription
Want to create a video LLM that can provide real-time commentary? LiveCC makes it possible! This innovative approach uses a novel video-ASR streaming method. The result? State-of-the-art (SOTA) performance. Both on streaming and offline benchmarks.
What is LiveCC?
LiveCC is the first video LLM capable of creating real-time commentary. It's trained using streaming speech transcription at scale. This groundbreaking technology opens doors for interactive video experiences.
Get Started with LiveCC: Installation Guide
Ready to dive in? Here's how to install LiveCC and its dependencies.
-
Python: Make sure you have Python version 3.11 or later installed.
-
Core Packages: Install essential packages using pip:
Note: the development was with
torch==2.6.0
andtransformers==4.50.0
. -
Advanced (Data Pipeline): To delve into data production pipeline, install:
Quick Start: Gradio Demo and CLI
Once installed, familiarize yourself with LiveCC through the Gradio demo and command-line interface (CLI). Details and instructions can be found in inference.md
.
Training Your Own Video LLM: Pre-training and SFT
LiveCC offers scripts for both pre-training and Supervised Fine-Tuning (SFT). Here's a breakdown:
Pre-training for your Video LLM
Use pre-training to lay the foundation for your model.
-
Data: Utilize the Live-CC-5M dataset available on Hugging Face.
-
Script: Execute the
scripts/pt_local.sh
script, ensuring the following parameters are configured:VIDEO_MIN_PIXELS
: Sets the minimum visual frame tokens sent to the LLM.FPS_MAX_FRAMES
: Defines the maximum number of frames per video (480 frames equates to 4 minutes at 2 FPS).VIDEO_MAX_PIXELS
: Limits the maximum overall video tokens sent to the LLM.learning_rate
: Defines the learning rate for pre-training.
-
Key Arguments Explained:
--deepspeed ./scripts/deepspeed_zero2.json
: Implements DeepSpeed ZeRO-2 for memory efficiency.--output_dir
: Specifies the directory for saving model checkpoints.--pretrained_model_name_or_path Qwen/Qwen2-VL-7B
: Initializes training from the Qwen2-VL-7B model.--freeze_modules visual
: Freezes the parameters of the visual encoder.--use_liger_kernel True
: Enables the Liger kernel for faster attention mechanisms.--annotation_paths
: Specifies the path to training datasets.
Supervised Fine-Tuning (SFT) for your Video LLM
Fine-tune your model for specific tasks using SFT.
- Data: Datasets like Live-WhisperX-526K and LLaVA-Video-178K.
- Script: Run
scripts/sft_local.sh
with the following considerations:- Leverage the provided datasets.
- Adjust the
learning_rate
for SFT.
- Configuration Highlights:
- Utilize multiple datasets to enhance versatility.
- Ensure consistent Liger kernel usage between training and inference.
Evaluation: Benchmarking Your Video LLM
Assess your model's performance using various benchmarks.
LiveSports3KCC: Real-time Video Commentary Evaluation
Evaluate your model's ability to generate real-time video commentary.
- Generation: Use
distributed_generate_livecc.py
to generate commentary. - LLM Judge: Employ
llm_judge.py
to assess winning rates, considering that results can vary slightly.
VideoMME: Multi-modal Evaluation
Evaluate your model on video multi-modal understanding.
- Data Preparation: Format your data according to the VideoMME format.
- Execution: Run
distributed_evaluate_videomme.py
with and without subtitles.
Citation
If you find LiveCC useful in your research, cite the following paper:
@inproceedings{livecc,
author = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
booktitle = {CVPR},
year = {2025},
}