Real-Time Video Commentary: Train Your Own Video LLM with Streaming Speech Transcription

Want to create a video LLM that can provide real-time commentary? LiveCC makes it possible! This innovative approach uses a novel video-ASR streaming method. The result? State-of-the-art (SOTA) performance. Both on streaming and offline benchmarks.

What is LiveCC?

LiveCC is the first video LLM capable of creating real-time commentary. It's trained using streaming speech transcription at scale. This groundbreaking technology opens doors for interactive video experiences.

Get Started with LiveCC: Installation Guide

Ready to dive in? Here's how to install LiveCC and its dependencies.

Python: Make sure you have Python version 3.11 or later installed.

Core Packages: Install essential packages using pip:

pip install torch torchvision torchaudio
pip install transformers accelerate deepspeed peft opencv-python decord datasets tensorboard gradio pillow-heif gpustat timm sentencepiece openai av==12.0.0 qwen_vl_utils liger_kernel numpy==1.24.4
pip install flash-attn --no-build-isolation
pip install livecc-utils

Note: the development was with torch==2.6.0 and transformers==4.50.0.

Advanced (Data Pipeline): To delve into data production pipeline, install:
```
pip install insightface onnxruntime-gpu python_speech_features wavfile
```

Quick Start: Gradio Demo and CLI

Once installed, familiarize yourself with LiveCC through the Gradio demo and command-line interface (CLI). Details and instructions can be found in inference.md.

Training Your Own Video LLM: Pre-training and SFT

LiveCC offers scripts for both pre-training and Supervised Fine-Tuning (SFT). Here's a breakdown:

Pre-training for your Video LLM

Use pre-training to lay the foundation for your model.

Data: Utilize the Live-CC-5M dataset available on Hugging Face.
Script: Execute the scripts/pt_local.sh script, ensuring the following parameters are configured:
- VIDEO_MIN_PIXELS: Sets the minimum visual frame tokens sent to the LLM.
- FPS_MAX_FRAMES: Defines the maximum number of frames per video (480 frames equates to 4 minutes at 2 FPS).
- VIDEO_MAX_PIXELS: Limits the maximum overall video tokens sent to the LLM.
- learning_rate: Defines the learning rate for pre-training.
Key Arguments Explained:
- --deepspeed ./scripts/deepspeed_zero2.json: Implements DeepSpeed ZeRO-2 for memory efficiency.
- --output_dir: Specifies the directory for saving model checkpoints.
- --pretrained_model_name_or_path Qwen/Qwen2-VL-7B: Initializes training from the Qwen2-VL-7B model.
- --freeze_modules visual: Freezes the parameters of the visual encoder.
- --use_liger_kernel True: Enables the Liger kernel for faster attention mechanisms.
- --annotation_paths: Specifies the path to training datasets.

Supervised Fine-Tuning (SFT) for your Video LLM

Fine-tune your model for specific tasks using SFT.

Data: Datasets like Live-WhisperX-526K and LLaVA-Video-178K.
Script: Run scripts/sft_local.sh with the following considerations:
- Leverage the provided datasets.
- Adjust the learning_rate for SFT.
Configuration Highlights:
- Utilize multiple datasets to enhance versatility.
- Ensure consistent Liger kernel usage between training and inference.

Evaluation: Benchmarking Your Video LLM

Assess your model's performance using various benchmarks.

LiveSports3KCC: Real-time Video Commentary Evaluation

Evaluate your model's ability to generate real-time video commentary.

Generation: Use distributed_generate_livecc.py to generate commentary.
LLM Judge: Employ llm_judge.py to assess winning rates, considering that results can vary slightly.

Evaluate your model on video multi-modal understanding.

Data Preparation: Format your data according to the VideoMME format.
Execution: Run distributed_evaluate_videomme.py with and without subtitles.

Citation

If you find LiveCC useful in your research, cite the following paper:

@inproceedings{livecc,
author = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
booktitle = {CVPR},
year = {2025},
}