Unleash the Power of Liquid: A Unified Multimodal Generation Paradigm
Discover Liquid, a groundbreaking auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation. This innovative approach utilizes a single large language model (LLM) to bridge the gap between text and visuals, unlocking unprecedented possibilities for multimodal applications.
Why Liquid is Revolutionizing Multimodal AI
Liquid stands apart from traditional multimodal large language models (MLLMs) by eliminating the reliance on external pretrained visual embeddings like CLIP. This streamlined architecture offers several key advantages:
- Simplified Architecture: Reduces complexity and dependencies, making integration easier.
- Enhanced Efficiency: Leverages the full potential of a single LLM for both visual and textual processing.
- Unified Token Space: Allows visual generation and comprehension to mutually enhance each other.
Key Dates & Updates
Stay up-to-date with the latest advancements in Liquid:
- 2025-03-25: Updated data processing and model pretraining scripts.
- 2025-03-04: Released text-to-image and visual understanding evaluation scripts.
- 2025-02-28: Paper, demo, model, and project page officially launched.
Open-Source Potential: Liquid's Accessibility
Foundation Vision is committed to open-source principles, making Liquid accessible to researchers and developers worldwide. Explore the available resources:
Liquid-7B-IT (Instruction Tuned Multimodal Model)
- [✅] Web Demo
- [✅] Evaluation
- [✅] Checkpoints
- [✅] Training Codes
Liquid-0.5B~32B-Pretrain (Multimodal Extension Models)
- Checkpoints available for various scales across three model families.
Hands-On with Liquid: Simple Inference Guide
Getting started with Liquid is straightforward, thanks to its HuggingFace-compatible format. You can perform both inference and evaluation with minimal dependencies.
- Installation:
pip install gradio==4.44.1
andpip install gradio_client==1.3.0
- Run the Gradio Demo locally:
cd evaluation
followed bypython app.py
If deploying on a GPU with less than 30GB VRAM, enable load_in_8bit
in AutoModelForCausalLM.from_pretrained
within app.py
to prevent memory errors.
Real-World Examples: Putting Liquid to Work
See Liquid in action with these practical inference examples:
-
Pure Language Dialogue:
-
Image Understanding:
-
Image Generation (Text-to-Image):
Scaling Laws: The Future of Multimodal Generation
Liquid reveals a crucial insight: the performance drop associated with unified training of visual and language tasks diminishes as model size increases. Discover the Liquid scaling law for enhanced multimodal generation. Liquid exhibits clear scaling across sizes from 0.5B to 32B.
Unleashing Visual Creativity: Autoregressive Image Generation
Liquid can generate high-quality, photorealistic images of any aspect ratio through descriptive language. Experience the power of autoregressive image generation with unparalleled control and precision.
Dive Deeper: Installation, Training, and More
For detailed instructions on installation, training, and data processing, refer to Data.md
and TRAIN.md
. Start exploring the world of unified multimodal generation with Liquid, and discover the potential of this powerful AI paradigm.
License & Citation
This project is licensed under the MIT License. If you find Liquid useful, please cite the following:
@article { wu2024liquid,
title = { Liquid: Language models are scalable multi-modal generators},
author = { Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang},
journal = { arXiv preprint arXiv:2412.04332},
year = { 2024}
}