Unleash the Power of Liquid: A Scalable Multimodal Generation Paradigm
Ready to experience the future of AI? Liquid offers a groundbreaking approach to multimodal generation, seamlessly blending visual comprehension and text generation. Learn how Liquid can revolutionize your AI projects!
What is Liquid?
Liquid is a cutting-edge autoregressive generation paradigm that unifies multimodal comprehension and generation. Instead of relying on external visual embeddings, Liquid achieves integration using a single large language model (LLM). This innovative approach delivers a new level of scalability and versatility.
Key Benefits of Using Liquid
- Unified Multimodal Generation: Liquid seamlessly integrates visual and textual data, enabling powerful applications.
- Single LLM Architecture: Eliminates the need for external pre-trained visual embeddings like CLIP, simplifying the architecture.
- Scalable Performance: Performance drop diminished as model size increases.
- Mutual Enhancement: The unified token space enables visual generation and comprehension to mutually enhance each other.
Getting Started with Liquid: Inference and Evaluation
Diving into the world of Liquid is easier than you think! Liquid's inference and evaluation don't require a complex setup. Because Liquid is based on a HuggingFace format language model, you just need the transformers
library and basic components.
Simple Steps for Inference
- Install Dependencies:
- Run the Gradio Demo:
Examples of Single Inference
-
Text to Text (T2T): Pure Language Dialogue
-
Image to Text (I2T): Image Understanding
-
Text to Image (T2I): Image Generation
Add
--load_8bit
when using GPUs with less than 30GB VRAM
Liquid's Open-Source Plan: Models and Capabilities
Liquid is designed to be accessible and adaptable, with a detailed open-source plan:
- Liquid-7B-IT: Instruction Tuned Multimodal Model with Instruction Following Ability
- [✅] Web Demo
- [✅] Evaluation
- [✅] Checkpoints
- [✅] Training Codes
- Liquid-0.5B~32B-Pretrain: Multimodal extension models of six different scales ranging from 0.5B to 32B across three model families.
- Checkpoints
Scaling Law and Multimodal Generation
Liquid showcases a clear scaling law in multimodal generation across different model sizes ranging from 0.5B to 32B. Liquid excels at generating high-quality, photorealistic images of any aspect ratio from textual prompts using an autoregressive paradigm.
Dive Deeper: Installation, Training, and Further Resources
For detailed instructions on installation and training, refer to Data.md and TRAIN.md.
License Information
This project is licensed under the MIT License. See the LICENSE file for details.