Unleash the Power of Liquid: A Unified Approach to Multimodal Generation

Discover Liquid, a revolutionary autoregressive generation paradigm poised to redefine how we interact with visual and textual data. This groundbreaking technology seamlessly integrates visual comprehension and generation, opening up exciting new possibilities for AI applications. Let's dive into Liquid's capabilities and explore how it can transform your projects.

What is Liquid and Why Should You Care?

Liquid is not just another multimodal large language model (MLLM). It's a paradigm shift. By using a single large language model (LLM), it eliminates the need for external pretrained visual embeddings like CLIP. This simplified architecture offers:

Enhanced Scalability: Liquid demonstrates a clear scaling law, meaning performance improves significantly as the model size increases (from 0.5B to 32B).
Unified Training: Liquid's design minimizes the performance drop often associated with combining visual and language tasks.
Mutual Enhancement: The unified token space allows visual generation and comprehension tasks to reinforce each other, leading to superior results.
Text-to-Image Innovation: Want to generate stunning, photorealistic images from text prompts? Liquid delivers high-quality results in any aspect ratio.

Getting Started with Liquid: Inference Made Easy

One of the most appealing aspects of Liquid is its user-friendliness. You don't need a complex environment to get started with inference or evaluation. Since it's based on the HuggingFace Transformers library, you only need a few basic components and the transformers library, making integration a breeze. Consult EVAL.md for recommended library versions.

Run a Gradio Demo Locally

Install the necessary libraries:

pip install gradio==4.44.1
pip install gradio_client==1.3.0

Navigate to the evaluation directory:
```
cd evaluation
```
Launch the demo:
```
python app.py
```
Pro Tip: If you're running on a GPU with less than 30GB VRAM, enable load_in_8bit in AutoModelForCausalLM.from_pretrained within app.py to prevent out-of-memory errors during image generation.

Single Inference Examples

Here are some examples of how to use Liquid for different tasks:

Pure Language Dialogue:

python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt " Write me a poem about Machine Learning. "

Image Understanding (Visual Question Answering):

python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt ' How to make this pastry? '

Text-to-Image Generation:
```
python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt " young blue dragon with horn lightning in the style of dd fantasy full body "
```
For GPUs with less than 30GB VRAM, add the --load_8bit flag. This empowers even those with limited resources to generate stunning visuals

Diving Deeper: Training and Data

For those interested in training your own Liquid models, refer to Data.md and TRAIN.md for detailed instructions on data processing and training scripts. These resources give a comprehensive overview on how to develop cutting-edge multimodal generation systems.

The Liquid Advantage: Scaling Laws and Unified Token Space

Liquid's architecture distinguishes itself through the discovery of a unique scaling law for multimodal generation and a unified token space. This unified architecture enables the groundbreaking text-to-image innovation we've discussed. Here's how these elements propel its performance:

Scaling Law: As the model size increases to 32B, Liquid demonstrates improved performance across various multimodal tasks.
Unified Token Space: By combining visual and textual information into a single token space, Liquid enables seamless interaction between visual understanding and generation tasks. The unified token space also supports visual understanding.

License and Citation

This project is licensed under the MIT License. If you find Liquid valuable, please cite the following paper:

@article { wu2024liquid,
 title = { Liquid: Language models are scalable multi-modal generators},
 author = { Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang},
 journal = { arXiv preprint arXiv:2412.04332},
 year = { 2024}
}

Embrace the Future with Liquid

Liquid represents a significant leap forward in multimodal AI. Its ease of use, scalability, and unified approach make it a powerful tool for visual understanding, visual generation, and multimodal generation. Explore the possibilities and unlock the potential of Liquid in your projects today!

Unleash the Power of Liquid: A Unified Approach to Multimodal Generation

What is Liquid and Why Should You Care?

Enhanced Scalability: Liquid demonstrates a clear scaling law, meaning performance improves significantly as the model size increases (from 0.5B to 32B).

Unified Training: Liquid's design minimizes the performance drop often associated with combining visual and language tasks.

Mutual Enhancement: The unified token space allows visual generation and comprehension tasks to reinforce each other, leading to superior results.

Text-to-Image Innovation: Want to generate stunning, photorealistic images from text prompts? Liquid delivers high-quality results in any aspect ratio.

Getting Started with Liquid: Inference Made Easy

Run a Gradio Demo Locally

Install the necessary libraries:

pip install gradio==4.44.1
pip install gradio_client==1.3.0

Navigate to the evaluation directory:

cd evaluation

Launch the demo:

python app.py

Pro Tip: If you're running on a GPU with less than 30GB VRAM, enable load_in_8bit in AutoModelForCausalLM.from_pretrained within app.py to prevent out-of-memory errors during image generation.

Single Inference Examples

Here are some examples of how to use Liquid for different tasks:

Pure Language Dialogue:

python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt " Write me a poem about Machine Learning. "

Image Understanding (Visual Question Answering):

python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt ' How to make this pastry? '

Text-to-Image Generation:

python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt " young blue dragon with horn lightning in the style of dd fantasy full body "

For GPUs with less than 30GB VRAM, add the --load_8bit flag. This empowers even those with limited resources to generate stunning visuals

The Liquid Advantage: Scaling Laws and Unified Token Space

Scaling Law: As the model size increases to 32B, Liquid demonstrates improved performance across various multimodal tasks.

Unified Token Space: By combining visual and textual information into a single token space, Liquid enables seamless interaction between visual understanding and generation tasks. The unified token space also supports visual understanding.

License and Citation

This project is licensed under the MIT License. If you find Liquid valuable, please cite the following paper:

@article { wu2024liquid, title = { Liquid: Language models are scalable multi-modal generators}, author = { Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang}, journal = { arXiv preprint arXiv:2412.04332}, year = { 2024} }