Unleash the Power of Moonlight: Train Large Language Models 2x Faster!

Are you ready to revolutionize your approach to Large Language Model (LLM) training? Introducing Moonlight, a game-changing Mixture-of-Expert (MoE) model that redefines the boundaries of computational efficiency. Backed by the innovative Muon optimizer and trained on a massive 5.7T token dataset, Moonlight offers unparalleled performance with significantly reduced training costs.

What is Muon and Why Should You Care?

Muon is a cutting-edge optimizer based on matrix orthogonalization showing promising results in LLM training, but previously faced scalability challenges with larger models. Moonlight addresses these limitations, providing a streamlined and optimized solution for training LLMs. The article "Muon is Scalable for LLM Training" on arXiv dives deep into the mechanics of the Muon optimizer that powers Moonlight.

Key benefits of using Muon:

Unmatched Speed: Achieve approximately 2x computational efficiency compared to AdamW, the previous industry standard.
Effortless Scalability: Train large models out-of-the-box without complex hyperparameter tuning.
Enhanced Stability: Benefit from increased training stability due to innovative parameter-wise update scale adjustments.

Moonlight: A New Era in LLM Performance

Moonlight isn't just about speed; it's about pushing the Pareto frontier of performance and efficiency. When compared to other state-of-the-art models like LLAMA3-3B and Qwen2.5-3B, Moonlight achieves superior performance with fewer training FLOPs.

Moonlight Key Features:

3B/16B-parameter MoE: Equipped to handle complex tasks and datasets.
Trained with 5.7T Tokens: Ensures robust learning and generalization.
Efficient Distributed Implementation: Optimized for memory and communication using ZeRO-1 style optimization.

Unlock the Power: How to Use Moonlight

Moonlight is readily accessible through Hugging Face Transformers, allowing seamless integration into your existing workflows.

Here's how to get started with the pretrained Moonlight model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "moonshotai/Moonlight-16B-A3B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "1+1=2, 1+2="
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

For the instruction-tuned model (Moonlight-Instruct):

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "moonshotai/Moonlight-16B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
    {"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Train Your Own Models with Muon: A Practical Guide

Eager to train your own models using the power of Muon? The provided code examples offer a straightforward starting point.

Training examples:

# train qwen-like dense model with muon
python3 examples/toy_train.py --model qwen --optimizer muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3
# train qwen-like dense model with adamw
python3 examples/toy_train.py --model qwen --optimizer adamw --dataset openwebtext-100k --hidden_size 896 --lr 1e-3

These commands allow you to compare the performance of Muon against AdamW directly.

Weight Decay's Crucial Role

The incorporation of weight decay and consistent update Root Mean Square (RMS) values are critical upgrades improving Muon's training stability by a wide margin in large-scale training applications. This ensures Muon maintains its edge as the go-to optimizer for demanding LLM training tasks.

Join the Future of LLM Training

Moonlight, powered by the Muon optimizer, represents a significant leap forward in LLM technology. By leveraging its efficiency and scalability, you can achieve state-of-the-art results while drastically reducing computational costs. Explore the code, experiment with the models, and contribute to the ongoing research that is reshaping the future of AI.