Unleash Unprecedented LLM Efficiency: Introducing Moonlight and the Scalable Muon Optimizer

Tired of exorbitant compute costs for large language model (LLM) training? MoonshotAI's Moonlight, powered by the groundbreaking Muon optimizer, is here to revolutionize the landscape. Achieving 2x computational efficiency compared to AdamW, Moonlight pushes the boundaries of performance while drastically reducing training expenses.

Key Benefits of Moonlight and the Muon Optimizer:

Unmatched Efficiency: Train faster and cheaper with Muon, achieving comparable performance to AdamW with approximately 52% of the training FLOPs.
Scalability Redefined: Overcome limitations of previous optimizers with Muon's enhanced scalability, now validated for large-scale models.
Pareto-Optimal Performance: The Moonlight model outperforms current state-of-the-art models at a similar scale, achieving better performance, and more efficient training.
Open-Source Accessibility: Leverage our distributed Muon implementation, optimized for memory efficiency and communication, and build upon our released pretrained and instruction-tuned checkpoints.

Scaling Up Your LLMs with Muon: The Secret Ingredients

Our research pinpoints crucial advancements that unlock Muon's potential for large-scale training:

Weight Decay: Implementing weight decay is critical for Muon's effective scaling. This key technique prevents overfitting and enhances generalization.
Per-Parameter Update Scale Adjustment: Maintaining consistent update root mean square (RMS) across all parameters, both matrix and non-matrix, is vital. Precisely calibrating update scales leads to significant gains in training stability.

These adjustments enable Muon to be used out-of-the-box for large-scale training without extensive hyperparameter tuning, saving valuable time and resources.

Efficient Distributed Implementation for Maximum Impact

We've developed a distributed version of Muon featuring ZeRO-1 style optimization, ensuring optimal memory efficiency and minimized communication overhead. This allows you to train larger models without compromising performance.

Validated by Rigorous Scaling Law Experiments

Our scaling law research firmly establishes Muon's superiority over strong AdamW baselines. Figure 1 conclusively demonstrates that Muon achieves comparable performance to AdamW while requiring only about 52% of the training FLOPs.

Experience Moonlight: Effortless Inference with Hugging Face Transformers

Ready to experience the power of Moonlight? Our models are readily available and easily integrated using Hugging Face Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "moonshotai/Moonlight-16B-A3B" # Or "moonshotai/Moonlight-16B-A3B-Instruct" for instruction-tuned model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Example usage (replace with your desired prompt)
prompt = "1+1=2, 1+2="
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Moonlight, with its DeepSeek-V3-inspired architecture, seamlessly integrates with popular inference engines like VLLM and SGLang, further streamlining deployment.

Training Your Own Models with Muon

Want to train with Muon? Here are example training script:

# train qwen-like dense model with muon
python3 examples/toy_train.py --model qwen --optimizer muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3
# train qwen-like dense model with adamw
python3 examples/toy_train.py --model qwen --optimizer adamw --dataset openwebtext-100k --hidden_size 896 --lr 1e-3

Join the Revolution: Unlock Unprecedented LLM Efficiency Today

Moonlight and the Muon optimizer represent a significant leap forward in LLM training. By embracing these innovations, you can achieve unparalleled efficiency, reduce costs, and push the boundaries of what's possible.

Unleash Unprecedented LLM Efficiency: Introducing Moonlight and the Scalable Muon Optimizer

Key Benefits of Moonlight and the Muon Optimizer:

Unmatched Efficiency: Train faster and cheaper with Muon, achieving comparable performance to AdamW with approximately 52% of the training FLOPs.

Scalability Redefined: Overcome limitations of previous optimizers with Muon's enhanced scalability, now validated for large-scale models.

Pareto-Optimal Performance: The Moonlight model outperforms current state-of-the-art models at a similar scale, achieving better performance, and more efficient training.

Open-Source Accessibility: Leverage our distributed Muon implementation, optimized for memory efficiency and communication, and build upon our released pretrained and instruction-tuned checkpoints.