Unleash the Power of Moonlight: Train Large Language Models 2x Faster

Tired of slow, computationally expensive LLM training? Discover Moonlight, a groundbreaking 3B/16B-parameter Mixture-of-Expert (MoE) model, that redefines the Pareto frontier. Trained with 5.7T tokens using the highly efficient Muon optimizer, Moonlight achieves superior performance with significantly fewer training FLOPs. Get ready to revolutionize your AI projects with unparalleled speed and efficiency.

The Secret Weapon: Muon Optimizer for Scalable LLM Training

Moonlight's exceptional performance is powered by the Muon optimizer. This matrix orthogonalization-based optimizer has been enhanced for large-scale training, overcoming previous limitations. By incorporating weight decay and carefully adjusting the per-parameter update scale, Muon achieves impressive results without extensive hyperparameter tuning. Muon optimizer delivers unmatched computational efficiency compared to AdamW, saving you time and resources.

Key Advantages of the Muon Optimizer:

Weight Decay: Crucial for Muon's scalability in large-scale training scenarios, significantly improving performance.
Consistent Update RMS: Parameter-wise update scale adjustments maintain a consistent update root mean square (RMS) across different matrix and non-matrix parameters, enhancing training stability.
Efficient Distributed Implementation: Implemented with ZeRO-1 style optimization for optimal memory efficiency and reduced communication overhead.

Scaling Laws Revealed: Muon vs. AdamW

Extensive scaling law research validates Muon's superiority. Compared to strong AdamW baselines, Muon achieves comparable performance while requiring only approximately 52% of the training FLOPs. This 2x computational efficiency allows you to train larger, more complex models faster and more cost-effectively.

Moonlight: Outperforming the Competition

The "Moonlight" model, trained with Muon, surpasses other state-of-the-art public models of similar scale. How does it stack up?

Beats LLAMA3-3B (3B parameters, trained with 9T tokens)
Outperforms Qwen2.5-3B (3B parameters, trained with 18T tokens)
Excels Deepseek-v2-Lite (2.4B/16B-parameter MOE model trained with 5.7T tokens)

Moonlight pushes the boundaries of what's possible in LLM performance and efficiency.

Get Started with Moonlight: Model Download and Inference

Ready to try Moonlight? Here's how to get started:

Model Download: Access the pretrained and instruction-tuned checkpoints for your research.

Inference with Hugging Face Transformers: Easily integrate Moonlight into your projects using the transformers library:

Pretrained Model (Moonlight):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "moonshotai/Moonlight-16B-A3B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "1+1=2, 1+2="
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Instruct Model (Moonlight-Instruct):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "moonshotai/Moonlight-16B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
    {"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Moonlight shares the same architecture as DeepSeek-V3, ensuring compatibility with popular inference engines like VLLM and SGLang for easy deployment.

Training Your Own Models with Muon

Want to train your own models using the Muon optimizer? Here are some example commands:

# train qwen-like dense model with muon
python3 examples/toy_train.py --model qwen --optimizer muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3

# train qwen-like dense model with adamw
python3 examples/toy_train.py --model qwen --optimizer adamw --dataset openwebtext-100k --hidden_size 896 --lr 1e-3

Dive Deeper: Intermediate Checkpoints and Citation

Intermediate checkpoints will be released soon to further support ongoing research. If you find Moonlight valuable, please cite the paper:

@misc{liu2025muonscalablellmtraining,
title={Muon is Scalable for LLM Training},
author={Jingyuan Liu and Jianlin Su and Xingcheng Yao and Zhejun Jiang and Guokun Lai and Yulun Du and Yidao Qin and Weixin Xu and Enzhe Lu and Junjie Yan and Yanru Chen and Huabin Zheng and Yibo Liu and Shaowei Liu and Bohong Yin and Weiran He and Han Zhu and Yuzhi Wang and Jianzhou Wang and Mengnan Dong and Zheng Zhang and Yongsheng Kang and Hao Zhang and Xinran Xu and Yutao Zhang and Yuxin Wu and Xinyu Zhou and Zhilin Yang},
year={2025},
eprint={2502.16982},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.16982},
}

Elevate your LLM training with Moonlight and experience the power of the Muon optimizer. Achieve unprecedented speed, efficiency, and performance in your AI endeavors.