Kimi-Audio: Open Source Audio Foundation Model for Understanding, Generation, and Conversation

Kimi-Audio Logo

Discover Kimi-Audio, a groundbreaking open-source audio foundation model capable of audio understanding, generation, and engaging in conversations. Built by MoonshotAI, this powerful tool is designed to handle a wide array of audio processing tasks within a single, unified framework. Looking to explore the possibilities of audio foundation models? Dive in!

Kimi-Audio: Key Features & Capabilities

Kimi-Audio goes beyond simple transcription. It offers a range of impressive features:

Universal Capabilities: From speech recognition and audio question answering to speech emotion recognition, do it all.
State-of-the-Art Performance: Kimi-Audio achieves top-tier results on various audio benchmarks.
Large-Scale Pre-training: Trained on over 13 million hours of diverse audio and text data for robust performance.
Novel Architecture: Hybrid audio input and an LLM core allow for both text and audio token generation.
Efficient Inference: Low-latency audio generation thanks to a chunk-wise streaming detokenizer.

That's not all, Kimi-Audio is completely open-source!

What can I do with Kimi-Audio?

Kimi-Audio provides powerful tools for many use-cases. Here are just a few:

Speech Recognition (ASR): Accurately transcribe spoken words into text.
Audio Understanding: Enable machines to comprehend the content and context of audio.
Audio-to-Text Chat: Create innovative conversational experiences.
Speech Conversation: Develop interactive voice-based applications.

Kimi-Audio Architecture: How it Works

Kimi-Audio Framework

The Kimi-Audio architecture ensures that your workflow achieves maximum value out of the model. The Kimi-Audio framework is built with these main components:

Audio Tokenizer: Transforms raw audio into discrete semantic tokens and continuous acoustic features.
Audio LLM: Shared layers for processing multimodal inputs generate both text and audio tokens.
Audio Detokenizer: Converts audio tokens back into high-fidelity waveforms.

Get Started: Quick Start Guide

Ready to test drive Kimi-Audio? Here's a basic example for generating text from audio (ASR) and creating audio-to-audio/text conversations:

import soundfile as sf
from kimia_infer.api.kimia import KimiAudio

# 1. Load Model
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)

# 2. Define Sampling Parameters
sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

# 3. Example 1: Audio-to-Text (ASR)
messages_asr = [
    { "role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
    { "role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output)

# 4. Example 2: Audio-to-Audio/Text Conversation
messages_conversation = [
    { "role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000)
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output)
print("Kimi-Audio inference examples complete.")

Evaluation & Benchmarking

Kimi-Audio's performance speaks for itself. It achieves state-of-the-art results across numerous audio benchmarks. See the technical report for more details.

Kimi-Audio Radar Chart

The team also recognizes the challenges in evaluating these types of models. That is why they developed the Kimi-Audio-Evalkit, available for use. This toolkit helps to standardize metric calculation and provide a unified platform for comparisons.

Kimi-Audio Generation Testset: Audio Dialogue Evaluation

The Kimi-Audio-Generation-Testset is designed to benchmark and evaluate the conversational abilities of any audio-based dialogue models. Use this testset to assess the model's ability to generate relevant and appropriately styled audio responses. The speech conversation dataset is in Chinese.

License & Acknowledgements

Kimi-Audio is based on Qwen 2.5-7B. Code derived from Qwen2.5-7B is licensed under the Apache 2.0 License, while other parts are under the MIT License.

The development team would like to thank these projects and individuals for their contribution: Whisper, Transformers, BigVGAN, GLM-4-Voice.

Citation

If you find Kimi-Audio useful, cite the technical report:

@misc { kimiteam2025kimiaudiotechnicalreport,
 title = { Kimi-Audio Technical Report}, 
 author = { KimiTeam and Ding Ding and Zeqian Ju and Yichong Leng and Songxiang Liu and Tong Liu and Zeyu Shang and Kai Shen and Wei Song and Xu Tan and Heyi Tang and Zhengtao Wang and Chu Wei and Yifei Xin and Xinran Xu and Jianwei Yu and Yutao Zhang and Xinyu Zhou and Y. Charles and Jun Chen and Yanru Chen and Yulun Du and Weiran He and Zhenxing Hu and Guokun Lai and Qingcheng Li and Yangyang Liu and Weidong Sun and Jianzhou Wang and Yuzhi Wang and Yuefeng Wu and Yuxin Wu and Dongchao Yang and Hao Yang and Ying Yang and Zhilin Yang and Aoxiong Yin and Ruibin Yuan and Yutong Zhang and Zaida Zhou},
 year = { 2025},
 eprint = { 2504.18425},
 archivePrefix = { arXiv},
 primaryClass = { eess.AS},
 url = { https://arxiv.org/abs/2504.18425}, 
}

Connect & Contribute

Have questions or need help? Open an issue on the Kimi-Audio GitHub repository! Your contributions are welcome!