*This model was released on 2024-09-17 and added to Hugging Face Transformers on 2024-09-18.*
PyTorch FlashAttention SDPA
# Mimi [Mimi](huggingface.co/papers/2410.00037) is a neural audio codec model with pretrained and quantized variants, designed for efficient speech representation and compression. The model operates at 1.1 kbps with a 12 Hz frame rate and uses a convolutional encoder-decoder architecture combined with a residual vector quantizer of 16 codebooks. Mimi outputs dual token streams i.e. semantic and acoustic to balance linguistic richness with high fidelity reconstruction. Key features include a causal streaming encoder for low-latency use, dual-path tokenization for flexible downstream generation, and integration readiness with large speech models like Moshi. You can find the original Mimi checkpoints under the [Kyutai](https://huggingface.co/kyutai/models?search=mimi) organization. >[!TIP] > This model was contributed by [ylacombe](https://huggingface.co/ylacombe). > > Click on the Mimi models in the right sidebar for more examples of how to apply Mimi. The example below demonstrates how to encode and decode audio with the [`AutoModel`] class. ```python >>> from datasets import load_dataset, Audio >>> from transformers import MimiModel, AutoFeatureExtractor >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") >>> # load model and feature extractor >>> model = MimiModel.from_pretrained("kyutai/mimi") >>> feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/mimi") >>> # load audio sample >>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) >>> audio_sample = librispeech_dummy[-1]["audio"]["array"] >>> inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt") >>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"]) >>> audio_values = model.decode(encoder_outputs.audio_codes, inputs["padding_mask"])[0] >>> # or the equivalent with a forward pass >>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values ``` ## MimiConfig [[autodoc]] MimiConfig ## MimiModel [[autodoc]] MimiModel - decode - encode - forward