*This model was released on 2024-09-03 and added to Hugging Face Transformers on 2024-09-03.*
# OLMoE
[OLMoE](https://huggingface.co/papers/2409.02060) is a sparse Mixture-of-Experts (MoE) language model with 7B parameters but only 1B parameters are used per input token. It has similar inference costs as dense models but trains ~3x faster. OLMoE uses fine-grained routing with 64 small experts in each layer and uses a dropless token-based routing algorithm.
You can find all the original OLMoE checkpoints under the [OLMoE](https://huggingface.co/collections/allenai/olmoe-november-2024-66cf678c047657a30c8cd3da) collection.
> [!TIP]
> This model was contributed by [Muennighoff](https://huggingface.co/Muennighoff).
>
> Click on the OLMoE models in the right sidebar for more examples of how to apply OLMoE to different language tasks.
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
```py
import torch
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="allenai/OLMoE-1B-7B-0125",
dtype=torch.float16,
device=0,
)
result = pipe("Dionysus is the god of")
print(result)
```
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Accelerator
device = Accelerator().device
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", dtype="auto", device_map="auto").to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
inputs = tokenizer("Bitcoin is", return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
output = model.generate(**inputs, max_length=64)
print(tokenizer.decode(output[0]))
```
## Quantization
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from accelerate import Accelerator
device = Accelerator().device
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", dtype="auto", device_map="auto", quantization_config=quantization_config).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
inputs = tokenizer("Bitcoin is", return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
output = model.generate(**inputs, max_length=64)
print(tokenizer.decode(output[0]))
```
## OlmoeConfig
[[autodoc]] OlmoeConfig
## OlmoeModel
[[autodoc]] OlmoeModel
- forward
## OlmoeForCausalLM
[[autodoc]] OlmoeForCausalLM
- forward