*This model was released on {release_date} and added to Hugging Face Transformers on 2025-09-16.*
PyTorch FlashAttention SDPA
# OLMo3 Olmo3 is an improvement on [OLMo2](./olmo2). More details will be released on *soon*. > [!TIP] > Click on the OLMo3 models in the right sidebar for more examples of how to apply OLMo3 to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line. ```py import torch from transformers import pipeline pipe = pipeline( task="text-generation", model="allenai/TBA", dtype=torch.bfloat16, device=0, ) result = pipe("Plants create energy through a process known as") print(result) ``` ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "allenai/TBA" ) model = AutoModelForCausalLM.from_pretrained( "allenai/TBA", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa" ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_length=50, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ```bash echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model allenai/TBA --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. ```py #pip install torchao import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig torchao_config = TorchAoConfig( "int4_weight_only", group_size=128 ) tokenizer = AutoTokenizer.from_pretrained( "allenai/TBA" ) model = AutoModelForCausalLM.from_pretrained( "allenai/TBA", quantization_config=torchao_config, dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa" ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_length=50, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Notes - Load specific intermediate checkpoints by adding the `revision` parameter to [`~PreTrainedModel.from_pretrained`]. ```py from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("allenai/TBA", revision="stage1-step140000-tokens294B") ``` ## Olmo3Config [[autodoc]] Olmo3Config ## Olmo3ForCausalLM [[autodoc]] Olmo3ForCausalLM ## Olmo3Model [[autodoc]] Olmo3Model - forward ## Olmo3PreTrainedModel [[autodoc]] Olmo3PreTrainedModel - forward