>>>model_inputs=tokenizer(["A list of colors: red, blue"],return_tensors="pt").to("cuda")
```
The `model_inputs` variable holds the tokenized text input, as well as the attention mask. While [`~generation.GenerationMixin.generate`] does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results.
Finally, call the [`~generation.GenerationMixin.generate`] method to returns the generated tokens, which should be converted to text before printing.
After tokenizing the inputs, you can call the [`~generation.GenerationMixin.generate`] method to returns the generated tokens. The generated tokens then should be converted to text before printing.
'A list of colors: red, blue, green, yellow, black, white, and brown'
'A list of colors: red, blue, green, yellow, orange, purple, pink,'
```
Finally, you don't need to do it one sequence at a time! You can batch your inputs, which will greatly improve the throughput at a small latency and memory cost. All you need to do is to make sure you pad your inputs properly (more on that below).
```py
>>>tokenizer.pad_token=tokenizer.eos_token# Most LLMs don't have a pad token by default
>>>model_inputs=tokenizer(
...["A list of colors: red, blue","Portugal is"],return_tensors="pt",padding=True
@@ -194,26 +206,61 @@ LLMs are [decoder-only](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt
'1, 2, 3, 4, 5, 6,'
```
<!-- TODO: when the prompting guide is ready, mention the importance of setting the right prompt in this section -->
### Wrong prompt
Some models and tasks expect a certain input prompt format to work properly. When this format is not applied, you will get a silent performance degradation: the model kinda works, but not as well as if you were following the expected prompt. More information about prompting, including which models and tasks need to be careful, is available in this [guide](tasks/prompting). Let's see an example with a chat LLM, which makes use of [chat templating](chat_templating):
'None, you thug. How bout you try to focus on more useful questions?'
>>># As we can see, it followed a proper thug style 😎
```
## Further resources
While the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts. For your next steps to help you dive deeper into LLM usage and understanding:
<!-- TODO: complete with new guides -->
### Advanced generate usage
1.[Guide](generation_strategies) on how to control different generation methods, how to set up the generation configuration file, and how to stream the output;
2. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils).
2.[Guide](chat_templating) on the prompt template for chat LLMs;
3.[Guide](tasks/prompting) on to get the most of prompt design;
4. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils).
### LLM leaderboards
1.[Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which focuses on the quality of the open-source models;
2.[Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), which focuses on LLM throughput.
### Latency and throughput
### Latency, throughput and memory utilization
1.[Guide](main_classes/quantization) on dynamic quantization, which shows you how to drastically reduce your memory requirements.
1.[Guide](llm_tutorial_optimization) on how to optimize LLMs for speed and memory;
2.[Guide](main_classes/quantization) on quantization such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements.