*This model was released on 2019-02-14 and added to Hugging Face Transformers on 2020-11-16.*
# GPT-2
[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) is a scaled up version of GPT, a causal transformer language model, with 10x more parameters and training data. The model was pretrained on a 40GB dataset to predict the next word in a sequence based on all the previous words. This approach enabled the model to perform many downstream tasks in a zero-shot setting. The blog post released by OpenAI can be found [here](https://openai.com/index/better-language-models/).
The model architecture uses a unidirectional (causal) attention mechanism where each token can only attend to previous tokens, making it particularly effective for text generation tasks.
You can find all the original GPT-2 checkpoints under the [OpenAI community](https://huggingface.co/openai-community?search_models=gpt) organization.
> [!TIP]
> Click on the GPT-2 models in the right sidebar for more examples of how to apply GPT-2 to different language tasks.
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
```py
import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=0)
pipeline("Hello, I'm a language model")
```
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
```bash
echo -e "Hello, I'm a language model" | transformers run --task text-generation --model openai-community/gpt2 --device 0
```
One can also serve the model using vLLM with the `transformers backend`.
```bash
vllm serve openai-community/gpt2 --model-imp transformers
```
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"openai-community/gpt2-xl",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Notes
- Pad inputs on the right because GPT-2 uses absolute position embeddings.
- GPT-2 can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers//en/model_doc/gpt2#transformers.GPT2Model.forward.past_key_values) parameter in [`GPT2Model.forward`].
- Enable the [scale_attn_by_inverse_layer_idx](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config.scale_attn_by_inverse_layer_idx) and [reorder_and_upcast_attn](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config.reorder_and_upcast_attn) parameters to apply the training stability improvements from [Mistral](./mistral).
## GPT2Config
[[autodoc]] GPT2Config
## GPT2Tokenizer
[[autodoc]] GPT2Tokenizer
- save_vocabulary
## GPT2TokenizerFast
[[autodoc]] GPT2TokenizerFast
## GPT2 specific outputs
[[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
## GPT2Model
[[autodoc]] GPT2Model
- forward
## GPT2LMHeadModel
[[autodoc]] GPT2LMHeadModel
- forward
## GPT2DoubleHeadsModel
[[autodoc]] GPT2DoubleHeadsModel
- forward
## GPT2ForQuestionAnswering
[[autodoc]] GPT2ForQuestionAnswering
- forward
## GPT2ForSequenceClassification
[[autodoc]] GPT2ForSequenceClassification
- forward
## GPT2ForTokenClassification
[[autodoc]] GPT2ForTokenClassification
- forward