stablelm.md 4.56 KB
Newer Older
Jonathan Tow's avatar
Jonathan Tow committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# StableLM

## Overview

`StableLM 3B 4E1T` was proposed in [`StableLM 3B 4E1T`: Technical Report](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Stability AI and is the first model in a series of multi-epoch pre-trained language models.

### Model Details

`StableLM 3B 4E1T` is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs.
The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.

We also provide `StableLM Zephyr 3B`, an instruction fine-tuned version of the model that can be used for chat-based applications.

### Usage Tips

- The architecture is similar to LLaMA but with RoPE applied to 25% of head embedding dimensions, LayerNorm instead of RMSNorm, and optional QKV bias terms.
- `StableLM 3B 4E1T`-based models uses the same tokenizer as [`GPTNeoXTokenizerFast`].

`StableLM 3B 4E1T` and `StableLM Zephyr 3B` can be found on the [Huggingface Hub](https://huggingface.co/stabilityai)

The following code snippet demonstrates how to use `StableLM 3B 4E1T` for inference:

```python
40
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
Jonathan Tow's avatar
Jonathan Tow committed
41
42
>>> device = "cuda" # the device to load the model onto

43
44
>>> set_seed(0)

Jonathan Tow's avatar
Jonathan Tow committed
45
46
>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t")
47
>>> model.to(device)  # doctest: +IGNORE_RESULT
Jonathan Tow's avatar
Jonathan Tow committed
48
49
50
51
52
53

>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device)

>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True)
>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
>>> responses
54
['The weather is always wonderful in Costa Rica, which makes it a prime destination for retirees. That鈥檚 where the Pensionado program comes in, offering']
Jonathan Tow's avatar
Jonathan Tow committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
```

## Combining StableLM and Flash Attention 2

First, make sure to install the latest version of Flash Attention v2.

```bash
pip install -U flash-attn --no-build-isolation
```

Also make sure that your hardware is compatible with Flash-Attention 2. Read more about it in the official documentation of the [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Note: you must load your model in half-precision (e.g. `torch.bfloat16`).

Now, to run the model with Flash Attention 2, refer to the snippet below:

```python
>>> import torch
71
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
Jonathan Tow's avatar
Jonathan Tow committed
72
73
>>> device = "cuda" # the device to load the model onto

74
75
>>> set_seed(0)

Jonathan Tow's avatar
Jonathan Tow committed
76
>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
77
78
>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")  # doctest: +SKIP
>>> model.to(device)  # doctest: +SKIP
Jonathan Tow's avatar
Jonathan Tow committed
79
80
81

>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device)

82
83
84
85
>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True)  # doctest: +SKIP
>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)  # doctest: +SKIP
>>> responses  # doctest: +SKIP
['The weather is always wonderful in Costa Rica, which makes it a prime destination for retirees. That鈥檚 where the Pensionado program comes in, offering']
Jonathan Tow's avatar
Jonathan Tow committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
```


## StableLmConfig

[[autodoc]] StableLmConfig

## StableLmModel

[[autodoc]] StableLmModel
    - forward

## StableLmForCausalLM

[[autodoc]] StableLmForCausalLM
    - forward

## StableLmForSequenceClassification

[[autodoc]] StableLmForSequenceClassification
    - forward
107
108
109
110
111

## StableLmForTokenClassification

[[autodoc]] StableLmForTokenClassification
    - forward