"docs/git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "5333f4c0ec1c4a69ad2ada88364c5dd5836ac1b7"
Commit ccfcffb1 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #805 canceled with stages
__pycache__
.idea
.DS_Store
*.egg-info
build
.venv
.vscode
# data
data
checkpoints
out
wandb
tests/original_falcon_40b.py
sft/output
sft/wandb
\ No newline at end of file
## Evaluate TinyLlama
### GPT4All Benchmarks
We evaluate TinyLlama's commonsense reasoning ability following the [GPT4All](https://gpt4all.io/index.html) evaluation suite. We include Pythia as our baseline. We report the acc_norm by default.
Base models:
| Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
|-------------------------------------------|-----------------|-----------|------|------------|-------|-------|-------|------|-----|
| Pythia-1.0B | 300B | 47.16 | 31.40| 53.43 | 27.05 | 48.99 | 60.83 | 69.21 | 48.30 |
| TinyLlama-1.1B-intermediate-step-50K-104b | 103B | 43.50 | 29.80| 53.28 | 24.32 | 44.91 | 59.66 | 67.30 | 46.11|
| TinyLlama-1.1B-intermediate-step-240k-503b| 503B | 49.56 |31.40 |55.80 |26.54 |48.32 |56.91 |69.42 | 48.28 |
| TinyLlama-1.1B-intermediate-step-480k-1007B | 1007B | 52.54 | 33.40 | 55.96 | 27.82 | 52.36 | 59.54 | 69.91 | 50.22 |
| TinyLlama-1.1B-intermediate-step-715k-1.5T | 1.5T | 53.68 | 35.20 | 58.33 | 29.18 | 51.89 | 59.08 | 71.65 | 51.29 |
| TinyLlama-1.1B-intermediate-step-955k-2T | 2T | 54.63 | 33.40 | 56.83 | 28.07 | 54.67 | 63.21 | 70.67 | 51.64 |
| TinyLlama-1.1B-intermediate-step-1195k-2.5T | 2.5T | 58.96 | 34.40 | 58.72 | 31.91 | 56.78 | 63.21 | 73.07 | 53.86|
| TinyLlama-1.1B-intermediate-step-1431k-3T | 3T | 59.20 | 36.00 | 59.12 | 30.12 | 55.25 | 57.83 | 73.29 | 52.99|
Chat models:
| Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
|-------------------------------------------|-----------------|-----------|------|------------|-------|-------|-------|------|-----|
| [TinyLlama-1.1B-Chat-v0.1](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1) | 503B | 53.81 |32.20 | 55.01 | 28.67 |49.62 | 58.04 | 69.64 | 49.57 |
| [TinyLlama-1.1B-Chat-v0.2](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.2) | 503B | 53.63 |32.80 | 54.85 | 28.75 |49.16 | 55.72 | 69.48 | 49.20 |
| [TinyLlama-1.1B-Chat-v0.3](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) | 1T | 56.81 |34.20 | 55.80 | 30.03 |53.20 | 59.57 | 69.91 | 51.36 |
| [TinyLlama-1.1B-Chat-v0.4](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) | 1.5T | 58.59 |35.40 | 58.80 | 30.80 |54.04 | 57.31 | 71.16 | 52.30 |
We observed huge improvements once we finetuned the model. We attribute this phenomenon to: 1. the base model has not undergone lr cool-down and FT helps to cool down the lr. 2. the SFT stage better elicits the model's internal knowledge.
You can obtain the above scores by running [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness):
```bash
python main.py \
--model hf-causal \
--model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float" \
--tasks hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa\
--device cuda:0 --batch_size 32
```
### Instruct-Eval Benchmarks
We evaluate TinyLlama's ability in problem-solving on the [Instruct-Eval](https://github.com/declare-lab/instruct-eval) evaluation suite.
| Model | MMLU | BBH | HumanEval | DROP |
| ------------------------------------------------- | ----- | ----- | --------- | ----- |
| Pythia-1.0B | 25.70 | 28.19 | 1.83 | 4.25 |
| TinyLlama-1.1B-intermediate-step-50K-104b | 26.45 | 28.82 | 5.49 | 11.42 |
| TinyLlama-1.1B-intermediate-step-240k-503b | 26.16 | 28.83 | 4.88 | 12.43 |
| TinyLlama-1.1B-intermediate-step-480K-1T | 24.65 | 29.21 | 6.1 | 13.03 |
| TinyLlama-1.1B-intermediate-step-715k-1.5T | 24.85 | 28.2 | 7.93 | 14.43 |
| TinyLlama-1.1B-intermediate-step-955k-2T | 25.97 | 29.07 | 6.71 | 13.14 |
| TinyLlama-1.1B-intermediate-step-1195k-token-2.5T | 25.92 | 29.32 | 9.15 | 15.45 |
You can obtain above scores by running [instruct-eval](https://github.com/declare-lab/instruct-eval):
```bash
CUDA_VISIBLE_DEVICES=0 python main.py mmlu --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=1 python main.py bbh --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=2 python main.py drop --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=3 python main.py humaneval --model_name llama --n_sample 1 --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [2023] Lightning AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
## Pretrain TinyLlama
### Installation
We expect you have CUDA 11.8 installed.
#### Install Pytorch Nightly.
```bash
pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
```
#### Build XFormers from Source
Note: as of 2023/09/02, xformers does not provide pre-built binaries for torch 2.1. You have to build it from source.
```bash
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
```
#### Install Flash-Attention 2 and other fused operators:
```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
```
#### Install Remaining Dependencies
```
pip install -r requirements.txt tokenizers sentencepiece
```
to install other dependencies.
It may take >= 5 minutes to build xformers/flash-attention. Do not worry if the process seemly stagnant or the terminal print out many warnings.
Then you are ready to go 🎉!
### Data Preparation
#### Download Datasets
Download the Slimpajama and Starcoderdata datasets to your chosen directory.
```bash
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
git clone https://huggingface.co/datasets/bigcode/starcoderdata
```
The SlimPajama dataset eats 893GB diskspace and the starcoderdata takes 290GB.
#### Tokenize data
Use the provided scripts to tokenize the datasets and divide them into chunks.
```bash
python scripts/prepare_starcoder.py --source_path /path/to/starcoderdata/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
```
The processed data will take 1.8T storage.
### Pretraining
If your setup comprises two nodes, each with 8 GPUs, you can initiate pretraining with the following commands:
On node 1:
```
lightning run model \
--node-rank=0 \
--main-address=172.16.101.5 \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star --val_data_dir data/slim_star
```
On node 2:
```
lightning run model \
--node-rank=1 \
--main-address=172.16.101.5 \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star --val_data_dir data/slim_star
```
You can follow [these instructions](https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) if you have a slurm cluster.
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
TinyLlama_logo.png filter=lfs diff=lfs merge=lfs -text
---
license: apache-2.0
datasets:
- cerebras/SlimPajama-627B
- bigcode/starcoderdata
language:
- en
---
<div align="center">
# TinyLlama-1.1B
</div>
https://github.com/jzhang38/TinyLlama
The TinyLlama project aims to **pretrain** a **1.1B Llama model on 3 trillion tokens**. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.
<div align="center">
<img src="./TinyLlama_logo.png" width="300"/>
</div>
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
#### This Model
This is an intermediate checkpoint with 240K steps and 503B tokens. **We suggest you not use this directly for inference.** The [chat model](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1) is always preferred **
#### How to use
You will need the transformers>=4.31
Do check the [TinyLlama](https://github.com/jzhang38/TinyLlama) github page for more information.
```
from transformers import AutoTokenizer
import transformers
import torch
model = "PY007/TinyLlama-1.1B-intermediate-step-240k-503b"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.',
do_sample=True,
top_k=10,
num_return_sequences=1,
repetition_penalty=1.5,
eos_token_id=tokenizer.eos_token_id,
max_length=500,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
```
\ No newline at end of file
{
"_name_or_path": "meta-llama/Llama-2-7b-hf",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5632,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 22,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.31.0.dev0",
"use_cache": true,
"vocab_size": 32000
}
{
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"max_length": 2048,
"transformers_version": "4.31.0.dev0"
}
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
{
"add_bos_token": true,
"add_eos_token": false,
"bos_token": {
"__type": "AddedToken",
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"clean_up_tokenization_spaces": false,
"eos_token": {
"__type": "AddedToken",
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"legacy": false,
"model_max_length": 1000000000000000019884624838656,
"pad_token": null,
"padding_side": "right",
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": {
"__type": "AddedToken",
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
# TinyLlama
只有1.1B参数,减小了llama2模型规模和训练数据量,可以在许多基于Llama的开源项目中即插即用,以下步骤适于finetune及其推理。
## 论文
`Llama 2: Open Foundation and Fine-Tuned Chat Models`
- https://arxiv.org/pdf/2307.09288.pdf
## 模型结构
llama2基于原始transformer decoder结构,输入处理阶段,Llama2 对文本进行分词,并将每个词转换为词向量表示,TinyLlama使用与Llama2相同的架构和分词器,特征提取阶段,Llama2 通过多组attention和全连接层结构FeedForward提取特征,最后,输出处理阶段,Llama2 采用全连接层结构MLP改变输入张量的形状获得生成结果,同时利用贪婪搜索等类似策略选取当前概率最高的词作为输出,为了进一步提供预测的准确率加入了强化学习RLHF进行监督,本文作者经过大量实验后提出:(data) quality is all you need!
<div align=center>
<img src="./doc/bockbone.png"/>
</div>
## 算法原理
llama2算法主要将转换成向量的分词用qkv自相关和全连接层提取特征,然后利用全连接层输出监督训练结果并用搜索算法筛选出需要的目标,具体算法原理理解可参照下图原始transformer模型结构右侧decoder部分,Llama2作者在原transformer基础上加入了三个创新点减小计算量并提升精度:RMSNorm、SwiGLU、RoPE。
<div align=center>
<img src="./doc/transformer.png"/>
</div>
## 环境配置
```
mv TinyLlama_pytorch TinyLlama # 去框架名后缀
```
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10-py38
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:ffa1f63239fc
docker run -it --shm-size=32G -v $PWD/TinyLlama:/home/TinyLlama -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name tinyllama <your IMAGE ID> bash
cd TinyLlama
pip install -r requirements.txt
```
### Dockerfile(方法二)
```
cd TinyLlama/docker
docker build --no-cache -t tinyllama:latest .
docker run --shm-size=32G --name tinyllama -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../TinyLlama:/home/TinyLlama -it tinyllama bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk23.10
python:python3.8
torch:2.1.0
torchvision:0.16.0
triton:2.1.0
apex:0.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
pip install -r requirements.txt
```
若finetune时遇到bitsandbytes调用失败的bug,升级系统环境的libstdc++进行解决:
```
wget http://www.vuln.cn/wp-content/uploads/2019/08/libstdc.so_.6.0.26.zip
unzip libstdc.so_.6.0.26.zip
cp libstdc++.so.6.0.26 /usr/lib64
rm -rf /lib64/libstdc++.so.6
ln -s /lib64/libstdc++.so.6.0.26 /lib64/libstdc++.so.6
```
上述环境配置适于finetune及其推理,若希望从头训练,参照环境[`PRETRAIN.md`](./PRETRAIN.md),还需要安装以下几个库:
```
# 以下安装包可从whl.zip文件里获取
# flash_attn-2
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl
# xformers
tar -xvf xformers.tar -C .
cd xformers
pip install xformers==0.0.23 --no-deps
bash patch_xformers.rocm.sh
# rotary
pip install rotary_emb-0.1_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl
# layer_norm
pip install dropout_layer_norm-0.1_torch2.1_dtk23.10-cp38-cp38-linux_x86_64.whl
# xentropy
pip install xentropy_cuda_lib-0.1_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl
```
## 数据集
`openassistant-guanaco`
- https://huggingface.co/datasets/timdettmers/openassistant-guanaco/tree/main
项目中已提供用于finetune的迷你数据集,数据目录结构如下:
```
timdettmers/
├── openassistant_best_replies_train.jsonl
└── openassistant_best_replies_eval.jsonl
```
官网提供的从头训练的数据集如下,完整数据集的预处理参照[`PRETRAIN.md`](./PRETRAIN.md)
`SlimPajama-627B`
- https://huggingface.co/datasets/cerebras/SlimPajama-627B
`starcoderdata`
- https://huggingface.co/datasets/bigcode/starcoderdata
`更多资料可参考源项目的README_origin.md`
## 训练
### 单机多卡(finetune)
```
# finetune所需预训练权重下载地址(权重较大需到hf下载):https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-240k-503b
# 本步骤说明的预训练权重采用503b,请下载后放入目录PY007下面:PY007/TinyLlama-1.1B-intermediate-step-240k-503b
cd TinyLlama
sh sft/script.sh # 全参数finetune
# 启动训练的过程中wandb: Enter your choice:3
```
若希望从头训练,请参照[`PRETRAIN.md`](./PRETRAIN.md)中的训练命令。
## 推理
```
python sft/infer.py
# 若采用官方默认权重推理:代码里设置model="PY007/TinyLlama-1.1B-intermediate-step-240k-503b"
```
## result
```
#问题
Human: Do you support the Biden or Sanders campaign for President?
#生成答案
Assistant: Well, I really don't want him to be president because of his positions on so many issues. But I do agree with Sanders that the US needs a change. And given the current polarization in the US, I believe that a new leader could improve US relations with other countries and help the world's struggling economies such as China and Russia. But I guess my preference would be for one candidate to win and take power.
```
### 精度
测试数据:[`openassistant-guanaco`](./timdettmers/openassistant-guanaco/openassistant_best_replies_eval.jsonl),推理框架:pytorch。
| device | train_loss | eval_loss |
|:---------:|:----:|:----:|
| DCU Z100SM | 1.7787 | 1.8038 |
| GPU V100S | 1.7787 | 1.8036 |
## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
## 源码仓库及问题反馈
- http://developer.hpccube.com/codes/modelzoo/mmpose-rtmo_pytorch.git
## 参考资料
- https://github.com/jzhang38/TinyLlama.git
- https://hf-mirror.com/ #Huggingface镜像官网下载教程
- https://hf-mirror.com/datasets #Huggingface镜像数据地址
<div align="center">
# TinyLlama-1.1B
English | [中文](README_zh-CN.md)
[Chat Demo](https://huggingface.co/spaces/TinyLlama/tinyllama-chat) | [Discord](https://discord.gg/74Wcx4j5Nb)
</div>
The TinyLlama project aims to **pretrain** a **1.1B Llama model on 3 trillion tokens**. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.
<div align="center">
<img src=".github/TinyLlama_logo.png" width="300"/>
</div>
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
#### News
- 2023-12-18: Add two notes [1](https://whimsical-aphid-86d.notion.site/Release-of-TinyLlama-1-5T-Checkpoints-Postponed-01b266998c1c47f78f5ae1520196d194?pvs=4), [2](https://whimsical-aphid-86d.notion.site/Latest-Updates-from-TinyLlama-Team-7d30c01fff794da28ccc952f327c8d4f?pvs=4) explaining the changes of training curves, project schedules, and bug fixes.
- 2023-10-03: Add examples in speculative decoding with llama.cpp. Do check out the [speculative_decoding/README.md](speculative_decoding/README.md).
- 2023-10-02: 1. 1T-token checkpoint just dropped. 2. We document **all** intermediate checkpoints [here](https://huggingface.co/TinyLlama/tinyLlama-intermediate-checkpoints/tree/step-480k-token-1007B).
- 2023-09-28: Add a discord server.
- 2023-09-18: 1. We added a [chat demo](https://huggingface.co/spaces/PY007/TinyLlama-Chat) so that you can play with TinyLlama-Chat-V0.1 right away.
- 2023-09-16: 1. We released the intermediate checkpoint trained on 503B tokens. 2. We released a chat model finetuned on OpenAssisant and simple [finetuning](sft) scripts is added. 3. More eval benchmarks are added and documented in [EVAL.md](EVAL.md).
#### Evaluation
You can find the evaluation results of TinyLlama in [EVAL.md](EVAL.md).
#### Releases Schedule
We will be rolling out intermediate checkpoints following the below schedule.
Base models:
| Date | HF Checkpoint | Tokens | Step | Commonsense Avg |
|------------|-------------------------------------------------|--------|------| --------------- |
| 2023-09-01 | Pythia-1.0B | 300B | 143k | 48.30 |
| 2023-09-04 | [TinyLlama-1.1B-intermediate-step-50k-105b](https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b) | 105B | 50k | 46.11|
| 2023-09-16 | [TinyLlama-1.1B-intermediate-step-240k-503b](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b) | 503B | 240K | 48.28 |
| 2023-10-01 | [TinyLlama-1.1B-intermediate-step-480k-1T](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-480k-1T) | 1T | 480k | 50.22 |
| 2023-11-04 | [TinyLlama-1.1B-intermediate-step-715k-1.5T](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-715k-1.5T) | 1.5T |715k |51.28
| 2023-11-20 | [TinyLlama-1.1B-intermediate-step-955k-2T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T) | 2T |955k |51.64 ||
| 2023-12-11 | [TinyLlama-1.1B-intermediate-step-1195k-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | 2.5T | 1195k |53.86 |
| 2023-12-28 | [TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) | 3T | 1431k | 52.99 |
We are crafting a note offering possible explaination on why there is a significant improvement from 2T to 2.5T checkpoint (It is related to [bos_id issue](https://github.com/jzhang38/TinyLlama/issues/83))
Chat models:
| Date | HF Checkpoint | Tokens | Step | Commonsense Avg |
|------------|-------------------------------------------------|--------|------| --------------- |
| 2023-09-16 | [TinyLlama-1.1B-Chat-V0.1](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1) | 503B | 240K | 49.57 |
| 2023-10-1 | [TinyLlama-1.1B-Chat-V0.3](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) | 1T | 480K | 51.36 |
| 2023-11-04 | [TinyLlama-1.1B-Chat-V0.4](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) | 1.5T | 715K | 52.30 |
Note that the learning rate of the base model has not cooled down yet so we recommend you to also use the finetuned chat model.
Meanwhile, you can track the live cross entropy loss [here](https://wandb.ai/lance777/lightning_logs/reports/metric-train_loss-23-09-04-23-38-15---Vmlldzo1MzA4MzIw?accessToken=5eu2sndit2mo6eqls8h38sklcgfwt660ek1f2czlgtqjv2c6tida47qm1oty8ik9).
## Potential Usecase
Tiny but strong language models are useful for many applications. Here are some potential usecases:
- Assisting speculative decoding of larger models. (See this [tutorial](https://twitter.com/karpathy/status/1697318534555336961) by Andrej Karpathy)
- Deployment on edge devices with restricted memory and computational capacities, for functionalities like real-time machine translation without an internet connection (the 4bit-quantized TinyLlama-1.1B's weight only takes up 637 MB).
- Enabling real-time dialogue generation in video games.
Moreover, our code can be a **reference for enthusiasts keen on pretraining language models under 5 billion parameters** without diving too early into [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
## Training Details
Below are some details of our training setup:
| Setting | Description |
|---------------------------------|----------------------------------------------------------------|
| Parameters | 1.1B |
| Attention Variant | Grouped Query Attention |
| Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632|
| Sequence Length | 2048 |
| Batch Size | 2 million tokens (2048 * 1024) |
| Learning Rate | 4e-4 |
| Learning Rate Schedule | Cosine with 2000 warmup steps. See [Issue 27](https://github.com/jzhang38/TinyLlama/issues/27) for a minor bug |
| Training Data | [Slimpajama](https://huggingface.co/datasets/cerebras/slimpajama-627b) & [Starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
| Data Preprocessing | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
| Combined Dataset Size | Around 950B tokens |
| Total Tokens During Training | 3 trillion (slightly more than 3 epochs/1430k steps) |
| Natural Language to Code Ratio | 7:3 |
| Hardware | 16 A100-40G GPUs |
## Blazingly Fast
Our codebase supports the following features:
- multi-gpu and multi-node distributed training with FSDP.
- flash attention 2.
- fused layernorm.
- fused swiglu.
- fused cross entropy loss .
- fused rotary positional embedding.
Credit: flash attention 2, fused layernorm, fused cross entropy loss, and fused
rotary positional embedding are from the [FlashAttention repo](https://github.com/Dao-AILab/flash-attention/). Fused swiglu is from [xformers](https://github.com/facebookresearch/xformers).
Thanks to those optimizations, we achieve a throughput of **24k** tokens per second per A100-40G GPU, which translates to **56% model flops utilization** without activation checkpointing (We expect the MFU to be even higher on A100-80G). It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in **32 hours with 8 A100**. Those optimizations also greatly reduce the memory footprint, allowing us to stuff our 1.1B model into 40GB GPU RAM and train with a per-gpu batch size of 16k tokens. **You can also pretrain TinyLlama on 3090/4090 GPUs with a smaller per-gpu batch size**.
Below is a comparison of the training speed of our codebase with that of Pythia and MPT.
| Model | A100 GPU hours taken on 300B tokens|
|-----------------------------------|------------------------------------|
|TinyLlama-1.1B | 3456 |
|[Pythia-1.0B](https://huggingface.co/EleutherAI/pythia-1b) | 4830 |
|[MPT-1.3B](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) | 7920 |
<small> The Pythia number comes from their [paper](https://arxiv.org/abs/2304.01373). The MPT number comes from [here](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b), in which they say MPT-1.3B " was trained on 440 A100-40GBs for about half a day" on 200B tokens. </small>
The fact that TinyLlama is a relatively small model with grouped query attention means it is also fast during inference. Below are some throughputs that we measure:
| Framework | Device | Settings | Throughput (tokens/sec) |
|-----------|--------------|-----|-----------|
|[Llama.cpp](https://github.com/ggerganov/llama.cpp) | Mac M2 16GB RAM | batch_size=1; 4-bit inference| 71.8 |
|[vLLM](https://github.com/vllm-project/vllm) | A40 GPU | batch_size=100, n=10 | 7094.5 |
## Pretrain
Please refer to [PRETRAIN.md](PRETRAIN.md) for instructions on how to pretrain TinyLlama.
## Finetune
We include a simple full-parameter finetuning & inference script in [sft](sft). Our V0.1 chat model is finetuned using this script. The FT dataset we use is [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).
For finetuning with less than 4GB RAM, we refer you to the [Qlora](https://github.com/artidoro/qlora) and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) repos.
We did not undergo extensive hyperparameter tuning nor choose more performant FT datasets. We hope the community can explore on finetuning TinyLlama and come up with better chat models. I will include community-finetuned models in this repo.
## TODO
This project is still under active development. We are a really small team. Community feedback and contributions are highly appreciated. Here are some things we plan to work on:
- [ ] Add scripts for pretraining on other datasets.
- [ ] Sequence length extrapolation.
- [ ] Test out speculative decoding for Llama-2-7B.
- [ ] Test the throughput on RTX 3090/4090.
- [ ] Add fine-tuning scripts.
- [ ] Properly evaluate the model on downstream tasks.
- [ ] A demo running on mobile phones.
- [ ] Explore retrieval-augmentation.
## Acknowledgements
This repository is built upon [lit-gpt](https://github.com/Lightning-AI/lit-gpt) and [flash-attention](https://github.com/Dao-AILab/flash-attention). Be sure to explore this fantastic open-source project if it's new to you!
```
@online{lit-gpt,
author = {Lightning AI},
title = {Lit-GPT},
url = {https://github.com/Lightning-AI/lit-gpt},
year = {2023},
}
@article{dao2023flashattention2,
title ={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
author ={Dao, Tri},
year ={2023}
}
```
## Citation
This project is currently contributed by [Peiyuan Zhang](https://veiled-texture-20c.notion.site/Peiyuan-Zhang-ab24b48621c9491db767a76df860873a?pvs=4) *, [Guangtao Zeng](https://github.com/ChaosCodes) *, [Tianduo Wang](https://github.com/TianduoWang) and [Wei Lu](https://istd.sutd.edu.sg/people/faculty/lu-wei/) from the StatNLP Research Group of Singapore University of Technology and Design.
If you find our work valuable, please cite:
```
@misc{zhang2024tinyllama,
title={TinyLlama: An Open-Source Small Language Model},
author={Peiyuan Zhang and Guangtao Zeng and Tianduo Wang and Wei Lu},
year={2024},
eprint={2401.02385},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Frequently Asked Questions
#### 1. Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?
<img src=".github/llama2-training.png" alt="The training loss curve of Llama 2" width="500"/>
Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.
#### 2. What does "saturation" mean?
<img src=".github/Pythia_saturation.png" alt="Figure 10 of the Pythia paper" width="500"/>
The figure from the Pythia paper displays the LAMBADA accuracy plotted against the total training tokens (300B). The term "saturation" pertains specifically to the 70M and 160M models. Notably, even the 410M model does not saturate with 300B tokens, as it continues to show an increasing trend, similar to the trend of larger models.
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=jzhang38/TinyLlama&type=Date)](https://star-history.com/#jzhang38/TinyLlama&Date)
<div align="center">
# TinyLlama-1.1B
[English](README.md) | 中文
[Chat Demo](https://huggingface.co/spaces/PY007/TinyLlama-Chat)
</div>
TinyLlama项目旨在在3万亿tokens上进行预训练,构建一个拥有11亿参数的Llama模型。经过精心优化,我们"仅"需16块A100-40G的GPU,便可在90天内完成这个任务🚀🚀。训练已于2023-09-01开始。
<div align="center">
<img src=".github/TinyLlama_logo.png" width="300"/>
</div>
我们采用了与Llama 2完全相同的架构和分词器。这意味着TinyLlama可以在许多基于Llama的开源项目中即插即用。此外,TinyLlama只有1.1B的参数,体积小巧,适用于需要限制计算和内存占用的多种应用。
#### 新闻
* 2023-09-18:
* 发布了一个 [chat demo](https://huggingface.co/spaces/PY007/TinyLlama-Chat),欢迎点击链接来尝试我们的模型。
* 2023-09-16:
* 发布了目前已经训练了 5.03 亿个 token 的 [checkpoints 模型](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b)
* 基于 5.03 亿 token 的 [checkpoints 模型](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b) 在 OpenAssistant 数据集上微调并开源了聊天模型 [TinyLlama-Chat-V0.1](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1) ,并添加了我们的 [微调脚本](sft)
* 添加了更多的评测数据集,您可以通过 [EVAL.md](EVAL.md) 文件来查看我们各模型的结果。
#### 发布时间表
我们会根据以下计划逐步发布中间checkpoint。我们也列了一些基线模型进行比较。
| Date | ModelScope 模型 | Tokens | Step | Commonsense Avg |
| ---------- | ------------------------------------------------------------ | ------ | ---- | --------------- |
| 2023-09-01 | Pythia-1.0B | 300B | 143k | 48.30 |
| 2023-09-04 | [TinyLlama-1.1B-intermediate-step-50k-105b](https://www.modelscope.cn/models/chaoscodes/TinyLlama-1.1B-step-50K-105b/files) | 105B | 50k | 46.11 |
| 2023-09-16 | [TinyLlama-1.1B-intermediate-step-240k-503b](https://www.modelscope.cn/models/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b/files) | 503B | 240K | 48.28 |
| 2023-09-16 | [TinyLlama-1.1B-Chat-V0.1](https://www.modelscope.cn/models/chaoscodes/TinyLlama-1.1B-Chat-v0.1/files) | 503B | 240K | 49.57 |
| 2023-10-01 | TinyLlama-1.1B-intermediate-step-480k-1007B | 1T | 480K | 50.22 |
| 2023-10-16 | -- | 1.5T | -- | -- |
| 2023-10-31 | -- | 2T | -- | -- |
| 2023-11-15 | -- | 2.5T | -- | -- |
| 2023-12-01 | -- | 3T | -- | -- |
需要注意的是,由于我们的现在模型还处于训练初期,学习率并没有完全稳定下来,为了更好的体验我们的模型,您可以下载我们 [聊天模型](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1) 或者通过 [chat demo](https://huggingface.co/spaces/PY007/TinyLlama-Chat) 来尝试我们的模型。
你们也可以在[这里](https://api.wandb.ai/links/lance777/pgvhrsny)实时跟踪TinyLlama的训练损失。
## 潜在场景
小型但强大的语言模型对许多应用都很有用。以下是一些潜在的场景:
- 帮助对大型模型进行speculative decoding。
- 在边缘装置上运行,比如离线的实时机器翻译 (TinyLlama的4比特量化版本的模型权重只需要550MB的内存)。
- 在游戏中实现实时对话生成(因为还得给游戏本身留显存所以模型要小)。
此外,我们的代码可以给初学者做一个**入门预训练的简洁参考**。如果你要训练50亿以下参数的语言模型, 你其实不需要Megatron-LM。
## 训练细节
以下是我们训练设置的一些细节:
| Setting | Description |
|---------------------------------|----------------------------------------------------------------|
| Parameters | 1.1B |
| Attention Variant | Grouped Query Attention |
| Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632|
| Sequence Length | 2048 |
| Batch Size | 2 million tokens (2048 * 1024) |
| Learning Rate | 4e-4 |
| Learning Rate Schedule | Cosine with 2000 warmup steps |
| Training Data | [Slimpajama](https://huggingface.co/datasets/cerebras/slimpajama-627b) & [Starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
| Data Preprocessing | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
| Combined Dataset Size | Around 950B tokens |
| Total Tokens During Training | 3 trillion (slightly more than 3 epochs/143k steps) |
| Natural Language to Code Ratio | 7:3 |
| Hardware | 16 A100-40G GPUs |
## 速度极快
我们的代码库支持以下特性:
- multi-gpu and multi-node distributed training with FSDP.
- flash attention 2.
- fused layernorm.
- fused swiglu.
- fused cross entropy loss .
- fused rotary positional embedding.
Credit: flash attention 2, fused layernorm, fused cross entropy loss, and fused
rotary positional embedding are from the [FlashAttention repo](https://github.com/Dao-AILab/flash-attention/). Fused swiglu is from [xformers](https://github.com/facebookresearch/xformers).
有了这些优化, 我们可以达到**24k tokens/秒/A100**的训练速度,也就是56%的MFU(在A100-80G上的MFU会更高)。这个速度可以让你可以在**8个A100上用32小时训练一个chinchilla-optimial的模型**(11亿参数,220亿token)。这些优化也大大减少了显存占用, 我们可以把11亿参数的模型塞入40GB的GPU里面还能同时维持16k tokens的per-gpu batch size。只需要把batch size改小一点, 你就可以在**RTX 3090/4090**上面训练TinyLlama。
下面是我们的代码库与Pythia和MPT的训练速度的比较。
| Model | A100 GPU hours taken on 300B tokens|
|-----------------------------------|------------------------------------|
|TinyLlama-1.1B | 3456 |
|[Pythia-1.0B](https://huggingface.co/EleutherAI/pythia-1b) | 4830 |
|[MPT-1.3B](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) | 7920 |
<small> Pythia的数字来自他们的论文。MPT的数字来自[这里](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b),作者说MPT-1.3B"was trained on 440 A100-40GBs for about half a day" on 200B tokens。</small>
TinyLlama是一个相对较小的模型, 同时我们用了GQA, 这意味着它在推理期间也很快。以下是我们测量的一些推理速度:
| Framework | Device | Settings | Throughput (tokens/sec) |
|-----------|--------------|-----|-----------|
|[Llama.cpp](https://github.com/ggerganov/llama.cpp) | Mac M2 16GB RAM | batch_size=1; 4-bit inference| 71.8 |
|[vLLM](https://github.com/vllm-project/vllm) | A40 GPU | batch_size=100, n=10 | 7094.5 |
## 开始训练
请参考[PRETRAIN.md](PRETRAIN.md)
## Finetune
* 我们在 [sft](sft) 中添加了我们进行微调和推理的代码。并且基于这个代码我们在[openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) 数据集上进行了微调,得到了我们的第一版[聊天模型](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1)
* 如果您希望在 RAM 小于 4GB 的 GPU 上对用我们的模型进行微调,可以参考并使用 [Qlora](https://github.com/artidoro/qlora)[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 项目。
* 目前微调的时候我们并没有广泛对超参进行搜索,也没有选择潜在更优的 instruction 数据集。我们希望促进 NLP 社区对于我们的TinyLlama模型的开放研究,并开源更好的微调聊天模型。我们也会把这些模型放在这个项目中。
## TODO
该项目仍在积极开发中。我们团队很小,非常欢迎社区的反馈和贡献。以下是我们计划进行的一些工作:
- [ ] Add scripts for pretraining on other datasets.
- [ ] Sequence length extrapolation.
- [ ] Test out speculative decoding for Llama-2-7B.
- [ ] Test the throughput on RTX 3090/4090.
- [ ] Add fine-tuning scripts.
- [ ] Properly evaluate the model on downstream tasks.
- [ ] A demo running on mobile phones.
- [ ] Explore retrieval-augmentation.
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=jzhang38/TinyLlama&type=Date)](https://star-history.com/#jzhang38/TinyLlama&Date)
## Acknowledgements
这个仓库基于出色的开源项目[lit-gpt](https://github.com/Lightning-AI/lit-gpt)[flash-attention](https://github.com/Dao-AILab/flash-attention)构建.
```
@online{lit-gpt,
author = {Lightning AI},
title = {Lit-GPT},
url = {https://github.com/Lightning-AI/lit-gpt},
year = {2023},
}
@article{dao2023flashattention2,
title ={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
author ={Dao, Tri},
year ={2023}
}
```
## Citation
此项目目前由[Peiyuan Zhang](https://github.com/jzhang38)[Guangtao Zeng](https://github.com/ChaosCodes)[Tianduo Wang](https://github.com/TianduoWang)[Wei Lu](https://istd.sutd.edu.sg/people/faculty/lu-wei/)贡献。
如果您觉得我们的工作有价值, 可以引用:
```
@online{tinyllama,
author = {Peiyuan Zhang, Guangtao Zeng, Tianduo Wang and Wei Lu},
title = {TinyLlama},
url = {https://github.com/jzhang38/TinyLlama},
year = {2023},
month = {Sep}
}
```
## Tinyllama Chatbot Implementation with Gradio
We offer an easy way to interact with Tinyllama. This guide explains how to set up a local Gradio demo for a chatbot using TinyLlama.
(A demo is also available on the Hugging Face Space [TinyLlama/tinyllama_chatbot](https://huggingface.co/spaces/TinyLlama/tinyllama-chat)) or Colab [colab](https://colab.research.google.com/drive/1qAuL5wTIa-USaNBu8DH35KQtICTnuLsy?usp=sharing).
### Requirements
* Python>=3.8
* PyTorch>=2.0
* Transformers>=4.34.0
* Gradio>=4.13.0
### Installation
`pip install -r requirements.txt`
### Usage
`python TinyLlama/chat_gradio/app.py`
* After running it, open the local URL displayed in your terminal in your web browser. (For server setup, use SSH local port forwarding with the command: `ssh -L [local port]:localhost:[remote port] [username]@[server address]`.)
* Interact with the chatbot by typing questions or commands.
**Note:** The chatbot's performance may vary based on your system's hardware. Ensure your system meets the above requirements for optimal experience.
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
# Loading the tokenizer and model from Hugging Face's model hub.
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# using CUDA for an optimal experience
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# Defining a custom stopping criteria class for the model's text generation.
class StopOnTokens(StoppingCriteria):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
stop_ids = [2] # IDs of tokens where the generation should stop.
for stop_id in stop_ids:
if input_ids[0][-1] == stop_id: # Checking if the last generated token is a stop token.
return True
return False
# Function to generate model predictions.
def predict(message, history):
history_transformer_format = history + [[message, ""]]
stop = StopOnTokens()
# Formatting the input for the model.
messages = "</s>".join(["</s>".join(["\n<|user|>:" + item[0], "\n<|assistant|>:" + item[1]])
for item in history_transformer_format])
model_inputs = tokenizer([messages], return_tensors="pt").to(device)
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
generate_kwargs = dict(
model_inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=True,
top_p=0.95,
top_k=50,
temperature=0.7,
num_beams=1,
stopping_criteria=StoppingCriteriaList([stop])
)
t = Thread(target=model.generate, kwargs=generate_kwargs)
t.start() # Starting the generation in a separate thread.
partial_message = ""
for new_token in streamer:
partial_message += new_token
if '</s>' in partial_message: # Breaking the loop if the stop token is generated.
break
yield partial_message
# Setting up the Gradio chat interface.
gr.ChatInterface(predict,
title="Tinyllama_chatBot",
description="Ask Tiny llama any questions",
examples=['How to cook a fish?', 'Who is the president of US now?']
).launch() # Launching the web interface.
torch>=2.0
transformers>=4.35.0
gradio>=4.13.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment