Unverified Commit 862fbaaa authored by Tong Li's avatar Tong Li Committed by GitHub
Browse files

[Feature] Support LLaMA-3 CPT and ST (#5619)

* support LLaMA-3

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Run pre-commit

---------
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
parent e094933d
<div align="center">
<h1>
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/>
Colossal-LLaMA
</h1>
</div>
......@@ -47,6 +47,7 @@
- [Citations](#citations)
## News
* [2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
......@@ -289,7 +290,7 @@ Here is details about CLI arguments:
#### 1. Install required packages
```
cd Colossal-LLaMA-2
cd Colossal-LLaMA
pip install -r requirements.txt
```
#### 2. Install `xentropy`, `layer_norm` and `rotary`
......@@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
Command to initialize new tokenizer:
```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
python colossal_llama2/tokenizer/init_tokenizer.py \
python colossal_llama/tokenizer/init_tokenizer.py \
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
......@@ -328,7 +329,7 @@ Here is details about CLI arguments:
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
Command to initialize new model checkpoint:
```bash
python colossal_llama2/model/init_model.py \
python colossal_llama/model/init_model.py \
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
--target_model_path "<TARGET_MODEL_DIR>"
......@@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format:
python prepare_pretrain_dataset.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--data_output_dirs "spliced tokenized output" \
--max_length 4096 \
--num_spliced_dataset_bins 10
```
Here is details about CLI arguments:
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally.
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format.
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly.
* Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
* `cache`: Directory to store Hugging Face data cache.
* `jsonl`: Output directory to store converted dataset in jsonl format.
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
......@@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
python prepare_sft_dataset.py.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--data_output_dirs "spliced tokenized output" \
--max_length 4096 \
--num_spliced_dataset_bins 10
--num_spliced_dataset_bins 10 \
--llama_version 3
```
Additional CLI arguments:
* LLaMA verison: `llama_version`. Specify the LLaMA version.
#### 4. Command Line Arguments for Training
##### 4.1 Arguments for Pretraining
......
......@@ -83,7 +83,7 @@ class Conversation:
}
conv = Conversation(
LLaMA2_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
......@@ -93,4 +93,14 @@ conv = Conversation(
seps=["<s>", "</s>"],
)
default_conversation = conv
LLaMA3_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
seps=["<|begin_of_text|>", "<|end_of_text|>"],
)
default_conversation = LLaMA3_Conv
......@@ -12,6 +12,7 @@ from typing import Any, Callable, Dict, Iterable, List, Tuple, Union
from datasets import dataset_dict
from torch.utils.data import ConcatDataset, Dataset, IterableDataset
from transformers import AutoTokenizer
from transformers.models.llama.tokenization_llama import LlamaTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer
......@@ -71,7 +72,7 @@ def supervised_tokenize_pretrain(
def supervised_tokenize_sft(
data_point: Dict[str, str],
tokenizer: LlamaTokenizer,
tokenizer: AutoTokenizer,
conversation_template: Conversation = default_conversation,
ignore_index: int = None,
max_length: int = 4096,
......
import argparse
import torch
from colossal_llama2.dataset.conversation import default_conversation
from colossal_llama.dataset.conversation import default_conversation
from transformers import AutoModelForCausalLM, AutoTokenizer
from colossalai.logging import get_dist_logger
......
......@@ -11,12 +11,12 @@ import os
import time
from multiprocessing import cpu_count
from colossal_llama2.dataset.spliced_and_tokenized_dataset import (
from colossal_llama.dataset.spliced_and_tokenized_dataset import (
ClosedToConstantLengthSplicedDataset,
supervised_tokenize_pretrain,
)
from datasets import dataset_dict, load_dataset
from transformers.models.llama.tokenization_llama import LlamaTokenizer
from transformers import AutoTokenizer
from colossalai.logging import get_dist_logger
......@@ -35,34 +35,23 @@ def main():
parser.add_argument(
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
)
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory")
parser.add_argument(
"--data_jsonl_output_dir",
type=str,
default="jsonl_output",
help="Output directory of spliced dataset with jsonl format",
)
parser.add_argument(
"--data_arrow_output_dir",
type=str,
default="arrow_output",
help="Output directory of spliced dataset with arrow format",
)
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
args = parser.parse_args()
if args.num_spliced_dataset_bins >= 100000:
raise ValueError("Too many spliced divisions, must be smaller than 100000")
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}"
assert not os.path.exists(
args.data_jsonl_output_dir
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
assert not os.path.exists(
args.data_arrow_output_dir
), f"Find existed arrow data output dir {args.data_arrow_output_dir}"
args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
if not os.path.exists(args.data_cache_dir):
os.makedirs(args.data_cache_dir)
if not os.path.exists(args.data_jsonl_output_dir):
os.makedirs(args.data_jsonl_output_dir)
if not os.path.exists(args.data_arrow_output_dir):
os.makedirs(args.data_arrow_output_dir)
# Prepare to all input datasets
......@@ -86,7 +75,7 @@ def main():
train_splits.append(f"train[{start}%:{end}%]")
# Prepare to the tokenizer.
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if tokenizer.pad_token is None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment