Unverified Commit 862fbaaa authored by Tong Li's avatar Tong Li Committed by GitHub
Browse files

[Feature] Support LLaMA-3 CPT and ST (#5619)

* support LLaMA-3

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Run pre-commit

---------
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
parent e094933d
<div align="center"> <div align="center">
<h1> <h1>
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/> Colossal-LLaMA
</h1> </h1>
</div> </div>
...@@ -47,6 +47,7 @@ ...@@ -47,6 +47,7 @@
- [Citations](#citations) - [Citations](#citations)
## News ## News
* [2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b). * [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2) [[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b) [[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
...@@ -289,7 +290,7 @@ Here is details about CLI arguments: ...@@ -289,7 +290,7 @@ Here is details about CLI arguments:
#### 1. Install required packages #### 1. Install required packages
``` ```
cd Colossal-LLaMA-2 cd Colossal-LLaMA
pip install -r requirements.txt pip install -r requirements.txt
``` ```
#### 2. Install `xentropy`, `layer_norm` and `rotary` #### 2. Install `xentropy`, `layer_norm` and `rotary`
...@@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke ...@@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
Command to initialize new tokenizer: Command to initialize new tokenizer:
```bash ```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python' export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
python colossal_llama2/tokenizer/init_tokenizer.py \ python colossal_llama/tokenizer/init_tokenizer.py \
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \ --source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl" --expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
...@@ -328,7 +329,7 @@ Here is details about CLI arguments: ...@@ -328,7 +329,7 @@ Here is details about CLI arguments:
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint. Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
Command to initialize new model checkpoint: Command to initialize new model checkpoint:
```bash ```bash
python colossal_llama2/model/init_model.py \ python colossal_llama/model/init_model.py \
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \ --source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
--target_model_path "<TARGET_MODEL_DIR>" --target_model_path "<TARGET_MODEL_DIR>"
...@@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format: ...@@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format:
python prepare_pretrain_dataset.py \ python prepare_pretrain_dataset.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10
``` ```
Here is details about CLI arguments: Here is details about CLI arguments:
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format. * Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format. * Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally. * Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format. * `cache`: Directory to store Hugging Face data cache.
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly. * `jsonl`: Output directory to store converted dataset in jsonl format.
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
* Max length: `max_length`. Max length of spliced samples. Default value is 4096. * Max length: `max_length`. Max length of spliced samples. Default value is 4096.
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training. * Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
...@@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3 ...@@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
python prepare_sft_dataset.py.py \ python prepare_sft_dataset.py.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10 \
--llama_version 3
``` ```
Additional CLI arguments:
* LLaMA verison: `llama_version`. Specify the LLaMA version.
#### 4. Command Line Arguments for Training #### 4. Command Line Arguments for Training
##### 4.1 Arguments for Pretraining ##### 4.1 Arguments for Pretraining
......
...@@ -83,7 +83,7 @@ class Conversation: ...@@ -83,7 +83,7 @@ class Conversation:
} }
conv = Conversation( LLaMA2_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n", "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"), roles=("Human", "Assistant"),
...@@ -93,4 +93,14 @@ conv = Conversation( ...@@ -93,4 +93,14 @@ conv = Conversation(
seps=["<s>", "</s>"], seps=["<s>", "</s>"],
) )
default_conversation = conv LLaMA3_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
seps=["<|begin_of_text|>", "<|end_of_text|>"],
)
default_conversation = LLaMA3_Conv
...@@ -12,6 +12,7 @@ from typing import Any, Callable, Dict, Iterable, List, Tuple, Union ...@@ -12,6 +12,7 @@ from typing import Any, Callable, Dict, Iterable, List, Tuple, Union
from datasets import dataset_dict from datasets import dataset_dict
from torch.utils.data import ConcatDataset, Dataset, IterableDataset from torch.utils.data import ConcatDataset, Dataset, IterableDataset
from transformers import AutoTokenizer
from transformers.models.llama.tokenization_llama import LlamaTokenizer from transformers.models.llama.tokenization_llama import LlamaTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer from transformers.tokenization_utils import PreTrainedTokenizer
...@@ -71,7 +72,7 @@ def supervised_tokenize_pretrain( ...@@ -71,7 +72,7 @@ def supervised_tokenize_pretrain(
def supervised_tokenize_sft( def supervised_tokenize_sft(
data_point: Dict[str, str], data_point: Dict[str, str],
tokenizer: LlamaTokenizer, tokenizer: AutoTokenizer,
conversation_template: Conversation = default_conversation, conversation_template: Conversation = default_conversation,
ignore_index: int = None, ignore_index: int = None,
max_length: int = 4096, max_length: int = 4096,
......
import argparse import argparse
import torch import torch
from colossal_llama2.dataset.conversation import default_conversation from colossal_llama.dataset.conversation import default_conversation
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from colossalai.logging import get_dist_logger from colossalai.logging import get_dist_logger
......
...@@ -11,12 +11,12 @@ import os ...@@ -11,12 +11,12 @@ import os
import time import time
from multiprocessing import cpu_count from multiprocessing import cpu_count
from colossal_llama2.dataset.spliced_and_tokenized_dataset import ( from colossal_llama.dataset.spliced_and_tokenized_dataset import (
ClosedToConstantLengthSplicedDataset, ClosedToConstantLengthSplicedDataset,
supervised_tokenize_pretrain, supervised_tokenize_pretrain,
) )
from datasets import dataset_dict, load_dataset from datasets import dataset_dict, load_dataset
from transformers.models.llama.tokenization_llama import LlamaTokenizer from transformers import AutoTokenizer
from colossalai.logging import get_dist_logger from colossalai.logging import get_dist_logger
...@@ -35,34 +35,23 @@ def main(): ...@@ -35,34 +35,23 @@ def main():
parser.add_argument( parser.add_argument(
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer" "--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
) )
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory") parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
parser.add_argument( parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
"--data_jsonl_output_dir",
type=str,
default="jsonl_output",
help="Output directory of spliced dataset with jsonl format",
)
parser.add_argument(
"--data_arrow_output_dir",
type=str,
default="arrow_output",
help="Output directory of spliced dataset with arrow format",
)
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins") parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
args = parser.parse_args() args = parser.parse_args()
if args.num_spliced_dataset_bins >= 100000: if args.num_spliced_dataset_bins >= 100000:
raise ValueError("Too many spliced divisions, must be smaller than 100000") raise ValueError("Too many spliced divisions, must be smaller than 100000")
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}" args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
assert not os.path.exists( args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
args.data_jsonl_output_dir args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
assert not os.path.exists( if not os.path.exists(args.data_cache_dir):
args.data_arrow_output_dir os.makedirs(args.data_cache_dir)
), f"Find existed arrow data output dir {args.data_arrow_output_dir}" if not os.path.exists(args.data_jsonl_output_dir):
os.makedirs(args.data_jsonl_output_dir) os.makedirs(args.data_jsonl_output_dir)
if not os.path.exists(args.data_arrow_output_dir):
os.makedirs(args.data_arrow_output_dir) os.makedirs(args.data_arrow_output_dir)
# Prepare to all input datasets # Prepare to all input datasets
...@@ -86,7 +75,7 @@ def main(): ...@@ -86,7 +75,7 @@ def main():
train_splits.append(f"train[{start}%:{end}%]") train_splits.append(f"train[{start}%:{end}%]")
# Prepare to the tokenizer. # Prepare to the tokenizer.
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir) tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
tokenizer.add_bos_token = False tokenizer.add_bos_token = False
tokenizer.add_eos_token = False tokenizer.add_eos_token = False
if tokenizer.pad_token is None: if tokenizer.pad_token is None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment