Merge branch 'main' of https://github.com/hpcaitech/ColossalAI

9e768b59 · zhuwenwen · 7bc5a8e3 · 8aed02b9 · 7bc5a8e3 · 7bc5a8e3
Commit 9e768b59 authored Oct 10, 2023 by zhuwenwen
20 changed files
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
-# Evaluation
-
-In this directory, we introduce how you can evaluate your model with GPT-4. 
-
-## Evaluation Pipeline
-
-The whole evaluation process undergoes the following three steps: 
-1. Prepare the questions following the internal data structure in the data format section (described below).
-2. Generate answers from different models: 
-    * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
-    * Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
-3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).
-
-### Generate Answers
-#### Generate Answers Using GPT-3.5
-You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).
-
-An example script is provided as follows:
-```shell
-python generate_gpt35_answers.py \
-    --dataset "path to the question dataset" \
-    --answer_path "path to answer folder" \
-    --num_workers 4 \
-    --openai_key "your openai key" \
-    --max_tokens 512 \
-``` 
-
-#### Generate Answers Using our Own Model
-You can also generate answers using your own models. The generation process is divided into two stages:
-1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
-2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).
-
-An example script is given as follows:
-
-```shell
-device_number=number of your devices
-model_name="name of your model"
-model_path="path to your model"
-dataset="path to the question dataset"
-answer_path="path to save the model answers"
-
-torchrun --standalone --nproc_per_node=$device_number generate_answers.py \
-    --model 'llama' \
-    --strategy ddp \
-    --model_path $model_path \
-    --model_name $model_name \
-    --dataset $dataset \
-    --batch_size 8 \
-    --max_datasets_size 80 \
-    --answer_path $answer_path \
-    --max_length 512
-
-python merge.py \
-    --model_name $model_name \
-    --shards $device_number \
-    --answer_path $answer_path \
-
-for (( i=0; i<device_number; i++ )) do
-    rm -rf "${answer_path}/${model_name}_answers_rank${i}.json"
-done
-
-```
-
-### Evaluate Answers
-
-In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
-
-The metrics include:
-
- `Invalid Count`: The number of reviews where the program fail to parse the score pair.
- `Better Count`: The number of reviews where Model 2 receives a higher score.
- `Worse Count`: The number of reviews where Model 2 receives a lower score.
- `Tie Count`: The number of reviews where two models play to a tie.
- `Win Rate of Model 2`: Win rate of Model 2.
- `Model 1 Average Score`: Average score of Model 1.
- `Model 2 Average Score`: Average score of Model 2.
-
-Other than the `review` and `result` file which include all reviews, the output files also include `invalid`, `better`, `worse` and `tie` JSON file which only include the corresponding reviews.
-
-```shell
-python evaluate.py \
-    --answer_file_list "path to answers of model 1" "path to answers of model 2" \
-    --prompt_file "path to prompt file" \
-    --reviewer_file "path to reviewer file" \
-    --output_folder "path to output folder" \
-    --openai_key "your openai key" \
-    --model "the gpt model" \
-    --num_workers 8 \
-    --max_tokens 512 \
-
-```
-
-## Results
-
-We compare our model with alpaca and vicuna. The results is shown below. Please note that the better cases don't add to 80 because there are reviews the program can't successfully parse to get the score pair. Our Coati-7B model performs better than Alpaca-7B. The Coati-7B model we evaluate is an old version we trained a few weeks ago and the new version is around the corner.
-
-|  Model Pair   | Alpaca-7B ⚔ Coati-7B | Coati-7B ⚔ Alpaca-7B |
-| :-----------: | :------------------: | :------------------: |
-| Better Cases  |     38 ⚔ **41**      |     **45** ⚔ 33      |
-|   Win Rate    |    48% ⚔ **52%**     |    **58%** ⚔ 42%     |
-| Average Score |   7.06 ⚔ **7.13**    |   **7.31** ⚔ 6.82    |
-
-We would like to mention that the evaluation of model answers using the GPT-3.5 model is not reliable. GPT-3.5 tends to give a higher score to the second answer (`{answer2}` in the prompt). In our evaluation which uses GPT-4, we still swap the two model answers. As can be seen from the table, GPT-4 can generate consistent results and it is more unbiased than GPT-3.5.
-
-## Data Format
-
-### Questions
-The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
-* `id` (id, compulsory): The ID of the instruction / question.
-* `instruction` (str, compulsory): The instruction / question for the LLM.
-* `input` (str, optional): The additional context of the instruction / question.
-* `output` (str, optional): The sample output of the instruction / question.
-* `category` (str, compulsory): The category of the instruction / question.
-
-Example:
-```
-{
-    "id": 0,
-    "instruction": "Help me summarize the following short story?",
-    "input": "{story}",
-    "output": "{summarized story}",
-    "category": "closed qa"
-}
-```
-
-### Answers
-
-We store model answers in `{model_name}_answers.json`. The JSON file contains one list. Each element in the list is an answer record to one question.
-
-An answer record has the following field:
-
-* `category` (str, compulsory): The category of the instruction / question.
-* `instruction` (str, compulsory): The instruction / question for the LLM.
-* `input` (str, optional): The additional context of the instruction / question.
-* `output` (str, compulsory): The output from the LLM.
-* `id` (int, compulsory): The ID of the instruction / question.
-
-### Results
-
-We store evaluation results in `results.json`. The JSON file contains one dictionary. The key in the dictionary is formatted as `{model 1}_vs_{model 2}` and the value is also a dictionary contains metrics about the evaluation.
-
-The value has the following field:
-
-* `model` (list, compulsory): The names of the two models.
-* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
-* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
-* `tie` (int, compulsory): The number of reviews where two models play to a tie.
-* `win_rate` (float, compulsory): Win rate of Model 2.
-* `score` (list, compulsory): Average score of the two models.
-
-### Better, Worse, Tie, Invalid, Review
-
-To help better compare the model answers, we store JSON files whose name ends with `_better`, `_worse`, `_tie`, `_invalid` or `_review`. Each JSON file contains one list. Each element in the list is a record of better, worse, tie, invalid or all cases.
-
-A record has the following field:
-
-* `review_id` (str, optional): Random UUID, not in use.
-* `id` (int, compulsory): The ID of the instruction / question.
-* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
-* `metadata` (dict, optional): It is empty.
-* `review` (str, optional): GPT-4's review.
-* `score` (list, compulsory): The scores of two models.
-
-### Prompts
-
-The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
-
-### Reviewer
-
-The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
-
-## Citations
-
-```bibtex
-@misc{vicuna2023,
-    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
-    url = {https://vicuna.lmsys.org},
-    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
-    month = {March},
-    year = {2023}
-}
-```
--- a/applications/Chat/evaluate/evaluate.py
+++ b/applications/Chat/evaluate/evaluate.py
-#    Adapted form https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/eval_gpt_review.py
-#    Copyright 2023 LM-SYS@FastChat
-
-#    Licensed under the Apache License, Version 2.0 (the "License");
-#    you may not use this file except in compliance with the License.
-#    You may obtain a copy of the License at
-
-#        http://www.apache.org/licenses/LICENSE-2.0
-
-#    Unless required by applicable law or agreed to in writing, software
-#    distributed under the License is distributed on an "AS IS" BASIS,
-#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#    See the License for the specific language governing permissions and
-#    limitations under the License.
-
-
-import argparse
-import json
-import os
-import time
-import re
-import concurrent.futures
-
-import openai
-import tqdm
-import shortuuid
-import logging
-
-from utils import jload, jdump, get_json_list
-
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-MAX_API_RETRY = 3
-
-
-def get_eval(sys_prompt, user_prompt: str, answer_id: int, max_tokens: int, model: str):
-    logging.basicConfig(level=logging.INFO)
-    for _ in range(MAX_API_RETRY):
-        try:
-            response = openai.ChatCompletion.create(
-                model=model,
-                messages=[{
-                    'role': 'system',
-                    'content': sys_prompt
-                }, {
-                    'role': 'user',
-                    'content': user_prompt,
-                }],
-                temperature=0.2,
-                max_tokens=max_tokens,
-            )
-            review = response['choices'][0]['message']['content']
-            return {"review": review, 'id': answer_id}
-        except Exception as e:
-            logger.error(e)
-            time.sleep(1)
-    logger.error(f' Review {answer_id} failed after {MAX_API_RETRY} retries.')
-    return 'error'
-
-
-def parse_score(review):
-    try:
-        pattern = re.compile('([0-9]|10) out of 10')
-        sp = re.findall(pattern, review)
-        if len(re.findall(pattern, review)) == 2:
-            return [float(sp[0]), float(sp[1])]
-
-        pattern = re.compile('a score of ([0-9]|10)')
-        sp = re.findall(pattern, review)
-        if len(re.findall(pattern, review)) == 2:
-            return [float(sp[0]), float(sp[1])]
-
-        pattern = re.compile('([0-9]|10)/10')
-        sp = re.findall(pattern, review)
-        if len(re.findall(pattern, review)) == 2:
-            return [float(sp[0]), float(sp[1])]
-
-        score_pair = review.split('\n')[0]
-        score_pair = score_pair.replace(',', ' ')
-        sp = score_pair.split(' ')
-        if len(sp) == 2:
-            return [float(sp[0]), float(sp[1])]
-        else:
-            raise Exception('Invalid score pair.')
-    except Exception as e:
-        return [-1, -1]
-
-
-def gen_prompt(reviewer_jsons, prompt_jsons, cat, ques, ans1, ans2):
-    reviewer_idx = 0
-    for idx, reviewer in enumerate(reviewer_jsons):
-        if reviewer['category'] == cat:
-            reviewer_idx = idx
-            break
-    prompt_id = reviewer_jsons[reviewer_idx]['prompt_id']
-    prompt_json = prompt_jsons[prompt_id-1]
-    assert prompt_json['prompt_id'] == prompt_id
-
-    sys_prompt = prompt_json['system_prompt']
-    prompt_template = prompt_json['prompt_template']
-    defaults = prompt_json['defaults']
-    prompt = prompt_template.format(
-        question=ques, answer_1=ans1, answer_2=ans2, **defaults)
-
-    return sys_prompt, prompt, reviewer_idx+1
-
-
-def evaluate(args):
-    answer1_jsons = jload(args.answer_file_list[0])
-    answer2_jsons = jload(args.answer_file_list[1])
-    reviewer_jsons = get_json_list(args.reviewer_file)
-    prompt_jsons = get_json_list(args.prompt_file)
-
-    assert len(answer1_jsons) == len(answer2_jsons)
-
-    handles = []
-    review_jsons = []
-
-    total_len = len(answer1_jsons)
-    question_idx_list = list(range(total_len))
-
-    logger.info(
-        f' Total number of answers: {len(answer2_jsons)}.')
-
-    reviews = []
-    with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_workers) as executor:
-        futures = []
-        for i in question_idx_list:
-            assert answer1_jsons[i]['id'] == answer2_jsons[i]['id']
-            answer_id = answer1_jsons[i]['id']
-
-            ques = answer1_jsons[i]['instruction'] if answer1_jsons[i]['input'] == "" else answer1_jsons[i]['instuction'] + \
-                " " + answer1_jsons[i]['input']
-            cat = answer1_jsons[i]['category']
-            ans1 = answer1_jsons[i]['output']
-            ans2 = answer2_jsons[i]['output']
-
-            sys_prompt, prompt, reviewer_id = gen_prompt(
-                reviewer_jsons, prompt_jsons, cat, ques, ans1, ans2)
-
-            review_id = shortuuid.uuid()
-            review_jsons.append({
-                'review_id': review_id,
-                'id': answer_id,
-                'reviewer_id': reviewer_id,
-                'metadata': {}
-            })
-
-            future = executor.submit(
-                get_eval, sys_prompt, prompt, answer_id, args.max_tokens, args.model)
-            futures.append(future)
-
-        for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
-            reviews.append(future.result())
-
-    reviews.sort(key=lambda x: x['id'])
-    review_jsons.sort(key=lambda x: x['id'])
-
-    ans1_score = 0
-    ans2_score = 0
-    better_count = 0
-    worse_count = 0
-    tie_count = 0
-    invalid_count = 0
-
-    better_file = []
-    worse_file = []
-    tie_file = []
-    invalid_file = []
-    output_review_file = []
-
-    for idx, review in enumerate(reviews):
-        scores = parse_score(review['review'])
-        review_jsons[idx]['review'] = review['review']
-        review_jsons[idx]['score'] = scores
-
-        if scores[0] == -1 and scores[1] == -1:
-            invalid_count += 1
-            invalid_file.append(review_jsons[idx])
-            logger.info(f' Invalid score pair: {review_jsons[idx]["id"]}.')
-        else:
-            if scores[0] > scores[1]:
-                worse_count += 1
-                worse_file.append(review_jsons[idx])
-            elif scores[0] < scores[1]:
-                better_count += 1
-                better_file.append(review_jsons[idx])
-            else:
-                tie_count += 1
-                tie_file.append(review_jsons[idx])
-            ans1_score += scores[0]
-            ans2_score += scores[1]
-
-        output_review_file.append(review_jsons[idx])
-
-    better_file.sort(key=lambda x: x['id'])
-    worse_file.sort(key=lambda x: x['id'])
-    tie_file.sort(key=lambda x: x['id'])
-    invalid_file.sort(key=lambda x: x['id'])
-    output_review_file.sort(key=lambda x: x['id'])
-
-    name1 = os.path.basename(args.answer_file_list[0]).split("_answers")[0]
-    name2 = os.path.basename(args.answer_file_list[1]).split("_answers")[0]
-    prefix = f"{name1}_vs_{name2}"
-
-    jdump(better_file, os.path.join(
-        args.output_folder, prefix, f"{prefix}_better.json"))
-    jdump(worse_file, os.path.join(
-        args.output_folder, prefix, f"{prefix}_worse.json"))
-    jdump(tie_file, os.path.join(
-        args.output_folder, prefix, f"{prefix}_tie.json"))
-    jdump(invalid_file, os.path.join(
-        args.output_folder, prefix, f"{prefix}_invalid.json"))
-    jdump(output_review_file, os.path.join(
-        args.output_folder, prefix, f"{prefix}_review.json"))
-
-    if os.path.exists(os.path.join(args.output_folder, "results.json")):
-        results = jload(os.path.join(args.output_folder, "results.json"))
-    else:
-        results = {}
-    results[prefix] = {'model': [name1, name2], 'better': better_count, 'worse': worse_count, 'tie': tie_count, 'win_rate': better_count /
-                       (len(reviews)-invalid_count), 'score': [ans1_score/(len(reviews)-invalid_count), ans2_score/(len(reviews)-invalid_count)]}
-    jdump(results, os.path.join(args.output_folder, "results.json"))
-
-    logger.info(f' Total {invalid_count} invalid score pair(s).')
-    logger.info(f' Model {name2} has {better_count} better answer(s).')
-    logger.info(f' Model {name2} has {worse_count} worse answer(s).')
-    logger.info(f' {tie_count} answer(s) play(s) to a tie.')
-    logger.info(
-        f' Win rate of model {name2}: {better_count/(len(reviews)-invalid_count):.2f}')
-    logger.info(
-        f' Model {name1} average score: {ans1_score/(len(reviews)-invalid_count):.2f}')
-    logger.info(
-        f' Model {name2} average score: {ans2_score/(len(reviews)-invalid_count):.2f}')
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        description='Model evaluation.')
-    parser.add_argument('--answer_file_list', nargs='+', default=[])
-    parser.add_argument('--prompt_file')
-    parser.add_argument('--reviewer_file')
-    parser.add_argument('--output_folder', type=str, default="./output")
-    parser.add_argument('--openai_key', type=str, default=None)
-    parser.add_argument('--model', type=str, default="gpt-4")
-    parser.add_argument('--num_workers', type=int, default=8)
-    parser.add_argument('--max_tokens', type=int, default=512,
-                        help='maximum number of tokens produced in the output')
-    args = parser.parse_args()
-
-    if args.openai_key is not None:
-        os.environ["OPENAI_API_KEY"] = args.openai_key
-    openai.api_key = os.getenv("OPENAI_API_KEY")
-
-    evaluate(args)
--- a/applications/Chat/evaluate/evaluate.sh
+++ b/applications/Chat/evaluate/evaluate.sh
-python evaluate.py \
-    --answer_file_list "path to answers of model 1" "path to answers of model 2" \
-    --prompt_file "path to prompt file" \
-    --reviewer_file "path to reviewer file" \
-    --output_folder "path to output folder" \
-    --openai_key "your openai key" \
-    --model "gpt-4" \
-    --num_workers 8 \
-    --max_tokens 512 \
--- a/applications/Chat/evaluate/generate_answers.py
+++ b/applications/Chat/evaluate/generate_answers.py
-import argparse
-import os
-import random
-import copy
-import math
-from tqdm import tqdm
-
-import torch
-import torch.distributed as dist
-import transformers
-
-from coati.models.bloom import BLOOMActor
-from coati.models.gpt import GPTActor
-from coati.models.opt import OPTActor
-from coati.models.roberta import RoBERTaActor
-from coati.models.llama import LlamaActor
-from coati.trainer.strategies import ColossalAIStrategy, DDPStrategy, NaiveStrategy
-from transformers import AutoTokenizer, RobertaTokenizer
-from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
-
-from colossalai.logging import get_dist_logger
-
-from utils import jload, jdump, is_rank_0
-
-
-logger = get_dist_logger()
-
-PROMPT_DICT = {
-    "prompt_input":
-        ("Below is an instruction that describes a task, paired with an input that provides further context. "
-         "Write a response that appropriately completes the request.\n\n"
-         "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"),
-    "prompt_no_input": ("Below is an instruction that describes a task. "
-                        "Write a response that appropriately completes the request.\n\n"
-                        "### Instruction:\n{instruction}\n\n### Response:"),
-}
-
-
-def generate(args):
-    # torch.cuda.set_per_process_memory_fraction(0.4)
-    if args.strategy == 'naive':
-        strategy = NaiveStrategy()
-    elif args.strategy == 'ddp':
-        strategy = DDPStrategy()
-    elif args.strategy == 'colossalai_gemini':
-        strategy = ColossalAIStrategy(stage=3, placement_policy='cuda')
-    elif args.strategy == 'colossalai_zero2':
-        strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
-    elif args.strategy == 'colossalai_zero2_cpu':
-        strategy = ColossalAIStrategy(stage=2, placement_policy='cpu')
-    else:
-        raise ValueError(f'Unsupported strategy "{args.strategy}"')
-
-    world_size = dist.get_world_size()
-    rank = dist.get_rank()
-
-    with strategy.model_init_context():
-        if args.model == 'gpt2':
-            actor = GPTActor(pretrained=args.model_path).to(
-                torch.cuda.current_device())
-        elif args.model == 'bloom':
-            actor = BLOOMActor(pretrained=args.model_path).to(
-                torch.cuda.current_device())
-        elif args.model == 'opt':
-            actor = OPTActor(pretrained=args.model_path).to(
-                torch.cuda.current_device())
-        elif args.model == 'roberta':
-            actor = RoBERTaActor(pretrained=args.model_path).to(
-                torch.cuda.current_device())
-        elif args.model == 'llama':
-            actor = LlamaActor(pretrained=args.model_path).to(
-                torch.float16).to(torch.cuda.current_device())
-        else:
-            raise ValueError(f'Unsupported model "{args.model}"')
-
-    if args.model == 'gpt2':
-        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        tokenizer.pad_token = tokenizer.eos_token
-    elif args.model == 'bloom':
-        tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
-        tokenizer.pad_token = tokenizer.eos_token
-    elif args.model == 'opt':
-        tokenizer = AutoTokenizer.from_pretrained('facebook/opt-350m')
-    elif args.model == 'roberta':
-        tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
-    elif args.model == 'llama':
-        tokenizer = AutoTokenizer.from_pretrained(args.model_path,
-                                                  padding_side="right",
-                                                  use_fast=False,
-                                                  )
-        tokenizer.eos_token = '<\s>'
-    else:
-        raise ValueError(f'Unsupported model "{args.model}"')
-    
-    questions = []
-    if args.max_datasets_size is not None:
-        questions = random.sample(jload(args.dataset), args.max_datasets_size)
-        if is_rank_0():
-            logger.info(
-                f"Limiting dataset to {args.max_datasets_size} examples.")
-        questions = questions[rank:args.max_datasets_size:world_size]
-
-    answers = copy.deepcopy(questions)
-
-    prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
-    sources = [
-        prompt_input.format_map(example) if example.get(
-            "input", "") != "" else prompt_no_input.format_map(example)
-        for example in questions
-    ]
-
-    if is_rank_0():
-        logger.info("Tokenizing inputs... This may take some time...")
-
-    input_ids_list = []
-
-    for string in sources:
-        input_ids = tokenizer.encode(string, return_tensors='pt').squeeze(0)
-        input_ids_list.append(input_ids)
-
-    bar = tqdm(range(math.ceil(len(input_ids_list)/args.batch_size)),
-               desc=f'steps', disable=not is_rank_0())
-
-    actor.eval()
-    with torch.no_grad():
-        for i in range(0, len(input_ids_list), args.batch_size):
-            batch = input_ids_list[i:i+args.batch_size]
-            batch = [i.flip(dims=[0]) for i in batch]
-            batch = torch.nn.utils.rnn.pad_sequence(batch,
-                                                    batch_first=True,
-                                                    padding_value=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0).to(torch.cuda.current_device())
-            batch = batch.flip(dims=[1])
-            attention_mask = batch.ne(tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0)
-
-            outputs = actor.model.generate(batch, attention_mask=attention_mask,
-                                           max_length=args.max_length,
-                                           do_sample=True,
-                                           top_k=50,
-                                           top_p=0.95,
-                                           num_return_sequences=1)
-
-            outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
-            for j in range(batch.size(0)):
-                answers[i +
-                        j]['output'] = outputs[j].split("### Response:")[1].strip()
-
-            bar.update()
-
-    jdump(answers, os.path.join(args.answer_path,
-          f'{args.model_name}_answers_rank{rank}.json'))
-
-    if is_rank_0():
-        logger.info(
-            f'Peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB')
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--strategy',
-                        choices=['naive', 'ddp', 'colossalai_gemini',
-                                 'colossalai_zero2', 'colossalai_zero2_cpu'],
-                        default='naive')
-    parser.add_argument('--model', default='gpt2',
-                        choices=['gpt2', 'bloom', 'opt', 'roberta', 'llama'])
-    parser.add_argument('--model_path', type=str, default=None)
-    parser.add_argument('--model_name', type=str, default='model')
-    parser.add_argument('--dataset', type=str, default=None)
-    parser.add_argument('--batch_size', type=int, default=1)
-    parser.add_argument('--max_datasets_size', type=int, default=None)
-    parser.add_argument('--answer_path', type=str, default="answer")
-    parser.add_argument('--max_length', type=int, default=1024)
-    args = parser.parse_args()
-    generate(args)
--- a/applications/Chat/evaluate/generate_answers.sh
+++ b/applications/Chat/evaluate/generate_answers.sh
-device_number=number of your devices
-model_name="name of your model"
-model_path="path to your model"
-dataset="path to the question dataset"
-answer_path="path to save the model answers"
-
-torchrun --standalone --nproc_per_node=$device_number generate_answers.py \
-    --model 'llama' \
-    --strategy ddp \
-    --model_path $model_path \
-    --model_name $model_name \
-    --dataset $dataset \
-    --batch_size 8 \
-    --max_datasets_size 80 \
-    --answer_path $answer_path \
-    --max_length 512
-
-python merge.py \
-    --model_name $model_name \
-    --shards $device_number \
-    --answer_path $answer_path \
-
-for (( i=0; i<device_number; i++ )) do
-    rm -rf "${answer_path}/${model_name}_answers_rank${i}.json"
-done
--- a/applications/Chat/evaluate/generate_gpt35_answers.py
+++ b/applications/Chat/evaluate/generate_gpt35_answers.py
-#    Adapted form https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/qa_baseline_gpt35.py
-#    Copyright 2023 LM-SYS@FastChat
-
-#    Licensed under the Apache License, Version 2.0 (the "License");
-#    you may not use this file except in compliance with the License.
-#    You may obtain a copy of the License at
-
-#        http://www.apache.org/licenses/LICENSE-2.0
-
-#    Unless required by applicable law or agreed to in writing, software
-#    distributed under the License is distributed on an "AS IS" BASIS,
-#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#    See the License for the specific language governing permissions and
-#    limitations under the License.
-
-
-import argparse
-import json
-import os
-import time
-import concurrent.futures
-
-import openai
-import tqdm
-import shortuuid
-import logging
-
-from utils import jload, jdump
-
-MODEL = 'gpt-3.5-turbo'
-MAX_API_RETRY = 3
-
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-def get_answer(question: str, max_tokens: int):
-    answer = question
-    prompt = question['instruction'] if question['input'] == "" else question['instuction'] + \
-            " " + question['input']
-    for _ in range(MAX_API_RETRY):
-        try:
-            response = openai.ChatCompletion.create(
-                model='gpt-3.5-turbo',
-                messages=[{
-                    'role': 'system',
-                    'content': 'You are a helpful assistant.'
-                }, {
-                    'role': 'user',
-                    'content': prompt,
-                }],
-                max_tokens=max_tokens,
-            )
-            answer['output'] = response['choices'][0]['message']['content']
-            return answer
-        except Exception as e:
-            logger.error(e)
-            time.sleep(1)
-    logger.error(f' Answer {question["id"]} failed after {MAX_API_RETRY} retries.')
-    return answer
-
-def evaluate_gpt35(args):
-    questions=jload(args.dataset)
-    
-    logger.info(
-        f' Total number of answers: {len(questions)}.')
-    logger.info(
-        f' Waiting for {args.request_time_gap} seconds before sending the next request.')
-    
-    answers = []
-    with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_workers) as executor:
-        futures = []
-        for question in questions:
-            future = executor.submit(get_answer, question, args.max_tokens)
-            futures.append(future)
-
-        for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
-            answers.append(future.result())
-
-    answers.sort(key=lambda x: x['id'])
-
-    jdump(answers, os.path.join(args.answer_path,
-          f'gpt35_answers.json'))
-        
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='Evaluate GPT 3.5.')
-    parser.add_argument('--dataset', type=str, default="questions.json")
-    parser.add_argument('--answer_path', type=str, default="answer")
-    parser.add_argument('--num_workers', type=int, default=4)
-    parser.add_argument('--openai_key', type=str, default=None)
-    parser.add_argument('--max_tokens', type=int, default=1024)
-    
-    args = parser.parse_args()
-    
-    if args.openai_key is not None:
-        os.environ["OPENAI_API_KEY"] = args.openai_key
-    openai.api_key = os.getenv("OPENAI_API_KEY")
-        
-    evaluate_gpt35(args)
--- a/applications/Chat/evaluate/generate_gpt35_answers.sh
+++ b/applications/Chat/evaluate/generate_gpt35_answers.sh
-python generate_gpt35_answers.py \
-    --dataset "path to the question dataset" \
-    --answer_path "path to answer folder" \
-    --num_workers 4 \
-    --openai_key "your openai key" \
-    --max_tokens 512 \
--- a/applications/Chat/evaluate/merge.py
+++ b/applications/Chat/evaluate/merge.py
-import argparse
-import os
-
-from utils import jload, jdump
-
-
-def generate(args):
-    dataset = []
-    for i in range(args.shards):
-        shard = jload(os.path.join(args.answer_path,
-                      f'{args.model_name}_answers_rank{i}.json'))
-        dataset.extend(shard)
-
-    dataset.sort(key=lambda x: x['id'])
-    jdump(dataset, os.path.join(args.answer_path,
-                                f'{args.model_name}_answers.json'))
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--model_name', type=str, default='model')
-    parser.add_argument('--shards', type=int, default=4)
-    parser.add_argument('--answer_path', type=str, default="answer")
-    args = parser.parse_args()
-    generate(args)
--- a/applications/Chat/evaluate/sample/questions.json
+++ b/applications/Chat/evaluate/sample/questions.json
-[
-    {
-        "id": 0,
-        "instruction": "Help me summarize the following news?",
-        "input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.",
-        "output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender",
-        "category": "closed qa"
-    }
-]
\ No newline at end of file
--- a/applications/Chat/examples/README.md
+++ b/applications/Chat/examples/README.md
@@ -6,6 +6,7 @@
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
+    - [Conversation dataset generation](#conversation-dataset-generation)
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
@@ -24,12 +25,11 @@
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
-    - [LM model](#lm-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)

-
 ---
+
 ## Install requirements

 ```shell
@@ -38,27 +38,74 @@ pip install -r requirements.txt

 ## Supervised datasets collection

-We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
-[InstructionWild](https://github.com/XueFuzhao/InstructionWild).
+We collected 104K bilingual datasets of Chinese and English, and you can find the datasets in this repo
+[InstructionWild](https://github.com/XueFuzhao/InstructionWild) and in this [file](https://github.com/XueFuzhao/InstructionWild/blob/main/data/README.md).
+
+Here is how we collected the data

-The following pic shows how we collected the data.
 <p align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
 </p>

+### Conversation dataset generation
+
+In order to further improve the model's ability to handle multi-turn conversations, we need to include samples with multi-turn conversations in the dataset. However, the samples in InstructWild and Alpaca datasets currently consist of only single-turn conversations, and their dataset organization is not suitable for storing multi-turn conversations. Additionally, after converting the aforementioned datasets, we also need to include multi-turn conversation datasets like ShareGPT, and we should transform them into the training format supported by ColossalChat.
+
+A sample of conversation dataset should have the following fields:
+
+- `type` (str, optional): The type of the data sample.
+- `language` (str, optional): The language of the data sample.
+- `dataset` (str, optional): The dataset the data sample originates from.
+- `conversations` (str, compulsory): Conversation content of the data sample.
+- `id` (int, optional): The ID of the data sample.
+
+A simple example:
+
+```json
+{
+  "type": "instruction",
+  "language": "English",
+  "dataset": "Alpaca",
+  "conversations": [
+    {
+      "from": "human",
+      "value": "Give three tips for staying healthy."
+    },
+    {
+      "from": "gpt",
+      "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
+    }
+  ],
+  "id": 1
+}
+```
+
+> **NOTE:** Only key `conversations` is compulsary for training and other keys serve as metadata. The length of `conversations` varies.
+
+You can run the `examples/generate_conversation_dataset.py` to generate a conversation dataset supported by ColossalChat.
+
+You can use the following cmd to generate conversation dataset.
+
+```bash
+python generate_conversation_dataset.py \
+    --dataset "All"
+    --save_path "/path/to/dataset"
+```
+
 ## Stage1 - Supervised instructs tuning

 Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
+[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)

 You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

 You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
-```
+
+```bash
 torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
-    --log_interval 10 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
@@ -68,27 +115,44 @@ torchrun --standalone --nproc_per_node=4 train_sft.py \
    --max_epochs 1 \
    --grad_checkpoint
 ```
+
+**Note**: the supervised dataset follows the following format,
+
+```json
+[
+    {
+        "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
+        "input": "",
+        "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
+        "id": 0
+    },
+    ...
+]
+```
+
 ### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --max_datasets_size: the max size of dataset, type=int, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --log_interval:      how many steps to log, type=int, default=100
- --grad_checkpoint:   enable gradient checkpointing, type=bool, default=False
+
+- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
+- `--model`: model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
+- `--pretrain`: pretrain model, type=str, default=None
+- `--max_datasets_size`: the max size of dataset, type=int, default=None
+- `--save_path`: path to save the model, type=str, default='output'
+- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
+- `--max_epochs`: max epochs for training, type=int, default=3
+- `--batch_size`: batch size while training, type=int, default=4
+- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
+- `--grad_checkpoint`: enable gradient checkpointing, type=bool, default=False

 ## Stage2 - Training reward model

 We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
+[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)

 You can run the `examples/train_rm.sh` to start a reward model training.

 You can also use the following cmd to start training a reward model.
-```
+
+```bash
 torchrun --standalone --nproc_per_node=4 train_reward_model.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
@@ -96,16 +160,19 @@ torchrun --standalone --nproc_per_node=4 train_reward_model.py \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
 ```
+
 ### Features and tricks in RM training
+
 - We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
- We change the loss to valid_acc and pair_dist to monitor progress during training.
+- We support 2 kinds of loss function named `log_sig`(used by OpenAI) and `log_exp`(used by Anthropic).
+- We change the loss to `valid_acc` and `pair_dist` to monitor progress during training.
 - We add special token to the end of the sequence to get better result.
 - We use cosine-reducing lr-scheduler for RM training.
 - We set value_head as 1 liner layer and initialize the weight of value_head using N(0，1/(d_model + 1)) distribution.
 - We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

 ### Experiment result
+
 Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

 <div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
@@ -117,20 +184,20 @@ Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):
 <div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.

 ### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --model_path:        the path of rm model(if continue to train), type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- --subset:            subset of the dataset, type=str, default=None
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
- --max_len:           max sentence length for generation, type=int, default=512
- --test:              whether is only testing, if it's true, the dataset will be small
+
+- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
+- `--model`: model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
+- `--pretrain`: pretrain model, type=str, default=None
+- `--model_path`: the path of rm model(if continue to train), type=str, default=None
+- `--save_path`: path to save the model, type=str, default='output'
+- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
+- `--max_epochs`: max epochs for training, type=int, default=3
+- `--dataset`: dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
+- `--subset`: subset of the dataset, type=str, default=None
+- `--batch_size`: batch size while training, type=int, default=4
+- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
+- `--loss_func`: which kind of loss function, choices=['log_sig', 'log_exp']
+- `--max_len`: max sentence length for generation, type=int, default=512

 ## Stage3 - Training model using prompts with RL

@@ -141,53 +208,89 @@ Stage3 uses reinforcement learning algorithm, which is the most complex part of
 </p>

 You can run the `examples/train_prompts.sh` to start PPO training.
+
 You can also use the cmd following to start PPO training.
+[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)

-```
+```bash
 torchrun --standalone --nproc_per_node=4 train_prompts.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --prompt_dataset /path/to/your/prompt_dataset \
    --pretrain_dataset /path/to/your/pretrain_dataset \
-         --rm_pretrain /your/pretrain/rm/defination \
+    --rm_pretrain /your/pretrain/rm/definition \
    --rm_path /your/rm/model/path
 ```

-Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/example_data_reformat.py) to reformat [seed_prompts_ch.jsonl](https://github.com/XueFuzhao/InstructionWild/blob/main/data/seed_prompts_ch.jsonl) or [seed_prompts_en.jsonl](https://github.com/XueFuzhao/InstructionWild/blob/main/data/seed_prompts_en.jsonl) in InstructionWild.  
+Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
 Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.

+**Note**: the required datasets follow the following format,
+
+- `pretrain dataset`
+
+  ```json
+  [
+      {
+          "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
+          "input": "",
+          "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
+          "id": 0
+      },
+      ...
+  ]
+  ```
+
+- `prompt dataset`
+
+  ```json
+  [
+      {
+          "instruction": "Edit this paragraph to make it more concise: \"Yesterday, I went to the store and bought some things. Then, I came home and put them away. After that, I went for a walk and met some friends.\"",
+          "id": 0
+      },
+      {
+          "instruction": "Write a descriptive paragraph about a memorable vacation you went on",
+          "id": 1
+      },
+      ...
+  ]
+  ```
+
 ### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
- --rm_pretrain:       pretrain model for reward model, type=str, default=None
- --rm_path:           the path of rm model, type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --prompt_dataset:       path of the prompt dataset, type=str, default=None
- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --num_episodes:      num of episodes for training, type=int, default=10
- --max_epochs:        max epochs for training in one episode, type=int, default=5
- --max_timesteps:     max episodes in one batch, type=int, default=10
- --update_timesteps:  timesteps to update, type=int, default=10
- --train_batch_size:  batch size while training, type=int, default=8
- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
- --experience_batch_size: batch size to make experience, type=int, default=8
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9
+
+- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
+- `--model`: model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
+- `--pretrain`: pretrain model, type=str, default=None
+- `--rm_model`: reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
+- `--rm_pretrain`: pretrain model for reward model, type=str, default=None
+- `--rm_path`: the path of rm model, type=str, default=None
+- `--save_path`: path to save the model, type=str, default='output'
+- `--prompt_dataset`: path of the prompt dataset, type=str, default=None
+- `--pretrain_dataset`: path of the ptx dataset, type=str, default=None
+- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
+- `--num_episodes`: num of episodes for training, type=int, default=10
+- `--num_update_steps`: number of steps to update policy per episode, type=int
+- `--num_collect_steps`: number of steps to collect experience per episode, type=int
+- `--train_batch_size`: batch size while training, type=int, default=8
+- `--ptx_batch_size`: batch size to compute ptx loss, type=int, default=1
+- `--experience_batch_size`: batch size to make experience, type=int, default=8
+- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
+- `--kl_coef`: kl_coef using for computing reward, type=float, default=0.1
+- `--ptx_coef`: ptx_coef using for computing policy loss, type=float, default=0.9

 ## Inference example - After Stage3
+
 We support different inference options, including int8 and int4 quantization.
 For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).

-
 ## Attention
+
 The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.

 #### data
+
 - [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
 - [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
@@ -197,6 +300,7 @@ The examples are demos for the whole training process.You need to change the hyp
 ## Support Model

 ### GPT
+
 - [x] GPT2-S (s)
 - [x] GPT2-M (m)
 - [x] GPT2-L (l)
@@ -205,6 +309,7 @@ The examples are demos for the whole training process.You need to change the hyp
 - [ ] GPT2-6B (6b)

 ### BLOOM
+
 - [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
 - [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
 - [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
@@ -212,6 +317,7 @@ The examples are demos for the whole training process.You need to change the hyp
 - [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)

 ### OPT
+
 - [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
 - [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
 - [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
@@ -221,6 +327,7 @@ The examples are demos for the whole training process.You need to change the hyp
 - [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)

 ### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
+
 - [x] LLaMA-7B
 - [x] LLaMA-13B
 - [ ] LLaMA-33B
@@ -237,12 +344,12 @@ if it is supported in huggingface [transformers](https://github.com/huggingface/
 r you can build your own model by yourself.

 ### Actor model
-```
+
+```python
 from ..base import Actor
 from transformers.models.coati import CoatiModel

 class CoatiActor(Actor):
-
    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
@@ -257,7 +364,8 @@ class CoatiActor(Actor):
 ```

 ### Reward model
-```
+
+```python
 from ..base import RewardModel
 from transformers.models.coati import CoatiModel

@@ -280,12 +388,11 @@ class CoatiRM(RewardModel):

 ### Critic model

-```
+```python
 from ..base import Critic
 from transformers.models.coati import CoatiModel

 class CoatiCritic(Critic):
-
    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,

--- a/applications/Chat/examples/community/README.md
+++ b/applications/Chat/examples/community/README.md
+:warning: **This content may be outdated since the major update of Colossal Chat. We will update this content soon.**
+
 # Community Examples
+
 ---
+
 We are thrilled to announce the latest updates to ColossalChat, an open-source solution for cloning ChatGPT with a complete RLHF (Reinforcement Learning with Human Feedback) pipeline.

 As Colossal-AI undergoes major updates, we are actively maintaining ColossalChat to stay aligned with the project's progress. With the introduction of Community-driven example, we aim to create a collaborative platform for developers to contribute exotic features built on top of ColossalChat.
@@ -15,10 +19,11 @@ For more information about community pipelines, please have a look at this [issu
 Community examples consist of both inference and training examples that have been added by the community. Please have a look at the following table to get an overview of all community examples. Click on the Code Example to get a copy-and-paste ready code example that you can try out. If a community doesn't work as expected, please open an issue and ping the author on it.

 | Example              | Description                                            | Code Example                                                                                                    | Colab |                                            Author |
-|:---------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------:|
+| :------------------- | :----------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------- | :---- | ------------------------------------------------: |
 | Peft                 | Adding Peft support for SFT and Prompts model training | [Huggingface Peft](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/community/peft) | -     |                [YY Lin](https://github.com/yynil) |
-| Train prompts on Ray           | A Ray based implementation of Train prompts example                                                                                                                                                                                                                                                                                                                                                                                                                                   | [Huggingface Peft](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/community/ray)     | - |             [MisterLin1995](https://github.com/MisterLin1995) |
-|...|...|...|...|...|
+| Train prompts on Ray | A Ray based implementation of Train prompts example    | [Training On Ray](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/community/ray)   | -     | [MisterLin1995](https://github.com/MisterLin1995) |
+| ...                  | ...                                                    | ...                                                                                                             | ...   |                                               ... |

 ### How to get involved
+
 To join our community-driven initiative, please visit the [ColossalChat GitHub repository](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples), review the provided information, and explore the codebase. To contribute, create a new issue outlining your proposed feature or enhancement, and our team will review and provide feedback. We look forward to collaborating with you on this exciting project!
--- a/applications/Chat/examples/community/peft/README.md
+++ b/applications/Chat/examples/community/peft/README.md
+:warning: **This content may be outdated since the major update of Colossal Chat. We will update this content soon.**
+
 # Add Peft support for SFT and Prompts model training

 The original implementation just adopts the loralib and merges the layers into the final model. The huggingface peft is a better lora model implementation and can be easily training and distributed.
@@ -5,7 +7,9 @@ The original implementation just adopts the loralib and merges the layers into t
 Since reward model is relative small, I just keep it as original one. I suggest train full model to get the proper reward/critic model.

 # Preliminary installation
+
 Since the current pypi peft package(0.2) has some bugs, please install the peft package using source.
+
 ```
 git clone https://github.com/huggingface/peft
 cd peft
@@ -13,12 +17,14 @@ pip install .
 ```

 # Usage
+
 For SFT training, just call train_peft_sft.py

-Its arguments are almost identical to train_sft.py instead adding a new eval_dataset if you have a eval_dataset file. The data file is just a plain datafile, please check the format in the easy_dataset.py.
+Its arguments are almost identical to train_sft.py instead adding a new eval_dataset if you have an eval_dataset file. The data file is just a plain datafile, please check the format in the easy_dataset.py.

 For stage-3 rlhf training, call train_peft_prompts.py.
-Its arguments are almost idential to train_prompts.py. The only difference is that I use text files to indicate the prompt and pretrained data file. The models are included in easy_models.py. Currently only bloom models are tested, but technically gpt2/opt/llama should be supported.
+Its arguments are almost identical to train_prompts.py. The only difference is that I use text files to indicate the prompt and pretrained data file. The models are included in easy_models.py. Currently only bloom models are tested, but technically gpt2/opt/llama should be supported.

 # Dataformat
+
 Please refer the formats in test_sft.txt, test_prompts.txt, test_pretrained.txt.
--- a/applications/Chat/examples/community/peft/easy_dataset.py
+++ b/applications/Chat/examples/community/peft/easy_dataset.py
@@ -3,7 +3,6 @@ import json
 from typing import Dict, Sequence

 import torch
-from datasets import load_dataset
 from torch.utils.data import Dataset
 from tqdm import tqdm
 from transformers import AutoTokenizer
@@ -20,7 +19,8 @@ def _tokenize_fn(strings: Sequence[str], tokenizer: AutoTokenizer, max_length: i
            padding="longest",
            max_length=max_length,
            truncation=True,
-        ) for text in strings
+        )
+        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
@@ -48,18 +48,17 @@ def preprocess(sources: Sequence[str], targets: Sequence[str], tokenizer: AutoTo


 class EasySupervisedDataset(Dataset):
-
    def __init__(self, data_file: str, tokenizer: AutoTokenizer, max_length: int = 512) -> None:
        super(EasySupervisedDataset, self).__init__()
        with open(data_file, "r", encoding="UTF-8") as f:
            all_lines = f.readlines()
-        #split to source and target ,source the characters before "回答：" including "回答：", target the characters after "回答："
+        # split to source and target ,source the characters before "回答：" including "回答：", target the characters after "回答："
        sources, targets = [], []
        for line in all_lines:
            if "回答：" in line:
                sep_index = line.index("回答：")
-                sources.append(line[:sep_index + 3])
-                targets.append(line[sep_index + 3:] + tokenizer.eos_token)
+                sources.append(line[: sep_index + 3])
+                targets.append(line[sep_index + 3 :] + tokenizer.eos_token)
            else:
                sources.append(line)
                targets.append("" + tokenizer.eos_token)
@@ -83,15 +82,17 @@ class EasySupervisedDataset(Dataset):


 class EasyPromptsDataset(Dataset):
-
    def __init__(self, data_file: str, tokenizer: AutoTokenizer, max_length: int = 96) -> None:
        super(EasyPromptsDataset, self).__init__()
        with open(data_file, "r", encoding="UTF-8") as f:
            all_lines = f.readlines()
-            all_lines = [line if "回答：" not in line else line[:line.index("回答：") + 3] for line in all_lines]
+            all_lines = [line if "回答：" not in line else line[: line.index("回答：") + 3] for line in all_lines]
        self.prompts = [
-            tokenizer(line, return_tensors='pt', max_length=max_length, padding='max_length',
-                      truncation=True)['input_ids'].to(torch.cuda.current_device()).squeeze(0)
+            tokenizer(line, return_tensors="pt", max_length=max_length, padding="max_length", truncation=True)[
+                "input_ids"
+            ]
+            .to(torch.cuda.current_device())
+            .squeeze(0)
            for line in tqdm(all_lines)
        ]
        self.data_file = data_file
@@ -110,7 +111,6 @@ class EasyPromptsDataset(Dataset):


 class EasyRewardDataset(Dataset):
-
    def __init__(self, train_file: str, tokenizer: AutoTokenizer, special_token=None, max_length=512) -> None:
        super(EasyRewardDataset, self).__init__()
        self.chosen = []
@@ -120,44 +120,42 @@ class EasyRewardDataset(Dataset):
        else:
            self.end_token = special_token
        print(self.end_token)
-        #read all lines in the train_file to a list
+        # read all lines in the train_file to a list
        with open(train_file, "r", encoding="UTF-8") as f:
            all_lines = f.readlines()
        for line in tqdm(all_lines):
            data = json.loads(line)
-            prompt = "提问：" + data['prompt'] + " 回答："
+            prompt = "提问：" + data["prompt"] + " 回答："

-            chosen = prompt + data['chosen'] + self.end_token
-            chosen_token = tokenizer(chosen,
-                                     max_length=max_length,
-                                     padding="max_length",
-                                     truncation=True,
-                                     return_tensors="pt")
-            self.chosen.append({
-                "input_ids": chosen_token['input_ids'],
-                "attention_mask": chosen_token['attention_mask']
-            })
-
-            reject = prompt + data['rejected'] + self.end_token
-            reject_token = tokenizer(reject,
-                                     max_length=max_length,
-                                     padding="max_length",
-                                     truncation=True,
-                                     return_tensors="pt")
-            self.reject.append({
-                "input_ids": reject_token['input_ids'],
-                "attention_mask": reject_token['attention_mask']
-            })
+            chosen = prompt + data["chosen"] + self.end_token
+            chosen_token = tokenizer(
+                chosen, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
+            )
+            self.chosen.append(
+                {"input_ids": chosen_token["input_ids"], "attention_mask": chosen_token["attention_mask"]}
+            )
+
+            reject = prompt + data["rejected"] + self.end_token
+            reject_token = tokenizer(
+                reject, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
+            )
+            self.reject.append(
+                {"input_ids": reject_token["input_ids"], "attention_mask": reject_token["attention_mask"]}
+            )

    def __len__(self):
        length = len(self.chosen)
        return length

    def __getitem__(self, idx):
-        return self.chosen[idx]["input_ids"], self.chosen[idx]["attention_mask"], self.reject[idx][
-            "input_ids"], self.reject[idx]["attention_mask"]
+        return (
+            self.chosen[idx]["input_ids"],
+            self.chosen[idx]["attention_mask"],
+            self.reject[idx]["input_ids"],
+            self.reject[idx]["attention_mask"],
+        )

-    #python representation of the object and the string representation of the object
+    # python representation of the object and the string representation of the object
    def __repr__(self):
        return f"LawRewardDataset(chosen_len={len(self.chosen)}, reject_len={len(self.reject)})"

@@ -165,30 +163,29 @@ class EasyRewardDataset(Dataset):
        return f"LawRewardDataset(chosen_len={len(self.chosen)}, reject_len={len(self.reject)})"


-'''
+"""
 Easy SFT just accept a text file which can be read line by line. However the datasets will group texts together to max_length so LLM will learn the texts meaning better.
 If individual lines are not related, just set is_group_texts to False.
-'''
+"""


 class EasySFTDataset(Dataset):
-
    def __init__(self, data_file: str, tokenizer: AutoTokenizer, max_length=512, is_group_texts=True) -> None:
        super().__init__()
-        #read the data_file line by line
+        # read the data_file line by line
        with open(data_file, "r", encoding="UTF-8") as f:
-            #encode the text data line by line and put raw python list input_ids only to raw_input_ids list
+            # encode the text data line by line and put raw python list input_ids only to raw_input_ids list
            raw_input_ids = []
            for line in f:
                encoded_ids = tokenizer.encode(line)
-                #if the encoded_ids is longer than max_length, then split it into several parts
+                # if the encoded_ids is longer than max_length, then split it into several parts
                if len(encoded_ids) > max_length:
                    for i in range(0, len(encoded_ids), max_length):
-                        raw_input_ids.append(encoded_ids[i:i + max_length])
+                        raw_input_ids.append(encoded_ids[i : i + max_length])
                else:
                    raw_input_ids.append(encoded_ids)

-        grouped_inpup_ids = []
+        grouped_input_ids = []
        current_input_ids = []
        attention_mask = []
        if tokenizer.pad_token_id is None:
@@ -196,30 +193,33 @@ class EasySFTDataset(Dataset):
        if is_group_texts:
            for input_ids in raw_input_ids:
                if len(current_input_ids) + len(input_ids) > max_length:
-                    #pad the current_input_ids to max_length with tokenizer.pad_token_id
+                    # pad the current_input_ids to max_length with tokenizer.pad_token_id
                    padded_length = max_length - len(current_input_ids)
                    current_input_ids.extend([tokenizer.pad_token_id] * padded_length)
-                    grouped_inpup_ids.append(torch.tensor(current_input_ids, dtype=torch.long))
+                    grouped_input_ids.append(torch.tensor(current_input_ids, dtype=torch.long))
                    attention_mask.append(
-                        torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long))
+                        torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long)
+                    )
                    current_input_ids = []
                else:
                    current_input_ids.extend(input_ids)
            if len(current_input_ids) > 0:
                padded_length = max_length - len(current_input_ids)
                current_input_ids.extend([tokenizer.pad_token_id] * padded_length)
-                grouped_inpup_ids.append(torch.tensor(current_input_ids, dtype=torch.long))
+                grouped_input_ids.append(torch.tensor(current_input_ids, dtype=torch.long))
                attention_mask.append(
-                    torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long))
+                    torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long)
+                )
        else:
-            #just append the raw_input_ids to max_length
+            # just append the raw_input_ids to max_length
            for input_ids in raw_input_ids:
                padded_length = max_length - len(input_ids)
                input_ids.extend([tokenizer.pad_token_id] * padded_length)
                attention_mask.append(
-                    torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long))
-                grouped_inpup_ids.append(torch.tensor(input_ids, dtype=torch.long))
-        self.input_ids = grouped_inpup_ids
+                    torch.tensor([1] * (max_length - padded_length) + [0] * padded_length, dtype=torch.long)
+                )
+                grouped_input_ids.append(torch.tensor(input_ids, dtype=torch.long))
+        self.input_ids = grouped_input_ids
        self.labels = copy.deepcopy(self.input_ids)
        self.file_name = data_file
        self.attention_mask = attention_mask
@@ -227,14 +227,14 @@ class EasySFTDataset(Dataset):
    def __len__(self):
        return len(self.input_ids)

-    #get item from dataset
+    # get item from dataset
    def __getitem__(self, idx):
        return dict(input_ids=self.input_ids[idx], labels=self.labels[idx], attention_mask=self.attention_mask[idx])

-    #generate the dataset description to be printed by print in python
+    # generate the dataset description to be printed by print in python
    def __repr__(self):
        return f"EasySFTDataset(len={len(self)},\nfile_name is {self.file_name})"

-    #generate the dataset description to be printed by print in python
+    # generate the dataset description to be printed by print in python
    def __str__(self):
        return f"EasySFTDataset(len={len(self)},\nfile_name is {self.file_name})"
--- a/applications/Chat/examples/community/peft/easy_models.py
+++ b/applications/Chat/examples/community/peft/easy_models.py
@@ -4,7 +4,7 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from coati.models.generation import generate
-from coati.models.utils import log_probs_from_logits, masked_mean
+from coati.models.utils import log_probs_from_logits
 from peft import PeftModel
 from torch.nn.modules import Module
 from transformers import BloomConfig, BloomForCausalLM
@@ -24,20 +24,17 @@ class Actor(Module):

    @torch.no_grad()
    def generate(
-        self,
-        input_ids: torch.Tensor,
-        return_action_mask: bool = True,
-        **kwargs
+        self, input_ids: torch.Tensor, return_action_mask: bool = True, **kwargs
    ) -> Union[Tuple[torch.LongTensor, torch.LongTensor], Tuple[torch.LongTensor, torch.LongTensor, torch.BoolTensor]]:
        sequences = generate(self.model, input_ids, **kwargs)
        attention_mask = None
-        pad_token_id = kwargs.get('pad_token_id', None)
+        pad_token_id = kwargs.get("pad_token_id", None)
        if pad_token_id is not None:
            attention_mask = sequences.not_equal(pad_token_id).to(dtype=torch.long, device=sequences.device)
        if not return_action_mask:
            return sequences, attention_mask, None
        input_len = input_ids.size(1)
-        eos_token_id = kwargs.get('eos_token_id', None)
+        eos_token_id = kwargs.get("eos_token_id", None)
        if eos_token_id is None:
            action_mask = torch.ones_like(sequences, dtype=torch.bool)
        else:
@@ -46,16 +43,14 @@ class Actor(Module):
            action_mask = F.pad(action_mask, (1 + input_len, -1), value=True)  # include eos token and input
        action_mask[:, :input_len] = False
        action_mask = action_mask[:, 1:]
-        return sequences, attention_mask, action_mask[:, -(sequences.size(1) - input_len):]
+        return sequences, attention_mask, action_mask[:, -(sequences.size(1) - input_len) :]

-    def forward(self,
-                sequences: torch.LongTensor,
-                num_actions: int,
-                attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
-        """Returns action log probs
-        """
+    def forward(
+        self, sequences: torch.LongTensor, num_actions: int, attention_mask: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+        """Returns action log probs"""
        output = self.model(sequences, attention_mask=attention_mask)
-        logits = output['logits']
+        logits = output["logits"]
        log_probs = log_probs_from_logits(logits[:, :-1, :], sequences[:, 1:])
        return log_probs[:, -num_actions:]

@@ -75,11 +70,13 @@ class BLOOMActor(Actor):
        lora_train_bias (str): LoRA bias training mode.
    """

-    def __init__(self,
+    def __init__(
+        self,
        pretrained: str = None,
        config: Optional[BloomConfig] = None,
        checkpoint: bool = False,
-                 lora_path: str = None) -> None:
+        lora_path: str = None,
+    ) -> None:
        if pretrained is not None:
            model = BloomForCausalLM.from_pretrained(pretrained)
        elif config is not None:

--- a/applications/Chat/examples/community/peft/train_peft_prompts.py
+++ b/applications/Chat/examples/community/peft/train_peft_prompts.py
 import argparse

-import pandas as pd
 import torch
 import torch.distributed as dist
-from coati.dataset import DataCollatorForSupervisedDataset, PromptDataset, SupervisedDataset
+from coati.dataset import DataCollatorForSupervisedDataset
 from coati.models.bloom import BLOOMRM, BLOOMCritic
-from coati.models.gpt import GPTRM, GPTActor, GPTCritic
-from coati.models.llama import LlamaActor, LlamaCritic, LlamaRM
-from coati.models.opt import OPTRM, OPTActor, OPTCritic
+from coati.models.gpt import GPTRM, GPTCritic
+from coati.models.llama import LlamaCritic, LlamaRM
+from coati.models.opt import OPTRM, OPTCritic
 from coati.trainer import PPOTrainer
-from coati.trainer.strategies import ColossalAIStrategy, DDPStrategy, NaiveStrategy
-from coati.utils import prepare_llama_tokenizer_and_embedding
+from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
 from easy_dataset import EasyPromptsDataset, EasySupervisedDataset
 from easy_models import BLOOMActor
-from peft import PeftModel
 from torch.optim import Adam
 from torch.utils.data import DataLoader
 from torch.utils.data.distributed import DistributedSampler
@@ -24,26 +21,24 @@ from colossalai.nn.optimizer import HybridAdam

 def main(args):
    # configure strategy
-    if args.strategy == 'naive':
-        strategy = NaiveStrategy()
-    elif args.strategy == 'ddp':
+    if args.strategy == "ddp":
        strategy = DDPStrategy()
-    elif args.strategy == 'colossalai_gemini':
-        strategy = ColossalAIStrategy(stage=3, placement_policy='cpu', initial_scale=2**5)
-    elif args.strategy == 'colossalai_zero2':
-        strategy = ColossalAIStrategy(stage=2, placement_policy='cpu')
+    elif args.strategy == "colossalai_gemini":
+        strategy = GeminiStrategy(placement_policy="static", offload_optim_frac=1.0, offload_param_frac=1.0, initial_scale=2**5)
+    elif args.strategy == "colossalai_zero2":
+        strategy = LowLevelZeroStrategy(stage=2, placement_policy="cpu")
    else:
        raise ValueError(f'Unsupported strategy "{args.strategy}"')

    if args.rm_path is not None:
-        state_dict = torch.load(args.rm_path, map_location='cpu')
+        state_dict = torch.load(args.rm_path, map_location="cpu")

    # configure model
-    if args.model == 'bloom':
+    if args.model == "bloom":
        # initial_model = BLOOMActor(pretrained=args.pretrain)
-        print('Using peft lora to load Bloom model as inital_model')
+        print("Using peft lora to load Bloom model as initial_model")
        initial_model = BLOOMActor(pretrained=args.pretrain, lora_path=args.sft_lora_path)
-        print('Using peft lora to load Bloom model as initial_model (Done)')
+        print("Using peft lora to load Bloom model as initial_model (Done)")
    else:
        raise ValueError(f'Unsupported actor model "{args.model}"')

@@ -52,59 +47,59 @@ def main(args):
    else:
        rm_model_name = args.rm_model

-    if rm_model_name == 'gpt2':
+    if rm_model_name == "gpt2":
        reward_model = GPTRM(pretrained=args.rm_pretrain)
-    elif rm_model_name == 'bloom':
+    elif rm_model_name == "bloom":
        print("load bloom reward model ", args.rm_pretrain)
        reward_model = BLOOMRM(pretrained=args.rm_pretrain)
-    elif rm_model_name == 'opt':
+    elif rm_model_name == "opt":
        reward_model = OPTRM(pretrained=args.rm_pretrain)
-    elif rm_model_name == 'llama':
+    elif rm_model_name == "llama":
        reward_model = LlamaRM(pretrained=args.rm_pretrain)
    else:
        raise ValueError(f'Unsupported reward model "{rm_model_name}"')

    if args.rm_path is not None:
-        print('Loading reward model from', args.rm_path)
+        print("Loading reward model from", args.rm_path)
        reward_model.load_state_dict(state_dict)

-    if args.strategy != 'colossalai_gemini':
+    if args.strategy != "colossalai_gemini":
        initial_model.to(torch.float16).to(torch.cuda.current_device())
        reward_model.to(torch.float16).to(torch.cuda.current_device())

    with strategy.model_init_context():
-        if args.model == 'bloom':
+        if args.model == "bloom":
            # actor = BLOOMActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
-            print('Using peft lora to load Bloom model as Actor')
+            print("Using peft lora to load Bloom model as Actor")
            actor = BLOOMActor(pretrained=args.pretrain, lora_path=args.sft_lora_path)
-            print('Using peft lora to load Bloom model as Actor (Done)')
+            print("Using peft lora to load Bloom model as Actor (Done)")
        else:
            raise ValueError(f'Unsupported actor model "{args.model}"')

-        if rm_model_name == 'gpt2':
+        if rm_model_name == "gpt2":
            critic = GPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank, use_action_mask=True)
-        elif rm_model_name == 'bloom':
+        elif rm_model_name == "bloom":
            print("load bloom critic ", args.rm_pretrain, " lora_rank ", args.lora_rank, " use_action_mask ", True)
            critic = BLOOMCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank, use_action_mask=True)
            print("load bloom critic (Done) ")
-        elif rm_model_name == 'opt':
+        elif rm_model_name == "opt":
            critic = OPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank, use_action_mask=True)
-        elif rm_model_name == 'llama':
+        elif rm_model_name == "llama":
            critic = LlamaCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank, use_action_mask=True)
        else:
            raise ValueError(f'Unsupported reward model "{rm_model_name}"')

        if args.rm_path is not None:
-            print('Loading reward model from', args.rm_path)
+            print("Loading reward model from", args.rm_path)
            critic.load_state_dict(state_dict)
            del state_dict

-    if args.strategy != 'colossalai_gemini':
+    if args.strategy != "colossalai_gemini":
        critic.to(torch.float16).to(torch.cuda.current_device())
        actor.to(torch.float16).to(torch.cuda.current_device())

    # configure optimizer
-    if args.strategy.startswith('colossalai'):
+    if args.strategy.startswith("colossalai"):
        actor_optim = HybridAdam(actor.parameters(), lr=1e-7)
        critic_optim = HybridAdam(critic.parameters(), lr=1e-7)
    else:
@@ -112,23 +107,22 @@ def main(args):
        critic_optim = Adam(critic.parameters(), lr=1e-7)

    # configure tokenizer
-    if args.model == 'gpt2':
+    if args.model == "gpt2":
        tokenizer = GPT2Tokenizer.from_pretrained(args.rm_pretrain)
-    elif args.model == 'bloom':
+        tokenizer.pad_token = tokenizer.eos_token
+    elif args.model == "bloom":
        tokenizer = BloomTokenizerFast.from_pretrained(args.rm_pretrain)
-    elif args.model == 'opt':
+        tokenizer.pad_token = tokenizer.eos_token
+    elif args.model == "opt":
        tokenizer = AutoTokenizer.from_pretrained(args.rm_pretrain)
-    elif args.model == 'llama':
+        tokenizer.pad_token = tokenizer.eos_token
+    elif args.model == "llama":
        tokenizer = LlamaTokenizer.from_pretrained(args.pretrain)
-        tokenizer.eos_token = '<\s>'
+        tokenizer.eos_token = "<\s>"
+        tokenizer.pad_token = tokenizer.unk_token
    else:
        raise ValueError(f'Unsupported model "{args.model}"')

-    if args.model == 'llama':
-        tokenizer = prepare_llama_tokenizer_and_embedding(tokenizer, actor)
-    else:
-        tokenizer.pad_token = tokenizer.eos_token
-
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

    prompt_dataset = EasyPromptsDataset(args.prompt_path, tokenizer)
@@ -136,26 +130,27 @@ def main(args):
        prompt_sampler = DistributedSampler(prompt_dataset, shuffle=True, seed=42, drop_last=True)
    else:
        prompt_sampler = None
-    prompt_dataloader = DataLoader(prompt_dataset,
-                                   shuffle=(prompt_sampler is None),
-                                   sampler=prompt_sampler,
-                                   batch_size=args.train_batch_size)
+    prompt_dataloader = DataLoader(
+        prompt_dataset, shuffle=(prompt_sampler is None), sampler=prompt_sampler, batch_size=args.train_batch_size
+    )

    pretrain_dataset = EasySupervisedDataset(args.pretrain_dataset, tokenizer)
    if dist.is_initialized() and dist.get_world_size() > 1:
        pretrain_sampler = DistributedSampler(pretrain_dataset, shuffle=True, seed=42, drop_last=True)
    else:
        pretrain_sampler = None
-    pretrain_dataloader = DataLoader(pretrain_dataset,
+    pretrain_dataloader = DataLoader(
+        pretrain_dataset,
        shuffle=(pretrain_sampler is None),
        sampler=pretrain_sampler,
        batch_size=args.ptx_batch_size,
-                                     collate_fn=data_collator)
+        collate_fn=data_collator,
+    )

    def tokenize_fn(texts):
        # MUST padding to max length to ensure inputs of all ranks have the same length
        # Different length may lead to hang when using gemini, as different generation steps
-        batch = tokenizer(texts, return_tensors='pt', max_length=96, padding='max_length', truncation=True)
+        batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
        return {k: v.to(torch.cuda.current_device()) for k, v in batch.items()}

    (actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))
@@ -171,7 +166,6 @@ def main(args):
        critic_optim,
        kl_coef=args.kl_coef,
        ptx_coef=args.ptx_coef,
-        max_epochs=args.max_epochs,
        train_batch_size=args.train_batch_size,
        experience_batch_size=args.experience_batch_size,
        tokenizer=tokenize_fn,
@@ -183,46 +177,46 @@ def main(args):
        eos_token_id=tokenizer.eos_token_id,
    )

-    trainer.fit(prompt_dataloader=prompt_dataloader,
+    trainer.fit(
+        prompt_dataloader=prompt_dataloader,
        pretrain_dataloader=pretrain_dataloader,
        num_episodes=args.num_episodes,
-                max_timesteps=args.max_timesteps,
-                update_timesteps=args.update_timesteps)
+        num_update_steps=args.num_update_steps,
+        num_collect_steps=args.num_collect_steps,
+    )

    # save model checkpoint after fitting
    trainer.save_model(args.save_path, only_rank0=True, tokenizer=tokenizer)
    # save optimizer checkpoint on all ranks
    if args.need_optim_ckpt:
-        strategy.save_optimizer(actor_optim,
-                                'actor_optim_checkpoint_prompts_%d.pt' % (torch.cuda.current_device()),
-                                only_rank0=False)
+        strategy.save_optimizer(
+            actor_optim, "actor_optim_checkpoint_prompts_%d.pt" % (torch.cuda.current_device()), only_rank0=False
+        )


-if __name__ == '__main__':
+if __name__ == "__main__":
    parser = argparse.ArgumentParser()
-    parser.add_argument('--prompt_path', type=str, default=None, help='path to the prompt dataset')
-    parser.add_argument('--pretrain_dataset', type=str, default=None, help='path to the pretrained dataset')
-    parser.add_argument('--strategy',
-                        choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'],
-                        default='naive',
-                        help='strategy to use')
-    parser.add_argument('--model', default='gpt2', choices=['gpt2', 'bloom', 'opt', 'llama'])
-    parser.add_argument('--pretrain', type=str, default=None)
-    parser.add_argument('--sft_lora_path', type=str, default=None)
-    parser.add_argument('--rm_model', default=None, choices=['gpt2', 'bloom', 'opt', 'llama'])
-    parser.add_argument('--rm_path', type=str, default=None)
-    parser.add_argument('--rm_pretrain', type=str, default=None)
-    parser.add_argument('--save_path', type=str, default='actor_checkpoint_prompts')
-    parser.add_argument('--need_optim_ckpt', type=bool, default=False)
-    parser.add_argument('--num_episodes', type=int, default=10)
-    parser.add_argument('--max_timesteps', type=int, default=10)
-    parser.add_argument('--update_timesteps', type=int, default=10)
-    parser.add_argument('--max_epochs', type=int, default=5)
-    parser.add_argument('--train_batch_size', type=int, default=2)
-    parser.add_argument('--ptx_batch_size', type=int, default=1)
-    parser.add_argument('--experience_batch_size', type=int, default=8)
-    parser.add_argument('--lora_rank', type=int, default=0, help="low-rank adaptation matrices rank")
-    parser.add_argument('--kl_coef', type=float, default=0.1)
-    parser.add_argument('--ptx_coef', type=float, default=0.9)
+    parser.add_argument("--prompt_path", type=str, default=None, help="path to the prompt dataset")
+    parser.add_argument("--pretrain_dataset", type=str, default=None, help="path to the pretrained dataset")
+    parser.add_argument(
+        "--strategy", choices=["ddp", "colossalai_gemini", "colossalai_zero2"], default="ddp", help="strategy to use"
+    )
+    parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
+    parser.add_argument("--pretrain", type=str, default=None)
+    parser.add_argument("--sft_lora_path", type=str, default=None)
+    parser.add_argument("--rm_model", default=None, choices=["gpt2", "bloom", "opt", "llama"])
+    parser.add_argument("--rm_path", type=str, default=None)
+    parser.add_argument("--rm_pretrain", type=str, default=None)
+    parser.add_argument("--save_path", type=str, default="actor_checkpoint_prompts")
+    parser.add_argument("--need_optim_ckpt", type=bool, default=False)
+    parser.add_argument("--num_episodes", type=int, default=10)
+    parser.add_argument("--num_collect_steps", type=int, default=10)
+    parser.add_argument("--num_update_steps", type=int, default=5)
+    parser.add_argument("--train_batch_size", type=int, default=2)
+    parser.add_argument("--ptx_batch_size", type=int, default=1)
+    parser.add_argument("--experience_batch_size", type=int, default=8)
+    parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
+    parser.add_argument("--kl_coef", type=float, default=0.1)
+    parser.add_argument("--ptx_coef", type=float, default=0.9)
    args = parser.parse_args()
    main(args)
--- a/applications/Chat/examples/community/peft/train_peft_sft.py
+++ b/applications/Chat/examples/community/peft/train_peft_sft.py
 import argparse
 import os

-import loralib as lora
 import torch
 import torch.distributed as dist
-from coati.dataset import DataCollatorForSupervisedDataset, SFTDataset, SupervisedDataset
-from coati.models.base import RewardModel
-from coati.models.bloom import BLOOMLM
-from coati.models.gpt import GPTLM
-from coati.models.llama import LlamaLM
-from coati.models.opt import OPTLM
 from coati.trainer import SFTTrainer
-from coati.trainer.strategies import ColossalAIStrategy, DDPStrategy, NaiveStrategy
-from coati.utils import prepare_llama_tokenizer_and_embedding
-from datasets import load_dataset
+from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
 from easy_dataset import EasyDataset
 from peft import LoraConfig, PeftModel, TaskType, get_peft_model
 from torch.optim import Adam
@@ -30,80 +21,76 @@ from colossalai.tensor import ColoParameter

 def train(args):
    # configure strategy
-    if args.strategy == 'naive':
-        strategy = NaiveStrategy()
-    elif args.strategy == 'ddp':
+    if args.strategy == "ddp":
        strategy = DDPStrategy()
-    elif args.strategy == 'colossalai_gemini':
-        strategy = ColossalAIStrategy(stage=3, placement_policy='cuda')
-    elif args.strategy == 'colossalai_zero2':
-        strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
+    elif args.strategy == "colossalai_gemini":
+        strategy = GeminiStrategy(placement_policy="static")
+    elif args.strategy == "colossalai_zero2":
+        strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
    else:
        raise ValueError(f'Unsupported strategy "{args.strategy}"')

    # configure model
    with strategy.model_init_context():
-        print('Warning: currently only bloom is tested, gpt2,llama and opt are not tested')
+        print("Warning: currently only bloom is tested, gpt2,llama and opt are not tested")
        model = AutoModelForCausalLM.from_pretrained(args.pretrain).to(torch.cuda.current_device())
-        #if the args.save_path exists and args.save_path+'/adapter_config.json' exists, we'll load the adapter_config.json
-        if os.path.exists(args.save_path) and os.path.exists(args.save_path+'/adapter_config.json') \
-            and os.path.exists(args.save_path+'/adapter_model.bin'):
+        # if the args.save_path exists and args.save_path+'/adapter_config.json' exists, we'll load the adapter_config.json
+        if (
+            os.path.exists(args.save_path)
+            and os.path.exists(args.save_path + "/adapter_config.json")
+            and os.path.exists(args.save_path + "/adapter_model.bin")
+        ):
            print("loading from saved peft model ", args.save_path)
            model = PeftModel.from_pretrained(model, args.save_path)
        else:
-            #we'll use peft lora library to do the lora
+            # we'll use peft lora library to do the lora
            lora_rank = args.lora_rank if args.lora_rank > 0 else 32
-            #config lora with rank of lora_rank
-            lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM,
-                                     inference_mode=False,
-                                     r=lora_rank,
-                                     lora_alpha=32,
-                                     lora_dropout=0.1)
+            # config lora with rank of lora_rank
+            lora_config = LoraConfig(
+                task_type=TaskType.CAUSAL_LM, inference_mode=False, r=lora_rank, lora_alpha=32, lora_dropout=0.1
+            )
            model = get_peft_model(model, lora_config)
        model.print_trainable_parameters()

    # configure tokenizer
-    if args.model == 'gpt2':
-        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+    if args.model == "gpt2":
+        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        tokenizer.pad_token = tokenizer.eos_token
-    elif args.model == 'bloom':
-        tokenizer = BloomTokenizerFast.from_pretrained(args.pretrain)
+    elif args.model == "bloom":
+        tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
        tokenizer.pad_token = tokenizer.eos_token
-    elif args.model == 'opt':
+    elif args.model == "opt":
        tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
-    elif args.model == 'llama':
+        tokenizer.pad_token = tokenizer.eos_token
+    elif args.model == "llama":
        tokenizer = AutoTokenizer.from_pretrained(
            args.pretrain,
            padding_side="right",
            use_fast=False,
        )
-        tokenizer.eos_token = '<\s>'
+        tokenizer.eos_token = "<\s>"
+        tokenizer.pad_token = tokenizer.unk_token
    else:
        raise ValueError(f'Unsupported model "{args.model}"')
-    tokenizer.pad_token = tokenizer.eos_token
-    if args.model == 'llama':
-        tokenizer = prepare_llama_tokenizer_and_embedding(tokenizer, model)

-        if args.strategy == 'colossalai_gemini':
+    if args.model == "llama" and args.strategy == "colossalai_gemini":
        # this is a hack to deal with the resized embedding
-            # to make sure all parameters are ColoParameter for Colossal-AI Gemini Compatiblity
+        # to make sure all parameters are ColoParameter for Colossal-AI Gemini Compatibility
        for name, param in model.named_parameters():
            if not isinstance(param, ColoParameter):
-                    sub_module_name = '.'.join(name.split('.')[:-1])
-                    weight_name = name.split('.')[-1]
+                sub_module_name = ".".join(name.split(".")[:-1])
+                weight_name = name.split(".")[-1]
                sub_module = model.get_submodule(sub_module_name)
                setattr(sub_module, weight_name, ColoParameter(param))
-    else:
-        tokenizer.pad_token = tokenizer.eos_token

    # configure optimizer
-    if args.strategy.startswith('colossalai'):
+    if args.strategy.startswith("colossalai"):
        optim = HybridAdam(model.parameters(), lr=args.lr, clipping_norm=1.0)
    else:
        optim = Adam(model.parameters(), lr=args.lr)

    logger = get_dist_logger()
-    logger.set_level('WARNING')
+    logger.set_level("WARNING")

    # configure dataset
    law_dataset = EasyDataset(args.dataset, tokenizer=tokenizer, is_group_texts=not args.is_short_text)
@@ -114,47 +101,57 @@ def train(args):
        eval_dataset = EasyDataset(args.eval_dataset, tokenizer=tokenizer, is_group_texts=not args.is_short_text)
    data_collator = default_collate
    if dist.is_initialized() and dist.get_world_size() > 1:
-        train_sampler = DistributedSampler(train_dataset,
+        train_sampler = DistributedSampler(
+            train_dataset,
            shuffle=True,
            seed=42,
            drop_last=True,
            rank=dist.get_rank(),
-                                           num_replicas=dist.get_world_size())
+            num_replicas=dist.get_world_size(),
+        )
        if eval_dataset is not None:
-            eval_sampler = DistributedSampler(eval_dataset,
+            eval_sampler = DistributedSampler(
+                eval_dataset,
                shuffle=False,
                seed=42,
                drop_last=False,
                rank=dist.get_rank(),
-                                              num_replicas=dist.get_world_size())
+                num_replicas=dist.get_world_size(),
+            )
    else:
        train_sampler = None
        eval_sampler = None

-    train_dataloader = DataLoader(train_dataset,
+    train_dataloader = DataLoader(
+        train_dataset,
        shuffle=(train_sampler is None),
        sampler=train_sampler,
        batch_size=args.batch_size,
        collate_fn=data_collator,
-                                  pin_memory=True)
+        pin_memory=True,
+    )
    if eval_dataset is not None:
-        eval_dataloader = DataLoader(eval_dataset,
+        eval_dataloader = DataLoader(
+            eval_dataset,
            shuffle=(eval_sampler is None),
            sampler=eval_sampler,
            batch_size=args.batch_size,
            collate_fn=data_collator,
-                                     pin_memory=True)
+            pin_memory=True,
+        )
    else:
        eval_dataloader = None

-    trainer = SFTTrainer(model=model,
+    trainer = SFTTrainer(
+        model=model,
        strategy=strategy,
        optim=optim,
        train_dataloader=train_dataloader,
        eval_dataloader=eval_dataloader,
        batch_size=args.batch_size,
        max_epochs=args.max_epochs,
-                         accumulation_steps=args.accumulation_steps)
+        accumulation_steps=args.accumulation_steps,
+    )

    trainer.fit(logger=logger, log_interval=args.log_interval)

@@ -162,29 +159,27 @@ def train(args):
    trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
    # save optimizer checkpoint on all ranks
    if args.need_optim_ckpt:
-        strategy.save_optimizer(trainer.optimizer,
-                                'rm_optim_checkpoint_%d.pt' % (torch.cuda.current_device()),
-                                only_rank0=False)
+        strategy.save_optimizer(
+            trainer.optimizer, "rm_optim_checkpoint_%d.pt" % (torch.cuda.current_device()), only_rank0=False
+        )


-if __name__ == '__main__':
+if __name__ == "__main__":
    parser = argparse.ArgumentParser()
-    parser.add_argument('--strategy',
-                        choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'],
-                        default='naive')
-    parser.add_argument('--model', choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom')
-    parser.add_argument('--pretrain', type=str, default=None)
-    parser.add_argument('--dataset', type=str, default=None)
-    parser.add_argument('--eval_dataset', type=str, default=None)
-    parser.add_argument('--save_path', type=str, default='output')
-    parser.add_argument('--need_optim_ckpt', type=bool, default=False)
-    parser.add_argument('--max_epochs', type=int, default=3)
-    parser.add_argument('--batch_size', type=int, default=4)
-    parser.add_argument('--lora_rank', type=int, default=0, help="low-rank adaptation matrices rank")
-    parser.add_argument('--log_interval', type=int, default=100, help="how many steps to log")
-    parser.add_argument('--lr', type=float, default=5e-6)
-    parser.add_argument('--accumulation_steps', type=int, default=8)
-    parser.add_argument('--enable_peft_lora', action='store_true', default=False)
-    parser.add_argument("--is_short_text", action='store_true', default=False)
+    parser.add_argument("--strategy", choices=["ddp", "colossalai_gemini", "colossalai_zero2"], default="ddp")
+    parser.add_argument("--model", choices=["gpt2", "bloom", "opt", "llama"], default="bloom")
+    parser.add_argument("--pretrain", type=str, default=None)
+    parser.add_argument("--dataset", type=str, default=None)
+    parser.add_argument("--eval_dataset", type=str, default=None)
+    parser.add_argument("--save_path", type=str, default="output")
+    parser.add_argument("--need_optim_ckpt", type=bool, default=False)
+    parser.add_argument("--max_epochs", type=int, default=3)
+    parser.add_argument("--batch_size", type=int, default=4)
+    parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
+    parser.add_argument("--log_interval", type=int, default=100, help="how many steps to log")
+    parser.add_argument("--lr", type=float, default=5e-6)
+    parser.add_argument("--accumulation_steps", type=int, default=8)
+    parser.add_argument("--enable_peft_lora", action="store_true", default=False)
+    parser.add_argument("--is_short_text", action="store_true", default=False)
    args = parser.parse_args()
    train(args)
--- a/applications/Chat/examples/community/ray/README.md
+++ b/applications/Chat/examples/community/ray/README.md
+:warning: **This content may be outdated since the major update of Colossal Chat. We will update this content soon.**
+
 # ColossalAI on Ray
+
 ## Abstract
+
 This is an experimental effort to run ColossalAI Chat training on Ray
+
 ## How to use?
+
 ### 1. Setup Ray clusters
+
 Please follow the official [Ray cluster setup instructions](https://docs.ray.io/en/latest/cluster/getting-started.html) to setup an cluster with GPU support. Record the cluster's api server endpoint, it should be something similar to http://your.head.node.addrees:8265
+
 ### 2. Clone repo
+
 Clone this project:
+
 ```shell
 git clone https://github.com/hpcaitech/ColossalAI.git
 ```
+
 ### 3. Submit the ray job
+
 ```shell
 python applications/Chat/examples/community/ray/ray_job_script.py http://your.head.node.addrees:8265
 ```
+
 ### 4. View your job on the Ray Dashboard
+
 Open your ray cluster dashboard http://your.head.node.addrees:8265 to view your submitted training job.
--- a/applications/Chat/examples/community/ray/ray_job_script.py
+++ b/applications/Chat/examples/community/ray/ray_job_script.py
@@ -6,16 +6,25 @@ from ray.job_submission import JobSubmissionClient
 def main(api_server_endpoint="http://127.0.0.1:8265"):
    client = JobSubmissionClient(api_server_endpoint)
    client.submit_job(
-        entrypoint=
-        "python experimental/ray/train_prompts_on_ray.py --strategy colossalai_zero2 --prompt_csv_url https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/resolve/main/prompts.csv",
+        entrypoint="python experimental/ray/train_prompts_on_ray.py --strategy colossalai_zero2 --prompt_csv_url https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/resolve/main/prompts.csv",
        runtime_env={
-            "working_dir":
-                "applications/Chat",
+            "working_dir": "applications/Chat",
            "pip": [
-                "torch==1.13.1", "transformers>=4.20.1", "datasets", "loralib", "colossalai>=0.2.4", "langchain",
-                "tokenizers", "fastapi", "sse_starlette", "wandb", "sentencepiece", "gpustat"
-            ]
-        })
+                "torch==1.13.1",
+                "transformers>=4.20.1",
+                "datasets",
+                "loralib",
+                "colossalai>=0.2.4",
+                "langchain",
+                "tokenizers",
+                "fastapi",
+                "sse_starlette",
+                "wandb",
+                "sentencepiece",
+                "gpustat",
+            ],
+        },
+    )


 if __name__ == "__main__":

--- a/applications/Chat/examples/community/ray/train_prompts_on_ray.py
+++ b/applications/Chat/examples/community/ray/train_prompts_on_ray.py
@@ -15,7 +15,7 @@ from coati.models.lora import LoRAModule
 from coati.models.loss import PolicyLoss, ValueLoss
 from coati.models.opt import OPTActor, OPTCritic
 from coati.models.utils import compute_reward
-from coati.trainer.strategies import ColossalAIStrategy, DDPStrategy, NaiveStrategy
+from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
 from ray.util.placement_group import placement_group
 from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
 from torch.optim import Adam
@@ -26,9 +26,14 @@ from colossalai.nn.optimizer import HybridAdam


 class ExperienceCompositionRefs:
-
-    def __init__(self, sequences_attention_mask_action_mask_ref: ray.ObjectRef, action_log_probs_ref: ray.ObjectRef,
-                 base_action_log_probs_ref: ray.ObjectRef, value_ref: ray.ObjectRef, r_ref: ray.ObjectRef) -> None:
+    def __init__(
+        self,
+        sequences_attention_mask_action_mask_ref: ray.ObjectRef,
+        action_log_probs_ref: ray.ObjectRef,
+        base_action_log_probs_ref: ray.ObjectRef,
+        value_ref: ray.ObjectRef,
+        r_ref: ray.ObjectRef,
+    ) -> None:
        self.sequences_attention_mask_action_mask_ref = sequences_attention_mask_action_mask_ref
        self.action_log_probs_ref = action_log_probs_ref
        self.base_action_log_probs_ref = base_action_log_probs_ref
@@ -37,14 +42,14 @@ class ExperienceCompositionRefs:


 class ExperienceMaker:
-
    def __init__(self, kl_coef) -> None:
        self.kl_coef = kl_coef

    @torch.no_grad()
    def make_experience(self, experiment_computation_refs: ExperienceCompositionRefs):
        sequences, attention_mask, action_mask = ray.get(
-            experiment_computation_refs.sequences_attention_mask_action_mask_ref)
+            experiment_computation_refs.sequences_attention_mask_action_mask_ref
+        )
        action_log_probs = ray.get(experiment_computation_refs.action_log_probs_ref)
        base_action_log_probs = ray.get(experiment_computation_refs.base_action_log_probs_ref)
        r = ray.get(experiment_computation_refs.r_ref)
@@ -58,11 +63,10 @@ class ExperienceMaker:


 class DistributedTorchRayActor:
-
    def __init__(self, world_size, rank, local_rank, master_addr, master_port):
-        logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
-                            level=logging.INFO,
-                            datefmt='%Y-%m-%d %H:%M:%S')
+        logging.basicConfig(
+            format="%(asctime)s %(levelname)-8s %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
+        )
        self._model = None
        self._world_size = world_size
        self._rank = rank
@@ -82,7 +86,7 @@ class DistributedTorchRayActor:
    @staticmethod
    def _get_free_port():
        with socket.socket() as sock:
-            sock.bind(('', 0))
+            sock.bind(("", 0))
            return sock.getsockname()[1]

    def get_master_addr_port(self):
@@ -90,7 +94,6 @@ class DistributedTorchRayActor:


 class BasePPORole(DistributedTorchRayActor):
-
    def add_experience_maker(self, kl_coef: float = 0.1):
        self._experience_maker = ExperienceMaker(kl_coef)

@@ -99,19 +102,17 @@ class BasePPORole(DistributedTorchRayActor):

    def _init_strategy(self, strategy: str):
        # configure strategy
-        if strategy == 'naive':
-            self._strategy = NaiveStrategy()
-        elif strategy == 'ddp':
+        if strategy == "ddp":
            self._strategy = DDPStrategy()
-        elif strategy == 'colossalai_gemini':
-            self._strategy = ColossalAIStrategy(stage=3, placement_policy='cuda', initial_scale=2**5)
-        elif strategy == 'colossalai_zero2':
-            self._strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
+        elif strategy == "colossalai_gemini":
+            self._strategy = GeminiStrategy(placement_policy="cuda", initial_scale=2**5)
+        elif strategy == "colossalai_zero2":
+            self._strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
        else:
            raise ValueError(f'Unsupported strategy "{strategy}"')

    def _init_optimizer(self):
-        if isinstance(self._strategy, ColossalAIStrategy):
+        if isinstance(self._strategy, (GeminiStrategy, LowLevelZeroStrategy)):
            self._optimizer = HybridAdam(self._model.parameters(), lr=5e-6)
        else:
            self._optimizer = Adam(self._model.parameters(), lr=5e-6)
@@ -126,11 +127,9 @@ class BasePPORole(DistributedTorchRayActor):
    def _load_model_from_pretrained(self, model_class: Type[LoRAModule], pretrain: str):
        raise NotImplementedError()

-    def init_model_from_pretrained(self,
-                                   strategy: str,
-                                   model_class: Type[LoRAModule],
-                                   pretrain: str,
-                                   has_optimizer=False):
+    def init_model_from_pretrained(
+        self, strategy: str, model_class: Type[LoRAModule], pretrain: str, has_optimizer=False
+    ):
        self._init_strategy(strategy)
        self._load_model_from_pretrained(model_class, pretrain)
        self._prepare_model_with_strategy(has_optimizer)
@@ -140,7 +139,6 @@ class BasePPORole(DistributedTorchRayActor):


 class TrainablePPORole(BasePPORole):
-
    def _load_model_from_pretrained(self, model_class, pretrain):
        with self._strategy.model_init_context():
            self._model = model_class(pretrain).to(torch.cuda.current_device())
@@ -163,38 +161,39 @@ class TrainablePPORole(BasePPORole):

 @ray.remote(num_gpus=1)
 class RayPPOActor(TrainablePPORole):
-
    def set_loss_function(self, eps_clip: float):
        self._actor_loss_fn = PolicyLoss(eps_clip)

    def load_tokenizer_from_pretrained(self, model_type: str, pretrained):
-        if model_type == 'gpt2':
+        if model_type == "gpt2":
            self._model_tokenizer = GPT2Tokenizer.from_pretrained(pretrained)
            self._model_tokenizer.pad_token = self._model_tokenizer.eos_token
-        elif model_type == 'bloom':
+        elif model_type == "bloom":
            self._model_tokenizer = BloomTokenizerFast.from_pretrained(pretrained)
            self._model_tokenizer.pad_token = self._model_tokenizer.eos_token
-        elif model_type == 'opt':
+        elif model_type == "opt":
            self._model_tokenizer = AutoTokenizer.from_pretrained(pretrained)
        else:
            raise ValueError(f'Unsupported model "{model_type}"')

        # Set tokenize function for sequence generation
        def _text_input_tokenize_fn(texts):
-            batch = self._model_tokenizer(texts, return_tensors='pt', max_length=96, padding=True, truncation=True)
+            batch = self._model_tokenizer(texts, return_tensors="pt", max_length=96, padding=True, truncation=True)
            return {k: v.cuda() for k, v in batch.items()}

        self._sample_tokenize_function = _text_input_tokenize_fn

    def setup_generate_kwargs(self, generate_kwargs: dict):
        from coati.trainer.ppo import _set_default_generate_kwargs
+
        self._generate_kwargs = _set_default_generate_kwargs(self._strategy, generate_kwargs, self._model)
-        self._generate_kwargs['pad_token_id'] = self._model_tokenizer.pad_token_id
-        self._generate_kwargs['eos_token_id'] = self._model_tokenizer.eos_token_id
+        self._generate_kwargs["pad_token_id"] = self._model_tokenizer.pad_token_id
+        self._generate_kwargs["eos_token_id"] = self._model_tokenizer.eos_token_id

    def load_csv_prompt_file_from_url_to_sampler(self, prompt_url):
        import pandas as pd
-        prompts = pd.read_csv(prompt_url)['prompt']
+
+        prompts = pd.read_csv(prompt_url)["prompt"]
        self._sampler = self._strategy.setup_sampler(prompts)

    def _generate(self, input_ids, **generate_kwargs):
@@ -216,10 +215,9 @@ class RayPPOActor(TrainablePPORole):
    def _training_step(self, experience):
        num_actions = experience.action_mask.size(1)
        action_log_probs = self._model(experience.sequences, num_actions, attention_mask=experience.attention_mask)
-        actor_loss = self._actor_loss_fn(action_log_probs,
-                                         experience.action_log_probs,
-                                         experience.advantages,
-                                         action_mask=experience.action_mask)
+        actor_loss = self._actor_loss_fn(
+            action_log_probs, experience.action_log_probs, experience.advantages, action_mask=experience.action_mask
+        )
        self._strategy.backward(actor_loss, self._model, self._optimizer)
        self._strategy.optimizer_step(self._optimizer)
        self._optimizer.zero_grad()
@@ -231,17 +229,18 @@ class RayPPOActor(TrainablePPORole):
            self._strategy.save_model(self._model, save_path, only_rank0=True)
        # save optimizer checkpoint on all ranks
        if should_save_optimizer:
-            self._strategy.save_optimizer(self._optimizer,
-                                          'actor_optim_checkpoint_prompts_%d.pt' % (torch.cuda.current_device()),
-                                          only_rank0=False)
+            self._strategy.save_optimizer(
+                self._optimizer,
+                "actor_optim_checkpoint_prompts_%d.pt" % (torch.cuda.current_device()),
+                only_rank0=False,
+            )

    def generate_answer(self, prompt, max_length=30, num_return_sequences=5):
-        encoded_input = self._model_tokenizer(prompt, return_tensors='pt')
+        encoded_input = self._model_tokenizer(prompt, return_tensors="pt")
        input_ids = {k: v.cuda() for k, v in encoded_input.items()}
-        sequence, _ = self._model.generate(**input_ids,
-                                           max_length=max_length,
-                                           return_action_mask=False,
-                                           num_return_sequences=num_return_sequences)
+        sequence, _ = self._model.generate(
+            **input_ids, max_length=max_length, return_action_mask=False, num_return_sequences=num_return_sequences
+        )
        token_list = list(sequence.data[0])
        output = " ".join([self._model_tokenizer.decode(token) for token in token_list])
        return output
@@ -249,18 +248,16 @@ class RayPPOActor(TrainablePPORole):

 @ray.remote(num_gpus=1)
 class RayPPOCritic(TrainablePPORole):
-
    def set_loss_function(self, value_clip: float):
        self._critic_loss_fn = ValueLoss(value_clip)

    def _training_step(self, experience):
-        values = self._model(experience.sequences,
-                             action_mask=experience.action_mask,
-                             attention_mask=experience.attention_mask)
-        critic_loss = self._critic_loss_fn(values,
-                                           experience.values,
-                                           experience.reward,
-                                           action_mask=experience.action_mask)
+        values = self._model(
+            experience.sequences, action_mask=experience.action_mask, attention_mask=experience.attention_mask
+        )
+        critic_loss = self._critic_loss_fn(
+            values, experience.values, experience.reward, action_mask=experience.action_mask
+        )
        self._strategy.backward(critic_loss, self._model, self._optimizer)
        self._strategy.optimizer_step(self._optimizer)
        self._optimizer.zero_grad()
@@ -274,12 +271,12 @@ class RayPPOCritic(TrainablePPORole):

 @ray.remote(num_gpus=1)
 class RayPPORewardModel(BasePPORole):
-
    def _load_model_from_pretrained(self, model_class, pretrain):
        with self._strategy.model_init_context():
            critic = model_class(pretrained=pretrain).to(torch.cuda.current_device())
-            self._model = RewardModel(deepcopy(critic.model),
-                                      deepcopy(critic.value_head)).to(torch.cuda.current_device())
+            self._model = RewardModel(deepcopy(critic.model), deepcopy(critic.value_head)).to(
+                torch.cuda.current_device()
+            )

    @torch.no_grad()
    def calculate_r(self, sequence_attention_action_mask):
@@ -289,7 +286,6 @@ class RayPPORewardModel(BasePPORole):

 @ray.remote(num_gpus=1)
 class RayPPOInitialModel(BasePPORole):
-
    def _load_model_from_pretrained(self, model_class, pretrain):
        with self._strategy.model_init_context():
            self._model = model_class(pretrain).to(torch.cuda.current_device())
@@ -321,8 +317,9 @@ class PPORayActorGroup:
            pg = placement_group(bundles, strategy="STRICT_SPREAD")
            ray.get(pg.ready())
        if pg:
-            master_actor = self.ray_actor_type.options(scheduling_strategy=PlacementGroupSchedulingStrategy(
-                placement_group=pg, placement_group_bundle_index=0)).remote(world_size, 0, 0, None, None)
+            master_actor = self.ray_actor_type.options(
+                scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0)
+            ).remote(world_size, 0, 0, None, None)
        else:
            master_actor = self.ray_actor_type.options(num_gpus=1).remote(world_size, 0, 0, None, None)
        self._actor_handlers = [master_actor]
@@ -333,16 +330,20 @@ class PPORayActorGroup:
            for rank in range(1, world_size):
                local_rank = rank % self._num_gpus_per_node
                if pg:
-                    worker_actor = self.ray_actor_type.options(scheduling_strategy=PlacementGroupSchedulingStrategy(
-                        placement_group=pg, placement_group_bundle_index=rank // self._num_gpus_per_node)).remote(
-                            world_size, rank, local_rank, master_addr, master_port)
+                    worker_actor = self.ray_actor_type.options(
+                        scheduling_strategy=PlacementGroupSchedulingStrategy(
+                            placement_group=pg, placement_group_bundle_index=rank // self._num_gpus_per_node
+                        )
+                    ).remote(world_size, rank, local_rank, master_addr, master_port)
                else:
-                    worker_actor = self.ray_actor_type.options(num_gpus=1).remote(world_size, rank, local_rank,
-                                                                                  master_addr, master_port)
+                    worker_actor = self.ray_actor_type.options(num_gpus=1).remote(
+                        world_size, rank, local_rank, master_addr, master_port
+                    )
                self._actor_handlers.append(worker_actor)

-    def async_init_model_from_pretrained(self, strategy: str, model_class: Type[LoRAModule], pretrain: str,
-                                         has_optimizer: bool):
+    def async_init_model_from_pretrained(
+        self, strategy: str, model_class: Type[LoRAModule], pretrain: str, has_optimizer: bool
+    ):
        return [
            actor.init_model_from_pretrained.remote(strategy, model_class, pretrain, has_optimizer)
            for actor in self._actor_handlers
@@ -350,7 +351,6 @@ class PPORayActorGroup:


 class TrainableModelRayActorGroup(PPORayActorGroup):
-
    def async_learn_on_experiences(self, experience_refs):
        num_actors = len(self._actor_handlers)
        learn_result_refs = []
@@ -361,7 +361,6 @@ class TrainableModelRayActorGroup(PPORayActorGroup):


 class PPOActorRayActorGroup(TrainableModelRayActorGroup):
-
    def __init__(self, num_nodes, num_gpus_per_node) -> None:
        super().__init__(num_nodes, num_gpus_per_node, RayPPOActor)

@@ -383,7 +382,8 @@ class PPOActorRayActorGroup(TrainableModelRayActorGroup):
        action_log_probs_refs = []
        for i in range(len(sequences_attention_mask_action_mask_refs)):
            action_log_probs_ref = self._actor_handlers[i % num_actors].calculate_action_log_probs.remote(
-                sequences_attention_mask_action_mask_refs[i])
+                sequences_attention_mask_action_mask_refs[i]
+            )
            action_log_probs_refs.append(action_log_probs_ref)
        return action_log_probs_refs

@@ -395,7 +395,6 @@ class PPOActorRayActorGroup(TrainableModelRayActorGroup):


 class PPOCriticRayActorGroup(TrainableModelRayActorGroup):
-
    def __init__(self, num_nodes, num_gpus_per_node) -> None:
        super().__init__(num_nodes, num_gpus_per_node, RayPPOCritic)

@@ -404,7 +403,8 @@ class PPOCriticRayActorGroup(TrainableModelRayActorGroup):
        value_refs = []
        for i in range(len(sequences_attention_mask_action_mask_refs)):
            value_ref = self._actor_handlers[i % num_actors].calculate_value.remote(
-                sequences_attention_mask_action_mask_refs[i])
+                sequences_attention_mask_action_mask_refs[i]
+            )
            value_refs.append(value_ref)
        return value_refs

@@ -413,7 +413,6 @@ class PPOCriticRayActorGroup(TrainableModelRayActorGroup):


 class PPOInitialRayActorGroup(PPORayActorGroup):
-
    def __init__(self, num_nodes, num_gpus_per_node) -> None:
        super().__init__(num_nodes, num_gpus_per_node, RayPPOInitialModel)

@@ -422,13 +421,13 @@ class PPOInitialRayActorGroup(PPORayActorGroup):
        base_action_log_probs_refs = []
        for i in range(len(sequences_attention_mask_action_mask_refs)):
            base_action_log_probs_ref = self._actor_handlers[i % num_actors].calculate_base_action_log_probs.remote(
-                sequences_attention_mask_action_mask_refs[i])
+                sequences_attention_mask_action_mask_refs[i]
+            )
            base_action_log_probs_refs.append(base_action_log_probs_ref)
        return base_action_log_probs_refs


 class PPORewardRayActorGroup(PPORayActorGroup):
-
    def __init__(self, num_nodes, num_gpus_per_node) -> None:
        super().__init__(num_nodes, num_gpus_per_node, RayPPORewardModel)

@@ -437,20 +436,21 @@ class PPORewardRayActorGroup(PPORayActorGroup):
        r_refs = []
        for i in range(len(sequences_attention_mask_action_mask_refs)):
            r_ref = self._actor_handlers[i % num_actors].calculate_r.remote(
-                sequences_attention_mask_action_mask_refs[i])
+                sequences_attention_mask_action_mask_refs[i]
+            )
            r_refs.append(r_ref)
        return r_refs


 def main(args):
-    logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
-                        level=logging.INFO,
-                        datefmt='%Y-%m-%d %H:%M:%S')
-    if args.model == 'gpt2':
+    logging.basicConfig(
+        format="%(asctime)s %(levelname)-8s %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
+    )
+    if args.model == "gpt2":
        actor_model_class, critic_model_class = GPTActor, GPTCritic
-    elif args.model == 'bloom':
+    elif args.model == "bloom":
        actor_model_class, critic_model_class = BLOOMActor, BLOOMCritic
-    elif args.model == 'opt':
+    elif args.model == "opt":
        actor_model_class, critic_model_class = OPTActor, OPTCritic
    else:
        raise ValueError(f'Unsupported model "{args.model}"')
@@ -464,13 +464,14 @@ def main(args):
    logging.info("Actors created")

    # Prepare model for training
-    generate_kwargs = {'max_length': 128, 'do_sample': True, 'temperature': 1.0, 'top_k': 50}
+    generate_kwargs = {"max_length": 128, "do_sample": True, "temperature": 1.0, "top_k": 50}
    ray.get(
-        actor_group.async_init_model_from_pretrained(args.strategy, actor_model_class, args.pretrain, True) +
-        critic_group.async_init_model_from_pretrained(args.strategy, critic_model_class, args.pretrain, True) +
-        initial_group.async_init_model_from_pretrained(args.strategy, actor_model_class, args.pretrain, False) +
-        reward_group.async_init_model_from_pretrained(args.strategy, critic_model_class, args.pretrain, False) +
-        actor_group.async_prepare_for_sequence_generation(args.model, args.pretrain, generate_kwargs))
+        actor_group.async_init_model_from_pretrained(args.strategy, actor_model_class, args.pretrain, True)
+        + critic_group.async_init_model_from_pretrained(args.strategy, critic_model_class, args.pretrain, True)
+        + initial_group.async_init_model_from_pretrained(args.strategy, actor_model_class, args.pretrain, False)
+        + reward_group.async_init_model_from_pretrained(args.strategy, critic_model_class, args.pretrain, False)
+        + actor_group.async_prepare_for_sequence_generation(args.model, args.pretrain, generate_kwargs)
+    )
    logging.info("Models prepared for training")

    # Prepare models for training
@@ -485,8 +486,12 @@ def main(args):
    # Start training
    logging.info("Training start")
    # Set all models to eval and add experience maker
-    all_ray_actors = actor_group._actor_handlers + critic_group._actor_handlers + \
-        initial_group._actor_handlers + reward_group._actor_handlers
+    all_ray_actors = (
+        actor_group._actor_handlers
+        + critic_group._actor_handlers
+        + initial_group._actor_handlers
+        + reward_group._actor_handlers
+    )
    num_ray_actors = len(all_ray_actors)
    ray.get([ray_actor.eval.remote() for ray_actor in all_ray_actors])
    ray.get([ray_actor.add_experience_maker.remote() for ray_actor in all_ray_actors])
@@ -499,18 +504,28 @@ def main(args):
            time += 1
            # Experience queueing stage
            sequences_attention_mask_action_mask_refs = actor_group.async_sample_prompts_and_make_sequence(
-                experience_batch_size)
+                experience_batch_size
+            )
            base_action_log_probs_refs = initial_group.async_calculate_base_action_log_probs(
-                sequences_attention_mask_action_mask_refs)
+                sequences_attention_mask_action_mask_refs
+            )
            values_refs = critic_group.async_calculate_value(sequences_attention_mask_action_mask_refs)
            r_refs = reward_group.async_calculate_r(sequences_attention_mask_action_mask_refs)
            action_log_probs_refs = actor_group.async_calculate_action_log_probs(
-                sequences_attention_mask_action_mask_refs)
-            experience_composition_refs.extend([
-                ExperienceCompositionRefs(sequences_attention_mask_action_mask_refs[i], action_log_probs_refs[i],
-                                          base_action_log_probs_refs[i], values_refs[i], r_refs[i])
+                sequences_attention_mask_action_mask_refs
+            )
+            experience_composition_refs.extend(
+                [
+                    ExperienceCompositionRefs(
+                        sequences_attention_mask_action_mask_refs[i],
+                        action_log_probs_refs[i],
+                        base_action_log_probs_refs[i],
+                        values_refs[i],
+                        r_refs[i],
+                    )
                    for i in range(len(sequences_attention_mask_action_mask_refs))
-            ])
+                ]
+            )
            # Learning stage
            if time % update_timesteps == 0:
                experience_refs = []
@@ -521,8 +536,9 @@ def main(args):
                    experience_refs.append(selected_ray_actor.make_experience.remote(exp_composition_ref))
                # backward
                ray.get(
-                    actor_group.async_learn_on_experiences(experience_refs) +
-                    critic_group.async_learn_on_experiences(experience_refs))
+                    actor_group.async_learn_on_experiences(experience_refs)
+                    + critic_group.async_learn_on_experiences(experience_refs)
+                )
                # clear refs queue
                experience_composition_refs.clear()
    logging.info("Training finished")
@@ -530,26 +546,24 @@ def main(args):
    actor_group.save_checkpoint(args.save_path, args.need_optim_ckpt)


-if __name__ == '__main__':
+if __name__ == "__main__":
    parser = argparse.ArgumentParser()
-    parser.add_argument('--prompt_csv_url', type=str)
-    parser.add_argument('--strategy',
-                        choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'],
-                        default='naive')
-    parser.add_argument('--model', default='gpt2', choices=['gpt2', 'bloom', 'opt'])
-    parser.add_argument('--pretrain', type=str, default='gpt2')
-    parser.add_argument('--save_path', type=str, default='actor_checkpoint_prompts.pt')
-    parser.add_argument('--need_optim_ckpt', type=bool, default=False)
-    parser.add_argument('--num_episodes', type=int, default=10)
-    parser.add_argument('--max_timesteps', type=int, default=10)
-    parser.add_argument('--update_timesteps', type=int, default=10)
-    parser.add_argument('--train_batch_size', type=int, default=8)
-    parser.add_argument('--experience_batch_size', type=int, default=8)
-    parser.add_argument('--num_actor_nodes', type=int, help='num of nodes to use to host actor model', default=1)
-    parser.add_argument('--num_critic_nodes', type=int, help='num of nodes to use to host critic model', default=1)
-    parser.add_argument('--num_initial_nodes', type=int, help='num of nodes to use to host initial model', default=1)
-    parser.add_argument('--num_reward_nodes', type=int, help='num of nodes to use to host reward model', default=1)
-    parser.add_argument('--num_gpus_per_node', type=int, help='num of gpus on a ray node', default=1)
+    parser.add_argument("--prompt_csv_url", type=str)
+    parser.add_argument("--strategy", choices=["ddp", "colossalai_gemini", "colossalai_zero2"], default="ddp")
+    parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt"])
+    parser.add_argument("--pretrain", type=str, default="gpt2")
+    parser.add_argument("--save_path", type=str, default="actor_checkpoint_prompts.pt")
+    parser.add_argument("--need_optim_ckpt", type=bool, default=False)
+    parser.add_argument("--num_episodes", type=int, default=10)
+    parser.add_argument("--max_timesteps", type=int, default=10)
+    parser.add_argument("--update_timesteps", type=int, default=10)
+    parser.add_argument("--train_batch_size", type=int, default=8)
+    parser.add_argument("--experience_batch_size", type=int, default=8)
+    parser.add_argument("--num_actor_nodes", type=int, help="num of nodes to use to host actor model", default=1)
+    parser.add_argument("--num_critic_nodes", type=int, help="num of nodes to use to host critic model", default=1)
+    parser.add_argument("--num_initial_nodes", type=int, help="num of nodes to use to host initial model", default=1)
+    parser.add_argument("--num_reward_nodes", type=int, help="num of nodes to use to host reward model", default=1)
+    parser.add_argument("--num_gpus_per_node", type=int, help="num of gpus on a ray node", default=1)
    args = parser.parse_args()
    ray.init()
    main(args)
--- a/applications/Chat/examples/download_model.py
+++ b/applications/Chat/examples/download_model.py
+import argparse
+import dataclasses
+import os
+import parser
+from typing import List
+
+import tqdm
+from coati.models.bloom import BLOOMRM, BLOOMActor, BLOOMCritic
+from coati.models.gpt import GPTRM, GPTActor, GPTCritic
+from coati.models.opt import OPTRM, OPTActor, OPTCritic
+from huggingface_hub import hf_hub_download, snapshot_download
+from transformers import AutoConfig, AutoTokenizer, BloomConfig, BloomTokenizerFast, GPT2Config, GPT2Tokenizer
+
+
+@dataclasses.dataclass
+class HFRepoFiles:
+    repo_id: str
+    files: List[str]
+
+    def download(self, dir_path: str):
+        for file in self.files:
+            file_path = hf_hub_download(self.repo_id, file, local_dir=dir_path)
+
+    def download_all(self):
+        snapshot_download(self.repo_id)
+
+
+def test_init(model: str, dir_path: str):
+    if model == "gpt2":
+        config = GPT2Config.from_pretrained(dir_path)
+        actor = GPTActor(config=config)
+        critic = GPTCritic(config=config)
+        reward_model = GPTRM(config=config)
+        GPT2Tokenizer.from_pretrained(dir_path)
+    elif model == "bloom":
+        config = BloomConfig.from_pretrained(dir_path)
+        actor = BLOOMActor(config=config)
+        critic = BLOOMCritic(config=config)
+        reward_model = BLOOMRM(config=config)
+        BloomTokenizerFast.from_pretrained(dir_path)
+    elif model == "opt":
+        config = AutoConfig.from_pretrained(dir_path)
+        actor = OPTActor(config=config)
+        critic = OPTCritic(config=config)
+        reward_model = OPTRM(config=config)
+        AutoTokenizer.from_pretrained(dir_path)
+    else:
+        raise NotImplementedError(f"Model {model} not implemented")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-dir", type=str, default="test_models")
+    parser.add_argument("--config-only", default=False, action="store_true")
+    args = parser.parse_args()
+
+    if os.path.exists(args.model_dir):
+        print(f"[INFO]: {args.model_dir} already exists")
+        exit(0)
+
+    repo_list = {
+        "gpt2": HFRepoFiles(repo_id="gpt2", files=["config.json", "tokenizer.json", "vocab.json", "merges.txt"]),
+        "bloom": HFRepoFiles(
+            repo_id="bigscience/bloom-560m", files=["config.json", "tokenizer.json", "tokenizer_config.json"]
+        ),
+        "opt": HFRepoFiles(
+            repo_id="facebook/opt-350m", files=["config.json", "tokenizer_config.json", "vocab.json", "merges.txt"]
+        ),
+    }
+
+    os.mkdir(args.model_dir)
+    for model_name in tqdm.tqdm(repo_list):
+        dir_path = os.path.join(args.model_dir, model_name)
+        if args.config_only:
+            os.mkdir(dir_path)
+            repo_list[model_name].download(dir_path)
+        else:
+            repo_list[model_name].download_all()
+        test_init(model_name, dir_path)