[example] add opt model in lauguage (#1809)

fd2c8d81 · Jiarui Fang · GitHub · e0da01ea · fd2c8d81 · fd2c8d81
Unverified Commit fd2c8d81 authored Nov 08, 2022 by Jiarui Fang Committed by GitHub Nov 08, 2022
8 changed files
--- a/examples/language/opt/README.md
+++ b/examples/language/opt/README.md
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+## OPT
+Meta recently released [Open Pretrained Transformer (OPT)](https://github.com/facebookresearch/metaseq), a 175-Billion parameter AI language model, which stimulates AI programmers to perform various downstream tasks and application deployments.
+The following example of [Colossal-AI](https://github.com/hpcaitech/ColossalAI) demonstrates fine-tuning Casual Language Modelling at low cost.
+We are using the pre-training weights of the OPT model provided by Hugging Face Hub on the raw WikiText-2 (no tokens were replaced before
+the tokenization). This training script is adapted from the [HuggingFace Language Modelling examples](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling).
+## Quick Start
+You can launch training by using the following bash script
+```bash
+bash ./run_clm.sh <batch-size-per-gpu> <mem-cap> <model> <gpu-num>
+```
+- batch-size-per-gpu: number of samples fed to each GPU, default is 16
+- mem-cap: limit memory usage within a value in GB, default is 0 (no limit)
+- model: the size of the OPT model, default is `6.7b`. Acceptable values include `125m`, `350m`, `1.3b`, `2.7b`, `6.7`, `13b`, `30b`, `66b`. For `175b`, you can request
+the pretrained weights from [OPT weight downloading page](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT).
+- gpu-num: the number of GPUs to use, default is 1.
+## Remarkable Performance
+On a single GPU, Colossal-AI’s automatic strategy provides remarkable performance gains from the ZeRO Offloading strategy by Microsoft DeepSpeed.
+Users can experience up to a 40% speedup, at a variety of model scales. However, when using a traditional deep learning training framework like PyTorch, a single GPU can no longer support the training of models at such a scale.
+<p align="center">
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/OPT.png" width=1000/>
+</p>
+Adopting the distributed training strategy with 8 GPUs is as simple as adding a `-nprocs 8` to the training command of Colossal-AI!
+More details about behind the scenes can be found on the corresponding [blog](https://medium.com/@yangyou_berkeley/colossal-ai-seamlessly-accelerates-large-models-at-low-costs-with-hugging-face-4d1a887e500d),
+and a detailed tutorial will be added in [Documentation](https://www.colossalai.org/docs/get_started/installation) very soon.
--- a/examples/language/opt/benchmark.sh
+++ b/examples/language/opt/benchmark.sh
+export BS=16
+export MEMCAP=0
+export MODEL="6.7b"
+export GPUNUM=1
+for MODEL in "6.7b" "13b" "1.3b"
+do
+for GPUNUM in 8 1
+do
+for BS in 16 24 32 8
+do
+for MEMCAP in 0 40
+do
+pkill -9 torchrun
+pkill -9 python
+bash ./run_clm.sh $BS $MEMCAP $MODEL $GPUNUM
+done
+done
+done
+done
--- a/examples/language/opt/colossalai_zero.py
+++ b/examples/language/opt/colossalai_zero.py
+from colossalai.zero.shard_utils import TensorShardStrategy
+zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
+                              tensor_placement_policy="auto",
+                              reuse_fp16_shard=True),
+            optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=16384))
--- a/examples/language/opt/log
+++ b/examples/language/opt/log
--- a/examples/language/opt/requirements.txt
+++ b/examples/language/opt/requirements.txt
+colossalai
+torch >= 1.8.1
+datasets >= 1.8.0
+sentencepiece != 0.1.92
+protobuf
--- a/examples/language/opt/run_clm.py
+++ b/examples/language/opt/run_clm.py
--- a/examples/language/opt/run_clm.sh
+++ b/examples/language/opt/run_clm.sh
+set -x
+export BS=${1:-16}
+export MEMCAP=${2:-0}
+export MODEL=${3:-"125m"}
+export GPUNUM=${4:-1}
+# make directory for logs
+mkdir -p ./logs
+export MODLE_PATH="facebook/opt-${MODEL}"
+# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1
+torchrun \
+  --nproc_per_node ${GPUNUM} \
+  --master_port 19198 \
+  run_clm.py \
+  --dataset_name wikitext \
+  --dataset_config_name wikitext-2-raw-v1 \
+  --output_dir $PWD \
+  --mem_cap ${MEMCAP} \
+  --model_name_or_path ${MODLE_PATH} \
+  --per_device_train_batch_size ${BS} 2>&1 | tee ./logs/colo_${MODEL}_bs_${BS}_cap_${MEMCAP}_gpu_${GPUNUM}.log
--- a/examples/language/opt/utils.py
+++ b/examples/language/opt/utils.py
+import torch
+import torch.distributed as dist
+def memory_cap(size_in_GB):
+    print(f"use only {size_in_GB} GB of CUDA memory")
+    assert dist.is_initialized(), "memory_cap must be used after dist init"
+    local_rank = dist.get_rank()
+    cuda_capacity = torch.cuda.get_device_properties(local_rank).total_memory
+    size_in_B = (size_in_GB * 1024**3)
+    if size_in_B > cuda_capacity:
+        print(f'memory_cap is uselsess since {cuda_capacity / 1024**3} less than {size_in_GB}')
+        return
+    fraction = (size_in_GB * 1024**3) / cuda_capacity
+    print(f'mem faction is {fraction}')
+    torch.cuda.set_per_process_memory_fraction(fraction, local_rank)
+def colo_memory_cap(size_in_GB):
+    from colossalai.utils import colo_device_memory_capacity, colo_set_process_memory_fraction, get_current_device
+    cuda_capacity = colo_device_memory_capacity(get_current_device())
+    if size_in_GB * (1024**3) < cuda_capacity:
+        colo_set_process_memory_fraction(size_in_GB * (1024**3) / cuda_capacity)
+        print("Using {} GB of GPU memory".format(size_in_GB))
+if __name__ == '__main__':
+    memory_cap(40)