README.md

# Repo for LLM Supervised Finetune and Text Generation

Date: 2024-01-02

Authors:
- menglibin.mlb


## Finetune
**Modifying `use_cache=true` in config.json of the model checkpoint can significantly improve the speed of inference.**  
Run the following command in the container of the image `dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/ngc:23.07-py310-cu121-ubuntu22.04-megatron-patch-llm`
### Usage
```
$ bash run_ds_train_huggingface_finetune.sh --help
```
Usage: bash run_ds_train_huggingface_finetune.sh \
    [--env ENV] \
    [--model-size MODEL_SIZE] \
    [--micro-batch-size MICRO_BATCH_SIZE] \
    [--gradient-accumulation-steps GRADIENT_ACCUMULATION_STEPS] \
    [--learning-rate LEARNING_RATE] \
    [--sequence-length SEQUENCE_LENGTH] \
    [--precision PRECISION] \
    [--zero-stage ZERO_STAGE] \
    [--enable-gradient-checkpointing ENABLE_GRADIENT_CHECKPOINTING] \
    [--model-name MODEL_NAME {llama2-13b, qwen-7b, qwen-14b, qwen-72b}] \
    [--flash-attention FLASH_ATTENTION] \
    [--epoch EPOCH] \
    [--train-dataset TRAIN_DATASET] \
    [--validation-dataset VALIDATION_DATASET] \
    [--pretrain-model-path PRETRAIN_MODEL_PATH] \
    [--finetune-output-path FINETUNE_OUTPUT_PATH]
### Demo
#### Llama2
```
$ bash run_ds_train_huggingface_finetune.sh --env dsw --model-size 13B --micro-batch-size 1 --gradient-accumulation-steps 2 --learning-rate 1e-5 --sequence-length 2048 --precision bf16 --zero-stage 2 --enable-gradient-checkpointing true --model-name llama2-13b --flash-attention true --epoch 2 --train-dataset /mnt/llama2-datasets/wudao_train.json --validation-dataset /mnt/llama2-datasets/wudao_valid.json --pretrain-model-path /mnt/llama2-ckpts/Llama-2-13b-hf --finetune-output-path /mnt/output_llama2_finetune
```
#### Other model (qwen、chatglm、baichuan2、falcon、bloom......)
```
$ bash run_ds_train_huggingface_finetune.sh --env dsw --model-size 7B --micro-batch-size 1 --gradient-accumulation-steps 2 --learning-rate 1e-5 --sequence-length 2048 --precision bf16 --zero-stage 2 --enable-gradient-checkpointing true --model-name qwen-7b --flash-attention false --epoch 2 --train-dataset /mnt/qwen-datasets/wudao_train.json --validation-dataset /mnt/qwen-datasets/wudao_valid.json --pretrain-model-path /mnt/qwen-ckpts/qwen-7b-hf --finetune-output-path /mnt/output_qwen_7b_finetune
```

## Text Generation
Run the following command in the container of the image `pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:vllm-0.2.6-v2`
### Usage
```
$ python text_generation_huggingface.py --help
```
Usage: text_generation_huggingface.py \
    --checkpoint CHECKPOINT \
    --input-file INPUT_FILE \ 
    --output-file OUTPUT_FILE \
    [--cuda-visible-devices CUDA_VISIBLE_DEVICES] \
    [--output-max-tokens OUTPUT_MAX_TOKENS]
```
$ python text_generation_vllm.py --help
```
Usage: text_generation_vllm.py \
    --checkpoint CHECKPOINT \
    --input-file INPUT_FILE \
    --output-file OUTPUT_FILE \
    [--tensor-parallel-size TENSOR_PARALLEL_SIZE] \
    [--output-max-tokens OUTPUT_MAX_TOKENS]
### Demo
#### Text generation by vllm
```
$ python text_generation_vllm.py --checkpoint /mnt/llama2-ckpts/Llama-2-13b-chat-hf --input-file /mnt/llama2-datasets/wudao_valid.jsonl --output-file /mnt/llama2-datasets/wudao_valid_output.txt --tensor-parallel-size 1
```
#### Text generation by huggingface transformers
```
$ python text_generation_huggingface.py --checkpoint /mnt/llama2-ckpts/Llama-2-13b-chat-hf --input-file /mnt/llama2-datasets/wudao_valid.jsonl --output-file /mnt/llama2-datasets/wudao_valid_output.txt --cuda-visible-devices 0
```