README.md

# Byte LLM Perf
## Requirements
* Python >= 3.8
* torch >= 2.1.0

## Installation
```shell
# modify according to torch version and hardware
pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# install required packages
pip3 install -r requirements.txt
```

## Quick Start (run accuracy and performance tests)
Please be sure to complete the installation steps before proceeding with the following steps: 
1. Modify task workload, for example, [chatglm2-torch-fp16-6b.json](https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/workloads/chatglm2-torch-fp16-6b.json)
2. Download model weights using prepare_model.sh or huggingface_cli.
3. Download model output logits in specific input case(.npy files) using prepare_model.sh.
4. Start accuracy and performance tests.

You can run following command automate all steps with chatglm2 model on GPU backend
```shell
python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b 
```

## Test accuracy (single query with specify prompt)
Launch a server running mixtral-8x22b (tp_size=8, max_batch_size=8) with following command:
```shell
cd byte_infer_perf/llm_perf
python3 ./server/launch_server.py --hardware_type GPU --model_config ./model_zoo/mixtral-torch-bf16-8x22b.json --tp_size 8 --max_batch_size 8
```

Test server with single prompt, and you can get infer result, logits numpy file and model forward time. Output files will locate in `./reports/single_query/`
```shell
python3 ./script/single_query.py --prompt "What is 7 multiplied by 7?" --batch_size 8
```

## Test model_impl model forward performance
Only need to instantiate MpEngine running mixtral-8x22b (tp_size=8, max_batch_size=8) and feed proper inputs. Runing following command will get performance outputs. You can modify test cases in `./bench_model.py` currerntly.
```shell
python3 ./bench_model.py --hardware_type GPU --model_config ./model_zoo/mixtral-torch-bf16-8x22b.json --tp_size 8 --max_batch_size 8
```

The output will located in `./reports/{hardware_type}/{model_config}/bench_model`:
- **config.json**: perf config
- **context_perf.csv**: prefill, latency with specified {batch_size, seq_len}
- **decode_perf.csv**: decode, latency with specified {batch_size, seq_len}
- **output.txt**: raw latency data


## Demo Project
[GPU Backend](https://github.com/bytedance/ByteMLPerf/tree/main/byte_infer_perf/llm_perf/backends/GPU) provides a demo project that realizes llm inference of chatglm2-6b on A100 with following features: 
- Separate functional components:
    * Scheduler 
        - custom scheduling on tasks
    * Inferencer
        - transfer tasks to real inputs and get outputs
    * Mp Engine
        - deal with TP logic using multiple processes
    * Sampler
        - postprocess logic
    * Ckpt Loader
        - custom ckpt loader with split logic which matches TP logic.
    * Custom model implementation
        - custom model implementation using hardware backend torch realization
- Seperate scheduling logic
    * Context: one task, input_ids shape is [1, q_len]
    * Decode: multiple tasks, input_ids shape up to [max_batch_size, 1]
- Tensor parallelism
- kv cache

The demo project is intended to provide a reference implementation, and there's no guarantee of achieving optimal performance. More technical details will be provided later on [ByteMLPerf](https://bytemlperf.ai)


## Vendor Integration
Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)

## Models
The following models are planned to be supported:
* [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
* [shenzhi-wang/Llama3-70B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-70B-Chinese-Chat)
* [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)
    - test_accuracy is unavailable temporarily.
* [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
    - test_accuracy is unavailable temporarily.