Commit 3c15726c authored by yangzhong's avatar yangzhong
Browse files

git init

parents
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvcr.io/nvidia/pytorch:24.07-py3
SHELL ["/bin/bash", "-c"]
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV TZ=US/Pacific
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN rm -rf /var/lib/apt/lists/* && rm -rf /etc/apt/sources.list.d/* \
&& apt update \
&& apt install -y --no-install-recommends build-essential autoconf \
libtool git ccache curl wget pkg-config sudo ca-certificates \
automake libssl-dev bc python3-dev python3-pip google-perftools \
gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
&& apt -y autoremove \
&& apt remove -y cmake \
&& apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
unzip libarchive-dev
RUN apt install -y --no-install-recommends rsync
# Install setuptools
RUN python3 -m pip install --upgrade pip \
&& python3 -m pip install --upgrade setuptools wheel virtualenv
# Install conda
WORKDIR /tmp
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
&& bash Miniconda3-* -b -p /opt/miniconda3
ENV PATH="$PATH:/opt/miniconda3/bin"
RUN conda create -n llama3.1-405b python=3.10
RUN chmod -R 777 /opt/miniconda3
# Set the env variable for vLLM
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Reference Implementation for llama3.1-405b
**Basic implementation for llama3.1-405b. Few noteworthy items:**
+ Streamer for communicating with loadgen has quite some overhead. This is only meant to provide functional implementation
+ For custom/optimized implementations of this benchmark it is important to include the :
- For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured.
- For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/)
Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3.1-405b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
## Automated command to run the benchmark via MLCommons CM
Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3_1-405b/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
You can also do pip install cm4mlops and then use cm commands for downloading the model and datasets using the commands given in the later sections.
## Prepare environment
### Local Environment Run
The following steps were tested in Ubuntu 22.04 with python 3.10
- **Prerrequisite for GPU runs:** Install Nvidia Driver and cuda 12.1.
The following links contain the commands for installing the [NVIDIA Driver](https://developer.nvidia.com/datacenter-driver-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local) and [Cuda](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local)
- **Prerrequisite:** Install conda.
```bash
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init
```
- Set the following helper variables
```bash
export ROOT=$PWD/inference
export LLAMA_FOLDER=$PWD/inference/language/llama3.1-405b
export LOADGEN_FOLDER=$PWD/inference/loadgen
export DATASET_FOLDER=$PWD/inference/language/llama3.1-405b/dataset
```
- Clone the inference repository:
```bash
git clone --recurse-submodules https://github.com/mlcommons/inference.git \
--depth 1
```
- Create a conda environment:
```bash
conda create -y -n llama3.1-405b python=3.10
conda activate llama3.1-405b
conda install -y -c conda-forge libstdcxx-ng=12
```
- Install requirements and loadgen:
```bash
cd $LLAMA_FOLDER
# Install packages
pip install -r requirements.txt
```
```bash
cd $LOADGEN_FOLDER
pip install -e .
```
### Docker Run
A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
`launch_docker.sh`. There is a section at the top of the file that looks like:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
)
```
For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
/raid/data:/raid/data
)
```
Once you have added all your mounts, build and launch the container with `bash launch.sh`.
Now install all the dependencies:
```
pip install -r requirements.txt
pip install -e ../../loadgen
```
## Get Model
### MLCommons Members Download
TODO: Host model and grant access to submitters
### External Download
+ First go to [llama3.1-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to HuggingFace (if you don't have account, you'll need to create one). **Please note your authentication credentials** as you may be required to provide them when cloning below.
+ Requires Git Large Files Storage
```
export CHECKPOINT_PATH=Meta-Llama-3.1-405B-Instruct
git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct ${CHECKPOINT_PATH}
cd ${CHECKPOINT_PATH} && git checkout be673f326cab4cd22ccfef76109faf68e41aa5f1
```
### Download model through CM (Collective Mind)
```
cm run script --tags=get,ml-model,llama3 --outdirname=${CHECKPOINT_PATH} --hf_token=<huggingface access token> -j
```
**Note:**
Downloading llama3.1-405B model from Hugging Face will require an [**access token**](https://huggingface.co/settings/tokens) which could be generated for your account. Additionally, ensure that your account has access to the [llama3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model.
## Get Dataset
### Preprocessed
You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_dataset_8313_processed_fp16_eval.pkl ./ -P
```
**CM Command**
```
cm run script --tags=get,dataset,mlperf,inference,llama3,_validation --outdirname=<path to download> -j
```
You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:
```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_calibration_dataset_512_processed_fp16_eval.pkl ./ -P
```
**CM Command**
```
cm run script --tags=get,dataset,mlperf,inference,llama3,_calibration --outdirname=<path to download> -j
```
## Run Performance Benchmarks
### Offline
```
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm
```
### Server
```
python -u main.py --scenario Server \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm
```
The ServerSUT was not tested for GPU runs.
## Run Accuracy Benchmarks
### Offline
```
OUTPUT_LOG_DIR=offline-accuracy-logs
mkdir -p "run_outputs" # The script will dump all the outputs to 'run_outputs'.
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--accuracy \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm
ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
if [ -e ${ACCURACY_LOG_FILE} ]; then
python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
fi
```
For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
`--device cpu` to adapt it to a CPU-only run.
### Server
```
OUTPUT_LOG_DIR=server-accuracy-logs
python -u main.py --scenario Server \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--accuracy \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm
ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
if [ -e ${ACCURACY_LOG_FILE} ]; then
python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
fi
```
The ServerSUT was not tested for GPU runs.
### Evaluate the accuracy using CM
You can also evaulate the accuracy from the generated accuracy log by using the following CM command
```
cm run script --tags=process,mlperf,accuracy,_dataset_llama3 --result_dir=<Path to accuracy log directory>
```
## Accuracy Target
Running the GPU implementation in FP16 precision resulted in the following FP16 accuracy targets:
```
{
'rougeL': 21.6666,
'exact_match': 90.1335,
'tokens_per_sample': 684.68,
}
```
The accuracy target is 99% for rougeL and exact_match, and 90% for tokens_per_sample
import asyncio
import os
import time
import numpy as np
import array
import torch
from torch.nn.functional import pad
from vllm import LLM, AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.inputs import TokensPrompt
import pickle
import time
import threading
import tqdm
import queue
import logging
from typing import TYPE_CHECKING, Optional, List
from pathlib import Path
import mlperf_loadgen as lg
from dataset import Dataset
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("Llama-405B-SUT")
class SUT:
def __init__(
self,
model_path=None,
dtype="bfloat16",
batch_size=None,
total_sample_count=8313,
dataset_path=None,
use_cached_outputs=False,
# Set this to True *only for test accuracy runs* in case your prior
# session was killed partway through
workers=1,
tensor_parallel_size=8
):
self.model_path = model_path or f"Meta-Llama-3.1-405B-Instruct{'-FP8' if dtype == 'float8' else ''}"
if not batch_size:
batch_size = 1
self.batch_size = batch_size
self.dtype = dtype
self.tensor_parallel_size = tensor_parallel_size
if not torch.cuda.is_available():
assert False, "torch gpu is not available, exiting..."
self.dataset_path = dataset_path
self.data_object = Dataset(
self.model_path,
dataset_path=self.dataset_path,
total_sample_count=total_sample_count,
dtype=dtype
)
self.qsl = lg.ConstructQSL(
self.data_object.total_sample_count,
self.data_object.perf_count,
self.data_object.LoadSamplesToRam,
self.data_object.UnloadSamplesFromRam,
)
self.load_model()
gen_kwargs = {
"temperature": 1,
"top_p": 1,
"top_k": 1,
"seed": 42,
"max_tokens": 20000,
"min_tokens": 2
}
self.sampling_params = SamplingParams(**gen_kwargs)
# self.sampling_params.all_stop_token_ids.add(self.model.get_tokenizer().eos_token_id)
self.num_workers = workers
self.worker_threads = [None] * self.num_workers
self.query_queue = queue.Queue()
self.use_cached_outputs = use_cached_outputs
self.sample_counter = 0
self.sample_counter_lock = threading.Lock()
def start(self):
# Create worker threads
for j in range(self.num_workers):
worker = threading.Thread(target=self.process_queries)
worker.start()
self.worker_threads[j] = worker
def stop(self):
for _ in range(self.num_workers):
self.query_queue.put(None)
for worker in self.worker_threads:
worker.join()
def process_queries(self):
"""Processor of the queued queries. User may choose to add batching logic"""
while True:
qitem = self.query_queue.get()
if qitem is None:
break
query_ids = [q.index for q in qitem]
tik1 = time.time()
input_ids_tensor = [
self.data_object.input_ids[q.index] for q in qitem]
tik2 = time.time()
outputs = self.model.generate(
prompt_token_ids=input_ids_tensor, sampling_params=self.sampling_params
)
pred_output_tokens = []
for output in outputs:
pred_output_tokens.append(list(output.outputs[0].token_ids))
tik3 = time.time()
processed_output = self.data_object.postProcess(
pred_output_tokens,
query_id_list=query_ids,
)
for i in range(len(qitem)):
n_tokens = processed_output[i].shape[0]
response_array = array.array(
"B", processed_output[i].tobytes())
bi = response_array.buffer_info()
response = [
lg.QuerySampleResponse(
qitem[i].id,
bi[0],
bi[1],
n_tokens)]
lg.QuerySamplesComplete(response)
tok = time.time()
with self.sample_counter_lock:
self.sample_counter += len(qitem)
log.info(f"Samples run: {self.sample_counter}")
if tik1:
log.info(f"\tBatchMaker time: {tik2 - tik1}")
log.info(f"\tInference time: {tik3 - tik2}")
log.info(f"\tPostprocess time: {tok - tik3}")
log.info(f"\t==== Total time: {tok - tik1}")
def load_model(self):
log.info("Loading model...")
self.model = LLM(
self.model_path,
dtype=self.dtype,
tensor_parallel_size=self.tensor_parallel_size,
)
log.info("Loaded model")
def get_sut(self):
self.sut = lg.ConstructSUT(self.issue_queries, self.flush_queries)
return self.sut
def get_qsl(self):
return self.qsl
def predict(self, **kwargs):
raise NotImplementedError
def issue_queries(self, query_samples):
"""Receives samples from loadgen and adds them to queue. Users may choose to batch here"""
list_prompts_tokens = []
list_prompts_attn_masks = []
log.info(f"IssueQuery started with {len(query_samples)} samples")
while len(query_samples) > 0:
self.query_queue.put(query_samples[: self.batch_size])
query_samples = query_samples[self.batch_size:]
log.info(f"IssueQuery done")
def flush_queries(self):
pass
def __del__(self):
pass
class SUTServer(SUT):
def __init__(
self,
model_path=None,
dtype="bfloat16",
total_sample_count=8313,
dataset_path=None,
batch_size=None,
workers=1,
tensor_parallel_size=8
):
super().__init__(
model_path=model_path,
dtype=dtype,
total_sample_count=total_sample_count,
dataset_path=dataset_path,
workers=workers,
tensor_parallel_size=tensor_parallel_size,
)
self.request_id = 0
self.first_token_queue = queue.Queue()
def start(self):
# Create worker threads
for j in range(self.num_workers):
worker = threading.Thread(target=self.process_queries)
worker.start()
self.worker_threads[j] = worker
async def stream_output(self, qitem, results_generator):
first = True
async for request_output in results_generator:
output_response = request_output
if first:
first_tokens = list(output_response.outputs[0].token_ids)
response_data = array.array(
"B", np.array(first_tokens, np.int32).tobytes())
bi = response_data.buffer_info()
response = [lg.QuerySampleResponse(qitem.id, bi[0], bi[1])]
lg.FirstTokenComplete(response)
first = False
outputs = output_response
pred_output_tokens = list(output_response.outputs[0].token_ids)
n_tokens = len(pred_output_tokens)
response_array = array.array(
"B", np.array(pred_output_tokens, np.int32).tobytes()
)
bi = response_array.buffer_info()
response = [
lg.QuerySampleResponse(
qitem.id,
bi[0],
bi[1],
n_tokens)]
lg.QuerySamplesComplete(response)
def process_queries(self):
"""Processor of the queued queries. User may choose to add batching logic"""
while True:
qitem = self.query_queue.get()
if qitem is None:
break
input_ids_tensor = TokensPrompt(
prompt_token_ids=self.data_object.input_ids[qitem.index])
# TODO: This PoC is super slow with significant overhead. Best to
# create a patch to `generate`
results_generator = self.model.generate(
prompt=input_ids_tensor, sampling_params=self.sampling_params, request_id=str(
self.request_id)
)
self.request_id += 1
asyncio.run(self.stream_output(qitem, results_generator))
def issue_queries(self, query_samples):
self.query_queue.put(query_samples[0])
def stop(self):
for _ in range(self.num_workers):
self.query_queue.put(None)
for worker in self.worker_threads:
worker.join()
self.first_token_queue.put(None)
self.ft_response_thread.join()
def load_model(self):
log.info("Loading model")
self.engine_args = AsyncEngineArgs(
self.model_path,
dtype=self.dtype,
tensor_parallel_size=self.tensor_parallel_size)
self.model = AsyncLLMEngine.from_engine_args(self.engine_args)
log.info("Loaded model")
set -e
conda install pybind11==2.10.4 -c conda-forge -y
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
python -m pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
cd ../../loadgen && python3 -m pip install .
import random
import os
import time
import numpy as np
import torch
from datasets import load_dataset, load_from_disk
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn.functional import pad
from torch.utils.data import DataLoader
from typing import Optional, Dict, Sequence
import io
# import utils
import copy
import pickle
import logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("Llama-405B-Dataset")
class Dataset:
def __init__(
self,
model_name=None,
total_sample_count=8313,
perf_count_override=None,
dataset_path=None,
dtype="bfloat16"
):
self.model_name = model_name or f"Meta-Llama-3.1-405B-Instruct{'-FP8' if dtype == 'float8' else ''}"
self.dataset_path = dataset_path
# self.total_sample_count = total_sample_count
self.load_processed_dataset()
self.total_sample_count = min(len(self.input_ids), total_sample_count)
self.perf_count = perf_count_override or self.total_sample_count
def load_processed_dataset(self):
if not os.path.isfile(self.dataset_path):
log.warn(
"Processed pickle file {} not found. Please check that the path is correct".format(
self.dataset_path
)
)
log.info("Loading dataset...")
import pandas as pd
self.processed_data = pd.read_pickle(self.dataset_path)
self.input = self.processed_data.input.tolist()
self.input_ids = self.processed_data.tok_input.tolist()
self.input_lens = self.processed_data.tok_input_len.tolist()
log.info("Finished loading dataset.")
def postProcess(
self,
out_tokens,
query_id_list=None,
sample_index_list=None,
):
"""Postprocesses output prediction"""
# TODO: Create response object in postProcess(?)
"""
preds = []
for i in range(out_tokens.shape[0]):
#pred = out_tokens[i].reshape(-1).cpu().numpy() # Slice up to original input length as below?
input_len = input_seq_lens[i] if input_seq_lens else 0
pred = out_tokens[i, input_len:].reshape(-1).cpu().numpy()
preds.append(pred)
"""
# Everything is padded to max_len (1024), so prune the input and parse
# to numpy
output_seq = out_tokens
assert len(query_id_list) == len(output_seq)
return [np.asarray(out, dtype=np.int32) for out in output_seq]
def LoadSamplesToRam(self, sample_list):
pass
def UnloadSamplesFromRam(self, sample_list):
pass
def __del__(self):
pass
import argparse
from transformers import AutoTokenizer
import nltk
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
import numpy as np
import pandas as pd
import json
import re
from rouge_score import rouge_scorer
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--checkpoint-path",
default="meta-llama/Meta-Llama-3-8B",
help="Path to Llama3.1-405b-hf-chat checkpoint"
)
parser.add_argument(
"--mlperf-accuracy-file", required=True, help="path to mlperf_log_accuracy.json"
)
parser.add_argument(
"--dataset-file",
required=True,
help="path to processed dataset set",
)
parser.add_argument(
"--verbose",
action="store_true",
help="verbose messages")
parser.add_argument(
"--dtype",
default="int64",
help="dtype of the accuracy log",
choices=["int32", "int64", "float"],
)
args = parser.parse_args()
return args
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
def rouge(label, pred):
score = scorer.score(label, pred)
return {
'rougeL': 100 * score['rougeL'].fmeasure,
}
def niah_em(label, pred):
label_uuids = re.findall(
r'[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', label)
pred_uuids = re.findall(r'[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', pred)
if len(pred_uuids) == 0:
return {'exact_match': 0.0}
# https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L28
score = sum([
sum([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) / len(ref)
for pred, ref in zip(pred_uuids, label_uuids)
]) / len(pred_uuids) * 100
return {'exact_match': round(score, 2)}
def qa_em(label, pred):
answer_substring = pred
if 'Answer: ' in pred:
last_answer_index = pred.rfind("Answer: ")
if last_answer_index == -1:
return {'exact_match': 0.0}
answer_substring = pred[last_answer_index + len("Answer: "):]
if answer_substring in label:
return {'exact_match': 100.0}
normalized_answer = re.sub(r'\s+', '', answer_substring).lower()
label_entries = [re.sub(r'\s+', '', entry).lower()
for entry in label.split('|')]
match_found = any(entry in normalized_answer for entry in label_entries)
return {'exact_match': 100.0 if match_found else 0.0}
metrics = {
fn.__name__: fn
for fn in [rouge, niah_em, qa_em]
}
def get_groundtruth(processed_dataset_file, return_metrics=True):
data = pd.read_pickle(processed_dataset_file)
ground_truths = data["gt_output"]
if return_metrics:
metrics = data["metric"]
return ground_truths, metrics
return ground_truths
def postprocess_text(preds, targets):
preds = [pred.strip() for pred in preds]
targets = [target.strip() for target in targets]
# rougeLSum expects newline after each sentence
preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
targets = ["\n".join(nltk.sent_tokenize(target)) for target in targets]
return preds, targets
def process_item(item):
pred, target, metric = item
metric_fn = metrics[metric]
metric_eval = metric_fn(target, pred)
return metric_eval
def run_evaluation(preds, targets, metrics, n_process=None):
n_process = cpu_count() if n_process is None else n_process
with Pool(n_process) as pool:
accuracies = list(
tqdm(
pool.imap(
process_item, zip(
preds, targets, metrics)), total=len(preds)))
df = pd.DataFrame({"accuracy": accuracies, "metric": metrics})
return df.accuracy.apply(pd.Series).describe().loc["mean"].to_dict()
def main():
args = get_args()
dataset_path = args.dataset_file
checkpoint_path = args.checkpoint_path
nltk.download("punkt")
nltk.download('punkt_tab')
tokenizer = AutoTokenizer.from_pretrained(
checkpoint_path,
model_max_length=22000,
padding_side="left",
use_fast=False,
)
targets, metrics = get_groundtruth(args.dataset_file)
target_required = []
metrics_required = []
preds_token_ids = []
eval_dtype = np.int64
if args.dtype == "int32":
eval_dtype = np.int32
elif args.dtype == "float":
eval_dtype = np.float32
with open(args.mlperf_accuracy_file, "r") as f:
results = json.load(f)
seen = set()
gen_tok_len = 0
for pred in results:
qsl_idx = pred["qsl_idx"]
if qsl_idx in seen:
continue
seen.add(qsl_idx)
target_required.append(targets[qsl_idx])
metrics_required.append(metrics[qsl_idx])
pred = np.frombuffer(bytes.fromhex(pred["data"]), eval_dtype)
gen_tok_len += len(pred)
preds_token_ids.append(pred)
preds_decoded_text = tokenizer.batch_decode(
preds_token_ids, skip_special_tokens=True
)
preds, targets = postprocess_text(preds_decoded_text, target_required)
result = run_evaluation(preds, targets, metrics_required)
result = dict(result)
prediction_lens = [len(pred) for pred in preds]
gen_num = len(preds)
result = {
**result,
"gen_len": np.sum(prediction_lens),
"gen_num": gen_num,
"gen_tok_len": gen_tok_len,
"tokens_per_sample": round(gen_tok_len / gen_num, 1),
}
print("\nResults\n")
print(result)
if __name__ == "__main__":
main()
#!/bin/bash
MLCOMMONS_REPO_PATH="$(dirname "$(dirname "$PWD")")"
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
)
# Set up docker environment file for current user
rm -f .docker_env
echo "CI_BUILD_USER=`id -u -n`" >> .docker_env
echo "CI_BUILD_UID=`id -u`" >> .docker_env
echo "CI_BUILD_GROUP=`id -g -n`" >> .docker_env
echo "CI_BUILD_GID=`id -g`" >> .docker_env
cat .docker_env
# Build container
docker build . -t llm/gpubringup
# Build mount flags
declare -a MOUNT_FLAGS
for _mount in ${MOUNTS[@]}; do
_split=($(echo $_mount | tr ':' '\n'));
MOUNT_FLAGS+=("--mount type=bind,source=${_split[0]},target=${_split[1]}");
done
set -x
nvidia-docker run -it --rm --net=host --runtime=nvidia --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --cap-add=DAC_READ_SEARCH \
--security-opt seccomp=unconfined \
-w $PWD \
--env-file `pwd`/.docker_env \
${MOUNT_FLAGS[*]} \
llm/gpubringup \
bash ./with_the_same_user
import subprocess
import mlperf_loadgen as lg
import argparse
import os
import logging
import sys
import requests
import json
sys.path.insert(0, os.getcwd())
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("Llama-405B-MAIN")
# function to check the model name in server matches the user specified one
def verify_model_name(user_specified_name, url):
response = requests.get(url)
if response.status_code == 200:
response_dict = response.json()
server_model_name = response_dict["data"][0]["id"]
if user_specified_name == server_model_name:
return {"matched": True, "error": False}
else:
return {
"matched": False,
"error": f"User specified {user_specified_name} and server model name {server_model_name} mismatch!",
}
else:
return {
"matched": False,
"error": f"Failed to get a valid response. Status code: {response.status_code}",
}
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--scenario",
type=str,
choices=["Offline", "Server"],
default="Offline",
help="Scenario",
)
parser.add_argument(
"--model-path",
type=str,
default="Meta-Llama-3.1-405B-Instruct",
help="Model name",
)
parser.add_argument("--dataset-path", type=str, default=None, help="")
parser.add_argument(
"--accuracy",
action="store_true",
help="Run accuracy mode")
parser.add_argument(
"--dtype",
type=str,
default="float32",
help="data type of the model, choose from float16, bfloat16 and float32",
)
parser.add_argument(
"--audit-conf",
type=str,
default="audit.conf",
help="audit config for LoadGen settings during compliance runs",
)
parser.add_argument(
"--user-conf",
type=str,
default="user.conf",
help="user config for user LoadGen settings such as target QPS",
)
# TODO: This interpretation of 'total-sample-count' is a little
# misleading. Fix it
parser.add_argument(
"--total-sample-count",
type=int,
default=8313,
help="Number of samples to use in benchmark.",
)
parser.add_argument(
"--batch-size",
type=int,
default=1,
help="Model batch-size to use in benchmark.",
)
parser.add_argument(
"--output-log-dir", type=str, default="output-logs", help="Where logs are saved"
)
parser.add_argument(
"--enable-log-trace",
action="store_true",
help="Enable log tracing. This file can become quite large",
)
parser.add_argument(
"--num-workers",
type=int,
default=1,
help="Number of workers to process queries",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=8,
help="Number of workers to process queries",
)
parser.add_argument("--vllm", action="store_true", help="vllm mode")
parser.add_argument(
"--api-model-name",
type=str,
default="Meta-Llama-3.1-405B-Instruct",
help="Model name(specified in llm server)",
)
parser.add_argument(
"--api-server",
type=str,
default=None,
help="Specify an api endpoint call to use api mode",
)
args = parser.parse_args()
return args
scenario_map = {
"offline": lg.TestScenario.Offline,
"server": lg.TestScenario.Server,
}
def main():
args = get_args()
settings = lg.TestSettings()
settings.scenario = scenario_map[args.scenario.lower()]
# mlperf.conf is automatically loaded by the loadgen
# settings.FromConfig(args.mlperf_conf, "llama3_1-405b", args.scenario)
settings.FromConfig(args.user_conf, "llama3_1-405b", args.scenario)
if args.accuracy:
settings.mode = lg.TestMode.AccuracyOnly
else:
settings.mode = lg.TestMode.PerformanceOnly
os.makedirs(args.output_log_dir, exist_ok=True)
log_output_settings = lg.LogOutputSettings()
log_output_settings.outdir = args.output_log_dir
log_output_settings.copy_summary_to_stdout = True
log_settings = lg.LogSettings()
log_settings.log_output = log_output_settings
log_settings.enable_trace = args.enable_log_trace
if args.vllm:
from SUT_VLLM import SUT, SUTServer
else:
raise NotImplementedError
sut_map = {"offline": SUT, "server": SUTServer}
sut_cls = sut_map[args.scenario.lower()]
if args.vllm:
sut = sut_cls(
model_path=args.model_path,
dtype=args.dtype,
batch_size=args.batch_size,
dataset_path=args.dataset_path,
total_sample_count=args.total_sample_count,
workers=args.num_workers,
tensor_parallel_size=args.tensor_parallel_size
)
else:
sut = sut_cls(
model_path=args.model_path,
dtype=args.dtype,
batch_size=args.batch_size,
dataset_path=args.dataset_path,
total_sample_count=args.total_sample_count,
workers=args.num_workers,
)
# Start sut before loadgen starts
sut.start()
lgSUT = lg.ConstructSUT(sut.issue_queries, sut.flush_queries)
log.info("Starting Benchmark run")
lg.StartTestWithLogSettings(
lgSUT,
sut.qsl,
settings,
log_settings,
args.audit_conf)
# Stop sut after completion
sut.stop()
log.info("Run Completed!")
log.info("Destroying SUT...")
lg.DestroySUT(lgSUT)
log.info("Destroying QSL...")
lg.DestroyQSL(sut.qsl)
if __name__ == "__main__":
main()
transformers==4.46.2
nltk==3.8.1
evaluate==0.4.0
absl-py==1.4.0
rouge-score==0.1.2
sentencepiece==0.2.0
accelerate==0.21.0
vllm==0.6.3
pybind11==2.10.4
CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
mkdir -p "run_outputs"
python3 -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--accuracy \
--mlperf-conf mlperf.conf \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline_accuracy_loadgen_logs \
--dtype float32 | tee offline_accuracy_log.log
python3 evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file offline_accuracy_loadgen_logs/mlperf_log_accuracy.json \
--dataset-file ${DATASET_PATH} \
--dtype int32
CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm 2>&1 | tee offline.log
CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
python -u main.py --scenario Server \
--model-path ${CHECKPOINT_PATH} \
--batch-size 16 \
--dtype float16 \
--user-conf user.conf \
--total-sample-count 8313 \
--dataset-path ${DATASET_PATH} \
--output-log-dir output \
--tensor-parallel-size ${GPU_COUNT} \
--vllm 2>&1 | tee server.log
# The format of this config file is 'key = value'.
# The key has the format 'model.scenario.key'. Value is mostly int64_t.
# Model maybe '*' as wildcard. In that case the value applies to all models.
# All times are in milli seconds
#
*.Offline.min_duration = 600000
*.Offline.min_query_count = 2000
*.Server.target_qps = 0.5
*.Server.min_duration = 120000
*.Server.min_query_count = 100
llama3_1-405b.Server.sample_concatenate_permutation = 1
\ No newline at end of file
#!/usr/bin/env bash
# wkong: manually set the user info in env first
set -ex
if [ -z "$@" ]; then
COMMAND=(bash)
else
COMMAND=("$@")
fi
apt-get update && apt-get install -y sudo
getent group "${CI_BUILD_GID}" || addgroup --gid "${CI_BUILD_GID}" "${CI_BUILD_GROUP}"
getent passwd "${CI_BUILD_UID}" || adduser --gid "${CI_BUILD_GID}" --uid "${CI_BUILD_UID}" --gecos "${CI_BUILD_USER} (generated by with_the_same_user script)" --disabled-password --quiet "${CI_BUILD_USER}"
usermod -a -G dip "${CI_BUILD_USER}"
usermod -a -G sudo "${CI_BUILD_USER}"
usermod -a -G root "${CI_BUILD_USER}"
echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
sudo -H -u "#${CI_BUILD_UID}" --preserve-env \
PATH="${PATH}" \
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
PYTHONPATH="${PYTHONPATH}" \
${COMMAND[@]}
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvcr.io/nvidia/pytorch:24.07-py3
SHELL ["/bin/bash", "-c"]
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV TZ=US/Pacific
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN rm -rf /var/lib/apt/lists/* && rm -rf /etc/apt/sources.list.d/* \
&& apt update \
&& apt install -y --no-install-recommends build-essential autoconf \
libtool git ccache curl wget pkg-config sudo ca-certificates \
automake libssl-dev bc python3-dev python3-pip google-perftools \
gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
&& apt -y autoremove \
&& apt remove -y cmake \
&& apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
unzip libarchive-dev
RUN apt install -y --no-install-recommends rsync
# Install setuptools
RUN python3 -m pip install --upgrade pip \
&& python3 -m pip install --upgrade setuptools wheel virtualenv
# Install conda
WORKDIR /tmp
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
&& bash Miniconda3-* -b -p /opt/miniconda3
ENV PATH="$PATH:/opt/miniconda3/bin"
RUN conda create -n llm python=3.10
RUN chmod -R 777 /opt/miniconda3
# Use Ubuntu 22.04 as the base image
FROM ubuntu:22.04
ARG DEBIAN_FRONTEND=noninteractive
# Update package lists
RUN apt-get update
# Install Python 3 and pip
RUN apt-get install -y python3 python3-pip git
# Set Python 3 as the default python interpreter
RUN ln -s /usr/bin/python3 /usr/bin/python
# Verify installation
RUN python --version
RUN pip --version
RUN git --version
# Install requirements
RUN pip install transformers nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0 huggingface_hub[cli]
# Clone and install mxeval
RUN git clone https://github.com/amazon-science/mxeval.git
RUN pip install -e mxeval
# Get language dependencies
RUN apt install -y wget
# Ruby
RUN apt install -y curl libssl-dev libreadline-dev zlib1g-dev autoconf bison build-essential libyaml-dev libreadline-dev libncurses5-dev libffi-dev libgdbm-dev
# PHP
RUN apt install -y software-properties-common ca-certificates lsb-release apt-transport-https
RUN add-apt-repository ppa:ondrej/php
RUN apt-get update
RUN apt install -y php8.0
# RUN apt install -y php-{pear,cgi,pdo,common,curl,mbstring,gd,mysqlnd,gettext,bcmath,json,xml,fpm,intl,zip}
# JAVA
RUN apt-get install -y openjdk-8-jdk
# JAVASCRIPT
RUN apt install -y npm
# SCALA
RUN apt-get install -y scala
# C#
RUN apt-get install -y dotnet6
# Kotlin
RUN apt install -y zip unzip
SHELL ["/bin/bash", "-c"]
WORKDIR "/mxeval"
RUN sed -i 's/sudo//g' /mxeval/language_setup/ubuntu.sh
RUN sed -i 's/source/PS1=1 source/g' /mxeval/language_setup/ubuntu.sh # Need this to make sure that the "source ~/.bashrc" lines work correctly
RUN sed -i 's/npx tsc/tsc/g' /mxeval/mxeval/execution.py # npx tsc runs into permission issues
RUN PATH="$HOME/.rbenv/bin:$PATH" bash /mxeval/language_setup/ubuntu.sh
WORKDIR "/"
CMD bash
# Reference Implementation for Mixtral-8x7B-instruct-v0.1
**Basic implementation for Mixtral-8x7B-instruct-v0.1. Few noteworthy items:**
+ Dataset was constructed by randomly sampling from the validation split of 3 datasets, open_orca_gpt4, GSM8k and MBXP. 5K samples from each one.
+ Streamer for communicating with loadgen has quite some overhead. This is only meant to provide functional implementation
+ For custom/optimized implementations of this benchmark it is important to include the :
- For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured.
- For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/)
Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/mixtral-8x7b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
## Prepare environment
For a CPU-only run:
```
conda create -n Mixtral-8x7B python=3.9
conda activate Mixtral-8x7B
# Install packages
conda install pybind11==2.10.4 -c conda-forge -y
python -m pip install torch==2.2.0.dev20231006+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
pip install git+https://github.com/amazon-science/mxeval.git@e09974f990eeaf0c0e8f2b5eaff4be66effb2c86
export CUR_DIR=${PWD}
cd <inference-repo-root>/loadgen
python -m pip install .
```
For a GPU-based run:
A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
`launch.sh`. There is a section at the top of the file that looks like:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
)
```
For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
/raid/data:/raid/data
)
```
Once you have added all your mounts, launch the container with `bash launch.sh`.
Inside the container, set up the environment with `bash build.sh`. This will install all the dependencies from the
CPU-only setup, as well as any GPU versions for applicable libraries like PyTorch.
## Model
| model | accuracy | model source | precision |
| ---- | ---- | ---- | ---- |
| Mixtral-8x7B-Instruct-v0.1 | [Accuracy target](#accuracy-target) | [Hugging Face](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | fp16 |
**Important Note:** Files and configurations of the model have changed, and might change in the future. If you are going to get the model from Hugging Face or any external source, use a version of the model that exactly matches the one in this [commit](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commit/a60832cb6c88d5cb6e507680d0e9996fbad77050). We strongly recommend to get the model following the steps in the next section:
### Get Checkpoint
#### Using Rclone
To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the model checkpoint:
```
rclone copy mlc-inference:mlcommons-inference-wg-public/mixtral_8x7b/mixtral-8x7b-instruct-v0.1 ./mixtral-8x7b-instruct-v0.1 -P
```
## Get Dataset
### Preprocessed
#### Using Rclone
We make many of the MLPerf infernce models and datasets available using Rclone. In order to keep compatibility, you can use Rclone to get the preprocessed dataset:
To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```bash
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, cd into the folder where you want to place the dataset and run:
```bash
rclone copyurl https://inference.mlcommons-storage.org/mixtral_8x7b/09292024_mixtral_15k_mintoken2_v1.pkl ./ -a -P
```
#### Using wget
Alternatively, you can simply cd into the folder where you want to place the dataset and run
```bash
wget https://inference.mlcommons-storage.org/mixtral_8x7b/09292024_mixtral_15k_mintoken2_v1.pkl
```
### Calibration dataset
#### Using Rclone
Rclone is installed, cd into the folder where you want to place the dataset and run:
```bash
rclone copyurl https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl ./ -a -P
```
#### Using wget
Alternatively, you can simply cd into the folder where you want to place the dataset and run
```bash
wget https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl
```
## Run Performance Benchmarks
### Offline
```
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 15000 \
--device cpu \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs
```
For a GPU-based run:
```
python3 -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 15000 \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs \
--dtype float32 \
--device cuda:0 2>&1 | tee offline_performance_log.log
```
### Server
```
python -u main.py --scenario Server \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 15000 \
--device cpu \
--dataset-path ${DATASET_PATH} \
--output-log-dir server-logs
```
The ServerSUT was not tested for GPU runs.
## Run Accuracy Benchmarks
### Offline
```
OUTPUT_LOG_DIR=offline-accuracy-logs
mkdir -p "run_outputs" # The script will dump all the outputs to 'run_outputs'.
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--accuracy \
--user-conf user.conf \
--total-sample-count 15000 \
--dataset-path ${DATASET_PATH} \
--output-log-dir ${OUTPUT_LOG_DIR} \
--device cpu
ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
if [ -e ${ACCURACY_LOG_FILE} ]; then
python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
fi
# Optional: Create a pickled pandas DataFrame that is the original dataset with extra columns with output data from the
# accuracy run. The following columns will be added:
# - "gen_output_tok_id": A list of ints representing the tokenized output sequence.
# - "gen_output_text": A str representing the untokenized output sequence.
# - "gen_output_tok_len": An int representing the number of output tokens.
# - "rouge1": The rouge1 score for this sample
# - "rouge2": The rouge2 score for this sample
# - "rougeL": The rougeL score for this sample
# This file will by default be saved to 'full_output.pkl'. You can modify this with --output-pkl-path.
python consolidate_results.py --dataset-path ${DATASET_PATH} --model-dir ${CHECKPOINT_PATH}
```
For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
`--device cpu` to adapt it to a CPU-only run.
### Server
```
OUTPUT_LOG_DIR=server-accuracy-logs
python -u main.py --scenario Server \
--model-path ${CHECKPOINT_PATH} \
--accuracy \
--user-conf user.conf \
--total-sample-count 15000 \
--dataset-path ${DATASET_PATH} \
--output-log-dir ${OUTPUT_LOG_DIR} \
--device cpu
ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
if [ -e ${ACCURACY_LOG_FILE} ]; then
python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
fi
```
The ServerSUT was not tested for GPU runs.
### Evaluation
Recreating the enviroment for evaluating the quality metrics can be quite tedious. Therefore we provide a dockerfile and recommend using docker for this task.
1. Build the evaluation container
```bash
docker build . -f Dockerfile.eval -t evaluation
```
2. Run the docker in interactive mode and with
```bash
docker run -it --rm --net=host --runtime=nvidia --ipc=host -v $PWD:$PWD -w $PWD evaluation
```
3.
```bash
cd eval
python -u evaluate-accuracy.py --checkpoint-path [path_to_model_checkpoint] \
--mlperf-accuracy-file [path_to_mlperf_accuracy_file] \
--dataset-file [path_to_dataset] \
--n_workers 8
```
## Accuracy Target
Reference scores:
Open Orca:
```json
{'rouge1': 45.5989, 'rouge2': 23.3526, 'rougeL': 30.4608}
```
GSM8K:
```json
{'gsm8k': 73.66}
```
MBXP:
```json
{'mbxp': 60.16}
```
For official submissions, 99% of each reference score is enforced. Additionally, 90%-110% of the generated tokens_per_samples (counting all the non-EOS tokens):
```json
{'tokens_per_sample': 144.84}
```
import os
import time
import numpy as np
import array
import torch
from torch.nn.functional import pad
from torch.utils.data import DataLoader
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
LogitsProcessor,
LogitsProcessorList,
)
from transformers.generation.streamers import BaseStreamer
import pickle
import time
import threading
import tqdm
import queue
import logging
from typing import TYPE_CHECKING, Optional, List
from pathlib import Path
import mlperf_loadgen as lg
from dataset import Dataset
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("Mixtral-8x7B-Instruct-v0.1")
gen_kwargs = {
"early_stopping": True,
"max_new_tokens": 1024,
"min_new_tokens": 2,
"num_beams": 1,
"do_sample": False,
}
class StopAfterSequence(LogitsProcessor):
"""Logits processor (to use with HuggingFace `generate()` method :
https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/
text_generation#transformers.generation_utils.GenerationMixin).
This logits processor makes that when the model generates a specified
stopping sequence, it stops generating new tokens
Args:
stop_seq (List[int]): ID of the space token.
eos_token_id (int): ID of the EOS token.
device (str): Device that the model is running
"""
def __init__(
self,
eos_token_id: int,
stop_seq: List[int] = [13, 13940, 28832, 13],
device="cpu",
):
super().__init__()
assert len(stop_seq) >= 1
self.device = device
self.stop_seq = torch.tensor(stop_seq, dtype=torch.long).to(device)
self.stop_seq_length = len(stop_seq)
self.eos_token_id = eos_token_id
def check_stop_condition(self, input_ids: torch.LongTensor):
stop_condition_met = (
input_ids[:, -self.stop_seq_length:] == self.stop_seq
).all(dim=1)
return stop_condition_met
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
) -> torch.FloatTensor:
if input_ids.size(1) > self.stop_seq_length:
forced_eos = torch.full(
(scores.size(1),), -float("inf")).to(self.device)
forced_eos[self.eos_token_id] = 0
scores[self.check_stop_condition(input_ids)] = forced_eos
return scores
class FirstTokenStreamer(BaseStreamer):
"""Streams first tokens to a 'holder'"""
def __init__(
self, first_token, tokens_cache=[], is_first_token=True, response_ids=[]
):
"""Response ids added to 'sign' the first token"""
self.first_token = first_token # Queue for first token
self.is_first_token = is_first_token
# Cache for subsequent generated tokens
self.tokens_cache = tokens_cache
self.response_ids = response_ids
# The first tokens sent to the streamer are actually the input prompts
self.is_prompt = True
def put(self, value):
"""Caches the tokens as they're generated. Assumes bs=1"""
# Prompts are streamed first so we need to skip the first time value
# that arrives
if self.is_prompt:
self.is_prompt = False
return
value = value.item()
if self.is_first_token:
# Add generated first token together with its query response_id to
# first tokens queue
self.first_token.put((value, self.response_ids[0]))
self.is_first_token = False
return
self.tokens_cache.append(value)
def end(self):
pass
def get_out_tokens(self):
return self.tokens_cache
class SUT:
def __init__(
self,
model_path=None,
dtype="bfloat16",
device="cpu",
batch_size=None,
total_sample_count=24576,
dataset_path=None,
use_cached_outputs=False,
# Set this to True *only for test accuracy runs* in case your
# prior session was killed partway through
workers=1,
):
self.model_path = model_path or "mistralai/Mixtral-8x7B-Instruct-v0.1"
self.device = device
if not batch_size:
if device == "cpu":
batch_size = 1
else:
batch_size = 32 # Reduce to 8 if using 4 GPUs, 16 for 8.
self.batch_size = batch_size
# dtype
if dtype == "bfloat16":
self.amp_enabled = True
self.amp_dtype = torch.bfloat16
elif dtype == "float16":
self.amp_enabled = True
self.amp_dtype = torch.float16
else:
self.amp_enabled = False
self.amp_dtype = torch.float32
if "cuda" in self.device:
assert torch.cuda.is_available(), "torch gpu is not available, exiting..."
self.dataset_path = dataset_path
self.data_object = Dataset(
self.model_path,
dataset_path=self.dataset_path,
total_sample_count=total_sample_count,
device=self.device,
)
self.qsl = lg.ConstructQSL(
self.data_object.total_sample_count,
self.data_object.perf_count,
self.data_object.LoadSamplesToRam,
self.data_object.UnloadSamplesFromRam,
)
self.load_model()
self.num_workers = workers
self.worker_threads = [None] * self.num_workers
self.query_queue = queue.Queue()
self.use_cached_outputs = use_cached_outputs
self.sample_counter = 0
self.sample_counter_lock = threading.Lock()
def start(self):
# Create worker threads
for j in range(self.num_workers):
worker = threading.Thread(target=self.process_queries)
worker.start()
self.worker_threads[j] = worker
def stop(self):
for _ in range(self.num_workers):
self.query_queue.put(None)
for worker in self.worker_threads:
worker.join()
def process_queries(self):
"""Processor of the queued queries. User may choose to add batching logic"""
while True:
qitem = self.query_queue.get()
if qitem is None:
break
query_ids = [q.index for q in qitem]
fname = "q" + "_".join([str(i) for i in query_ids])
fname = f"run_outputs/{fname}.pkl"
_p = Path(fname)
if self.use_cached_outputs and _p.exists():
# Read cache
with _p.open(mode="rb") as f:
d = pickle.load(f)
processed_output = d["outputs"]
tik1 = None
tik2 = None
tik3 = None
tok = None
else:
# Construct / collate batch
max_seq_len = 1024
tik1 = time.time()
input_ids_tensor = []
input_masks_tensor = []
input_len = []
input_dataset = []
for q in qitem:
input_ids_tensor.append(
pad(
self.data_object.input_ids[q.index],
(
max_seq_len -
self.data_object.input_lens[q.index],
0,
0,
0,
),
value=self.tokenizer.pad_token_id,
)
)
input_masks_tensor.append(
pad(
self.data_object.attention_masks[q.index],
(
max_seq_len -
self.data_object.input_lens[q.index],
0,
0,
0,
),
value=0,
)
)
input_len.append(self.data_object.input_lens[q.index])
# In case we predict code generation, we can specify an
# additional stop sequence
input_dataset.append(
self.data_object.dataset_names[q.index])
input_ids_tensor = torch.cat(input_ids_tensor)
input_masks_tensor = torch.cat(input_masks_tensor)
assert input_ids_tensor.shape == input_masks_tensor.shape
assert input_ids_tensor.shape[0] <= self.batch_size
tik2 = time.time()
logits_processor = LogitsProcessorList(
[StopAfterSequence(
self.tokenizer.eos_token_id, device=self.device)]
)
for i in range(len(input_ids_tensor)):
ids, masks, dataset = (
input_ids_tensor[i: i + 1],
input_masks_tensor[i: i + 1],
input_dataset[i],
)
pred_output_tokens = []
if dataset == "MBXP":
out = self.model.generate(
input_ids=ids,
attention_mask=masks,
pad_token_id=self.tokenizer.pad_token_id,
logits_processor=logits_processor,
**gen_kwargs,
)
else:
out = self.model.generate(
input_ids=ids,
attention_mask=masks,
pad_token_id=self.tokenizer.pad_token_id,
**gen_kwargs,
)
pred_output_tokens.append(out)
pred_output_tokens = torch.cat(pred_output_tokens)
tik3 = time.time()
processed_output = self.data_object.postProcess(
pred_output_tokens,
input_seq_lens=input_len,
query_id_list=query_ids,
)
for i in range(len(qitem)):
n_tokens = processed_output[i].shape[0]
response_array = array.array(
"B", processed_output[i].tobytes())
bi = response_array.buffer_info()
response = [
lg.QuerySampleResponse(
qitem[i].id,
bi[0],
bi[1],
n_tokens)]
lg.QuerySamplesComplete(response)
tok = time.time()
with self.sample_counter_lock:
self.sample_counter += len(qitem)
print(f"Samples run: {self.sample_counter}")
if tik1:
print(f"\tBatchMaker time: {tik2 - tik1}")
print(f"\tInference time: {tik3 - tik2}")
print(f"\tPostprocess time: {tok - tik3}")
print(f"\t==== Total time: {tok - tik1}")
else:
print(f"\tLoaded from cache: {_p}")
def load_model(self):
self.model = AutoModelForCausalLM.from_pretrained(
self.model_path,
device_map="auto",
low_cpu_mem_usage=True,
torch_dtype=self.amp_dtype,
)
print("Loaded model")
self.device = torch.device(self.device)
if self.device == "cpu":
# Force CPU if your system has GPU and you specifically want
# CPU-only run
self.model = self.model.to(self.device)
self.model.eval()
try: # for systems with low ram, the below command gives error as some part is offloaded to disk
self.model = self.model.to(memory_format=torch.channels_last)
except BaseException:
pass
self.tokenizer = AutoTokenizer.from_pretrained(
self.model_path,
model_max_length=1024,
padding_side="left",
use_fast=False,
)
self.tokenizer.pad_token = self.tokenizer.eos_token
print("Loaded tokenizer")
def get_sut(self):
self.sut = lg.ConstructSUT(self.issue_queries, self.flush_queries)
return self.sut
def get_qsl(self):
return self.qsl
def predict(self, **kwargs):
raise NotImplementedError
def issue_queries(self, query_samples):
"""Receives samples from loadgen and adds them to queue. Users may choose to batch here"""
list_prompts_tokens = []
list_prompts_attn_masks = []
print(f"IssueQuery started with {len(query_samples)} samples")
while len(query_samples) > 0:
self.query_queue.put(query_samples[: self.batch_size])
query_samples = query_samples[self.batch_size:]
print(f"IssueQuery done")
def flush_queries(self):
pass
def __del__(self):
pass
class SUTServer(SUT):
def __init__(
self,
model_path=None,
dtype="bfloat16",
device="cpu",
total_sample_count=24576,
dataset_path=None,
workers=1,
):
super().__init__(
model_path=model_path,
dtype=dtype,
device=device,
total_sample_count=total_sample_count,
dataset_path=dataset_path,
workers=workers,
)
self.first_token_queue = queue.Queue()
def start(self):
# Create worker threads
for j in range(self.num_workers):
worker = threading.Thread(target=self.process_queries)
worker.start()
self.worker_threads[j] = worker
# Create first token response thread
self.ft_response_thread = threading.Thread(
target=self.process_first_tokens)
self.ft_response_thread.start()
def process_first_tokens(self):
while True:
first_token_item = self.first_token_queue.get()
if first_token_item is None:
log.info("Exiting First token response thread")
break
first_tokens, response_id = first_token_item
response_data = array.array(
"B", np.array(
first_tokens, np.int32).tobytes())
bi = response_data.buffer_info()
response = [lg.QuerySampleResponse(response_id, bi[0], bi[1])]
lg.FirstTokenComplete(response)
def process_queries(self):
"""Processor of the queued queries. User may choose to add batching logic"""
while True:
qitem = self.query_queue.get()
if qitem is None:
break
input_ids_tensor = self.data_object.input_ids[qitem.index]
input_masks_tensor = self.data_object.attention_masks[qitem.index]
dataset = self.data_object.dataset_names[qitem.index]
# TODO: This PoC is super slow with significant overhead. Best to
# create a patch to `generate`
tokens_cache = []
tokens_streamer = FirstTokenStreamer(
self.first_token_queue,
tokens_cache=tokens_cache,
is_first_token=True,
response_ids=[qitem.id],
)
logits_processor = LogitsProcessorList(
[StopAfterSequence(
self.tokenizer.eos_token_id, device=self.device)]
)
if dataset == "MBXP":
_ = self.model.generate(
input_ids=input_ids_tensor,
attention_mask=input_masks_tensor,
pad_token_id=self.tokenizer.pad_token_id,
streamer=tokens_streamer,
logits_processor=logits_processor,
**gen_kwargs,
)
else:
_ = self.model.generate(
input_ids=input_ids_tensor,
attention_mask=input_masks_tensor,
pad_token_id=self.tokenizer.pad_token_id,
streamer=tokens_streamer,
**gen_kwargs,
)
output_tokens = tokens_streamer.get_out_tokens()
n_tokens = len(output_tokens)
response_array = array.array(
"B", np.array(output_tokens, np.int32).tobytes()
)
bi = response_array.buffer_info()
response = [
lg.QuerySampleResponse(
qitem.id,
bi[0],
bi[1],
n_tokens)]
lg.QuerySamplesComplete(response)
def issue_queries(self, query_samples):
self.query_queue.put(query_samples[0])
def stop(self):
for _ in range(self.num_workers):
self.query_queue.put(None)
for worker in self.worker_threads:
worker.join()
self.first_token_queue.put(None)
self.ft_response_thread.join()
set -e
conda install pybind11==2.10.4 -c conda-forge -y
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
python -m pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
cd ../../loadgen && python3 -m pip install .
import random
import os
import time
import numpy as np
import torch
from datasets import load_dataset, load_from_disk
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn.functional import pad
from torch.utils.data import DataLoader
from typing import Optional, Dict, Sequence
import io
# import utils
import copy
import pickle
import logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("Llama-70B-Dataset")
class Dataset:
def __init__(
self,
model_name=None,
total_sample_count=15000,
perf_count_override=None,
dataset_path=None,
device="cpu",
):
self.model_name = model_name or "mistralai/Mixtral-8x7B-v0.1"
self.dataset_path = dataset_path
self.max_length = 1024
self.device = device
# self.total_sample_count = total_sample_count
self.load_tokenizer()
self.load_processed_dataset()
self.total_sample_count = min(len(self.input_ids), total_sample_count)
self.perf_count = perf_count_override or self.total_sample_count
def load_tokenizer(self):
"""Returns tokenizer"""
self.tokenizer = AutoTokenizer.from_pretrained(
self.model_name,
model_max_length=1024,
padding_side="left",
use_fast=False,
)
self.tokenizer.pad_token = self.tokenizer.eos_token
def load_processed_dataset(self):
if not os.path.isfile(self.dataset_path):
log.warn(
"Processed pickle file {} not found. Please check that the path is correct".format(
self.dataset_path
)
)
print("Loading dataset...")
import pandas as pd
processed_data = pd.read_pickle(self.dataset_path)
input_tokens = processed_data["tok_input"]
self.input_ids = []
self.input_lens = []
self.attention_masks = []
self.dataset_names = []
for ids in input_tokens:
input_ids = torch.tensor(ids, dtype=torch.int32).view(
1, -1).to(self.device)
attn_mask = torch.ones_like(input_ids)
self.input_ids.append(input_ids)
self.attention_masks.append(attn_mask)
self.input_lens.append(input_ids.shape[-1])
for dataset in processed_data["dataset"]:
self.dataset_names.append(dataset)
print("Finished loading dataset.")
def postProcess(
self,
out_tokens,
input_seq_lens=None,
query_id_list=None,
sample_index_list=None,
):
"""Postprocesses output prediction"""
# TODO: Create response object in postProcess(?)
"""
preds = []
for i in range(out_tokens.shape[0]):
#pred = out_tokens[i].reshape(-1).cpu().numpy() # Slice up to original input length as below?
input_len = input_seq_lens[i] if input_seq_lens else 0
pred = out_tokens[i, input_len:].reshape(-1).cpu().numpy()
preds.append(pred)
"""
# Everything is padded to max_len (1024), so prune the input and parse
# to numpy
output_seq = out_tokens[:, 1024:].cpu().numpy()
aux_seq = []
assert len(query_id_list) == output_seq.shape[0]
for i in range(len(output_seq)):
aux = output_seq[i]
while len(output_seq[i]) <= 1:
aux = np.append(aux, self.tokenizer.eos_token_id)
aux_seq.append(aux)
output_seq = np.stack(aux_seq)
# Save outputs
if not os.path.exists("run_outputs"):
os.makedirs("run_outputs")
fname = "q" + "_".join([str(i) for i in query_id_list])
fname = f"run_outputs/{fname}.pkl"
with open(fname, mode="wb") as f:
d = {"query_ids": query_id_list, "outputs": output_seq}
print(f"Saving outputs to {fname}")
pickle.dump(d, f)
return output_seq
def LoadSamplesToRam(self, sample_list):
pass
def UnloadSamplesFromRam(self, sample_list):
pass
def __del__(self):
pass
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment