git init

3c15726c · yangzhong · 3c15726c · 3c15726c · 3c15726c · 3c15726c
Commit 3c15726c authored Nov 01, 2025 by yangzhong
20 changed files
--- a/language/llama3.1-405b/Dockerfile
+++ b/language/llama3.1-405b/Dockerfile
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM nvcr.io/nvidia/pytorch:24.07-py3
+SHELL ["/bin/bash", "-c"]
+
+ENV LC_ALL=C.UTF-8
+ENV LANG=C.UTF-8
+
+ENV TZ=US/Pacific
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+RUN rm -rf /var/lib/apt/lists/* && rm -rf /etc/apt/sources.list.d/* \
+ && apt update \
+ && apt install -y --no-install-recommends build-essential autoconf \
+        libtool git ccache curl wget pkg-config sudo ca-certificates \
+        automake libssl-dev bc python3-dev python3-pip google-perftools \
+        gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
+        libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
+ && apt -y autoremove \
+ && apt remove -y cmake \
+ && apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
+        unzip libarchive-dev
+RUN apt install -y --no-install-recommends rsync
+
+# Install setuptools
+RUN python3 -m pip install --upgrade pip \
+    && python3 -m pip install --upgrade setuptools wheel virtualenv
+
+# Install conda
+WORKDIR /tmp
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
+    && bash Miniconda3-* -b -p /opt/miniconda3
+ENV PATH="$PATH:/opt/miniconda3/bin"
+RUN conda create -n llama3.1-405b python=3.10
+RUN chmod -R 777 /opt/miniconda3
+
+# Set the env variable for vLLM
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
--- a/language/llama3.1-405b/README.md
+++ b/language/llama3.1-405b/README.md
+# Reference Implementation for llama3.1-405b
+
+**Basic implementation for llama3.1-405b. Few noteworthy items:**
+
+ Streamer for communicating with loadgen has quite some overhead. This is only meant to provide functional implementation
+ For custom/optimized implementations of this benchmark it is important to include the :
+        - For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured.
+        - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/)
+
+Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3.1-405b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
+
+## Automated command to run the benchmark via MLCommons CM
+
+Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3_1-405b/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
+
+You can also do pip install cm4mlops and then use cm commands for downloading the model and datasets using the commands given in the later sections.
+
+## Prepare environment
+
+### Local Environment Run
+
+The following steps were tested in Ubuntu 22.04 with python 3.10
+
+- **Prerrequisite for GPU runs:** Install Nvidia Driver and cuda 12.1.
+
+The following links contain the commands for installing the [NVIDIA Driver](https://developer.nvidia.com/datacenter-driver-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local) and [Cuda](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local)
+
+- **Prerrequisite:** Install conda.
+
+```bash
+mkdir -p ~/miniconda3
+wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
+bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
+rm ~/miniconda3/miniconda.sh
+~/miniconda3/bin/conda init
+```
+
+- Set the following helper variables
+```bash
+export ROOT=$PWD/inference
+export LLAMA_FOLDER=$PWD/inference/language/llama3.1-405b
+export LOADGEN_FOLDER=$PWD/inference/loadgen
+export DATASET_FOLDER=$PWD/inference/language/llama3.1-405b/dataset
+```
+
+- Clone the inference repository:
+```bash
+git clone --recurse-submodules https://github.com/mlcommons/inference.git \
+ --depth 1
+```
+
+- Create a conda environment:
+```bash
+conda create -y -n llama3.1-405b python=3.10
+conda activate llama3.1-405b
+conda install -y -c conda-forge libstdcxx-ng=12
+```
+
+- Install requirements and loadgen:
+```bash
+cd $LLAMA_FOLDER
+# Install packages
+pip install -r requirements.txt
+```
+
+```bash
+cd $LOADGEN_FOLDER
+pip install -e .
+```
+
+### Docker Run
+
+A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
+`launch_docker.sh`. There is a section at the top of the file that looks like:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+)
+```
+
+For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+    /raid/data:/raid/data
+)
+```
+Once you have added all your mounts, build and launch the container with `bash launch.sh`.
+
+Now install all the dependencies:
+```
+pip install -r requirements.txt
+pip install -e ../../loadgen
+```
+
+
+## Get Model
+### MLCommons Members Download
+
+TODO: Host model and grant access to submitters
+
+
+### External Download
+ First go to [llama3.1-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to HuggingFace (if you don't have account, you'll need to create one). **Please note your authentication credentials** as you may be required to provide them when cloning below.
+ Requires Git Large Files Storage
+```
+export CHECKPOINT_PATH=Meta-Llama-3.1-405B-Instruct
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct ${CHECKPOINT_PATH}
+cd ${CHECKPOINT_PATH} && git checkout be673f326cab4cd22ccfef76109faf68e41aa5f1
+```
+
+### Download model through CM (Collective Mind)
+
+```
+cm run script --tags=get,ml-model,llama3 --outdirname=${CHECKPOINT_PATH} --hf_token=<huggingface access token> -j
+```
+
+**Note:**
+Downloading llama3.1-405B model from Hugging Face will require an [**access token**](https://huggingface.co/settings/tokens) which could be generated for your account. Additionally, ensure that your account has access to the [llama3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model. 
+
+## Get Dataset
+
+### Preprocessed
+
+You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
+```
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+```
+Once Rclone is installed, run the following command to authenticate with the bucket:
+```
+rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```
+You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_dataset_8313_processed_fp16_eval.pkl ./ -P
+```
+**CM Command**
+
+```
+cm run script --tags=get,dataset,mlperf,inference,llama3,_validation --outdirname=<path to download> -j
+```
+
+You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_calibration_dataset_512_processed_fp16_eval.pkl ./ -P
+```
+
+**CM Command**
+```
+cm run script --tags=get,dataset,mlperf,inference,llama3,_calibration --outdirname=<path to download> -j
+```
+
+
+## Run Performance Benchmarks
+
+### Offline
+```
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --batch-size 16 \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8313 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+```
+
+### Server
+```
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --batch-size 16 \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8313 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+```
+
+The ServerSUT was not tested for GPU runs.
+
+## Run Accuracy Benchmarks
+
+### Offline
+```
+OUTPUT_LOG_DIR=offline-accuracy-logs
+
+mkdir -p "run_outputs"  # The script will dump all the outputs to 'run_outputs'.
+
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --batch-size 16 \
+                --accuracy \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8313 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
+`--device cpu` to adapt it to a CPU-only run.
+
+### Server
+```
+OUTPUT_LOG_DIR=server-accuracy-logs
+
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --batch-size 16 \
+                --accuracy \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8313 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+The ServerSUT was not tested for GPU runs.
+
+### Evaluate the accuracy using CM
+You can also evaulate the accuracy from the generated accuracy log by using the following CM command
+```
+cm run script --tags=process,mlperf,accuracy,_dataset_llama3 --result_dir=<Path to accuracy log directory>
+```
+
+## Accuracy Target
+Running the GPU implementation in FP16 precision resulted in the following FP16 accuracy targets:
+```
+{
+        'rougeL': 21.6666,
+        'exact_match': 90.1335,
+        'tokens_per_sample': 684.68,
+}
+```
+The accuracy target is 99% for rougeL and exact_match, and 90% for tokens_per_sample
--- a/language/llama3.1-405b/SUT_VLLM.py
+++ b/language/llama3.1-405b/SUT_VLLM.py
+import asyncio
+import os
+import time
+import numpy as np
+import array
+import torch
+from torch.nn.functional import pad
+from vllm import LLM, AsyncLLMEngine, AsyncEngineArgs, SamplingParams
+from vllm.inputs import TokensPrompt
+
+import pickle
+import time
+import threading
+import tqdm
+import queue
+
+import logging
+from typing import TYPE_CHECKING, Optional, List
+from pathlib import Path
+
+import mlperf_loadgen as lg
+from dataset import Dataset
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("Llama-405B-SUT")
+
+
+class SUT:
+    def __init__(
+        self,
+        model_path=None,
+        dtype="bfloat16",
+        batch_size=None,
+        total_sample_count=8313,
+        dataset_path=None,
+        use_cached_outputs=False,
+        # Set this to True *only for test accuracy runs* in case your prior
+        # session was killed partway through
+        workers=1,
+        tensor_parallel_size=8
+    ):
+
+        self.model_path = model_path or f"Meta-Llama-3.1-405B-Instruct{'-FP8' if dtype == 'float8' else ''}"
+
+        if not batch_size:
+            batch_size = 1
+        self.batch_size = batch_size
+
+        self.dtype = dtype
+        self.tensor_parallel_size = tensor_parallel_size
+
+        if not torch.cuda.is_available():
+            assert False, "torch gpu is not available, exiting..."
+
+        self.dataset_path = dataset_path
+        self.data_object = Dataset(
+            self.model_path,
+            dataset_path=self.dataset_path,
+            total_sample_count=total_sample_count,
+            dtype=dtype
+        )
+        self.qsl = lg.ConstructQSL(
+            self.data_object.total_sample_count,
+            self.data_object.perf_count,
+            self.data_object.LoadSamplesToRam,
+            self.data_object.UnloadSamplesFromRam,
+        )
+
+        self.load_model()
+        gen_kwargs = {
+            "temperature": 1,
+            "top_p": 1,
+            "top_k": 1,
+            "seed": 42,
+            "max_tokens": 20000,
+            "min_tokens": 2
+        }
+        self.sampling_params = SamplingParams(**gen_kwargs)
+        # self.sampling_params.all_stop_token_ids.add(self.model.get_tokenizer().eos_token_id)
+
+        self.num_workers = workers
+        self.worker_threads = [None] * self.num_workers
+        self.query_queue = queue.Queue()
+
+        self.use_cached_outputs = use_cached_outputs
+        self.sample_counter = 0
+        self.sample_counter_lock = threading.Lock()
+
+    def start(self):
+        # Create worker threads
+        for j in range(self.num_workers):
+            worker = threading.Thread(target=self.process_queries)
+            worker.start()
+            self.worker_threads[j] = worker
+
+    def stop(self):
+        for _ in range(self.num_workers):
+            self.query_queue.put(None)
+
+        for worker in self.worker_threads:
+            worker.join()
+
+    def process_queries(self):
+        """Processor of the queued queries. User may choose to add batching logic"""
+        while True:
+            qitem = self.query_queue.get()
+            if qitem is None:
+                break
+
+            query_ids = [q.index for q in qitem]
+
+            tik1 = time.time()
+
+            input_ids_tensor = [
+                self.data_object.input_ids[q.index] for q in qitem]
+
+            tik2 = time.time()
+            outputs = self.model.generate(
+                prompt_token_ids=input_ids_tensor, sampling_params=self.sampling_params
+            )
+            pred_output_tokens = []
+            for output in outputs:
+                pred_output_tokens.append(list(output.outputs[0].token_ids))
+            tik3 = time.time()
+
+            processed_output = self.data_object.postProcess(
+                pred_output_tokens,
+                query_id_list=query_ids,
+            )
+            for i in range(len(qitem)):
+                n_tokens = processed_output[i].shape[0]
+                response_array = array.array(
+                    "B", processed_output[i].tobytes())
+                bi = response_array.buffer_info()
+                response = [
+                    lg.QuerySampleResponse(
+                        qitem[i].id,
+                        bi[0],
+                        bi[1],
+                        n_tokens)]
+                lg.QuerySamplesComplete(response)
+
+            tok = time.time()
+
+            with self.sample_counter_lock:
+                self.sample_counter += len(qitem)
+                log.info(f"Samples run: {self.sample_counter}")
+                if tik1:
+                    log.info(f"\tBatchMaker time: {tik2 - tik1}")
+                    log.info(f"\tInference time: {tik3 - tik2}")
+                    log.info(f"\tPostprocess time: {tok - tik3}")
+                    log.info(f"\t==== Total time: {tok - tik1}")
+
+    def load_model(self):
+        log.info("Loading model...")
+        self.model = LLM(
+            self.model_path,
+            dtype=self.dtype,
+            tensor_parallel_size=self.tensor_parallel_size,
+        )
+        log.info("Loaded model")
+
+    def get_sut(self):
+        self.sut = lg.ConstructSUT(self.issue_queries, self.flush_queries)
+        return self.sut
+
+    def get_qsl(self):
+        return self.qsl
+
+    def predict(self, **kwargs):
+        raise NotImplementedError
+
+    def issue_queries(self, query_samples):
+        """Receives samples from loadgen and adds them to queue. Users may choose to batch here"""
+
+        list_prompts_tokens = []
+        list_prompts_attn_masks = []
+
+        log.info(f"IssueQuery started with {len(query_samples)} samples")
+        while len(query_samples) > 0:
+            self.query_queue.put(query_samples[: self.batch_size])
+            query_samples = query_samples[self.batch_size:]
+        log.info(f"IssueQuery done")
+
+    def flush_queries(self):
+        pass
+
+    def __del__(self):
+        pass
+
+
+class SUTServer(SUT):
+    def __init__(
+        self,
+        model_path=None,
+        dtype="bfloat16",
+        total_sample_count=8313,
+        dataset_path=None,
+        batch_size=None,
+        workers=1,
+        tensor_parallel_size=8
+    ):
+
+        super().__init__(
+            model_path=model_path,
+            dtype=dtype,
+            total_sample_count=total_sample_count,
+            dataset_path=dataset_path,
+            workers=workers,
+            tensor_parallel_size=tensor_parallel_size,
+        )
+        self.request_id = 0
+
+        self.first_token_queue = queue.Queue()
+
+    def start(self):
+        # Create worker threads
+        for j in range(self.num_workers):
+            worker = threading.Thread(target=self.process_queries)
+            worker.start()
+            self.worker_threads[j] = worker
+
+    async def stream_output(self, qitem, results_generator):
+        first = True
+        async for request_output in results_generator:
+            output_response = request_output
+            if first:
+                first_tokens = list(output_response.outputs[0].token_ids)
+                response_data = array.array(
+                    "B", np.array(first_tokens, np.int32).tobytes())
+                bi = response_data.buffer_info()
+                response = [lg.QuerySampleResponse(qitem.id, bi[0], bi[1])]
+                lg.FirstTokenComplete(response)
+                first = False
+
+        outputs = output_response
+        pred_output_tokens = list(output_response.outputs[0].token_ids)
+        n_tokens = len(pred_output_tokens)
+        response_array = array.array(
+            "B", np.array(pred_output_tokens, np.int32).tobytes()
+        )
+        bi = response_array.buffer_info()
+        response = [
+            lg.QuerySampleResponse(
+                qitem.id,
+                bi[0],
+                bi[1],
+                n_tokens)]
+        lg.QuerySamplesComplete(response)
+
+    def process_queries(self):
+        """Processor of the queued queries. User may choose to add batching logic"""
+        while True:
+
+            qitem = self.query_queue.get()
+            if qitem is None:
+                break
+
+            input_ids_tensor = TokensPrompt(
+                prompt_token_ids=self.data_object.input_ids[qitem.index])
+
+            # TODO: This PoC is super slow with significant overhead. Best to
+            # create a patch to `generate`
+            results_generator = self.model.generate(
+                prompt=input_ids_tensor, sampling_params=self.sampling_params, request_id=str(
+                    self.request_id)
+            )
+            self.request_id += 1
+            asyncio.run(self.stream_output(qitem, results_generator))
+
+    def issue_queries(self, query_samples):
+        self.query_queue.put(query_samples[0])
+
+    def stop(self):
+        for _ in range(self.num_workers):
+            self.query_queue.put(None)
+
+        for worker in self.worker_threads:
+            worker.join()
+
+        self.first_token_queue.put(None)
+        self.ft_response_thread.join()
+
+    def load_model(self):
+        log.info("Loading model")
+        self.engine_args = AsyncEngineArgs(
+            self.model_path,
+            dtype=self.dtype,
+            tensor_parallel_size=self.tensor_parallel_size)
+        self.model = AsyncLLMEngine.from_engine_args(self.engine_args)
+        log.info("Loaded model")
--- a/language/llama3.1-405b/build.sh
+++ b/language/llama3.1-405b/build.sh
+set -e
+
+conda install pybind11==2.10.4 -c conda-forge -y
+conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
+python -m pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
+
+
+cd ../../loadgen && python3 -m pip install .
--- a/language/llama3.1-405b/dataset.py
+++ b/language/llama3.1-405b/dataset.py
+import random
+import os
+import time
+import numpy as np
+import torch
+from datasets import load_dataset, load_from_disk
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from torch.nn.functional import pad
+from torch.utils.data import DataLoader
+from typing import Optional, Dict, Sequence
+import io
+
+# import utils
+import copy
+import pickle
+
+import logging
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("Llama-405B-Dataset")
+
+
+class Dataset:
+    def __init__(
+        self,
+        model_name=None,
+        total_sample_count=8313,
+        perf_count_override=None,
+        dataset_path=None,
+        dtype="bfloat16"
+    ):
+        self.model_name = model_name or f"Meta-Llama-3.1-405B-Instruct{'-FP8' if dtype == 'float8' else ''}"
+        self.dataset_path = dataset_path
+
+        # self.total_sample_count = total_sample_count
+        self.load_processed_dataset()
+
+        self.total_sample_count = min(len(self.input_ids), total_sample_count)
+        self.perf_count = perf_count_override or self.total_sample_count
+
+    def load_processed_dataset(self):
+        if not os.path.isfile(self.dataset_path):
+            log.warn(
+                "Processed pickle file {} not found. Please check that the path is correct".format(
+                    self.dataset_path
+                )
+            )
+
+        log.info("Loading dataset...")
+        import pandas as pd
+
+        self.processed_data = pd.read_pickle(self.dataset_path)
+
+        self.input = self.processed_data.input.tolist()
+        self.input_ids = self.processed_data.tok_input.tolist()
+        self.input_lens = self.processed_data.tok_input_len.tolist()
+
+        log.info("Finished loading dataset.")
+
+    def postProcess(
+        self,
+        out_tokens,
+        query_id_list=None,
+        sample_index_list=None,
+    ):
+        """Postprocesses output prediction"""
+
+        # TODO: Create response object in postProcess(?)
+        """
+        preds = []
+        for i in range(out_tokens.shape[0]):
+            #pred = out_tokens[i].reshape(-1).cpu().numpy() # Slice up to original input length as below?
+
+            input_len = input_seq_lens[i] if input_seq_lens else 0
+            pred = out_tokens[i, input_len:].reshape(-1).cpu().numpy()
+            preds.append(pred)
+        """
+        # Everything is padded to max_len (1024), so prune the input and parse
+        # to numpy
+        output_seq = out_tokens
+        assert len(query_id_list) == len(output_seq)
+
+        return [np.asarray(out, dtype=np.int32) for out in output_seq]
+
+    def LoadSamplesToRam(self, sample_list):
+        pass
+
+    def UnloadSamplesFromRam(self, sample_list):
+        pass
+
+    def __del__(self):
+        pass
--- a/language/llama3.1-405b/evaluate-accuracy.py
+++ b/language/llama3.1-405b/evaluate-accuracy.py
+import argparse
+from transformers import AutoTokenizer
+import nltk
+from multiprocessing import Pool, cpu_count
+from tqdm import tqdm
+import numpy as np
+import pandas as pd
+import json
+import re
+from rouge_score import rouge_scorer
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--checkpoint-path",
+        default="meta-llama/Meta-Llama-3-8B",
+        help="Path to Llama3.1-405b-hf-chat checkpoint"
+    )
+    parser.add_argument(
+        "--mlperf-accuracy-file", required=True, help="path to mlperf_log_accuracy.json"
+    )
+    parser.add_argument(
+        "--dataset-file",
+        required=True,
+        help="path to processed dataset set",
+    )
+    parser.add_argument(
+        "--verbose",
+        action="store_true",
+        help="verbose messages")
+    parser.add_argument(
+        "--dtype",
+        default="int64",
+        help="dtype of the accuracy log",
+        choices=["int32", "int64", "float"],
+    )
+    args = parser.parse_args()
+    return args
+
+
+scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
+
+
+def rouge(label, pred):
+    score = scorer.score(label, pred)
+    return {
+        'rougeL': 100 * score['rougeL'].fmeasure,
+    }
+
+
+def niah_em(label, pred):
+    label_uuids = re.findall(
+        r'[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', label)
+    pred_uuids = re.findall(r'[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', pred)
+
+    if len(pred_uuids) == 0:
+        return {'exact_match': 0.0}
+
+    # https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L28
+    score = sum([
+        sum([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) / len(ref)
+        for pred, ref in zip(pred_uuids, label_uuids)
+    ]) / len(pred_uuids) * 100
+
+    return {'exact_match': round(score, 2)}
+
+
+def qa_em(label, pred):
+    answer_substring = pred
+
+    if 'Answer: ' in pred:
+        last_answer_index = pred.rfind("Answer: ")
+        if last_answer_index == -1:
+            return {'exact_match': 0.0}
+
+        answer_substring = pred[last_answer_index + len("Answer: "):]
+
+    if answer_substring in label:
+        return {'exact_match': 100.0}
+
+    normalized_answer = re.sub(r'\s+', '', answer_substring).lower()
+    label_entries = [re.sub(r'\s+', '', entry).lower()
+                     for entry in label.split('|')]
+
+    match_found = any(entry in normalized_answer for entry in label_entries)
+    return {'exact_match': 100.0 if match_found else 0.0}
+
+
+metrics = {
+    fn.__name__: fn
+    for fn in [rouge, niah_em, qa_em]
+}
+
+
+def get_groundtruth(processed_dataset_file, return_metrics=True):
+    data = pd.read_pickle(processed_dataset_file)
+    ground_truths = data["gt_output"]
+    if return_metrics:
+        metrics = data["metric"]
+        return ground_truths, metrics
+    return ground_truths
+
+
+def postprocess_text(preds, targets):
+    preds = [pred.strip() for pred in preds]
+    targets = [target.strip() for target in targets]
+
+    # rougeLSum expects newline after each sentence
+    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
+    targets = ["\n".join(nltk.sent_tokenize(target)) for target in targets]
+
+    return preds, targets
+
+
+def process_item(item):
+    pred, target, metric = item
+    metric_fn = metrics[metric]
+    metric_eval = metric_fn(target, pred)
+    return metric_eval
+
+
+def run_evaluation(preds, targets, metrics, n_process=None):
+    n_process = cpu_count() if n_process is None else n_process
+    with Pool(n_process) as pool:
+        accuracies = list(
+            tqdm(
+                pool.imap(
+                    process_item, zip(
+                        preds, targets, metrics)), total=len(preds)))
+    df = pd.DataFrame({"accuracy": accuracies, "metric": metrics})
+    return df.accuracy.apply(pd.Series).describe().loc["mean"].to_dict()
+
+
+def main():
+
+    args = get_args()
+    dataset_path = args.dataset_file
+    checkpoint_path = args.checkpoint_path
+    nltk.download("punkt")
+    nltk.download('punkt_tab')
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        checkpoint_path,
+        model_max_length=22000,
+        padding_side="left",
+        use_fast=False,
+    )
+
+    targets, metrics = get_groundtruth(args.dataset_file)
+
+    target_required = []
+    metrics_required = []
+    preds_token_ids = []
+
+    eval_dtype = np.int64
+    if args.dtype == "int32":
+        eval_dtype = np.int32
+    elif args.dtype == "float":
+        eval_dtype = np.float32
+
+    with open(args.mlperf_accuracy_file, "r") as f:
+        results = json.load(f)
+
+    seen = set()
+    gen_tok_len = 0
+    for pred in results:
+        qsl_idx = pred["qsl_idx"]
+        if qsl_idx in seen:
+            continue
+
+        seen.add(qsl_idx)
+        target_required.append(targets[qsl_idx])
+        metrics_required.append(metrics[qsl_idx])
+        pred = np.frombuffer(bytes.fromhex(pred["data"]), eval_dtype)
+
+        gen_tok_len += len(pred)
+        preds_token_ids.append(pred)
+
+    preds_decoded_text = tokenizer.batch_decode(
+        preds_token_ids, skip_special_tokens=True
+    )
+
+    preds, targets = postprocess_text(preds_decoded_text, target_required)
+
+    result = run_evaluation(preds, targets, metrics_required)
+    result = dict(result)
+    prediction_lens = [len(pred) for pred in preds]
+    gen_num = len(preds)
+
+    result = {
+        **result,
+        "gen_len": np.sum(prediction_lens),
+        "gen_num": gen_num,
+        "gen_tok_len": gen_tok_len,
+        "tokens_per_sample": round(gen_tok_len / gen_num, 1),
+    }
+
+    print("\nResults\n")
+    print(result)
+
+
+if __name__ == "__main__":
+    main()
--- a/language/llama3.1-405b/launch_docker.sh
+++ b/language/llama3.1-405b/launch_docker.sh
+#!/bin/bash
+
+MLCOMMONS_REPO_PATH="$(dirname "$(dirname "$PWD")")"
+
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+)
+
+# Set up docker environment file for current user
+rm -f .docker_env
+echo "CI_BUILD_USER=`id -u -n`" >> .docker_env
+echo "CI_BUILD_UID=`id -u`" >> .docker_env
+echo "CI_BUILD_GROUP=`id -g -n`" >> .docker_env
+echo "CI_BUILD_GID=`id -g`" >> .docker_env
+cat .docker_env
+
+# Build container
+docker build . -t llm/gpubringup
+
+# Build mount flags
+declare -a MOUNT_FLAGS
+for _mount in ${MOUNTS[@]}; do
+    _split=($(echo $_mount | tr ':' '\n'));
+    MOUNT_FLAGS+=("--mount type=bind,source=${_split[0]},target=${_split[1]}");
+done
+
+set -x
+nvidia-docker run -it --rm --net=host --runtime=nvidia --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
+  --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --cap-add=DAC_READ_SEARCH \
+  --security-opt seccomp=unconfined \
+  -w $PWD \
+  --env-file `pwd`/.docker_env \
+  ${MOUNT_FLAGS[*]} \
+  llm/gpubringup \
+  bash ./with_the_same_user
--- a/language/llama3.1-405b/main.py
+++ b/language/llama3.1-405b/main.py
+import subprocess
+import mlperf_loadgen as lg
+import argparse
+import os
+import logging
+import sys
+import requests
+import json
+
+sys.path.insert(0, os.getcwd())
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("Llama-405B-MAIN")
+
+# function to check the model name in server matches the user specified one
+
+
+def verify_model_name(user_specified_name, url):
+    response = requests.get(url)
+    if response.status_code == 200:
+        response_dict = response.json()
+        server_model_name = response_dict["data"][0]["id"]
+        if user_specified_name == server_model_name:
+            return {"matched": True, "error": False}
+        else:
+            return {
+                "matched": False,
+                "error": f"User specified {user_specified_name} and server model name {server_model_name} mismatch!",
+            }
+    else:
+        return {
+            "matched": False,
+            "error": f"Failed to get a valid response. Status code: {response.status_code}",
+        }
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--scenario",
+        type=str,
+        choices=["Offline", "Server"],
+        default="Offline",
+        help="Scenario",
+    )
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        default="Meta-Llama-3.1-405B-Instruct",
+        help="Model name",
+    )
+    parser.add_argument("--dataset-path", type=str, default=None, help="")
+    parser.add_argument(
+        "--accuracy",
+        action="store_true",
+        help="Run accuracy mode")
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        default="float32",
+        help="data type of the model, choose from float16, bfloat16 and float32",
+    )
+    parser.add_argument(
+        "--audit-conf",
+        type=str,
+        default="audit.conf",
+        help="audit config for LoadGen settings during compliance runs",
+    )
+    parser.add_argument(
+        "--user-conf",
+        type=str,
+        default="user.conf",
+        help="user config for user LoadGen settings such as target QPS",
+    )
+    # TODO: This interpretation of 'total-sample-count' is a little
+    # misleading. Fix it
+    parser.add_argument(
+        "--total-sample-count",
+        type=int,
+        default=8313,
+        help="Number of samples to use in benchmark.",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=1,
+        help="Model batch-size to use in benchmark.",
+    )
+    parser.add_argument(
+        "--output-log-dir", type=str, default="output-logs", help="Where logs are saved"
+    )
+    parser.add_argument(
+        "--enable-log-trace",
+        action="store_true",
+        help="Enable log tracing. This file can become quite large",
+    )
+    parser.add_argument(
+        "--num-workers",
+        type=int,
+        default=1,
+        help="Number of workers to process queries",
+    )
+    parser.add_argument(
+        "--tensor-parallel-size",
+        type=int,
+        default=8,
+        help="Number of workers to process queries",
+    )
+    parser.add_argument("--vllm", action="store_true", help="vllm mode")
+    parser.add_argument(
+        "--api-model-name",
+        type=str,
+        default="Meta-Llama-3.1-405B-Instruct",
+        help="Model name(specified in llm server)",
+    )
+    parser.add_argument(
+        "--api-server",
+        type=str,
+        default=None,
+        help="Specify an api endpoint call to use api mode",
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+scenario_map = {
+    "offline": lg.TestScenario.Offline,
+    "server": lg.TestScenario.Server,
+}
+
+
+def main():
+    args = get_args()
+
+    settings = lg.TestSettings()
+    settings.scenario = scenario_map[args.scenario.lower()]
+    # mlperf.conf is automatically loaded by the loadgen
+    # settings.FromConfig(args.mlperf_conf, "llama3_1-405b", args.scenario)
+    settings.FromConfig(args.user_conf, "llama3_1-405b", args.scenario)
+
+    if args.accuracy:
+        settings.mode = lg.TestMode.AccuracyOnly
+    else:
+        settings.mode = lg.TestMode.PerformanceOnly
+
+    os.makedirs(args.output_log_dir, exist_ok=True)
+    log_output_settings = lg.LogOutputSettings()
+    log_output_settings.outdir = args.output_log_dir
+    log_output_settings.copy_summary_to_stdout = True
+    log_settings = lg.LogSettings()
+    log_settings.log_output = log_output_settings
+    log_settings.enable_trace = args.enable_log_trace
+
+    if args.vllm:
+        from SUT_VLLM import SUT, SUTServer
+    else:
+        raise NotImplementedError
+
+    sut_map = {"offline": SUT, "server": SUTServer}
+
+    sut_cls = sut_map[args.scenario.lower()]
+
+    if args.vllm:
+        sut = sut_cls(
+            model_path=args.model_path,
+            dtype=args.dtype,
+            batch_size=args.batch_size,
+            dataset_path=args.dataset_path,
+            total_sample_count=args.total_sample_count,
+            workers=args.num_workers,
+            tensor_parallel_size=args.tensor_parallel_size
+        )
+    else:
+        sut = sut_cls(
+            model_path=args.model_path,
+            dtype=args.dtype,
+            batch_size=args.batch_size,
+            dataset_path=args.dataset_path,
+            total_sample_count=args.total_sample_count,
+            workers=args.num_workers,
+        )
+
+    # Start sut before loadgen starts
+    sut.start()
+    lgSUT = lg.ConstructSUT(sut.issue_queries, sut.flush_queries)
+    log.info("Starting Benchmark run")
+    lg.StartTestWithLogSettings(
+        lgSUT,
+        sut.qsl,
+        settings,
+        log_settings,
+        args.audit_conf)
+
+    # Stop sut after completion
+    sut.stop()
+
+    log.info("Run Completed!")
+
+    log.info("Destroying SUT...")
+    lg.DestroySUT(lgSUT)
+
+    log.info("Destroying QSL...")
+    lg.DestroyQSL(sut.qsl)
+
+
+if __name__ == "__main__":
+    main()
--- a/language/llama3.1-405b/requirements.txt
+++ b/language/llama3.1-405b/requirements.txt
+transformers==4.46.2
+nltk==3.8.1
+evaluate==0.4.0
+absl-py==1.4.0
+rouge-score==0.1.2
+sentencepiece==0.2.0
+accelerate==0.21.0
+vllm==0.6.3
+pybind11==2.10.4
--- a/language/llama3.1-405b/run_accuracy.sh
+++ b/language/llama3.1-405b/run_accuracy.sh
+CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
+DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
+
+mkdir -p "run_outputs"
+
+python3 -u main.py --scenario Offline \
+        --model-path ${CHECKPOINT_PATH} \
+        --batch-size 16 \
+        --accuracy \
+        --mlperf-conf mlperf.conf \
+        --user-conf user.conf \
+        --total-sample-count 8313 \
+        --dataset-path ${DATASET_PATH} \
+        --output-log-dir offline_accuracy_loadgen_logs \
+        --dtype float32 | tee offline_accuracy_log.log
+
+python3 evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+        --mlperf-accuracy-file offline_accuracy_loadgen_logs/mlperf_log_accuracy.json \
+        --dataset-file ${DATASET_PATH} \
+        --dtype int32
--- a/language/llama3.1-405b/run_offline.sh
+++ b/language/llama3.1-405b/run_offline.sh
+CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
+DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
+
+python -u main.py --scenario Offline \
+	--model-path ${CHECKPOINT_PATH} \
+	--batch-size 16 \
+	--dtype float16 \
+	--user-conf user.conf \
+	--total-sample-count 8313 \
+	--dataset-path ${DATASET_PATH} \
+	--output-log-dir output \
+	--tensor-parallel-size ${GPU_COUNT} \
+	--vllm 2>&1 | tee offline.log
--- a/language/llama3.1-405b/run_server.sh
+++ b/language/llama3.1-405b/run_server.sh
+
+
+CHECKPOINT_PATH="${CHECKPOINT_PATH:Meta-Llama-3.1-405B-Instruct}"
+DATASET_PATH="${DATASET_PATH:mlperf_llama3.1_405b_dataset_8318.pkl}"
+
+python -u main.py --scenario Server \
+	--model-path ${CHECKPOINT_PATH} \
+	--batch-size 16 \
+	--dtype float16 \
+	--user-conf user.conf \
+	--total-sample-count 8313 \
+	--dataset-path ${DATASET_PATH} \
+	--output-log-dir output \
+	--tensor-parallel-size ${GPU_COUNT} \
+	--vllm 2>&1 | tee server.log
--- a/language/llama3.1-405b/user.conf
+++ b/language/llama3.1-405b/user.conf
+# The format of this config file is 'key = value'.
+# The key has the format 'model.scenario.key'. Value is mostly int64_t.
+# Model maybe '*' as wildcard. In that case the value applies to all models.
+# All times are in milli seconds
+#
+*.Offline.min_duration = 600000
+*.Offline.min_query_count = 2000
+
+*.Server.target_qps = 0.5
+*.Server.min_duration = 120000
+*.Server.min_query_count = 100
+
+llama3_1-405b.Server.sample_concatenate_permutation = 1
\ No newline at end of file
--- a/language/llama3.1-405b/with_the_same_user
+++ b/language/llama3.1-405b/with_the_same_user
+#!/usr/bin/env bash
+# wkong: manually set the user info in env first
+
+set -ex
+
+if [ -z "$@" ]; then
+  COMMAND=(bash)
+else
+  COMMAND=("$@")
+fi
+
+apt-get update && apt-get install -y sudo
+
+getent group "${CI_BUILD_GID}" || addgroup --gid "${CI_BUILD_GID}" "${CI_BUILD_GROUP}"
+getent passwd "${CI_BUILD_UID}" || adduser --gid "${CI_BUILD_GID}" --uid "${CI_BUILD_UID}" --gecos "${CI_BUILD_USER} (generated by with_the_same_user script)" --disabled-password --quiet "${CI_BUILD_USER}"
+
+usermod -a -G dip "${CI_BUILD_USER}"
+usermod -a -G sudo "${CI_BUILD_USER}"
+usermod -a -G root "${CI_BUILD_USER}"
+
+echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
+
+sudo -H -u "#${CI_BUILD_UID}" --preserve-env \
+  PATH="${PATH}" \
+  LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
+  PYTHONPATH="${PYTHONPATH}" \
+  ${COMMAND[@]}
--- a/language/mixtral-8x7b/Dockerfile
+++ b/language/mixtral-8x7b/Dockerfile
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM nvcr.io/nvidia/pytorch:24.07-py3
+SHELL ["/bin/bash", "-c"]
+
+ENV LC_ALL=C.UTF-8
+ENV LANG=C.UTF-8
+
+ENV TZ=US/Pacific
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+RUN rm -rf /var/lib/apt/lists/* && rm -rf /etc/apt/sources.list.d/* \
+ && apt update \
+ && apt install -y --no-install-recommends build-essential autoconf \
+        libtool git ccache curl wget pkg-config sudo ca-certificates \
+        automake libssl-dev bc python3-dev python3-pip google-perftools \
+        gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
+        libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
+ && apt -y autoremove \
+ && apt remove -y cmake \
+ && apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
+        unzip libarchive-dev
+RUN apt install -y --no-install-recommends rsync
+
+# Install setuptools
+RUN python3 -m pip install --upgrade pip \
+    && python3 -m pip install --upgrade setuptools wheel virtualenv
+
+# Install conda
+WORKDIR /tmp
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
+    && bash Miniconda3-* -b -p /opt/miniconda3
+ENV PATH="$PATH:/opt/miniconda3/bin"
+RUN conda create -n llm python=3.10
+RUN chmod -R 777 /opt/miniconda3
--- a/language/mixtral-8x7b/Dockerfile.eval
+++ b/language/mixtral-8x7b/Dockerfile.eval
+# Use Ubuntu 22.04 as the base image
+FROM ubuntu:22.04
+ARG DEBIAN_FRONTEND=noninteractive
+
+# Update package lists
+RUN apt-get update
+
+# Install Python 3 and pip
+RUN apt-get install -y python3 python3-pip git
+
+# Set Python 3 as the default python interpreter
+RUN ln -s /usr/bin/python3 /usr/bin/python
+
+# Verify installation
+RUN python --version
+RUN pip --version
+RUN git --version
+
+# Install requirements
+RUN pip install transformers nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0 huggingface_hub[cli]
+
+# Clone and install mxeval
+RUN git clone https://github.com/amazon-science/mxeval.git
+RUN pip install -e mxeval
+
+# Get language dependencies
+RUN apt install -y wget
+
+# Ruby
+RUN apt install -y curl libssl-dev libreadline-dev zlib1g-dev autoconf bison build-essential libyaml-dev libreadline-dev libncurses5-dev libffi-dev libgdbm-dev
+
+# PHP
+RUN apt install -y software-properties-common ca-certificates lsb-release apt-transport-https
+RUN add-apt-repository ppa:ondrej/php
+RUN apt-get update
+RUN apt install -y php8.0
+# RUN apt install -y php-{pear,cgi,pdo,common,curl,mbstring,gd,mysqlnd,gettext,bcmath,json,xml,fpm,intl,zip}
+
+# JAVA
+RUN apt-get install -y openjdk-8-jdk
+
+# JAVASCRIPT
+RUN apt install -y npm
+
+# SCALA
+RUN apt-get install -y scala
+
+# C#
+RUN apt-get install -y dotnet6
+
+# Kotlin
+RUN apt install -y zip unzip
+
+SHELL ["/bin/bash", "-c"]
+WORKDIR "/mxeval"
+RUN sed -i 's/sudo//g' /mxeval/language_setup/ubuntu.sh
+RUN sed -i 's/source/PS1=1 source/g' /mxeval/language_setup/ubuntu.sh # Need this to make sure that the "source ~/.bashrc" lines work correctly
+RUN sed -i 's/npx tsc/tsc/g' /mxeval/mxeval/execution.py # npx tsc runs into permission issues
+
+RUN PATH="$HOME/.rbenv/bin:$PATH" bash /mxeval/language_setup/ubuntu.sh
+
+WORKDIR "/"
+CMD bash
--- a/language/mixtral-8x7b/README.md
+++ b/language/mixtral-8x7b/README.md
+# Reference Implementation for Mixtral-8x7B-instruct-v0.1
+
+**Basic implementation for Mixtral-8x7B-instruct-v0.1. Few noteworthy items:**
+
+ Dataset was constructed by randomly sampling from the validation split of 3 datasets, open_orca_gpt4, GSM8k and MBXP. 5K samples from each one.
+ Streamer for communicating with loadgen has quite some overhead. This is only meant to provide functional implementation
+ For custom/optimized implementations of this benchmark it is important to include the :
+        - For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured.
+        - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/)
+
+
+Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/mixtral-8x7b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
+
+## Prepare environment
+
+For a CPU-only run:
+
+```
+conda create -n Mixtral-8x7B python=3.9
+conda activate Mixtral-8x7B
+
+# Install packages
+conda install pybind11==2.10.4 -c conda-forge -y
+python -m pip install torch==2.2.0.dev20231006+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
+pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
+pip install git+https://github.com/amazon-science/mxeval.git@e09974f990eeaf0c0e8f2b5eaff4be66effb2c86
+
+export CUR_DIR=${PWD}
+cd <inference-repo-root>/loadgen
+
+python -m pip install .
+```
+
+For a GPU-based run:
+
+A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
+`launch.sh`. There is a section at the top of the file that looks like:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+)
+```
+
+For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+    /raid/data:/raid/data
+)
+```
+Once you have added all your mounts, launch the container with `bash launch.sh`.
+
+Inside the container, set up the environment with `bash build.sh`. This will install all the dependencies from the
+CPU-only setup, as well as any GPU versions for applicable libraries like PyTorch.
+
+
+## Model
+
+| model | accuracy | model source | precision |
+| ---- | ---- | ---- | ---- |
+| Mixtral-8x7B-Instruct-v0.1 | [Accuracy target](#accuracy-target) | [Hugging Face](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | fp16 |
+
+**Important Note:** Files and configurations of the model have changed, and might change in the future. If you are going to get the model from Hugging Face or any external source, use a version of the model that exactly matches the one in this [commit](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commit/a60832cb6c88d5cb6e507680d0e9996fbad77050). We strongly recommend to get the model following the steps in the next section:
+
+### Get Checkpoint
+
+#### Using Rclone
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
+```
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+```
+Once Rclone is installed, run the following command to authenticate with the bucket:
+```
+rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```
+You can then navigate in the terminal to your desired download directory and run the following command to download the model checkpoint:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/mixtral_8x7b/mixtral-8x7b-instruct-v0.1 ./mixtral-8x7b-instruct-v0.1 -P
+```
+
+## Get Dataset
+
+### Preprocessed
+
+#### Using Rclone
+We make many of the MLPerf infernce models and datasets available using Rclone. In order to keep compatibility, you can use Rclone to get the preprocessed dataset:
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
+```bash
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+```
+Once Rclone is installed, cd into the folder where you want to place the dataset and run:
+```bash
+rclone copyurl https://inference.mlcommons-storage.org/mixtral_8x7b/09292024_mixtral_15k_mintoken2_v1.pkl ./ -a -P
+```
+#### Using wget
+
+Alternatively, you can simply cd into the folder where you want to place the dataset and run
+
+
+```bash
+wget https://inference.mlcommons-storage.org/mixtral_8x7b/09292024_mixtral_15k_mintoken2_v1.pkl
+```
+
+### Calibration dataset
+
+#### Using Rclone
+Rclone is installed, cd into the folder where you want to place the dataset and run:
+```bash
+rclone copyurl https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl ./ -a -P
+```
+
+#### Using wget
+
+Alternatively, you can simply cd into the folder where you want to place the dataset and run
+```bash
+wget https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl
+```
+
+## Run Performance Benchmarks
+
+### Offline
+```
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --user-conf user.conf \
+                --total-sample-count 15000 \
+                --device cpu \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir offline-logs
+
+```
+
+For a GPU-based run:
+```
+python3 -u main.py --scenario Offline \
+        --model-path ${CHECKPOINT_PATH} \
+        --user-conf user.conf \
+        --total-sample-count 15000 \
+        --dataset-path ${DATASET_PATH} \
+        --output-log-dir offline-logs \
+        --dtype float32 \
+        --device cuda:0 2>&1 | tee offline_performance_log.log
+```
+
+### Server
+```
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --user-conf user.conf \
+                --total-sample-count 15000 \
+                --device cpu \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir server-logs
+```
+
+The ServerSUT was not tested for GPU runs.
+
+
+## Run Accuracy Benchmarks
+
+### Offline
+```
+OUTPUT_LOG_DIR=offline-accuracy-logs
+
+mkdir -p "run_outputs"  # The script will dump all the outputs to 'run_outputs'.
+
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --accuracy \
+                --user-conf user.conf \
+                --total-sample-count 15000 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir ${OUTPUT_LOG_DIR} \
+                --device cpu
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+
+# Optional: Create a pickled pandas DataFrame that is the original dataset with extra columns with output data from the
+# accuracy run. The following columns will be added:
+# - "gen_output_tok_id": A list of ints representing the tokenized output sequence.
+# - "gen_output_text": A str representing the untokenized output sequence.
+# - "gen_output_tok_len": An int representing the number of output tokens.
+# - "rouge1": The rouge1 score for this sample
+# - "rouge2": The rouge2 score for this sample
+# - "rougeL": The rougeL score for this sample
+# This file will by default be saved to 'full_output.pkl'. You can modify this with --output-pkl-path.
+python consolidate_results.py --dataset-path ${DATASET_PATH} --model-dir ${CHECKPOINT_PATH}
+```
+
+For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
+`--device cpu` to adapt it to a CPU-only run.
+
+
+### Server
+```
+OUTPUT_LOG_DIR=server-accuracy-logs
+
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --accuracy \
+                --user-conf user.conf \
+                --total-sample-count 15000 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir ${OUTPUT_LOG_DIR} \
+                --device cpu
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+The ServerSUT was not tested for GPU runs.
+
+### Evaluation
+Recreating the enviroment for evaluating the quality metrics can be quite tedious. Therefore we provide a dockerfile and recommend using docker for this task.
+1. Build the evaluation container
+```bash
+docker build . -f Dockerfile.eval -t evaluation
+```
+2. Run the docker in interactive mode and with
+```bash
+docker run -it --rm --net=host --runtime=nvidia --ipc=host -v $PWD:$PWD -w $PWD evaluation
+```
+3.
+```bash
+cd eval
+python -u evaluate-accuracy.py --checkpoint-path [path_to_model_checkpoint] \
+                --mlperf-accuracy-file [path_to_mlperf_accuracy_file] \
+                --dataset-file [path_to_dataset] \
+                --n_workers 8
+```
+
+
+## Accuracy Target
+
+Reference scores:
+Open Orca:
+```json
+{'rouge1': 45.5989, 'rouge2': 23.3526, 'rougeL': 30.4608}
+```
+GSM8K:
+```json
+{'gsm8k': 73.66}
+```
+MBXP:
+```json
+{'mbxp': 60.16}
+```
+For official submissions, 99% of each reference score is enforced. Additionally, 90%-110% of the generated tokens_per_samples (counting all the non-EOS tokens):
+```json
+{'tokens_per_sample': 144.84}
+```
--- a/language/mixtral-8x7b/SUT.py
+++ b/language/mixtral-8x7b/SUT.py
+import os
+import time
+import numpy as np
+import array
+import torch
+from torch.nn.functional import pad
+from torch.utils.data import DataLoader
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    LogitsProcessor,
+    LogitsProcessorList,
+)
+from transformers.generation.streamers import BaseStreamer
+
+import pickle
+import time
+import threading
+import tqdm
+import queue
+
+import logging
+from typing import TYPE_CHECKING, Optional, List
+from pathlib import Path
+
+import mlperf_loadgen as lg
+from dataset import Dataset
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("Mixtral-8x7B-Instruct-v0.1")
+
+gen_kwargs = {
+    "early_stopping": True,
+    "max_new_tokens": 1024,
+    "min_new_tokens": 2,
+    "num_beams": 1,
+    "do_sample": False,
+}
+
+
+class StopAfterSequence(LogitsProcessor):
+    """Logits processor (to use with HuggingFace `generate()` method :
+    https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/
+    text_generation#transformers.generation_utils.GenerationMixin).
+
+    This logits processor makes that when the model generates a specified
+    stopping sequence, it stops generating new tokens
+
+    Args:
+        stop_seq (List[int]): ID of the space token.
+        eos_token_id (int): ID of the EOS token.
+        device (str): Device that the model is running
+    """
+
+    def __init__(
+        self,
+        eos_token_id: int,
+        stop_seq: List[int] = [13, 13940, 28832, 13],
+        device="cpu",
+    ):
+        super().__init__()
+        assert len(stop_seq) >= 1
+        self.device = device
+        self.stop_seq = torch.tensor(stop_seq, dtype=torch.long).to(device)
+        self.stop_seq_length = len(stop_seq)
+        self.eos_token_id = eos_token_id
+
+    def check_stop_condition(self, input_ids: torch.LongTensor):
+        stop_condition_met = (
+            input_ids[:, -self.stop_seq_length:] == self.stop_seq
+        ).all(dim=1)
+        return stop_condition_met
+
+    def __call__(
+        self, input_ids: torch.LongTensor, scores: torch.FloatTensor
+    ) -> torch.FloatTensor:
+        if input_ids.size(1) > self.stop_seq_length:
+            forced_eos = torch.full(
+                (scores.size(1),), -float("inf")).to(self.device)
+            forced_eos[self.eos_token_id] = 0
+            scores[self.check_stop_condition(input_ids)] = forced_eos
+        return scores
+
+
+class FirstTokenStreamer(BaseStreamer):
+    """Streams first tokens to a 'holder'"""
+
+    def __init__(
+        self, first_token, tokens_cache=[], is_first_token=True, response_ids=[]
+    ):
+        """Response ids added to 'sign' the first token"""
+
+        self.first_token = first_token  # Queue for first token
+        self.is_first_token = is_first_token
+
+        # Cache for subsequent generated tokens
+        self.tokens_cache = tokens_cache
+
+        self.response_ids = response_ids
+
+        # The first tokens sent to the streamer are actually the input prompts
+        self.is_prompt = True
+
+    def put(self, value):
+        """Caches the tokens as they're generated. Assumes bs=1"""
+
+        # Prompts are streamed first so we need to skip the first time value
+        # that arrives
+        if self.is_prompt:
+            self.is_prompt = False
+            return
+
+        value = value.item()
+        if self.is_first_token:
+
+            # Add generated first token together with its query response_id to
+            # first tokens queue
+            self.first_token.put((value, self.response_ids[0]))
+
+            self.is_first_token = False
+            return
+
+        self.tokens_cache.append(value)
+
+    def end(self):
+        pass
+
+    def get_out_tokens(self):
+        return self.tokens_cache
+
+
+class SUT:
+    def __init__(
+        self,
+        model_path=None,
+        dtype="bfloat16",
+        device="cpu",
+        batch_size=None,
+        total_sample_count=24576,
+        dataset_path=None,
+        use_cached_outputs=False,
+        # Set this to True *only for test accuracy runs* in case your
+        # prior session was killed partway through
+        workers=1,
+    ):
+
+        self.model_path = model_path or "mistralai/Mixtral-8x7B-Instruct-v0.1"
+        self.device = device
+
+        if not batch_size:
+            if device == "cpu":
+                batch_size = 1
+            else:
+                batch_size = 32  # Reduce to 8 if using 4 GPUs, 16 for 8.
+        self.batch_size = batch_size
+
+        # dtype
+        if dtype == "bfloat16":
+            self.amp_enabled = True
+            self.amp_dtype = torch.bfloat16
+        elif dtype == "float16":
+            self.amp_enabled = True
+            self.amp_dtype = torch.float16
+        else:
+            self.amp_enabled = False
+            self.amp_dtype = torch.float32
+
+        if "cuda" in self.device:
+            assert torch.cuda.is_available(), "torch gpu is not available, exiting..."
+
+        self.dataset_path = dataset_path
+        self.data_object = Dataset(
+            self.model_path,
+            dataset_path=self.dataset_path,
+            total_sample_count=total_sample_count,
+            device=self.device,
+        )
+        self.qsl = lg.ConstructQSL(
+            self.data_object.total_sample_count,
+            self.data_object.perf_count,
+            self.data_object.LoadSamplesToRam,
+            self.data_object.UnloadSamplesFromRam,
+        )
+
+        self.load_model()
+
+        self.num_workers = workers
+        self.worker_threads = [None] * self.num_workers
+        self.query_queue = queue.Queue()
+
+        self.use_cached_outputs = use_cached_outputs
+        self.sample_counter = 0
+        self.sample_counter_lock = threading.Lock()
+
+    def start(self):
+        # Create worker threads
+        for j in range(self.num_workers):
+            worker = threading.Thread(target=self.process_queries)
+            worker.start()
+            self.worker_threads[j] = worker
+
+    def stop(self):
+        for _ in range(self.num_workers):
+            self.query_queue.put(None)
+
+        for worker in self.worker_threads:
+            worker.join()
+
+    def process_queries(self):
+        """Processor of the queued queries. User may choose to add batching logic"""
+
+        while True:
+            qitem = self.query_queue.get()
+            if qitem is None:
+                break
+
+            query_ids = [q.index for q in qitem]
+
+            fname = "q" + "_".join([str(i) for i in query_ids])
+            fname = f"run_outputs/{fname}.pkl"
+            _p = Path(fname)
+            if self.use_cached_outputs and _p.exists():
+                # Read cache
+                with _p.open(mode="rb") as f:
+                    d = pickle.load(f)
+                processed_output = d["outputs"]
+                tik1 = None
+                tik2 = None
+                tik3 = None
+                tok = None
+            else:
+                # Construct / collate batch
+                max_seq_len = 1024
+
+                tik1 = time.time()
+
+                input_ids_tensor = []
+                input_masks_tensor = []
+                input_len = []
+                input_dataset = []
+                for q in qitem:
+                    input_ids_tensor.append(
+                        pad(
+                            self.data_object.input_ids[q.index],
+                            (
+                                max_seq_len -
+                                self.data_object.input_lens[q.index],
+                                0,
+                                0,
+                                0,
+                            ),
+                            value=self.tokenizer.pad_token_id,
+                        )
+                    )
+                    input_masks_tensor.append(
+                        pad(
+                            self.data_object.attention_masks[q.index],
+                            (
+                                max_seq_len -
+                                self.data_object.input_lens[q.index],
+                                0,
+                                0,
+                                0,
+                            ),
+                            value=0,
+                        )
+                    )
+                    input_len.append(self.data_object.input_lens[q.index])
+
+                    # In case we predict code generation, we can specify an
+                    # additional stop sequence
+                    input_dataset.append(
+                        self.data_object.dataset_names[q.index])
+                input_ids_tensor = torch.cat(input_ids_tensor)
+                input_masks_tensor = torch.cat(input_masks_tensor)
+
+                assert input_ids_tensor.shape == input_masks_tensor.shape
+                assert input_ids_tensor.shape[0] <= self.batch_size
+
+                tik2 = time.time()
+                logits_processor = LogitsProcessorList(
+                    [StopAfterSequence(
+                        self.tokenizer.eos_token_id, device=self.device)]
+                )
+                for i in range(len(input_ids_tensor)):
+                    ids, masks, dataset = (
+                        input_ids_tensor[i: i + 1],
+                        input_masks_tensor[i: i + 1],
+                        input_dataset[i],
+                    )
+                    pred_output_tokens = []
+                    if dataset == "MBXP":
+                        out = self.model.generate(
+                            input_ids=ids,
+                            attention_mask=masks,
+                            pad_token_id=self.tokenizer.pad_token_id,
+                            logits_processor=logits_processor,
+                            **gen_kwargs,
+                        )
+                    else:
+                        out = self.model.generate(
+                            input_ids=ids,
+                            attention_mask=masks,
+                            pad_token_id=self.tokenizer.pad_token_id,
+                            **gen_kwargs,
+                        )
+                    pred_output_tokens.append(out)
+                pred_output_tokens = torch.cat(pred_output_tokens)
+                tik3 = time.time()
+
+                processed_output = self.data_object.postProcess(
+                    pred_output_tokens,
+                    input_seq_lens=input_len,
+                    query_id_list=query_ids,
+                )
+
+            for i in range(len(qitem)):
+                n_tokens = processed_output[i].shape[0]
+                response_array = array.array(
+                    "B", processed_output[i].tobytes())
+                bi = response_array.buffer_info()
+                response = [
+                    lg.QuerySampleResponse(
+                        qitem[i].id,
+                        bi[0],
+                        bi[1],
+                        n_tokens)]
+                lg.QuerySamplesComplete(response)
+
+            tok = time.time()
+
+            with self.sample_counter_lock:
+                self.sample_counter += len(qitem)
+                print(f"Samples run: {self.sample_counter}")
+                if tik1:
+                    print(f"\tBatchMaker time: {tik2 - tik1}")
+                    print(f"\tInference time: {tik3 - tik2}")
+                    print(f"\tPostprocess time: {tok - tik3}")
+                    print(f"\t==== Total time: {tok - tik1}")
+                else:
+                    print(f"\tLoaded from cache: {_p}")
+
+    def load_model(self):
+        self.model = AutoModelForCausalLM.from_pretrained(
+            self.model_path,
+            device_map="auto",
+            low_cpu_mem_usage=True,
+            torch_dtype=self.amp_dtype,
+        )
+        print("Loaded model")
+
+        self.device = torch.device(self.device)
+        if self.device == "cpu":
+            # Force CPU if your system has GPU and you specifically want
+            # CPU-only run
+            self.model = self.model.to(self.device)
+
+        self.model.eval()
+        try:  # for systems with low ram, the below command gives error as some part is offloaded to disk
+            self.model = self.model.to(memory_format=torch.channels_last)
+        except BaseException:
+            pass
+
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.model_path,
+            model_max_length=1024,
+            padding_side="left",
+            use_fast=False,
+        )
+
+        self.tokenizer.pad_token = self.tokenizer.eos_token
+        print("Loaded tokenizer")
+
+    def get_sut(self):
+        self.sut = lg.ConstructSUT(self.issue_queries, self.flush_queries)
+        return self.sut
+
+    def get_qsl(self):
+        return self.qsl
+
+    def predict(self, **kwargs):
+        raise NotImplementedError
+
+    def issue_queries(self, query_samples):
+        """Receives samples from loadgen and adds them to queue. Users may choose to batch here"""
+
+        list_prompts_tokens = []
+        list_prompts_attn_masks = []
+
+        print(f"IssueQuery started with {len(query_samples)} samples")
+        while len(query_samples) > 0:
+            self.query_queue.put(query_samples[: self.batch_size])
+            query_samples = query_samples[self.batch_size:]
+        print(f"IssueQuery done")
+
+    def flush_queries(self):
+        pass
+
+    def __del__(self):
+        pass
+
+
+class SUTServer(SUT):
+    def __init__(
+        self,
+        model_path=None,
+        dtype="bfloat16",
+        device="cpu",
+        total_sample_count=24576,
+        dataset_path=None,
+        workers=1,
+    ):
+
+        super().__init__(
+            model_path=model_path,
+            dtype=dtype,
+            device=device,
+            total_sample_count=total_sample_count,
+            dataset_path=dataset_path,
+            workers=workers,
+        )
+
+        self.first_token_queue = queue.Queue()
+
+    def start(self):
+
+        # Create worker threads
+        for j in range(self.num_workers):
+            worker = threading.Thread(target=self.process_queries)
+            worker.start()
+            self.worker_threads[j] = worker
+
+        # Create first token response thread
+        self.ft_response_thread = threading.Thread(
+            target=self.process_first_tokens)
+        self.ft_response_thread.start()
+
+    def process_first_tokens(self):
+
+        while True:
+            first_token_item = self.first_token_queue.get()
+
+            if first_token_item is None:
+                log.info("Exiting First token response thread")
+                break
+
+            first_tokens, response_id = first_token_item
+
+            response_data = array.array(
+                "B", np.array(
+                    first_tokens, np.int32).tobytes())
+            bi = response_data.buffer_info()
+            response = [lg.QuerySampleResponse(response_id, bi[0], bi[1])]
+            lg.FirstTokenComplete(response)
+
+    def process_queries(self):
+        """Processor of the queued queries. User may choose to add batching logic"""
+        while True:
+
+            qitem = self.query_queue.get()
+            if qitem is None:
+                break
+
+            input_ids_tensor = self.data_object.input_ids[qitem.index]
+            input_masks_tensor = self.data_object.attention_masks[qitem.index]
+            dataset = self.data_object.dataset_names[qitem.index]
+
+            # TODO: This PoC is super slow with significant overhead. Best to
+            # create a patch to `generate`
+            tokens_cache = []
+            tokens_streamer = FirstTokenStreamer(
+                self.first_token_queue,
+                tokens_cache=tokens_cache,
+                is_first_token=True,
+                response_ids=[qitem.id],
+            )
+
+            logits_processor = LogitsProcessorList(
+                [StopAfterSequence(
+                    self.tokenizer.eos_token_id, device=self.device)]
+            )
+            if dataset == "MBXP":
+                _ = self.model.generate(
+                    input_ids=input_ids_tensor,
+                    attention_mask=input_masks_tensor,
+                    pad_token_id=self.tokenizer.pad_token_id,
+                    streamer=tokens_streamer,
+                    logits_processor=logits_processor,
+                    **gen_kwargs,
+                )
+            else:
+                _ = self.model.generate(
+                    input_ids=input_ids_tensor,
+                    attention_mask=input_masks_tensor,
+                    pad_token_id=self.tokenizer.pad_token_id,
+                    streamer=tokens_streamer,
+                    **gen_kwargs,
+                )
+
+            output_tokens = tokens_streamer.get_out_tokens()
+            n_tokens = len(output_tokens)
+            response_array = array.array(
+                "B", np.array(output_tokens, np.int32).tobytes()
+            )
+            bi = response_array.buffer_info()
+            response = [
+                lg.QuerySampleResponse(
+                    qitem.id,
+                    bi[0],
+                    bi[1],
+                    n_tokens)]
+            lg.QuerySamplesComplete(response)
+
+    def issue_queries(self, query_samples):
+
+        self.query_queue.put(query_samples[0])
+
+    def stop(self):
+        for _ in range(self.num_workers):
+            self.query_queue.put(None)
+
+        for worker in self.worker_threads:
+            worker.join()
+
+        self.first_token_queue.put(None)
+        self.ft_response_thread.join()
--- a/language/mixtral-8x7b/build.sh
+++ b/language/mixtral-8x7b/build.sh
+set -e
+
+conda install pybind11==2.10.4 -c conda-forge -y
+conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
+python -m pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
+
+
+cd ../../loadgen && python3 -m pip install .
--- a/language/mixtral-8x7b/dataset.py
+++ b/language/mixtral-8x7b/dataset.py
+import random
+import os
+import time
+import numpy as np
+import torch
+from datasets import load_dataset, load_from_disk
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from torch.nn.functional import pad
+from torch.utils.data import DataLoader
+from typing import Optional, Dict, Sequence
+import io
+
+# import utils
+import copy
+import pickle
+
+import logging
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("Llama-70B-Dataset")
+
+
+class Dataset:
+    def __init__(
+        self,
+        model_name=None,
+        total_sample_count=15000,
+        perf_count_override=None,
+        dataset_path=None,
+        device="cpu",
+    ):
+        self.model_name = model_name or "mistralai/Mixtral-8x7B-v0.1"
+        self.dataset_path = dataset_path
+        self.max_length = 1024
+        self.device = device
+
+        # self.total_sample_count = total_sample_count
+
+        self.load_tokenizer()
+        self.load_processed_dataset()
+
+        self.total_sample_count = min(len(self.input_ids), total_sample_count)
+        self.perf_count = perf_count_override or self.total_sample_count
+
+    def load_tokenizer(self):
+        """Returns tokenizer"""
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.model_name,
+            model_max_length=1024,
+            padding_side="left",
+            use_fast=False,
+        )
+
+        self.tokenizer.pad_token = self.tokenizer.eos_token
+
+    def load_processed_dataset(self):
+        if not os.path.isfile(self.dataset_path):
+            log.warn(
+                "Processed pickle file {} not found. Please check that the path is correct".format(
+                    self.dataset_path
+                )
+            )
+
+        print("Loading dataset...")
+        import pandas as pd
+
+        processed_data = pd.read_pickle(self.dataset_path)
+
+        input_tokens = processed_data["tok_input"]
+
+        self.input_ids = []
+        self.input_lens = []
+        self.attention_masks = []
+        self.dataset_names = []
+
+        for ids in input_tokens:
+            input_ids = torch.tensor(ids, dtype=torch.int32).view(
+                1, -1).to(self.device)
+            attn_mask = torch.ones_like(input_ids)
+            self.input_ids.append(input_ids)
+            self.attention_masks.append(attn_mask)
+            self.input_lens.append(input_ids.shape[-1])
+
+        for dataset in processed_data["dataset"]:
+            self.dataset_names.append(dataset)
+        print("Finished loading dataset.")
+
+    def postProcess(
+        self,
+        out_tokens,
+        input_seq_lens=None,
+        query_id_list=None,
+        sample_index_list=None,
+    ):
+        """Postprocesses output prediction"""
+
+        # TODO: Create response object in postProcess(?)
+        """
+        preds = []
+        for i in range(out_tokens.shape[0]):
+            #pred = out_tokens[i].reshape(-1).cpu().numpy() # Slice up to original input length as below?
+
+            input_len = input_seq_lens[i] if input_seq_lens else 0
+            pred = out_tokens[i, input_len:].reshape(-1).cpu().numpy()
+            preds.append(pred)
+        """
+        # Everything is padded to max_len (1024), so prune the input and parse
+        # to numpy
+        output_seq = out_tokens[:, 1024:].cpu().numpy()
+        aux_seq = []
+        assert len(query_id_list) == output_seq.shape[0]
+        for i in range(len(output_seq)):
+            aux = output_seq[i]
+            while len(output_seq[i]) <= 1:
+                aux = np.append(aux, self.tokenizer.eos_token_id)
+            aux_seq.append(aux)
+        output_seq = np.stack(aux_seq)
+
+        # Save outputs
+        if not os.path.exists("run_outputs"):
+            os.makedirs("run_outputs")
+        fname = "q" + "_".join([str(i) for i in query_id_list])
+        fname = f"run_outputs/{fname}.pkl"
+        with open(fname, mode="wb") as f:
+            d = {"query_ids": query_id_list, "outputs": output_seq}
+            print(f"Saving outputs to {fname}")
+            pickle.dump(d, f)
+
+        return output_seq
+
+    def LoadSamplesToRam(self, sample_list):
+        pass
+
+    def UnloadSamplesFromRam(self, sample_list):
+        pass
+
+    def __del__(self):
+        pass