First add in 0524

5eaaba41 · Rayyyyy · 5eaaba41 · 5eaaba41 · 5eaaba41 · 5eaaba41
Commit 5eaaba41 authored May 24, 2024 by Rayyyyy
20 changed files
--- a/docs/single_gpu.md
+++ b/docs/single_gpu.md
+# Fine-tuning with Single GPU
+
+To run fine-tuning on a single GPU, we will  make use of two packages
+
+1- [PEFT](https://huggingface.co/blog/peft) methods and in specific using HuggingFace [PEFT](https://github.com/huggingface/peft)library.
+
+2- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) int8 quantization.
+
+Given combination of PEFT and Int8 quantization, we would be able to fine_tune a Meta Llama 3 8B model on one consumer grade GPU such as A10.
+
+## Requirements
+To run the examples, make sure to install the llama-recipes package (See [README.md](../README.md) for details).
+
+**Please note that the llama-recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
+
+## How to run it?
+
+Get access to a machine with one GPU or if using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id` and run the following. It runs by default with `samsum_dataset` for summarization application.
+
+
+```bash
+
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+```
+The args used in the command above are:
+
+* `--use_peft` boolean flag to enable PEFT methods in the script
+
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`.
+
+* `--quantization` boolean flag to enable int8 quantization
+
+
+## How to run with different datasets?
+
+Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py).
+
+* `grammar_dataset` : use this [notebook](../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
+
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `ft_dataset` folder.
+
+```bash
+wget -P src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+
+* `samsum_dataset`
+
+to run with each of the datasets set the `dataset` flag in the command as shown below:
+
+```bash
+# grammer_dataset
+
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+# alpaca_dataset
+
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+
+# samsum_dataset
+
+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+```
+
+## Where to configure settings?
+
+* [Training config file](../src/llama_recipes/configs/training.py) is the main config file that help to specify the settings for our run can be found in
+
+It let us specify the training settings, everything from `model_name` to `dataset_name`, `batch_size` etc. can be set here. Below is the list of supported settings:
+
+```python
+    model_name: str="PATH/to/Model"
+    tokenizer_name: str=None
+    enable_fsdp: bool=False
+    low_cpu_fsdp: bool=False
+    run_validation: bool=True
+    batch_size_training: int=4
+    batching_strategy: str="packing" #alternative: padding
+    context_length: int=4096
+    gradient_accumulation_steps: int=1
+    gradient_clipping: bool = False
+    gradient_clipping_threshold: float = 1.0
+    num_epochs: int=3
+    max_train_step: int=0
+    max_eval_step: int=0
+    num_workers_dataloader: int=1
+    lr: float=1e-4
+    weight_decay: float=0.0
+    gamma: float= 0.85
+    seed: int=42
+    use_fp16: bool=False
+    mixed_precision: bool=True
+    val_batch_size: int=1
+    dataset = "samsum_dataset"
+    peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
+    use_peft: bool=False
+    from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
+    output_dir: str = "PATH/to/save/PEFT/model"
+    freeze_layers: bool = False
+    num_freeze_layers: int = 1
+    quantization: bool = False
+    one_gpu: bool = False
+    save_model: bool = True
+    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
+    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
+    save_optimizer: bool=False # will be used if using FSDP
+    use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    use_wandb: bool = False # Enable wandb for experient tracking
+    save_metrics: bool = False # saves training metrics to a json file for later plotting
+    flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
+    flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
+    use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
+    profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
+
+```
+
+* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
+
+* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
+
+## FLOPS Counting and Pytorch Profiling
+
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/pyproject.toml
+++ b/pyproject.toml
+[build-system]
+requires = ["hatchling", "hatch-requirements-txt"]
+build-backend = "hatchling.build"
+
+[project]
+name = "llama-recipes"
+version = "0.0.2"
+authors = [
+  { name="Hamid Shojanazeri", email="hamidnazeri@meta.com" },
+  { name="Matthias Reso", email="mreso@meta.com" },
+  { name="Geeta Chauhan", email="gchauhan@meta.com" },
+]
+description = "Llama-recipes is a companion project to the Llama 2 model. It's goal is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. "
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: Other/Proprietary License",
+    "Operating System :: OS Independent",
+]
+dynamic = ["dependencies"]
+
+[project.optional-dependencies]
+vllm = ["vllm"]
+tests = ["pytest-mock"]
+auditnlg = ["auditnlg"]
+
+[project.urls]
+"Homepage" = "https://github.com/facebookresearch/llama-recipes/"
+"Bug Tracker" = "https://github.com/facebookresearch/llama-recipes/issues"
+
+[tool.hatch.build]
+exclude = [
+  "dist/*",
+]
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/llama_recipes"]
+
+[tool.hatch.metadata.hooks.requirements_txt]
+files = ["requirements.txt"]
+
+[tool.pytest.ini_options]
+markers = [
+    "skip_missing_tokenizer: skip tests when we can not access meta-llama/Llama-2-7b-hf on huggingface hub (Log in with `huggingface-cli login` to unskip).",
+]
--- a/recipes/README.md
+++ b/recipes/README.md
+This folder contains examples organized by topic:
+
+| Subfolder | Description |
+|---|---|
+[quickstart](./quickstart)|The "Hello World" of using Llama 3, start here if you are new to using Llama 3
+[multilingual](./multilingual)|Scripts to add a new language to Llama
+[finetuning](./finetuning)|Scripts to finetune Llama 3 on single-GPU and multi-GPU setups
+[inference](./inference)|Scripts to deploy Llama 3 for inference [locally](./inference/local_inference/), on mobile [Android](./inference/mobile_inference/android_inference/) and using [model servers](./inference/mobile_inference/)
+[use_cases](./use_cases)|Scripts showing common applications of Llama 3
+[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
+[llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
+[benchmarks](./benchmarks)|Scripts to benchmark Llama 3 models inference on various backends
+[code_llama](./code_llama)|Scripts to run inference with the Code Llama models
+[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama 3 models using `lm-evaluation-harness` from `EleutherAI`
--- a/recipes/benchmarks/fmbench/README.md
+++ b/recipes/benchmarks/fmbench/README.md
+# Benchmark Llama models on AWS
+
+The [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) tool provides a quick and easy way to benchmark the Llama family of models for price and performance on any AWS service including [`Amazon SagMaker`](https://aws.amazon.com/solutions/guidance/generative-ai-deployments-using-amazon-sagemaker-jumpstart/), [`Amazon Bedrock`](https://aws.amazon.com/bedrock/) or `Amazon EKS` or `Amazon EC2` as `Bring your own endpoint`.
+
+## The need for benchmarking
+
+<!-- markdown-link-check-disable -->
+Customers often wonder what is the best AWS service to run Llama models for _my specific use-case_ and _my specific price performance requirements_. While model evaluation metrics are available on several leaderboards ([`HELM`](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), [`LMSys`](https://chat.lmsys.org/?leaderboard)), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as ([`LongBench`](https://huggingface.co/datasets/THUDM/LongBench), [`QMSum`](https://paperswithcode.com/dataset/qmsum)). This is the problem that [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main) solves.
+<!-- markdown-link-check-enable -->
+
+## [`FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main): an open-source Python package for FM benchmarking on AWS
+
+`FMBench` runs inference requests against endpoints that are either deployed through `FMBench` itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given Llama model for a given use-case.
+
+The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the `Llama2-13b` model on different instance types available on SageMaker using prompts for Q&A task created from the [`LongBench`](https://huggingface.co/datasets/THUDM/LongBench) dataset, these prompts are between 3000 to 3840 tokens in length. **_Note that the numbers are hidden in this figure but you would be able to see them when you run `FMBench` yourself_**.
+
+![`Llama2-13b` on different instance types ](./img/business_summary.png)
+
+The following table (also included in the report) provides information about the best available instance type for that experiment<sup>1</sup>.
+
+|Information	|Value	|
+|---	|---	|
+|experiment_name	|llama2-13b-inf2.24xlarge	|
+|payload_file	|payload_en_3000-3840.jsonl	|
+|instance_type	|ml.inf2.24xlarge	|
+|concurrency	|**	|
+|error_rate	|**	|
+|prompt_token_count_mean	|3394	|
+|prompt_token_throughput	|2400	|
+|completion_token_count_mean	|31	|
+|completion_token_throughput	|15	|
+|latency_mean	|**	|
+|latency_p50	|**	|
+|latency_p95	|**	|
+|latency_p99	|**	|
+|transactions_per_minute	|**	|
+|price_per_txn	|**	|
+
+<sup>1</sup> ** represent values hidden on purpose, these are available when you run the tool yourself.
+
+The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).
+
+![Effect of prompt size on inference latency for different concurrency levels](./img/latency_vs_tokens.png)
+
+### How to get started with `FMBench`
+
+The following steps provide a [Quick start guide for `FMBench`](https://github.com/aws-samples/foundation-model-benchmarking-tool#quickstart). For a more detailed DIY version, please see the [`FMBench Readme`](https://github.com/aws-samples/foundation-model-benchmarking-tool?tab=readme-ov-file#the-diy-version-with-gory-details).
+
+1. Each `FMBench` run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical `FMBench` workflow involves either directly using an already provided config file from the [`configs`](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main/src/fmbench/configs) folder in the `FMBench` GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).
+
+    >A simple config file with key parameters annotated is included in this repo, see [`config.yml`](./config.yml). This file benchmarks performance of Llama2-7b on an `ml.g5.xlarge` instance and an `ml.g5.2xlarge` instance. You can use this provided config file as it is for this Quickstart.
+
+1. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run `FMBench` and a write S3 bucket is created which will hold the metrics and reports generated by `FMBench`. The CloudFormation stack takes about 5-minutes to create.
+
+   |AWS Region                |     Link        |
+   |:------------------------:|:-----------:|
+   |us-east-1 (N. Virginia)    | [<img src="./img/CFT.png">](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=fmbench&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-FMBT/template.yml) |
+   |us-west-2 (Oregon)    | [<img src="./img/CFT.png">](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=fmbench&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-FMBT/template.yml) |
+
+1. Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the `fmbench-notebook`.
+
+1. On the `fmbench-notebook` open a Terminal and run the following commands.
+
+    ```{.bash}
+    conda create --name fmbench_python311 -y python=3.11 ipykernel
+    source activate fmbench_python311;
+    pip install -U fmbench
+    ```
+
+1. Now you are ready to `fmbench` with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.
+
+    1. We benchmark performance for the `Llama2-7b` model on a `ml.g5.xlarge` and a `ml.g5.2xlarge` instance type, using the `huggingface-pytorch-tgi-inference` inference container. This test would take about 30 minutes to complete and cost about $0.20.
+
+    1. It uses a simple relationship that 750 words equals 1000 tokens, to get a more accurate representation of token counts use the `Llama2 tokenizer`. **_It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided [here](https://github.com/aws-samples/foundation-model-benchmarking-tool/tree/main?tab=readme-ov-file#the-diy-version-with-gory-details) on how to use a custom tokenizer_**.
+
+        <!-- markdown-link-check-disable -->
+        ```{.bash}
+        account=`aws sts get-caller-identity | jq .Account | tr -d '"'`
+        region=`aws configure get region`
+        fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml >> fmbench.log 2>&1
+        ```
+        <!-- markdown-link-check-enable -->
+
+    1. Open another terminal window and do a `tail -f` on the `fmbench.log` file to see all the traces being generated at runtime.
+
+        ```{.bash}
+        tail -f fmbench.log
+        ```
+
+1. The generated reports and metrics are available in the `sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id>` bucket. The metrics and report files are also downloaded locally and in the `results` directory (created by `FMBench`) and the benchmarking report is available as a markdown file called `report.md` in the `results` directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.
+
+## 🚨 Benchmarking Llama3 on Amazon Bedrock 🚨
+
+Llama3 is now available on Bedrock (read [blog post](https://aws.amazon.com/blogs/aws/metas-llama-3-models-are-now-available-in-amazon-bedrock/)), and you can now benchmark it using `FMBench`. Here is the config file for benchmarking `Llama3-8b-instruct` and `Llama3-70b-instruct` on Bedrock.
+
+<!-- markdown-link-check-disable -->
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/bedrock/config-bedrock-llama3.yml) for `Llama3-8b-instruct` and `Llama3-70b-instruct`.
+<!-- markdown-link-check-enable -->
+
+## 🚨 Benchmarking Llama3 on Amazon SageMaker 🚨
+
+Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/)), and you can now benchmark it using `FMBench`. Here are the config files for benchmarking `Llama3-8b-instruct` and `Llama3-70b-instruct` on `ml.p4d.24xlarge`, `ml.inf2.24xlarge` and `ml.g5.12xlarge` instances.
+
+<!-- markdown-link-check-disable -->
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama3/8b/config-llama3-8b-instruct-g5-p4d.yml) for `Llama3-8b-instruct` on  `ml.p4d.24xlarge` and `ml.g5.12xlarge`.
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama3/70b/config-llama3-70b-instruct-g5-p4d.yml) for `Llama3-70b-instruct` on  `ml.p4d.24xlarge` and `ml.g5.48xlarge`.
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama3/8b/config-llama3-8b-inf2-g5.yml) for `Llama3-8b-instruct` on  `ml.inf2.24xlarge` and `ml.g5.12xlarge`.
+<!-- markdown-link-check-enable -->
+
+## Benchmarking Llama2 on Amazon SageMaker
+
+Llama2 models are available through SageMaker JumpStart as well as directly deployable from Hugging Face to a SageMaker endpoint. You can use `FMBench` to benchmark Llama2 on SageMaker for different combinations of instance types and inference containers.
+
+<!-- markdown-link-check-disable -->
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g5-quick.yml) for `Llama2-7b` on `ml.g5.xlarge` and `ml.g5.2xlarge` instances, using the [Hugging Face TGI container](763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04).
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g4dn-g5-trt.yml) for `Llama2-7b` on `ml.g4dn.12xlarge` instance using the [Deep Java Library DeepSpeed container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-deepspeed0.12.6-cu121).
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama2/13b/config-llama2-13b-inf2-g5-p4d.yml) for `Llama2-13b` on `ml.g5.12xlarge`, `ml.inf2.24xlarge` and `ml.p4d.24xlarge` instances using the [Hugging Face TGI container](763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04) and the [Deep Java Library & NeuronX container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-neuronx-sdk2.16.0).
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama2/70b/config-llama2-70b-g5-p4d-trt.yml) for `Llama2-70b` on `ml.p4d.24xlarge` instance using the [Deep Java Library TensorRT container](763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122).
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/llama2/70b/config-llama2-70b-inf2-g5.yml) for `Llama2-70b` on `ml.inf2.48xlarge` instance using the [HuggingFace TGI with Optimum NeuronX container](763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.17-neuronx-py310-ubuntu22.04).
+<!-- markdown-link-check-enable -->
+
+## Benchmarking Llama2 on Amazon Bedrock
+
+The Llama2-13b-chat and Llama2-70b-chat models are available on [Bedrock](https://aws.amazon.com/bedrock/llama/). You can use `FMBench` to benchmark Llama2 on Bedrock for both on-demand throughput and provisioned throughput inference options.
+
+<!-- markdown-link-check-disable -->
+- [Config file](https://github.com/aws-samples/foundation-model-benchmarking-tool/blob/main/src/fmbench/configs/bedrock/config-bedrock.yml) for `Llama2-13b-chat` and `Llama2-70b-chat` on Bedrock for on-demand throughput.
+<!-- markdown-link-check-enable -->
+
+- For testing provisioned throughput simply replace the `ep_name` parameter in `experiments` section of the config file with the ARN of your provisioned throughput.
+
+## More..
+
+For bug reports, enhancement requests and any questions please create a [GitHub issue](https://github.com/aws-samples/foundation-model-benchmarking-tool/issues) on the `FMBench` repo.
--- a/recipes/benchmarks/fmbench/config.yml
+++ b/recipes/benchmarks/fmbench/config.yml
+general:
+  name: "llama2-7b-v1"      
+  model_name: "Llama2-7b"
+  
+# AWS and SageMaker settings
+aws:
+  # AWS region, this parameter is templatized, no need to change
+  region: {region}
+  # SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
+  sagemaker_execution_role: {role_arn}
+  # S3 bucket to which metrics, plots and reports would be written to
+  bucket: {write_bucket}
+
+# directory paths in the write bucket, no need to change these
+dir_paths:
+  data_prefix: data
+  prompts_prefix: prompts
+  all_prompts_file: all_prompts.csv
+  metrics_dir: metrics
+  models_dir: models
+  metadata_dir: metadata
+
+# S3 information for reading datasets, scripts and tokenizer
+s3_read_data:
+  # read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-<region>-<account_id>
+  read_bucket: {read_bucket}
+  scripts_prefix: scripts
+  
+  # S3 prefix in the read bucket where deployment and inference scripts should be placed
+  scripts_prefix: scripts
+    
+  # deployment and inference script files to be downloaded are placed in this list
+  # only needed if you are creating a new deployment script or inference script
+  # your HuggingFace token does need to be in this list and should be called "hf_token.txt"
+  script_files:
+  - hf_token.txt
+
+  # configuration files (like this one) are placed in this prefix
+  configs_prefix: configs
+
+  # list of configuration files to download, for now only pricing.yml needs to be downloaded
+  config_files:
+  - pricing.yml
+
+  # S3 prefix for the dataset files
+  source_data_prefix: source_data
+  # list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench
+  source_data_files:
+  - 2wikimqa_e.jsonl
+  - 2wikimqa.jsonl
+  - hotpotqa_e.jsonl
+  - hotpotqa.jsonl
+  - narrativeqa.jsonl
+  - triviaqa_e.jsonl
+  - triviaqa.jsonl
+  # S3 prefix for the tokenizer to be used with the models
+  # NOTE 1: the same tokenizer is used with all the models being tested through a config file
+  # NOTE 2: place your model specific tokenizers in a prefix named as <model_name>_tokenizer
+  #         so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in llama2_tokenizer and so on and so forth.
+  tokenizer_prefix: tokenizer
+  
+  # S3 prefix for prompt templates
+  prompt_template_dir: prompt_template
+
+  # prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file
+  # the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one
+  prompt_template_file: prompt_template_llama2.txt
+
+# steps to run, usually all of these would be
+# set to yes so nothing needs to change here
+# you could, however, bypass some steps for example
+# set the 2_deploy_model.ipynb to no if you are re-running
+# the same config file and the model is already deployed
+run_steps:
+  0_setup.ipynb: yes
+  1_generate_data.ipynb: yes
+  2_deploy_model.ipynb: yes
+  3_run_inference.ipynb: yes
+  4_model_metric_analysis.ipynb: yes
+  5_cleanup.ipynb: yes
+
+
+datasets:
+  # Refer to the 1_generate_data.ipynb notebook
+  # the dataset you use is expected to have the 
+  # columns you put in prompt_template_keys list
+  # and your prompt template also needs to have
+  # the same placeholders (refer to the prompt template folder)
+  prompt_template_keys:
+  - input
+  - context
+  
+  # if your dataset has multiple languages and it has a language
+  # field then you could filter it for a language. Similarly,
+  # you can filter your dataset to only keep prompts between
+  # a certain token length limit (the token length is determined
+  # using the tokenizer you provide in the tokenizer_prefix prefix in the
+  # read S3 bucket). Each of the array entries below create a payload file
+  # containing prompts matching the language and token length criteria.
+  filters:
+  - language: en    
+    min_length_in_tokens: 1
+    max_length_in_tokens: 500
+    payload_file: payload_en_1-500.jsonl
+  - language: en
+    min_length_in_tokens: 500
+    max_length_in_tokens: 1000
+    payload_file: payload_en_500-1000.jsonl
+  - language: en
+    min_length_in_tokens: 1000
+    max_length_in_tokens: 2000
+    payload_file: payload_en_1000-2000.jsonl
+  - language: en
+    min_length_in_tokens: 2000
+    max_length_in_tokens: 3000
+    payload_file: payload_en_2000-3000.jsonl
+  - language: en
+    min_length_in_tokens: 3000
+    max_length_in_tokens: 3840
+    payload_file: payload_en_3000-3840.jsonl
+
+# While the tests would run on all the datasets
+# configured in the experiment entries below but 
+# the price:performance analysis is only done for 1
+# dataset which is listed below as the dataset_of_interest
+metrics:
+  dataset_of_interest: en_2000-3000
+
+# all pricing information is in the pricing.yml file
+# this file is provided in the repo. You can add entries
+# to this file for new instance types and new Bedrock models
+pricing: pricing.yml
+
+# inference parameters, these are added to the payload
+# for each inference request. The list here is not static
+# any parameter supported by the inference container can be
+# added to the list. Put the sagemaker parameters in the sagemaker
+# section, bedrock parameters in the bedrock section (not shown here).
+# Use the section name (sagemaker in this example) in the inference_spec.parameter_set
+# section under experiments.
+inference_parameters:
+  sagemaker:
+    do_sample: yes
+    temperature: 0.1
+    top_p: 0.92
+    top_k: 120  
+    max_new_tokens: 100
+    return_full_text: False
+
+# Configuration for experiments to be run. The experiments section is an array
+# so more than one experiments can be added, these could belong to the same model
+# but different instance types, or different models, or even different hosting
+# options (such as one experiment is SageMaker and the other is Bedrock).
+experiments:
+  - name: llama2-7b-g5.xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0
+    # model_id is interpreted in conjunction with the deployment_script, so if you
+    # use a JumpStart model id then set the deployment_script to jumpstart.py.
+    # if deploying directly from HuggingFace this would be a HuggingFace model id
+    # see the DJL serving deployment script in the code repo for reference.
+    model_id: meta-textgeneration-llama-2-7b-f
+    model_version: "3.*"
+    model_name: llama2-7b-f
+    ep_name: llama-2-7b-g5xlarge
+    instance_type: "ml.g5.xlarge"
+    image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'
+    deploy: yes
+    instance_count: 1
+    # FMBench comes packaged with multiple deployment scripts, such as scripts for JumpStart
+    # scripts for deploying using DJL DeepSpeed, tensorRT etc. You can also add your own.
+    # See repo for details    
+    deployment_script: jumpstart.py
+    # FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker
+    # and Bedrock. You can also add your own. See repo for details
+    inference_script: sagemaker_predictor.py
+    inference_spec:
+      # this should match one of the sections in the inference_parameters section above
+      parameter_set: sagemaker
+    # runs are done for each combination of payload file and concurrency level
+    payload_files:
+    - payload_en_1-500.jsonl
+    - payload_en_500-1000.jsonl
+    - payload_en_1000-2000.jsonl
+    - payload_en_2000-3000.jsonl
+    #- payload_en_3000-3840.jsonl
+    # concurrency level refers to number of requests sent in parallel to an endpoint
+    # the next set of requests is sent once responses for all concurrent requests have
+    # been received.
+    concurrency_levels:
+    - 1
+    - 2
+    - 4
+
+    accept_eula: true
+    # Environment variables to be passed to the container
+    # this is not a fixed list, you can add more parameters as applicable.
+    env:
+      SAGEMAKER_PROGRAM: "inference.py"
+      ENDPOINT_SERVER_TIMEOUT: "3600"
+      MODEL_CACHE_ROOT: "/opt/ml/model"
+      SAGEMAKER_ENV: "1"
+      HF_MODEL_ID: "/opt/ml/model"
+      MAX_INPUT_LENGTH: "4095"
+      MAX_TOTAL_TOKENS: "4096"
+      SM_NUM_GPUS: "1"
+      SAGEMAKER_MODEL_SERVER_WORKERS: "1"
+
+  - name: llama2-7b-g5.2xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0
+    # model_id is interpreted in conjunction with the deployment_script, so if you
+    # use a JumpStart model id then set the deployment_script to jumpstart.py.
+    # if deploying directly from HuggingFace this would be a HuggingFace model id
+    # see the DJL serving deployment script in the code repo for reference. 
+    model_id: meta-textgeneration-llama-2-7b-f
+    model_version: "3.*"
+    model_name: llama2-7b-f
+    ep_name: llama-2-7b-g5-2xlarge
+    instance_type: "ml.g5.2xlarge"
+    image_uri: '763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'
+    deploy: yes
+    # FMBench comes packaged with multiple deployment scripts, such as scripts for JumpStart
+    # scripts for deploying using DJL DeepSpeed, tensorRT etc. You can also add your own.
+    # See repo for details
+    instance_count: 1
+    deployment_script: jumpstart.py
+    # FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker
+    # and Bedrock. You can also add your own. See repo for details
+    inference_script: sagemaker_predictor.py
+    inference_spec:
+      # this should match one of the sections in the inference_parameters section above
+      parameter_set: sagemaker
+    # runs are done for each combination of payload file and concurrency level
+    payload_files:
+    - payload_en_1-500.jsonl
+    - payload_en_500-1000.jsonl
+    - payload_en_1000-2000.jsonl
+    - payload_en_2000-3000.jsonl
+    #- payload_en_3000-3840.jsonl
+    
+    # concurrency level refers to number of requests sent in parallel to an endpoint
+    # the next set of requests is sent once responses for all concurrent requests have
+    # been received.
+    concurrency_levels:
+    - 1
+    - 2
+    - 4
+    # Added for models that require accepting a EULA
+    accept_eula: true
+    # Environment variables to be passed to the container
+    # this is not a fixed list, you can add more parameters as applicable.
+    env:
+      SAGEMAKER_PROGRAM: "inference.py"
+      ENDPOINT_SERVER_TIMEOUT: "3600"
+      MODEL_CACHE_ROOT: "/opt/ml/model"
+      SAGEMAKER_ENV: "1"
+      HF_MODEL_ID: "/opt/ml/model"
+      MAX_INPUT_LENGTH: "4095"
+      MAX_TOTAL_TOKENS: "4096"
+      SM_NUM_GPUS: "1"
+      SAGEMAKER_MODEL_SERVER_WORKERS: "1"
+
+# parameters related to how the final report is generated
+report:
+  # constraints for latency, cost and error rate
+  # an experiment is considered successful or eligible for
+  # selection for a use-case if it satisfies all of the following
+  # constraints. Experiments are scored as per this criteria
+  # higher score is better (see 4_model_metric_analysis.ipynb score_run function)
+  latency_budget: 2
+  cost_per_10k_txn_budget: 20
+  error_rate_budget: 0
+  # other misc reporting parameters, see 4_model_metric_analysis.ipynb
+  # for more information
+  per_inference_request_file: per_inference_request_results.csv
+  all_metrics_file: all_metrics.csv
+  txn_count_for_showing_cost: 10000
+  v_shift_w_single_instance: 0.025
+  v_shift_w_gt_one_instance: 0.025
--- a/recipes/benchmarks/fmbench/img/CFT.png
+++ b/recipes/benchmarks/fmbench/img/CFT.png
--- a/recipes/benchmarks/fmbench/img/business_summary.png
+++ b/recipes/benchmarks/fmbench/img/business_summary.png
--- a/recipes/benchmarks/fmbench/img/instances.png
+++ b/recipes/benchmarks/fmbench/img/instances.png
--- a/recipes/benchmarks/fmbench/img/latency_vs_tokens.png
+++ b/recipes/benchmarks/fmbench/img/latency_vs_tokens.png
--- a/recipes/benchmarks/inference_throughput/README.md
+++ b/recipes/benchmarks/inference_throughput/README.md
+# Inference Throughput Benchmarks
+In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
+* On-prem - Popular serving frameworks and containers (i.e. vLLM)
+* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
+* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
+* [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
+
+# Why
+There are three major reasons we want to run these benchmarks and share them with our Llama community:
+* Provide inference throughput analysis based on real world situation to help you select the best service or deployment for your scenario
+* Provide a baseline measurement for validating various optimization solutions on different backends, so we can provide guidance on which solutions work best for your scenario
+* Encourage the community to develop benchmarks on top of our works, so we can better quantify the latest proposed solutions combined with current popular frameworks, especially in this crazy fast-moving area
+
+# Parameters
+Here are the parameters (if applicable) that you can configure for running the benchmark:
+* **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
+* **MAX_NEW_TOKENS** - Max number of tokens generated
+* **CONCURRENT_LEVELS** - Max number of concurrent requests
+* **MODEL_PATH** - Model source
+* **MODEL_HEADERS** - Request headers
+* **SAFE_CHECK** - Content safety check (either Azure service or simulated latency)
+* **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)
+* **TOKENIZER_PATH** - Tokenizer source
+* **RANDOM_PROMPT_LENGTH** - Random prompt length (for pretrained models)
+* **NUM_GPU** - Number of GPUs for request dispatch among multiple containers
+* **TEMPERATURE** - Temperature for inference
+* **TOP_P** - Top_p for inference
+* **MODEL_ENDPOINTS** - Container endpoints
+* Model parallelism or model replicas - Load one model into multiple GPUs or multiple model replicas on one instance. More detail in the README files for specific containers.
+
+You can also configure other model hyperparameters as part of the request payload.  
+All these parameters are stored in ```parameter.json``` and real prompts are stored in ```input.jsonl```. Running the script will load these configurations.
+
+
+
+# Metrics
+The benchmark will report these metrics per instance:
+* Number of concurrent requests
+* P50 Latency(ms)
+* P99 Latency(ms)
+* Request per second (RPS)
+* Output tokens per second
+* Output tokens per second per GPU
+* Input tokens per second
+* Input tokens per second per GPU
+* Average tokens per second per request
+
+We intend to add these metrics in the future:
+* Time to first token (TTFT)
+  
+The benchmark result will be displayed in the terminal output and saved as a CSV file (```performance_metrics.csv```) which you can export to spreadsheets.
+
+# Getting Started
+Please follow the ```README.md``` in each subfolder for instructions on how to setup and run these benchmarks. 
+
--- a/recipes/benchmarks/inference_throughput/cloud-api/README.md
+++ b/recipes/benchmarks/inference_throughput/cloud-api/README.md
+# Llama-Cloud-API-Benchmark
+This folder contains code to run inference benchmark for Llama 2 models on cloud API with popular cloud service providers. The benchmark will focus on overall inference **throughput** for querying the API endpoint for output generation with different level of concurrent requests. Remember that to send queries to the API endpoint, you are required to acquire subscriptions with the cloud service providers and there will be a fee associated with it.
+
+Disclaimer - The purpose of the code is to provide a configurable setup to measure inference throughput. It is not a representative of the performance of these API services and we do not plan to make comparisons between different API providers.
+
+
+# Azure - Getting Started
+To get started, there are certain steps we need to take to deploy the models:
+
+<!-- markdown-link-check-disable -->
+* Register for a valid Azure account with subscription [here](https://azure.microsoft.com/en-us/free/search/?ef_id=_k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&OCID=AIDcmm5edswduu_SEM__k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&gad_source=1&gclid=CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE)
+<!-- markdown-link-check-enable -->
+* Take a quick look on what is the [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home) and navigate to the website from the link in the article
+* Follow the demos in the article to create a project and [resource](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal) group, or you can also follow the guide [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio)
+* Select Llama models from Model catalog
+* Deploy with "Pay-as-you-go"
+
+Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.
+For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference.
+
+Now, replace the endpoint url and API key in ```azure/parameters.json```. For parameter `MODEL_ENDPOINTS`, with chat models the suffix should be `v1/chat/completions` and with pretrained models the suffix should be `v1/completions`.
+Note that the API endpoint might implemented a rate limit for token generation in certain amount of time. If you encountered the error, you can try reduce `MAX_NEW_TOKEN` or start with smaller `CONCURRENT_LEVELs`.
+
+Once everything configured, to run chat model benchmark:
+```python chat_azure_api_benchmark.py```
+
+To run pretrained model benchmark:
+```python pretrained_azure_api_benchmark.py```
+
+Once finished, the result will be written into a CSV file in the same directory, which can be later imported into dashboard of your choice.
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import csv
+import json
+import time
+import urllib.request
+import numpy as np
+import transformers
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import Dict, Tuple, List
+
+with open('input.jsonl') as input:
+    prompt_data = json.load(input)
+
+# Prompt data stored in json file. Choose from number of tokens - 5, 25, 50, 100, 500, 1k, 2k.
+PROMPT = prompt_data["25"] 
+
+with open('parameters.json') as parameters:
+    params = json.load(parameters)
+
+MAX_NEW_TOKEN = params["MAX_NEW_TOKEN"]
+CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
+# Threshold for tokens per second below which we deem the query to be slow
+THRESHOLD_TPS = params["THRESHOLD_TPS"] 
+# Default Llama 2 tokenizer, replace with your own tokenizer 
+TOKENIZER_PATH = params["TOKENIZER_PATH"] 
+TEMPERATURE = params["TEMPERATURE"]
+TOP_P = params["TOP_P"]
+# Model endpoint provided with API provider 
+MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
+API_KEY = params["API_KEY"]
+SYS_PROMPT = params["SYS_PROMPT"]
+
+
+# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
+tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+
+num_token_input_prompt = len(tokenizer.encode(PROMPT))
+print(f"Number of token for input prompt: {num_token_input_prompt}")
+
+
+def generate_text() -> Tuple[int, int]:
+
+    #Configure payload data sending to API endpoint
+    payload = {"messages":[
+                {"role":"system", "content": SYS_PROMPT},
+                {"role":"user", "content": PROMPT}], 
+            "max_tokens": MAX_NEW_TOKEN,
+            "temperature": TEMPERATURE,
+            "top_p" : TOP_P,
+            "stream": "False"
+    }
+    body = str.encode(json.dumps(payload))
+    url = MODEL_ENDPOINTS
+    api_key = API_KEY
+    if not api_key:
+        raise Exception("API Key is missing")
+    
+    headers = {'Content-Type':'application/json', 'Authorization':(api_key)}
+    req = urllib.request.Request(url, body, headers)
+    token_count = 0
+    output = ""
+    start_time = time.time()
+    # Send request
+    try:
+        response = urllib.request.urlopen(req)
+        result = response.read()
+        output = json.loads(result)["choices"][0]["message"]["content"]
+        
+    except urllib.error.HTTPError as error:
+        print("The request failed with status code: " + str(error.code))
+        # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
+        print(error.info())
+        print(error.read().decode("utf8", 'ignore'))
+
+    end_time = time.time()
+    # Convert to ms
+    latency = (end_time - start_time) * 1000  
+    token_count = len(tokenizer.encode(output))
+
+    return latency, token_count
+
+
+def evaluate_performance(concurrent_requests: int) -> Tuple[float, float, float, float, float, float, float, List[float]]:
+    latencies = []
+    total_output_tokens = 0
+    output_tokens_per_second_each_request = []
+    start_time = time.time()
+
+    # Init multi-thread execution 
+    with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
+        future_to_req = {executor.submit(generate_text): i for i in range(concurrent_requests)}
+        for future in as_completed(future_to_req):
+            latency, token_count = future.result()
+            latencies.append(latency)
+            total_output_tokens += token_count
+            # Calculate tokens per second for this request
+            tokens_per_sec = token_count / (latency / 1000)
+            output_tokens_per_second_each_request.append(tokens_per_sec)
+
+    end_time = time.time()
+    total_time = end_time - start_time
+    # RPS (requests per second)
+    rps = concurrent_requests / total_time  
+    # Overall tokens per second
+    output_tokens_per_second_overall = total_output_tokens / total_time  
+    input_tokens_per_second_overall = (num_token_input_prompt * concurrent_requests) / total_time
+    p50_latency = np.percentile(latencies, 50)
+    p99_latency = np.percentile(latencies, 99)
+
+    # Count the number of requests below the token-per-second threshold
+    below_threshold_count = sum(1 for tps in output_tokens_per_second_each_request if tps < THRESHOLD_TPS)
+    output_tokens_per_second_per_request = sum(output_tokens_per_second_each_request)/len(output_tokens_per_second_each_request)
+
+    return p50_latency, p99_latency, rps, output_tokens_per_second_overall, input_tokens_per_second_overall, output_tokens_per_second_per_request, below_threshold_count
+
+
+
+# Print markdown
+print("| Number of Concurrent Requests | P50 Latency (ms) | P99 Latency (ms) | RPS | Output Tokens per Second | Input Tokens per Second | Average Output Tokens per Second per Request | Number of Requests Below Threshold |")
+print("|-------------------------------|------------------|------------------|-----|--------------------------|-------------------------|----------------------------------------------|------------------------------------|")
+
+# Save to file
+csv_file = "performance_metrics.csv"
+with open(csv_file, "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow(["Number of Concurrent Requests", "P50 Latency (ms)", "P99 Latency (ms)", "RPS", "Output Tokens per Second", "Input Tokens per Second", "Average Output Tokens per Second per Request"])
+
+    for level in CONCURRENT_LEVELS:
+        p50_latency, p99_latency, rps, output_tokens_per_second_overall, input_tokens_per_second_overall, output_tokens_per_second_per_request, below_threshold_count = evaluate_performance(level)
+        print(f"| {level} | {p50_latency:.2f} | {p99_latency:.2f} | {rps:.2f} | {output_tokens_per_second_overall:.2f} | {input_tokens_per_second_overall:.2f} | {output_tokens_per_second_per_request:.2f} | {below_threshold_count:.2f} |")
+        writer.writerow([level, round(p50_latency, 2), round(p99_latency, 2), round(rps, 2), round(output_tokens_per_second_overall, 2), round(input_tokens_per_second_overall, 2), round(output_tokens_per_second_per_request, 2)])
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
+{
+    "5" : "What is Deep Learning",
+    "25" : "How does Llama 2 improve text generation, offering coherent, relevant, and contextually appropriate content?",
+    "50" : "In the context of the rapid evolution of AI, how does the Llama 2 address issues of ethical concerns, bias reduction, and increased performance to generate text that is not only coherent but also culturally sensitive?",
+    "100" : "As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "500" : "In AI context as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "1k" : "In the context of the AI evolution, as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "2k" : "In the context of the evolution of AI, especially in the crazy LLM field, as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?"
+}
\ No newline at end of file
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
+{
+    "MAX_NEW_TOKEN" : 256,
+    "CONCURRENT_LEVELS" : [1, 2, 4, 8, 16, 32, 64],
+    "THRESHOLD_TPS" : 7,
+    "TOKENIZER_PATH" : "../../tokenizer",
+    "RANDOM_PROMPT_LENGTH" : 1000,
+    "TEMPERATURE" : 0.6,
+    "TOP_P" : 0.9,
+    "MODEL_ENDPOINTS" : "https://your-endpoint.inference.ai.azure.com/v1/completions",
+    "API_KEY" : "your-auth-key",
+    "SYS_PROMPT" : "You are a helpful assistant."
+}
\ No newline at end of file
--- a/recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import csv
+import json
+import time
+import random
+import urllib.request
+import numpy as np
+import transformers
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import Dict, Tuple, List
+
+# Predefined inputs
+with open('input.jsonl') as input:
+    prompt_data = json.load(input)
+
+with open('parameters.json') as parameters:
+    params = json.load(parameters)
+
+MAX_NEW_TOKEN = params["MAX_NEW_TOKEN"]
+CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
+# Threshold for tokens per second below which we deem the query to be slow
+THRESHOLD_TPS = params["THRESHOLD_TPS"] 
+# Default Llama 2 tokenizer, replace with your own tokenizer 
+TOKENIZER_PATH = params["TOKENIZER_PATH"]
+RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
+TEMPERATURE = params["TEMPERATURE"]
+TOP_P = params["TOP_P"]
+# Model endpoint provided with API provider 
+MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
+API_KEY = params["API_KEY"]
+
+
+# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
+tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+
+# Select vocabulary that is longer than 2 tokens (closer to real words) and close to the English (not foolproof)
+vocab = [token for token in tokenizer.get_vocab().keys() if len(token) > 2 and all(ord(c) < 128 for c in token)]
+
+def generate_random_prompt(num_tokens):
+    generated_tokens_count = 0
+    selected_tokens = ""
+    while generated_tokens_count < num_tokens:
+        selected_tokens += random.choice(vocab)
+        selected_tokens += " "
+        generated_tokens_count = len(tokenizer.encode(selected_tokens))
+
+    return selected_tokens
+
+PROMPT = generate_random_prompt(RANDOM_PROMPT_LENGTH)
+num_token_input_prompt = len(tokenizer.encode(PROMPT))
+print(f"Number of token for input prompt: {num_token_input_prompt}")
+
+def generate_text() -> Tuple[int, int]:
+
+    #Configure payload data sending to API endpoint
+    payload = {"prompt": PROMPT, 
+               "max_tokens": MAX_NEW_TOKEN, 
+               "temperature": TEMPERATURE,
+               "top_p": TOP_P,      
+    }
+    body = str.encode(json.dumps(payload))
+    url = MODEL_ENDPOINTS
+    api_key = API_KEY
+    if not api_key:
+        raise Exception("API Key is missing")
+    
+    headers = {'Content-Type':'application/json', 'Authorization':(api_key)}
+    req = urllib.request.Request(url, body, headers)
+    token_count = 0
+    output = ""
+    start_time = time.time()
+    # Send request
+    try:
+        response = urllib.request.urlopen(req)
+        result = response.read()
+        output = json.loads(result)["choices"][0]["text"]
+        
+    except urllib.error.HTTPError as error:
+        print("The request failed with status code: " + str(error.code))
+        # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
+        print(error.info())
+        print(error.read().decode("utf8", 'ignore'))
+
+    end_time = time.time()
+    # Convert to ms
+    latency = (end_time - start_time) * 1000  
+    token_count = len(tokenizer.encode(output))
+
+    return latency, token_count
+
+
+def evaluate_performance(concurrent_requests: int) -> Tuple[float, float, float, float, float, float, float, List[float]]:
+    latencies = []
+    total_output_tokens = 0
+    output_tokens_per_second_each_request = []
+    start_time = time.time()
+
+    # Init multi-thread execution 
+    with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
+        future_to_req = {executor.submit(generate_text): i for i in range(concurrent_requests)}
+        for future in as_completed(future_to_req):
+            latency, token_count = future.result()
+            latencies.append(latency)
+            total_output_tokens += token_count
+            # Calculate tokens per second for this request
+            tokens_per_sec = token_count / (latency / 1000)
+            output_tokens_per_second_each_request.append(tokens_per_sec)
+
+    end_time = time.time()
+    total_time = end_time - start_time
+    # RPS (requests per second)
+    rps = concurrent_requests / total_time  
+    # Overall tokens per second
+    output_tokens_per_second_overall = total_output_tokens / total_time  
+    input_tokens_per_second_overall = (num_token_input_prompt * concurrent_requests) / total_time
+    p50_latency = np.percentile(latencies, 50)
+    p99_latency = np.percentile(latencies, 99)
+
+    # Count the number of requests below the token-per-second threshold
+    below_threshold_count = sum(1 for tps in output_tokens_per_second_each_request if tps < THRESHOLD_TPS)
+    output_tokens_per_second_per_request = sum(output_tokens_per_second_each_request)/len(output_tokens_per_second_each_request)
+
+    return p50_latency, p99_latency, rps, output_tokens_per_second_overall, input_tokens_per_second_overall, output_tokens_per_second_per_request, below_threshold_count
+
+
+
+# Print markdown
+print("| Number of Concurrent Requests | P50 Latency (ms) | P99 Latency (ms) | RPS | Output Tokens per Second | Input Tokens per Second | Average Output Tokens per Second per Request | Number of Requests Below Threshold |")
+print("|-------------------------------|------------------|------------------|-----|--------------------------|-------------------------|----------------------------------------------|------------------------------------|")
+
+# Save to file
+csv_file = "performance_metrics.csv"
+with open(csv_file, "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow(["Number of Concurrent Requests", "P50 Latency (ms)", "P99 Latency (ms)", "RPS", "Output Tokens per Second", "Input Tokens per Second", "Average Output Tokens per Second per Request"])
+
+    for level in CONCURRENT_LEVELS:
+        p50_latency, p99_latency, rps, output_tokens_per_second_overall, input_tokens_per_second_overall, output_tokens_per_second_per_request, below_threshold_count = evaluate_performance(level)
+        print(f"| {level} | {p50_latency:.2f} | {p99_latency:.2f} | {rps:.2f} | {output_tokens_per_second_overall:.2f} | {input_tokens_per_second_overall:.2f} | {output_tokens_per_second_per_request:.2f} | {below_threshold_count:.2f} |")
+        writer.writerow([level, round(p50_latency, 2), round(p99_latency, 2), round(rps, 2), round(output_tokens_per_second_overall, 2), round(input_tokens_per_second_overall, 2), round(output_tokens_per_second_per_request, 2)])
--- a/recipes/benchmarks/inference_throughput/on-prem/README.md
+++ b/recipes/benchmarks/inference_throughput/on-prem/README.md
+# Llama-On-Prem-Benchmark
+This folder contains code to run inference benchmark for Meta Llama 3 models on-prem with popular serving frameworks.
+The benchmark will focus on overall inference **throughput** for running containers on one instance (single or multiple GPUs) that you can acquire from cloud service providers such as Azure and AWS. You can also run this benchmark on local laptop or desktop.
+We support benchmark on these serving framework:
+* [vLLM](https://github.com/vllm-project/vllm)
+
+
+# vLLM - Getting Started
+
+To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](../../../inference/model_servers/llama-on-prem.md#setting-up-vllm-with-llama-3) to deploy vLLM on-prem.
+
+Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism. Additionally, as deploying multiple model replicas, there is a need for a higher level wrapper to handle the load balancing which here has been simulated in the benchmark scripts.
+For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Meta Llama 3 70B instruct model, which is around 140GB with FP16. So for deployment we can do:
+* 1x70B model parallel on 8 GPUs, each GPU RAM takes around 17.5GB for loading model weights.
+* 2x70B models each use 4 GPUs, each GPU RAM takes around 35GB for loading model weights.
+* 4x70B models each use 2 GPUs, each GPU RAM takes around 70GB for loading model weights. (Preferred configuration for max overall throughput. Note that you will have 4 endpoints hosted on different ports and the benchmark script will route requests into each model equally)
+
+Here are examples for deploying 2x70B chat models over 8 GPUs with vLLM.
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8000
+CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8001
+```
+Once you have finished deployment, you can use the command below to run benchmark scripts in a separate terminal.
+
+```
+python chat_vllm_benchmark.py
+```
+<!-- markdown-link-check-disable -->
+If you are going to use [Azure AI content check](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety), then you should install dependencies as shown below in your terminal:
+<!-- markdown-link-check-enable -->
+```
+pip install azure-ai-contentsafety azure-core
+```
+Besides chat models, we also provide benchmark scripts for running pretrained models for text completion tasks. To better simulate the real traffic, we generate configurable random token prompt as input. In this process, we select vocabulary that is longer than 2 tokens so the generated words are closer to the English, rather than symbols.
+However, random token prompts can't be applied for chat model benchmarks, since the chat model expects a valid question. By feeding random prompts, chat models rarely provide answers that is meeting our ```MAX_NEW_TOKEN``` requirement, defeating the purpose of running throughput benchmarks. Hence for chat models, the questions are copied over to form long inputs such as for 2k and 4k inputs.
+To run pretrained model benchmark, follow the command below.
+```
+python pretrained_vllm_benchmark.py
+```
--- a/recipes/benchmarks/inference_throughput/on-prem/vllm/chat_vllm_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/on-prem/vllm/chat_vllm_benchmark.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import csv
+import json
+import time
+import random
+import threading
+import numpy as np
+import requests
+import transformers
+import torch
+
+# Imports for Azure content safety
+from azure.ai.contentsafety import ContentSafetyClient
+from azure.core.credentials import AzureKeyCredential
+from azure.core.exceptions import HttpResponseError
+from azure.ai.contentsafety.models import AnalyzeTextOptions
+
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import Dict, Tuple, List
+
+
+
+with open('input.jsonl') as input:
+    prompt_data = json.load(input)
+
+# Prompt data stored in json file. Choose from number of tokens - 5, 25, 50, 100, 500, 1k, 2k.
+# You can also configure and add your own prompt in input.jsonl
+PROMPT = prompt_data["1k"] 
+
+with open('parameters.json') as parameters:
+    params = json.load(parameters)
+
+MAX_NEW_TOKENS = params["MAX_NEW_TOKENS"]
+CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
+# Replace with your own deployment
+MODEL_PATH = params["MODEL_PATH"]
+MODEL_HEADERS = params["MODEL_HEADERS"]
+SAFE_CHECK = params["SAFE_CHECK"]
+# Threshold for tokens per second below which we deem the query to be slow
+THRESHOLD_TPS = params["THRESHOLD_TPS"] 
+TEMPERATURE = params["TEMPERATURE"]
+TOP_P = params["TOP_P"]
+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-prem server like vLLM.
+# Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
+MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
+
+# Get number of GPUs on this instance
+if torch.cuda.is_available():
+    NUM_GPU = torch.cuda.device_count()
+else:
+    print("No available GPUs")
+
+
+# This tokenizer is downloaded from HuggingFace based on the model path you set. Note Llama 3 use a different tokenizer compare to Llama 2
+tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
+
+num_token_input_prompt = len(tokenizer.encode(PROMPT))
+print(f"Number of token for input prompt: {num_token_input_prompt}")
+
+# Azure content safety analysis
+def analyze_prompt(input):
+    start_time = time.time()
+
+    # Obtain credentials
+    key = "" #Add your AZURE_CONTENT_SAFETY_KEY
+    endpoint = "" #Add your AZURE_CONTENT_SAFETY_ENDPOINT
+
+    # Create a content safety client
+    client = ContentSafetyClient(endpoint, AzureKeyCredential(key))
+
+    # Create request
+    request = AnalyzeTextOptions(text=input)
+
+    # Analyze prompt
+    try:
+        response = client.analyze_text(request)
+    except HttpResponseError as e:
+        print("prompt failed due to content safety filtering.")
+        if e.error:
+            print(f"Error code: {e.error.code}")
+            print(f"Error message: {e.error.message}")
+            raise
+        print(e)
+        raise
+
+    analyze_end_time = time.time()
+    # The round trip latency for using Azure content safety check
+    analyze_latency = (analyze_end_time - start_time) * 1000
+
+
+# Simple round-robin to dispatch requests into different containers
+executor_id = 0
+lock = threading.Lock()
+
+def generate_text() -> Tuple[int, int]:
+    headers = MODEL_HEADERS
+    payload = {
+        "model" : MODEL_PATH,
+        "messages" : [
+            {
+                "role": "user",
+                "content": PROMPT
+            }
+        ],
+        "stream" : False,
+        "temperature" : TEMPERATURE,
+        "top_p" : TOP_P,
+        "max_tokens" : MAX_NEW_TOKENS
+    }
+
+    start_time = time.time()
+
+    if(SAFE_CHECK):
+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
+        analyze_prompt(PROMPT)
+        # Or add delay simulation if you don't want to use Azure Content Safety check. The API round-trip for this check is around 0.3-0.4 seconds depends on where you located. You can use something like this: time.sleep(random.uniform(0.3, 0.4))
+
+    # Acquire lock to dispatch the request
+    lock.acquire()
+    global executor_id
+    if executor_id != len(MODEL_ENDPOINTS)-1:
+        executor_id += 1
+        endpoint_id = executor_id
+    else:
+        executor_id = 0
+        endpoint_id = executor_id
+    lock.release()
+
+    # Send request
+    response = requests.post(MODEL_ENDPOINTS[endpoint_id], headers=headers, json=payload)
+
+    if(SAFE_CHECK):
+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
+        analyze_prompt(PROMPT)
+        # Or add delay simulation if you don't want to use Azure Content Safety check. The API round-trip for this check is around 0.3-0.4 seconds depends on where you located. You can use something like this: time.sleep(random.uniform(0.3, 0.4))
+
+    end_time = time.time()
+    # Convert to ms
+    latency = (end_time - start_time) * 1000  
+
+    if response.status_code != 200:
+        raise ValueError(f"Error: {response.content}")
+    output = json.loads(response.content)["choices"][0]["message"]["content"]
+
+    token_count = len(tokenizer.encode(output))
+    return latency, token_count
+
+
+def evaluate_performance(concurrent_requests: int) -> Tuple[float, float, float, float, float, float, float, List[float]]:
+    latencies = []
+    total_output_tokens = 0
+    output_tokens_per_second_each_request = []
+    start_time = time.time()
+
+    # Init multi-thread execution 
+    with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
+        future_to_req = {executor.submit(generate_text): i for i in range(concurrent_requests)}
+        for future in as_completed(future_to_req):
+            latency, token_count = future.result()
+            latencies.append(latency)
+            total_output_tokens += token_count
+            # Calculate tokens per second for this request
+            tokens_per_sec = token_count / (latency / 1000)
+            output_tokens_per_second_each_request.append(tokens_per_sec)
+
+    end_time = time.time()
+    total_time = end_time - start_time
+    # RPS (requests per second)
+    rps = concurrent_requests / total_time  
+    # Overall tokens per second
+    output_tokens_per_second_overall = total_output_tokens / total_time  
+    input_tokens_per_second_overall = (num_token_input_prompt * concurrent_requests) / total_time
+    output_tokens_per_second_per_gpu = output_tokens_per_second_overall / NUM_GPU
+    input_tokens_per_second_per_gpu = input_tokens_per_second_overall / NUM_GPU
+    p50_latency = np.percentile(latencies, 50)
+    p99_latency = np.percentile(latencies, 99)
+
+    # Count the number of requests below the token-per-second threshold
+    below_threshold_count = sum(1 for tps in output_tokens_per_second_each_request if tps < THRESHOLD_TPS)
+    output_tokens_per_second_per_request = sum(output_tokens_per_second_each_request)/len(output_tokens_per_second_each_request)
+
+    return p50_latency, p99_latency, rps, output_tokens_per_second_overall, output_tokens_per_second_per_gpu, input_tokens_per_second_overall, input_tokens_per_second_per_gpu, output_tokens_per_second_per_request, below_threshold_count
+
+
+
+# Print markdown
+print("| Number of Concurrent Requests | P50 Latency (ms) | P99 Latency (ms) | RPS | Output Tokens per Second | Output Tokens per Second per GPU | Input Tokens per Second | Input Tokens per Second per GPU |Average Output Tokens per Second per Request | Number of Requests Below Threshold |")
+print("|-------------------------------|------------------|------------------|------------------|-------------------|---------------------------|---------------------|------------------------|-------------------------------------- | ---------------------------------- |")
+
+# Save to file
+csv_file = "performance_metrics.csv"
+with open(csv_file, "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow(["Number of Concurrent Requests", "P50 Latency (ms)", "P99 Latency (ms)", "RPS", "Output Tokens per Second", "Output Tokens per Second per GPU", "Input Tokens per Second", "Input Tokens per Second per GPU", "Average Output Tokens per Second per Request"])
+
+    for level in CONCURRENT_LEVELS:
+        p50_latency, p99_latency, rps, output_tokens_per_second_overall, output_tokens_per_second_per_gpu, input_tokens_per_second_overall, input_tokens_per_second_per_gpu, output_tokens_per_second_per_request, below_threshold_count = evaluate_performance(level)
+        print(f"| {level} | {p50_latency:.2f} | {p99_latency:.2f} | {rps:.2f} | {output_tokens_per_second_overall:.2f} | {output_tokens_per_second_per_gpu:.2f} | {input_tokens_per_second_overall:.2f} | {input_tokens_per_second_per_gpu:.2f} | {output_tokens_per_second_per_request:.2f} | {below_threshold_count:.2f} |")
+        writer.writerow([level, round(p50_latency, 2), round(p99_latency, 2), round(rps, 2), round(output_tokens_per_second_overall, 2), round(output_tokens_per_second_per_gpu, 2), round(input_tokens_per_second_overall, 2), round(input_tokens_per_second_per_gpu, 2), round(output_tokens_per_second_per_request, 2)])
--- a/recipes/benchmarks/inference_throughput/on-prem/vllm/input.jsonl
+++ b/recipes/benchmarks/inference_throughput/on-prem/vllm/input.jsonl
+{
+    "5" : "What is Deep Learning",
+    "25" : "How does Llama 2 improve text generation, offering coherent, relevant, and contextually appropriate content?",
+    "50" : "In the context of the rapid evolution of AI, how does the Llama 2 address issues of ethical concerns, bias reduction, and increased performance to generate text that is not only coherent but also culturally sensitive?",
+    "100" : "As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "500" : "In AI context as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "1k" : "In the context of the AI evolution, as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?",
+    "2k" : "In the context of the evolution of AI, especially in the crazy LLM field, as a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience? As a sophisticated large language model, how does Llama 2 balance the intricate dance of producing highly coherent, contextually rich text while navigating ethical considerations, biases, and the imperative need for inclusivity and diversity? Furthermore, how does it ensure that the generated content adheres to global communication standards and respects cultural sensitivities, offering tailored experiences that are both engaging and respectful to a diverse audience?"
+}
\ No newline at end of file
--- a/recipes/benchmarks/inference_throughput/on-prem/vllm/parameters.json
+++ b/recipes/benchmarks/inference_throughput/on-prem/vllm/parameters.json
+{
+    "MAX_NEW_TOKENS" : 256,
+    "CONCURRENT_LEVELS" : [1, 2, 4, 8, 16, 32, 64, 128, 256],
+    "MODEL_PATH" : "meta-llama/Meta-Llama-3-70B-Instruct",
+    "MODEL_HEADERS" : {"Content-Type": "application/json"},
+    "SAFE_CHECK" : true,
+    "THRESHOLD_TPS" : 7,
+    "RANDOM_PROMPT_LENGTH" : 1000,
+    "TEMPERATURE" : 0.6,
+    "TOP_P" : 0.9,
+    "MODEL_ENDPOINTS" : [
+        "http://localhost:8000/v1/chat/completions"
+    ]
+}
--- a/recipes/benchmarks/inference_throughput/on-prem/vllm/pretrained_vllm_benchmark.py
+++ b/recipes/benchmarks/inference_throughput/on-prem/vllm/pretrained_vllm_benchmark.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import csv
+import json
+import time
+import random
+import threading
+import numpy as np
+import requests
+import transformers
+import torch
+
+#imports for Azure content safety
+from azure.ai.contentsafety import ContentSafetyClient
+from azure.core.credentials import AzureKeyCredential
+from azure.core.exceptions import HttpResponseError
+from azure.ai.contentsafety.models import AnalyzeTextOptions
+
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import Dict, Tuple, List
+
+
+# Predefined inputs
+with open('input.jsonl') as input:
+    prompt_data = json.load(input)
+
+with open('parameters.json') as parameters:
+    params = json.load(parameters)
+
+MAX_NEW_TOKENS = params["MAX_NEW_TOKENS"]
+CONCURRENT_LEVELS = params["CONCURRENT_LEVELS"]
+# Replace with your own deployment
+MODEL_PATH = params["MODEL_PATH"]
+MODEL_HEADERS = params["MODEL_HEADERS"]
+SAFE_CHECK = params["SAFE_CHECK"]
+# Threshold for tokens per second below which we deem the query to be slow
+THRESHOLD_TPS = params["THRESHOLD_TPS"] 
+RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
+TEMPERATURE = params["TEMPERATURE"]
+TOP_P = params["TOP_P"]
+# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-prem server like vLLM.
+# Group of model endpoints - Send balanced requests to each endpoint for batch maximization.  
+MODEL_ENDPOINTS = params["MODEL_ENDPOINTS"]
+
+#Get number of GPUs on this instance
+if torch.cuda.is_available():
+    NUM_GPU = torch.cuda.device_count()
+else:
+    print("No available GPUs")
+
+
+# This tokenizer is downloaded from HuggingFace based on the model path you set. Note Llama 3 use a different tokenizer compare to Llama 2
+tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
+
+# Select vocabulary that is longer than 2 tokens (closer to real words) and close to the English (not foolproof)
+vocab = [token for token in tokenizer.get_vocab().keys() if len(token) > 2 and all(ord(c) < 128 for c in token)]
+
+def generate_random_prompt(num_tokens):
+    generated_tokens_count = 0
+    selected_tokens = ""
+    while generated_tokens_count < num_tokens:
+        selected_tokens += random.choice(vocab)
+        selected_tokens += " "
+        generated_tokens_count = len(tokenizer.encode(selected_tokens))
+
+    return selected_tokens
+
+PROMPT = generate_random_prompt(RANDOM_PROMPT_LENGTH)
+num_token_input_prompt = len(tokenizer.encode(PROMPT))
+print(f"Number of token for input prompt: {num_token_input_prompt}")
+
+
+# Azure content safety analysis
+def analyze_prompt(input):
+    start_time = time.time()
+
+    # Obtain credentials
+    key = "" #Add your AZURE_CONTENT_SAFETY_KEY
+    endpoint = "" #Add your AZURE_CONTENT_SAFETY_ENDPOINT
+
+    # Create a content safety client
+    client = ContentSafetyClient(endpoint, AzureKeyCredential(key))
+
+    # Create request
+    request = AnalyzeTextOptions(text=input)
+
+    # Analyze prompt
+    try:
+        response = client.analyze_text(request)
+    except HttpResponseError as e:
+        print("prompt failed due to content safety filtering.")
+        if e.error:
+            print(f"Error code: {e.error.code}")
+            print(f"Error message: {e.error.message}")
+            raise
+        print(e)
+        raise
+
+    analyze_end_time = time.time()
+    # The round trip latency for using Azure content safety check
+    analyze_latency = (analyze_end_time - start_time) * 1000
+
+
+# Simple round-robin to dispatch requests into different containers
+executor_id = 0
+lock = threading.Lock()
+
+def generate_text() -> Tuple[int, int]:
+    headers = MODEL_HEADERS
+    payload = {
+        "model" : MODEL_PATH,
+        "messages" : [
+            {
+                "role": "user",
+                "content": PROMPT
+            }
+        ],
+        "stream" : False,
+        "temperature" : TEMPERATURE,
+        "top_p" : TOP_P,
+        "max_tokens" : MAX_NEW_TOKENS
+    }
+
+    start_time = time.time()
+
+    if(SAFE_CHECK):
+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
+        analyze_prompt(PROMPT)
+        # Or add delay simulation if you don't want to use Azure Content Safety check. The API round-trip for this check is around 0.3-0.4 seconds depends on where you located. You can use something like this: time.sleep(random.uniform(0.3, 0.4))
+
+    lock.acquire()
+    global executor_id
+    if executor_id != len(MODEL_ENDPOINTS)-1:
+        executor_id += 1
+        endpoint_id = executor_id
+    else:
+        executor_id = 0
+        endpoint_id = executor_id
+    lock.release()
+
+    response = requests.post(MODEL_ENDPOINTS[endpoint_id], headers=headers, json=payload)
+
+    if(SAFE_CHECK):
+        # Function to send prompts for safety check. Add delays for request round-trip that count towards overall throughput measurement.
+        # Expect NO returns from calling this function. If you want to check the safety check results, print it out within the function itself.
+        analyze_prompt(PROMPT)
+        # Or add delay simulation if you don't want to use Azure Content Safety check. The API round-trip for this check is around 0.3-0.4 seconds depends on where you located. You can use something like this: time.sleep(random.uniform(0.3, 0.4))
+
+    end_time = time.time()
+    # Convert to ms
+    latency = (end_time - start_time) * 1000 
+
+    if response.status_code != 200:
+        raise ValueError(f"Error: {response.content}")
+    output = json.loads(response.content)["choices"][0]["message"]["content"]
+
+    token_count = len(tokenizer.encode(output))
+    return latency, token_count
+
+
+def evaluate_performance(concurrent_requests: int) -> Tuple[float, float, float, float, float, float, float, List[float]]:
+    latencies = []
+    total_output_tokens = 0
+    output_tokens_per_second_each_request = []
+    start_time = time.time()
+
+    # Init multi-thread execution 
+    with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
+        future_to_req = {executor.submit(generate_text): i for i in range(concurrent_requests)}
+        for future in as_completed(future_to_req):
+            latency, token_count = future.result()
+            latencies.append(latency)
+            total_output_tokens += token_count
+            # Calculate tokens per second for this request
+            tokens_per_sec = token_count / (latency / 1000)
+            output_tokens_per_second_each_request.append(tokens_per_sec)
+
+    end_time = time.time()
+    total_time = end_time - start_time
+    # RPS (requests per second)
+    rps = concurrent_requests / total_time  
+    # Overall tokens per second
+    output_tokens_per_second_overall = total_output_tokens / total_time  
+    input_tokens_per_second_overall = (num_token_input_prompt * concurrent_requests) / total_time
+    output_tokens_per_second_per_gpu = output_tokens_per_second_overall / NUM_GPU
+    input_tokens_per_second_per_gpu = input_tokens_per_second_overall / NUM_GPU
+    p50_latency = np.percentile(latencies, 50)
+    p99_latency = np.percentile(latencies, 99)
+
+    # Count the number of requests below the token-per-second threshold
+    below_threshold_count = sum(1 for tps in output_tokens_per_second_each_request if tps < THRESHOLD_TPS)
+    output_tokens_per_second_per_request = sum(output_tokens_per_second_each_request)/len(output_tokens_per_second_each_request)
+
+    return p50_latency, p99_latency, rps, output_tokens_per_second_overall, output_tokens_per_second_per_gpu, input_tokens_per_second_overall, input_tokens_per_second_per_gpu, output_tokens_per_second_per_request, below_threshold_count
+
+
+
+# Print markdown
+print("| Number of Concurrent Requests | P50 Latency (ms) | P99 Latency (ms) | RPS | Output Tokens per Second | Output Tokens per Second per GPU | Input Tokens per Second | Input Tokens per Second per GPU |Average Output Tokens per Second per Request | Number of Requests Below Threshold |")
+print("|-------------------------------|------------------|------------------|------------------|-------------------|---------------------------|---------------------|------------------------|-------------------------------------- | ---------------------------------- |")
+
+# Save to file
+csv_file = "performance_metrics.csv"
+with open(csv_file, "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow(["Number of Concurrent Requests", "P50 Latency (ms)", "P99 Latency (ms)", "RPS", "Output Tokens per Second", "Output Tokens per Second per GPU", "Input Tokens per Second", "Input Tokens per Second per GPU", "Average Output Tokens per Second per Request"])
+
+    for level in CONCURRENT_LEVELS:
+        p50_latency, p99_latency, rps, output_tokens_per_second_overall, output_tokens_per_second_per_gpu, input_tokens_per_second_overall, input_tokens_per_second_per_gpu, output_tokens_per_second_per_request, below_threshold_count = evaluate_performance(level)
+        print(f"| {level} | {p50_latency:.2f} | {p99_latency:.2f} | {rps:.2f} | {output_tokens_per_second_overall:.2f} | {output_tokens_per_second_per_gpu:.2f} | {input_tokens_per_second_overall:.2f} | {input_tokens_per_second_per_gpu:.2f} | {output_tokens_per_second_per_request:.2f} | {below_threshold_count:.2f} |")
+        writer.writerow([level, round(p50_latency, 2), round(p99_latency, 2), round(rps, 2), round(output_tokens_per_second_overall, 2), round(output_tokens_per_second_per_gpu, 2), round(input_tokens_per_second_overall, 2), round(input_tokens_per_second_per_gpu, 2), round(output_tokens_per_second_per_request, 2)])