First add in 0524

5eaaba41 · Rayyyyy · 5eaaba41 · 5eaaba41 · 5eaaba41 · 5eaaba41
Commit 5eaaba41 authored May 24, 2024 by Rayyyyy
20 changed files
--- a/recipes/finetuning/README.md
+++ b/recipes/finetuning/README.md
+# Finetuning Llama
+This folder contains instructions to fine-tune Meta Llama 3 on a
+* [single-GPU setup](./singlegpu_finetuning.md)
+* [multi-GPU setup](./multigpu_finetuning.md)
+using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
+> [!TIP]
+> If you want to try finetuning Meta Llama 3 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
+## How to configure finetuning settings?
+> [!TIP]
+> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
+* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
+It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
+```python
+    model_name: str="PATH/to/Model"
+    tokenizer_name: str=None
+    enable_fsdp: bool=False
+    low_cpu_fsdp: bool=False
+    run_validation: bool=True
+    batch_size_training: int=4
+    batching_strategy: str="packing" #alternative: padding
+    context_length: int=4096
+    gradient_accumulation_steps: int=1
+    gradient_clipping: bool = False
+    gradient_clipping_threshold: float = 1.0
+    num_epochs: int=3
+    max_train_step: int=0
+    max_eval_step: int=0
+    num_workers_dataloader: int=1
+    lr: float=1e-4
+    weight_decay: float=0.0
+    gamma: float= 0.85
+    seed: int=42
+    use_fp16: bool=False
+    mixed_precision: bool=True
+    val_batch_size: int=1
+    dataset = "samsum_dataset"
+    peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
+    use_peft: bool=False
+    from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
+    output_dir: str = "PATH/to/save/PEFT/model"
+    freeze_layers: bool = False
+    num_freeze_layers: int = 1
+    quantization: bool = False
+    one_gpu: bool = False
+    save_model: bool = True
+    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
+    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
+    save_optimizer: bool=False # will be used if using FSDP
+    use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    use_wandb: bool = False # Enable wandb for experient tracking
+    save_metrics: bool = False # saves training metrics to a json file for later plotting
+    flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
+    flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
+    use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
+    profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
+```
+* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
+* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified. We currently support LoRA and Llama-Adapter. Please note that LoRA is the only technique which is supported in combination with FSDP.
+* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
+    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
+    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
+        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
+        * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
+        * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
+* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
+## Weights & Biases Experiment Tracking
+You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
+```bash
+python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
+```
+You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
+<div style="display: flex;">
+    <img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
+</div>
+## FLOPS Counting and Pytorch Profiling
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/recipes/finetuning/datasets/README.md
+++ b/recipes/finetuning/datasets/README.md
+# Datasets and Evaluation Metrics
+The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
+* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
+* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
+* [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
+## Batching Strategies
+Llama-recipes support two strategies to batch requests together.
+The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
+This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
+Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
+If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
+Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences.
+The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.
+The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`.
+## Using custom datasets
+The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
+To use a custom dataset there are two possible ways.
+The first provides a function returning the dataset in a .py file which can be given to the command line tool.
+This does not involve changing the source code of llama-recipes.
+The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
+### Training on custom data
+To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
+```@python
+def get_custom_dataset(dataset_config, tokenizer, split: str):
+```
+For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](custom_dataset.py).
+The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
+The split signals wether to return the training or validation dataset.
+The default function name is `get_custom_dataset` but this can be changed as described below.
+In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
+```
+To change the function name that is used in the .py you can append the name following a `:` like this:
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
+```
+This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
+### Adding new dataset
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
+Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
+The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
+For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
+To add a custom dataset the following steps need to be performed.
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
+2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../../../src/llama_recipes/utils/dataset_utils.py)
+4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
+## Application
+Below we list other datasets and their main use cases that can be used for fine tuning.
+### Q&A these can be used for evaluation as well
+- [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation)
+- [BoolQ](https://huggingface.co/datasets/boolq)
+- [NarrativeQA](https://huggingface.co/datasets/narrativeqa)
+- [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book)
+- [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book)
+- [QuAC](https://huggingface.co/datasets/quac)
+- [HellaSwag](https://huggingface.co/datasets/hellaswag)
+- [OpenbookQA](https://huggingface.co/datasets/openbookqa)
+- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model)
+### instruction finetuning
+- [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned)	52k	instruction tuning
+- [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k	15k	instruction tuning
+### simple text generation for quick tests
+[English](https://huggingface.co/datasets/Abirate/english_quotes) quotes	2508	Multi-label text classification, text generation
+### Reasoning used mostly for evaluation of LLMs
+- [bAbI](https://research.facebook.com/downloads/babi/)
+- [Dyck](https://huggingface.co/datasets/dyk)
+- [GSM8K](https://huggingface.co/datasets/gsm8k)
+- [MATH](https://github.com/hendrycks/math)
+- [APPS](https://huggingface.co/datasets/codeparrot/apps)
+- [HumanEval](https://huggingface.co/datasets/openai_humaneval)
+- [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar)
+- [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching)
+### Toxicity evaluation
+- [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)
+### Bias evaluation
+- [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
+- WinoGender gender bias
+### Useful Links
+More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)
--- a/recipes/finetuning/datasets/custom_dataset.py
+++ b/recipes/finetuning/datasets/custom_dataset.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+# For dataset details visit: https://huggingface.co/datasets/samsum
+import copy
+import datasets
+import itertools
+B_INST, E_INST = "[INST]", "[/INST]"
+def tokenize_dialog(dialog, tokenizer):
+    if tokenizer.vocab_size >= 128000:
+        dialog_tokens = tokenizer.apply_chat_template(dialog)
+        dialog_tokens = dialog_tokens[:-4] # Remove generation prompt <|start_header_id|>assistant<|end_header_id|>\n\n
+        eot_indices = [i for i,n in enumerate(dialog_tokens) if n == 128009]
+        labels = copy.copy(dialog_tokens)
+        last_idx = 0
+        for n, idx in enumerate(eot_indices):
+            if n % 2 == 1:
+                last_idx = idx
+            else:
+                labels[last_idx:idx+1] = [-100] * (idx-last_idx+1)
+        dialog_tokens = [dialog_tokens]
+        labels_tokens = [labels]
+    else:
+        prompt_tokens = [tokenizer.encode(f"{tokenizer.bos_token}{B_INST} {(prompt['content']).strip()} {E_INST}", add_special_tokens=False) for prompt in dialog[::2]]
+        answer_tokens = [tokenizer.encode(f"{answer['content'].strip()} {tokenizer.eos_token}", add_special_tokens=False) for answer in dialog[1::2]]
+        dialog_tokens = list(itertools.chain.from_iterable(zip(prompt_tokens, answer_tokens)))
+        #Add labels, convert prompt token to -100 in order to ignore in loss function
+        labels_tokens = [len(c)*[-100,] if i % 2 == 0 else c for i,c in enumerate(dialog_tokens)]
+    combined_tokens = {
+        "input_ids": list(itertools.chain(*(t for t in dialog_tokens))),
+        "labels": list(itertools.chain(*(t for t in labels_tokens))),
+    }
+    return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
+def get_custom_dataset(dataset_config, tokenizer, split):
+    dataset = datasets.load_dataset("OpenAssistant/oasst1", split=split)
+    dataset = dataset.map(lambda sample: {
+        "message_id": sample["message_id"],
+        "parent_id": sample["parent_id"],
+        "text": sample["text"],
+        },
+        batched=True,
+        remove_columns=list(dataset.features),)
+    nodes = {}
+    messages = {}
+    root_ids = []
+    for data in dataset:
+        if data["parent_id"]:
+            nodes[data["parent_id"]] = nodes.get(data["parent_id"], []) + [data["message_id"]]
+        else:
+            root_ids.append(data["message_id"])
+        messages[data["message_id"]]=data["text"]
+    def follow(thread, current_id):
+        thread = copy.copy(thread) + [messages[current_id]]
+        if current_id in nodes:
+            new_threads = []
+            for next_id in nodes[current_id]:
+                new_threads += follow(thread, next_id)
+            return new_threads
+        else:
+            return [thread]
+    def get_threads_from_root(root_id):
+        all_threads = []
+        thread = [messages[root_id]]
+        for cid in nodes[root_id]:
+            all_threads += follow(thread, cid)
+        return all_threads
+    dataset = dataset.filter(lambda x: x["message_id"] in root_ids)
+    dataset = dataset.map(lambda x: {"thread": get_threads_from_root(x["message_id"])}, remove_columns=list(dataset.features))
+    dataset = dataset.map(lambda x: {"thread": [i for row in x["thread"] for i in row]}, batched=True)
+    def to_dialog(thread):
+        dialog = []
+        for i, content in enumerate(thread):
+            dialog.append({
+                "role": "user" if i % 2 == 0 else "assistant",
+                "content": content,
+            })
+        return {"dialog": dialog}
+    dataset = dataset.map(lambda x: to_dialog(x["thread"]), remove_columns=list(dataset.features))
+    dataset = dataset.map(lambda x: tokenize_dialog(x["dialog"], tokenizer), remove_columns=list(dataset.features))
+    return dataset
--- a/recipes/finetuning/finetuning.py
+++ b/recipes/finetuning/finetuning.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import fire
+from llama_recipes.finetuning import main
+if __name__ == "__main__":
+    fire.Fire(main)
\ No newline at end of file
--- a/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
+++ b/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Meta Platforms, Inc. and affiliates.\n",
+    "This software may be used and distributed according to the terms of the Llama 2 Community License Agreement."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Quick Start Notebook\n",
+    "\n",
+    "This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.\n",
+    "\n",
+    "### Step 0: Install pre-requirements and convert checkpoint\n",
+    "\n",
+    "The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.\n",
+    "The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.\n",
+    "Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# %%bash\n",
+    "# pip install llama-recipes transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets\n",
+    "# TRANSFORM=`python -c \"import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')\"`\n",
+    "# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Load the model\n",
+    "\n",
+    "Point model_id to model weight folder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "===================================BUG REPORT===================================\n",
+      "Welcome to bitsandbytes. For bug reports, please run\n",
+      "\n",
+      "python -m bitsandbytes\n",
+      "\n",
+      " and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
+      "================================================================================\n",
+      "bin /data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so\n",
+      "CUDA SETUP: CUDA runtime path found: /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so\n",
+      "CUDA SETUP: Highest compute capability among GPUs detected: 8.0\n",
+      "CUDA SETUP: Detected CUDA version 112\n",
+      "CUDA SETUP: Loading binary /data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/mreso/miniconda3/envs/llama did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...\n",
+      "  warn(msg)\n",
+      "/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/efa/lib')}\n",
+      "  warn(msg)\n",
+      "The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.\n",
+      "Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.09s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "from transformers import LlamaForCausalLM, LlamaTokenizer\n",
+    "\n",
+    "model_id=\"./models_hf/7B\"\n",
+    "\n",
+    "tokenizer = LlamaTokenizer.from_pretrained(model_id)\n",
+    "\n",
+    "model =LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Load the preprocessed dataset\n",
+    "\n",
+    "We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset samsum (/data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)\n",
+      "Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-b14554a76c1c7ecd.arrow\n",
+      "Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e40e61e15ebeb527.arrow\n",
+      "Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e08ac9e1b792e7ba.arrow\n"
+     ]
+    }
+   ],
+   "source": [
+    "from llama_recipes.utils.dataset_utils import get_preprocessed_dataset\n",
+    "from llama_recipes.configs.datasets import samsum_dataset\n",
+    "\n",
+    "train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Check base model\n",
+    "\n",
+    "Run the base model on an example input:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Summarize this dialog:\n",
+      "A: Hi Tom, are you busy tomorrow’s afternoon?\n",
+      "B: I’m pretty sure I am. What’s up?\n",
+      "A: Can you go with me to the animal shelter?.\n",
+      "B: What do you want to do?\n",
+      "A: I want to get a puppy for my son.\n",
+      "B: That will make him so happy.\n",
+      "A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
+      "B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
+      "A: I'll get him one of those little dogs.\n",
+      "B: One that won't grow up too big;-)\n",
+      "A: And eat too much;-))\n",
+      "B: Do you know which one he would like?\n",
+      "A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
+      "B: I bet you had to drag him away.\n",
+      "A: He wanted to take it home right away ;-).\n",
+      "B: I wonder what he'll name it.\n",
+      "A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))\n",
+      "---\n",
+      "Summary:\n",
+      "A: Hi Tom, are you busy tomorrow’s afternoon?\n",
+      "B: I’m pretty sure I am. What’s up?\n",
+      "A: Can you go with me to the animal shelter?.\n",
+      "B: What do you want to do?\n",
+      "A: I want to get a puppy for my son.\n",
+      "B: That will make him so happy.\n",
+      "A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
+      "B\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_prompt = \"\"\"\n",
+    "Summarize this dialog:\n",
+    "A: Hi Tom, are you busy tomorrow’s afternoon?\n",
+    "B: I’m pretty sure I am. What’s up?\n",
+    "A: Can you go with me to the animal shelter?.\n",
+    "B: What do you want to do?\n",
+    "A: I want to get a puppy for my son.\n",
+    "B: That will make him so happy.\n",
+    "A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
+    "B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
+    "A: I'll get him one of those little dogs.\n",
+    "B: One that won't grow up too big;-)\n",
+    "A: And eat too much;-))\n",
+    "B: Do you know which one he would like?\n",
+    "A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
+    "B: I bet you had to drag him away.\n",
+    "A: He wanted to take it home right away ;-).\n",
+    "B: I wonder what he'll name it.\n",
+    "A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))\n",
+    "---\n",
+    "Summary:\n",
+    "\"\"\"\n",
+    "\n",
+    "model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
+    "\n",
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the base model only repeats the conversation.\n",
+    "\n",
+    "### Step 4: Prepare model for PEFT\n",
+    "\n",
+    "Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199\n"
+     ]
+    }
+   ],
+   "source": [
+    "model.train()\n",
+    "\n",
+    "def create_peft_config(model):\n",
+    "    from peft import (\n",
+    "        get_peft_model,\n",
+    "        LoraConfig,\n",
+    "        TaskType,\n",
+    "        prepare_model_for_kbit_training,\n",
+    "    )\n",
+    "\n",
+    "    peft_config = LoraConfig(\n",
+    "        task_type=TaskType.CAUSAL_LM,\n",
+    "        inference_mode=False,\n",
+    "        r=8,\n",
+    "        lora_alpha=32,\n",
+    "        lora_dropout=0.05,\n",
+    "        target_modules = [\"q_proj\", \"v_proj\"]\n",
+    "    )\n",
+    "\n",
+    "    # prepare int-8 model for training\n",
+    "    model = prepare_model_for_kbit_training(model)\n",
+    "    model = get_peft_model(model, peft_config)\n",
+    "    model.print_trainable_parameters()\n",
+    "    return model, peft_config\n",
+    "\n",
+    "# create peft config\n",
+    "model, lora_config = create_peft_config(model)\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "### Step 5: Define an optional profiler"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import TrainerCallback\n",
+    "from contextlib import nullcontext\n",
+    "enable_profiler = False\n",
+    "output_dir = \"tmp/llama-output\"\n",
+    "\n",
+    "config = {\n",
+    "    'lora_config': lora_config,\n",
+    "    'learning_rate': 1e-4,\n",
+    "    'num_train_epochs': 1,\n",
+    "    'gradient_accumulation_steps': 2,\n",
+    "    'per_device_train_batch_size': 2,\n",
+    "    'gradient_checkpointing': False,\n",
+    "}\n",
+    "\n",
+    "# Set up profiler\n",
+    "if enable_profiler:\n",
+    "    wait, warmup, active, repeat = 1, 1, 2, 1\n",
+    "    total_steps = (wait + warmup + active) * (1 + repeat)\n",
+    "    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)\n",
+    "    profiler = torch.profiler.profile(\n",
+    "        schedule=schedule,\n",
+    "        on_trace_ready=torch.profiler.tensorboard_trace_handler(f\"{output_dir}/logs/tensorboard\"),\n",
+    "        record_shapes=True,\n",
+    "        profile_memory=True,\n",
+    "        with_stack=True)\n",
+    "    \n",
+    "    class ProfilerCallback(TrainerCallback):\n",
+    "        def __init__(self, profiler):\n",
+    "            self.profiler = profiler\n",
+    "            \n",
+    "        def on_step_end(self, *args, **kwargs):\n",
+    "            self.profiler.step()\n",
+    "\n",
+    "    profiler_callback = ProfilerCallback(profiler)\n",
+    "else:\n",
+    "    profiler = nullcontext()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 6: Fine tune the model\n",
+    "\n",
+    "Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...\n",
+      "/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization\n",
+      "  warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
+      "/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
+      "  warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='389' max='389' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [389/389 1:12:06, Epoch 1/1]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>10</td>\n",
+       "      <td>1.965000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>20</td>\n",
+       "      <td>1.845600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>30</td>\n",
+       "      <td>1.801100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>40</td>\n",
+       "      <td>1.780900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>50</td>\n",
+       "      <td>1.715400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>60</td>\n",
+       "      <td>1.697800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>70</td>\n",
+       "      <td>1.707600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>80</td>\n",
+       "      <td>1.713300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>90</td>\n",
+       "      <td>1.663900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>100</td>\n",
+       "      <td>1.702700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>110</td>\n",
+       "      <td>1.658800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>120</td>\n",
+       "      <td>1.692400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>130</td>\n",
+       "      <td>1.644900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>140</td>\n",
+       "      <td>1.687900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>150</td>\n",
+       "      <td>1.686600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>160</td>\n",
+       "      <td>1.649600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>170</td>\n",
+       "      <td>1.666900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>180</td>\n",
+       "      <td>1.709200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>190</td>\n",
+       "      <td>1.670400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>200</td>\n",
+       "      <td>1.662700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>210</td>\n",
+       "      <td>1.681300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>220</td>\n",
+       "      <td>1.685500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>230</td>\n",
+       "      <td>1.663400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>240</td>\n",
+       "      <td>1.638300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>250</td>\n",
+       "      <td>1.627400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>260</td>\n",
+       "      <td>1.654300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>270</td>\n",
+       "      <td>1.640900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>280</td>\n",
+       "      <td>1.674700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>290</td>\n",
+       "      <td>1.657300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>300</td>\n",
+       "      <td>1.660200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>310</td>\n",
+       "      <td>1.666600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>320</td>\n",
+       "      <td>1.674500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>330</td>\n",
+       "      <td>1.656200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>340</td>\n",
+       "      <td>1.684300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>350</td>\n",
+       "      <td>1.667900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>360</td>\n",
+       "      <td>1.661400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>370</td>\n",
+       "      <td>1.676800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>380</td>\n",
+       "      <td>1.628100</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from transformers import default_data_collator, Trainer, TrainingArguments\n",
+    "\n",
+    "\n",
+    "\n",
+    "# Define training args\n",
+    "training_args = TrainingArguments(\n",
+    "    output_dir=output_dir,\n",
+    "    overwrite_output_dir=True,\n",
+    "    bf16=True,  # Use BF16 if available\n",
+    "    # logging strategies\n",
+    "    logging_dir=f\"{output_dir}/logs\",\n",
+    "    logging_strategy=\"steps\",\n",
+    "    logging_steps=10,\n",
+    "    save_strategy=\"no\",\n",
+    "    optim=\"adamw_torch_fused\",\n",
+    "    max_steps=total_steps if enable_profiler else -1,\n",
+    "    **{k:v for k,v in config.items() if k != 'lora_config'}\n",
+    ")\n",
+    "\n",
+    "with profiler:\n",
+    "    # Create Trainer instance\n",
+    "    trainer = Trainer(\n",
+    "        model=model,\n",
+    "        args=training_args,\n",
+    "        train_dataset=train_dataset,\n",
+    "        data_collator=default_data_collator,\n",
+    "        callbacks=[profiler_callback] if enable_profiler else [],\n",
+    "    )\n",
+    "\n",
+    "    # Start training\n",
+    "    trainer.train()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 7:\n",
+    "Save model checkpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save_pretrained(output_dir)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 8:\n",
+    "Try the fine tuned model on the same example again to see the learning progress:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Summarize this dialog:\n",
+      "A: Hi Tom, are you busy tomorrow’s afternoon?\n",
+      "B: I’m pretty sure I am. What’s up?\n",
+      "A: Can you go with me to the animal shelter?.\n",
+      "B: What do you want to do?\n",
+      "A: I want to get a puppy for my son.\n",
+      "B: That will make him so happy.\n",
+      "A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
+      "B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
+      "A: I'll get him one of those little dogs.\n",
+      "B: One that won't grow up too big;-)\n",
+      "A: And eat too much;-))\n",
+      "B: Do you know which one he would like?\n",
+      "A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
+      "B: I bet you had to drag him away.\n",
+      "A: He wanted to take it home right away ;-).\n",
+      "B: I wonder what he'll name it.\n",
+      "A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))\n",
+      "---\n",
+      "Summary:\n",
+      "A wants to get a puppy for his son. He took him to the animal shelter last Monday. He showed him one that he really liked. A will name it after his dead hamster - Lemmy.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/recipes/finetuning/multi_node.slurm
+++ b/recipes/finetuning/multi_node.slurm
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the GNU General Public License version 3.
+#!/bin/bash
+#SBATCH --job-name=Nano-2d-trainer-20b-8nodes
+#SBATCH --ntasks=2
+#SBATCH --nodes=2
+#SBATCH --gpus-per-task=4
+#SBATCH --partition=train 
+nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+nodes_array=($nodes)
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+# Enable for A100
+export FI_PROVIDER="efa"
+echo Node IP: $head_node_ip
+export LOGLEVEL=INFO
+# debugging flags (optional)
+export NCCL_DEBUG=WARN
+export NCCL_DEBUG_SUBSYS=WARN
+export PYTHONFAULTHANDLER=1
+export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
+export CUDA_LAUNCH_BLOCKING=0
+# on your cluster you might need these:
+# set the network interface
+export NCCL_SOCKET_IFNAME="ens"
+export FI_EFA_USE_DEVICE_RDMA=1
+srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora
--- a/recipes/finetuning/multigpu_finetuning.md
+++ b/recipes/finetuning/multigpu_finetuning.md
+# Fine-tuning with Multi GPU
+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
+## Requirements
+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
+We will also need 2 packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
+> [!NOTE]
+> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
+>
+> INT8 quantization is not currently supported in FSDP
+## How to run it
+Get access to a machine with multiple GPUs (in this case we tested with 4 A100 and A10s).
+### With FSDP + PEFT
+<details open>
+<summary>Single-node Multi-GPU</summary>
+    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+</details>
+<details>
+<summary>Multi-node Multi-GPU</summary>
+Here we use a slurm script to schedule a job with slurm over multiple nodes.
+    # Change the num nodes and GPU per nodes in the script before running.
+    sbatch ./multi_node.slurm
+</details>
+We use `torchrun` to spawn multiple processes for FSDP.
+The args used in the command above are:
+* `--enable_fsdp` boolean flag to enable FSDP  in the script
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+### With only FSDP
+If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
+```
+### Using less CPU memory (FSDP on 70B model)
+If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+```
+## Running with different datasets
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+* `samsum_dataset`
+To run with each of the datasets set the `dataset` flag in the command as shown below:
+```bash
+# grammer_dataset
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
+# alpaca_dataset
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+# samsum_dataset
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+```
+## [TIP] Slow interconnect between nodes?
+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
+HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
+This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
+```bash
+torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
+```
+## FLOPS Counting and Pytorch Profiling
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/recipes/finetuning/singlegpu_finetuning.md
+++ b/recipes/finetuning/singlegpu_finetuning.md
+# Fine-tuning with Single GPU
+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
+These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+## Requirements
+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
+To run fine-tuning on a single GPU, we will make use of two packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for int8 quantization.
+## How to run it?
+```bash
+python finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+```
+The args used in the command above are:
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+* `--quantization` boolean flag to enable int8 quantization
+> [!NOTE]
+> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
+### How to run with different datasets?
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+* `alpaca_dataset` : to get this open source data please download the `alpaca.json` to `dataset` folder.
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+* `samsum_dataset`
+to run with each of the datasets set the `dataset` flag in the command as shown below:
+```bash
+# grammar_dataset
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+# alpaca_dataset
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+# samsum_dataset
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+```
+## FLOPS Counting and Pytorch Profiling
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/recipes/inference/local_inference/README.md
+++ b/recipes/inference/local_inference/README.md
+# Local Inference
+For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
+To finetune all model parameters the output dir of the training has to be given as --model_name argument.
+In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
+Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
+**Content Safety**
+The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
+**Note**
+If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
+Examples:
+ ```bash
+# Full finetuning of all parameters
+cat <test_prompt_file> | python inference.py --model_name <training_config.output_dir> --use_auditnlg
+# PEFT method
+cat <test_prompt_file> | python inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
+# prompt as parameter
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
+ ```
+The  folder contains test prompts for summarization use-case:
+```
+samsum_prompt.txt
+...
+```
+**Note**
+Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
+```python
+tokenizer.add_special_tokens(
+        {
+            "pad_token": "<PAD>",
+        }
+    )
+model.resize_token_embeddings(model.config.vocab_size + 1)
+```
+Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
+## Chat completion
+The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg
+```
+## Flash Attention and Xformer Memory Efficient Kernels
+Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
+python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
+```
+## Loading back FSDP checkpoints
+In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
+**To convert the checkpoint use the following command**:
+This is helpful if you have fine-tuned you model using FSDP only as follows:
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  recipes/finetuning/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
+```
+Then convert your FSDP checkpoint to HuggingFace checkpoints using:
+```bash
+ python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
+ # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
+ ```
+By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
+Then run inference using:
+```bash
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
+```
\ No newline at end of file
--- a/recipes/inference/local_inference/chat_completion/chat_completion.py
+++ b/recipes/inference/local_inference/chat_completion/chat_completion.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+import fire
+import os
+import sys
+import torch
+from transformers import AutoTokenizer
+from llama_recipes.inference.chat_utils import read_dialogs_from_file
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+from llama_recipes.inference.safety_utils import get_safety_checker
+from accelerate.utils import is_xpu_available
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =256, #The maximum numbers of tokens to generate
+    min_new_tokens:int=0, #The minimum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    safety_score_threshold: float=0.5,
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
+    use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    enable_llamaguard_content_safety: bool = False,
+    **kwargs
+):
+    if prompt_file is not None:
+        assert os.path.exists(
+            prompt_file
+        ), f"Provided Prompt file does not exist {prompt_file}"
+        dialogs= read_dialogs_from_file(prompt_file)
+    elif not sys.stdin.isatty():
+        dialogs = "\n".join(sys.stdin.readlines())
+    else:
+        print("No user prompt provided. Exiting.")
+        sys.exit(1)
+    print(f"User dialogs:\n{dialogs}")
+    print("\n==================================\n")
+    # Set the seeds for reproducibility
+    if is_xpu_available():
+        torch.xpu.manual_seed(seed)
+    else:
+        torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.add_special_tokens(
+        {
+            "pad_token": "<PAD>",
+        }
+    )
+    chats = tokenizer.apply_chat_template(dialogs)
+    with torch.no_grad():
+        for idx, chat in enumerate(chats):
+            safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_saleforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        )
+            # Safety check of the user prompt
+            safety_results = [check(dialogs[idx][0]["content"]) for check in safety_checker]
+            are_safe = all([r[1] for r in safety_results])
+            if are_safe:
+                print(f"User prompt deemed safe.")
+                print("User prompt:\n", dialogs[idx][0]["content"])
+                print("\n==================================\n")
+            else:
+                print("User prompt deemed unsafe.")
+                for method, is_safe, report in safety_results:
+                    if not is_safe:
+                        print(method)
+                        print(report)
+                print("Skipping the inferece as the prompt is not safe.")
+                sys.exit(1)  # Exit the program with an error status
+            tokens= torch.tensor(chat).long()
+            tokens= tokens.unsqueeze(0)
+            if is_xpu_available():
+                tokens= tokens.to("xpu:0")
+            else:
+                tokens= tokens.to("cuda:0")
+            outputs = model.generate(
+                input_ids=tokens,
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                top_p=top_p,
+                temperature=temperature,
+                use_cache=use_cache,
+                top_k=top_k,
+                repetition_penalty=repetition_penalty,
+                length_penalty=length_penalty,
+                **kwargs
+            )
+            output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            # Safety check of the model output
+            safety_results = [check(output_text) for check in safety_checker]
+            are_safe = all([r[1] for r in safety_results])
+            if are_safe:
+                print("User input and model output deemed safe.")
+                print(f"Model output:\n{output_text}")
+                print("\n==================================\n")
+            else:
+                print("Model output deemed unsafe.")
+                for method, is_safe, report in safety_results:
+                    if not is_safe:
+                        print(method)
+                        print(report)
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/inference/local_inference/chat_completion/chats.json
+++ b/recipes/inference/local_inference/chat_completion/chats.json
+[
+    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
+    [
+        {"role": "user", "content": "I am going to Paris, what should I see?"},
+        {
+            "role": "assistant",
+            "content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
+        },
+        {"role": "user", "content": "What is so great about #1?"}
+    ],
+    [
+        {"role": "system", "content": "Always answer with Haiku"},
+        {"role": "user", "content": "I am going to Paris, what should I see?"}
+    ],
+    [
+        {
+            "role": "system",
+            "content": "Always answer with emojis"
+        },
+        {"role": "user", "content": "How to go from Beijing to NY?"}
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
+        },
+        {"role": "user", "content": "Write a brief birthday message to John"}
+    ]
+]
\ No newline at end of file
--- a/recipes/inference/local_inference/inference.py
+++ b/recipes/inference/local_inference/inference.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+import fire
+import os
+import sys
+import time
+import gradio as gr
+import torch
+from transformers import AutoTokenizer
+from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+from accelerate.utils import is_xpu_available
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =100, #The maximum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False,
+    max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
+    use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    **kwargs
+):
+  def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,):
+    safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety
+                                        )
+    # Safety check of the user prompt
+    safety_results = [check(user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User prompt deemed safe.")
+        print(f"User prompt:\n{user_prompt}")
+    else:
+        print("User prompt deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+        print("Skipping the inference as the prompt is not safe.")
+        sys.exit(1)  # Exit the program with an error status
+    # Set the seeds for reproducibility
+    if is_xpu_available():
+        torch.xpu.manual_seed(seed)
+    else:
+        torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+    batch = tokenizer(user_prompt, padding='max_length', truncation=True, max_length=max_padding_length, return_tensors="pt")
+    if is_xpu_available():
+        batch = {k: v.to("xpu") for k, v in batch.items()}
+    else:
+        batch = {k: v.to("cuda") for k, v in batch.items()}
+    start = time.perf_counter()
+    with torch.no_grad():
+        outputs = model.generate(
+            **batch,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            top_p=top_p,
+            temperature=temperature,
+            min_length=min_length,
+            use_cache=use_cache,
+            top_k=top_k,
+            repetition_penalty=repetition_penalty,
+            length_penalty=length_penalty,
+            **kwargs
+        )
+    e2e_inference_time = (time.perf_counter()-start)*1000
+    print(f"the inference time is {e2e_inference_time} ms")
+    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Safety check of the model output
+    safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User input and model output deemed safe.")
+        print(f"Model output:\n{output_text}")
+    else:
+        print("Model output deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+    return output_text
+  if prompt_file is not None:
+      assert os.path.exists(
+          prompt_file
+      ), f"Provided Prompt file does not exist {prompt_file}"
+      with open(prompt_file, "r") as f:
+          user_prompt = "\n".join(f.readlines())
+      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+  elif not sys.stdin.isatty():
+      user_prompt = "\n".join(sys.stdin.readlines())
+      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+  else:
+      gr.Interface(
+        fn=inference,
+        inputs=[
+            gr.components.Textbox(
+                lines=9,
+                label="User Prompt",
+                placeholder="none",
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=1, value=1.0, label="Temperature"
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=1, value=1.0, label="Top p"
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=100, step=1, value=50, label="Top k"
+            ),
+            gr.components.Slider(
+                minimum=1, maximum=2000, step=1, value=200, label="Max tokens"
+            ),
+        ],
+        outputs=[
+            gr.components.Textbox(
+                lines=5,
+                label="Output",
+            )
+        ],
+        title="Meta Llama3 Playground",
+        description="https://github.com/facebookresearch/llama-recipes",
+      ).queue().launch(server_name="0.0.0.0", share=True)
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/inference/local_inference/samsum_prompt.txt
+++ b/recipes/inference/local_inference/samsum_prompt.txt
+Summarize this dialog:
+A: Hi Tom, are you busy tomorrow’s afternoon?
+B: I’m pretty sure I am. What’s up?
+A: Can you go with me to the animal shelter?.
+B: What do you want to do?
+A: I want to get a puppy for my son.
+B: That will make him so happy.
+A: Yeah, we’ve discussed it many times. I think he’s ready now.
+B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
+A: I'll get him one of those little dogs.
+B: One that won't grow up too big;-)
+A: And eat too much;-))
+B: Do you know which one he would like?
+A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
+B: I bet you had to drag him away.
+A: He wanted to take it home right away ;-).
+B: I wonder what he'll name it.
+A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
+---
+Summary:
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/README.md
+++ b/recipes/inference/mobile_inference/android_inference/README.md
+# Running Llama3 8B Instruct on Android with MLC-LLM
+Author: Thierry Moreau - tmoreau@octo.ai
+# Overview
+In this tutorial we'll learn how to deploy Llama3 8B Instruct on an Android-based phone using MLC-LLM.
+Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.
+You can read more about MLC-LLM at the following [link](https://github.com/mlc-ai/mlc-llm).
+MLC-LLM is also what powers the Llama3 inference APIs provided by [OctoAI](https://octo.ai/). You can use OctoAI for your Llama3 cloud-based inference needs by trying out the examples under the [following path](../../../llama_api_providers/OctoAI_API_examples/).
+This tutorial was tested with the following setup:
+* MacBook Pro 16 inch from 2021 with Apple M1 Max and 32GB of RAM running Sonoma 14.3.1
+* OnePlus 12 Android Smartphone with a Snapdragon 8Gen3 SoC and 12GB or RAM, running OxygenOS 14.0
+Running Llama3 on a phone will likely require a powerful chipset. We haven't tested extensively the range of chipset that will support this usecase. Feel free to update this README.md to specify what devices were successfully tested.
+| Phone      | Chipset          | RAM  | Status  | Comments |
+|------------|------------------|------|---------|----------|
+| OnePlus 12 | Snapdragon 8Gen3 | 12GB | Success | None     |
+|            |                  |      |         |          |
+This guide is heavily based on the [MLC Android Guide](https://llm.mlc.ai/docs/deploy/android.html), but several steps have been taken to streamline the instructions.
+# Pre-requisites
+## Python
+Whether you're using conda or virtual env to manage your environment, we highly recommend starting from scratch with a clean new environment.
+For instance with virtual environment:
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+Next you'll need to install the following packages:
+```bash
+python3 -m pip install -r requirements.txt
+```
+## Rust
+[Rust](https://www.rust-lang.org/tools/install) is needed to cross-compile HuggingFace tokenizers to Android.
+Make sure rustc, cargo, and rustup are available in $PATH.
+## Android Studio
+Install Android Studio from <!-- markdown-link-check-disable -->https://developer.android.com/studio<!-- markdown-link-check-enable --> with NDK and CMake.
+To install NDK and CMake, in the Android Studio welcome page, click “Projects → SDK Manager → SDK Tools”. Set up the following environment variables:
+* ANDROID_NDK so that $ANDROID_NDK/build/cmake/android.toolchain.cmake is available.
+* TVM_NDK_CC that points to NDK's clang compiler.
+For instance, the paths will look like the following on OSX for user `moreau`:
+```bash
+# Android + TVM setup
+export ANDROID_NDK="/Users/moreau/Library/Android/sdk/ndk/26.1.10909125"
+export TVM_NDK_CC="$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang"
+```
+This tutorial was tested successfully on Android Studio Hedgehog | 2023.1.1 Patch 1.
+## JDK
+JDK, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime.
+We strongly recommend setting the JAVA_HOME to the JDK bundled with Android Studio. Using Android Studio’s JBR bundle as recommended (<!-- markdown-link-check-disable -->https://developer.android.com/build/jdks<!-- markdown-link-check-enable -->) will reduce the chances of potential errors in JNI compilation.
+For instance on macOS, you'll need to point JAVA_HOME to the following.
+```bash
+export JAVA_HOME=/Applications/Android\ Studio.app/Contents/jbr/Contents/Home
+```
+To make sure the java binary can be found do an `ls $JAVA_HOME/bin/java`
+## MLC-LLM
+Let's clone mlc-llm from its repo in the directory of your choice:
+```bash
+cd /path/to/where/to/clone/repo
+git clone https://github.com/mlc-ai/mlc-llm --recursive
+export MLC_LLM_HOME=/path/to/mlc-llm
+```
+At the time of writing this README, we tested `mlc-llm` at the following sha: `21feb7010db02e0c2149489f5972d6a8a796b5a0`.
+## Phone Setup
+On your phone, enable debugging on your phone in your phone’s developer settings. Each phone manufacturer will have its own approach to enabling debug mode, so a simple Google search should equip you with the steps to do that on your phone.
+In addition, make sure to change your USB configuration from "Charging" to "MTP (Media Transfer Protocol)". This will allow us to connect to the device serially.
+Connect your phone to your development machine. On OSX, you'll be prompted on the dev machine whether you want to allow the accessory to connect. Hit "Allow".
+# Build Steps
+## Building the Android Package with MLC
+First edit the file under `android/MLCChat/mlc-package-config.json` and with the [mlc-package-config.json](./mlc-package-config.json) in llama-recipes.
+To understand what these JSON fields mean you can refer to this [documentation](https://llm.mlc.ai/docs/deploy/android.html#step-2-build-runtime-and-model-libraries).
+From the `mlc-llm` project root directory:
+```bash
+cd $MLC_LLM_HOME
+cd android/MLCChat
+python3 -m mlc_llm package  --package-config mlc-package-config.json --output dist
+```
+The command above will take a few minutes to run as it runs through the following steps:
+* Compile the Llama 3 8B instruct specified in the `mlc-package-config.json` into a binary model library.
+* Build the `mlc-llm` runtime and tokenizer. In addition to the model itself, a lightweight runtime and tokenizer are required to actually run the LLM.
+## Building and Running MLC Chat in Android Studio
+Now let's launch Android Studio.
+* On the "Welcome to Android Studio" page, hit "Open", and navigate to `$MLC_LLM_HOME/android/MLCChat`, then hit "Open"
+* A window will pop up asking whether to "Trust and Open project 'MLCChat'" - hit "Trust Project"
+* The project will now launch
+* Under File -> Project Structure... -> Project change the Gradle Version (second drop down from the top) to 8.5
+Connect your phone to your development machine - assuming you've followed the setup steps in the pre-requisite section, you should be able to see the device.
+Next you'll need to:
+* Hit Build -> Make Project.
+* Hit Run -> Run 'app'
+The MLCChat app will launch on your phone, now access your phone:
+* Under Model List you'll see the `Llama-3-8B-Instruct` LLM listed.
+* The model's not quite ready to launch yet, because the weights need to be downloaded over Wifi first. Hit the Download button on the right to the model name to download the weights from HuggingFace.
+Note that you can change the build settings to bundle the weights with the MLCChat app so you don't have to download the weights over wifi. To do so you can follow the instructions [here](https://llm.mlc.ai/docs/deploy/android.html#bundle-model-weights).
+Once the model weights are downloaded you can now interact with Llama 3 locally on your Android phone!
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/mlc-package-config.json
+++ b/recipes/inference/mobile_inference/android_inference/mlc-package-config.json
+{
+    "device": "android",
+    "model_list": [
+        {
+            "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
+            "estimated_vram_bytes": 4348727787,
+            "model_id": "Llama-3-8B-Instruct",
+            "overrides": {
+                "context_window_size": 768,
+                "prefill_chunk_size": 256
+            }
+        }
+    ]
+}
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/requirements.txt
+++ b/recipes/inference/mobile_inference/android_inference/requirements.txt
+--pre
+--find-links https://mlc.ai/wheels
+mlc-llm-nightly
+mlc-ai-nightly
+attrs
+decorator
+numpy
+psutil
+pydantic
+requests
+scipy
+setuptools
+torch
+tqdm
\ No newline at end of file
--- a/recipes/inference/model_servers/README.md
+++ b/recipes/inference/model_servers/README.md
+## [Running Llama 3 On-Prem with vLLM and TGI](llama-on-prem.md)
+This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 3 on-prem apps.
--- a/recipes/inference/model_servers/hf_text_generation_inference/README.md
+++ b/recipes/inference/model_servers/hf_text_generation_inference/README.md
+# Serving a fine tuned Llama model with HuggingFace text-generation-inference server
+This document shows how to serve a fine tuned Llama mode with HuggingFace's text-generation-inference server. This option is currently only available for models that were trained using the LoRA method or without using the `--use_peft` argument.
+## Step 0: Merging the weights (Only required if LoRA method was used) 
+In case the model was fine tuned with LoRA method we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file.
+The script takes the base model, the peft weight folder as well as an output as arguments:
+```
+python -m llama_recipes.inference.hf_text_generation_inference.merge_lora_weights --base_model llama-7B --peft_model ft_output --output_dir data/merged_model_output
+```
+## Step 1: Serving the model
+Subsequently, the model can be served using the docker container provided by [hf text-generation-inference](https://github.com/huggingface/text-generation-inference) started from the main directory of this repository:
+```bash
+model=/data/merged_model_output
+num_shard=2
+volume=$PWD/inference/hf-text-generation-inference/data
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard
+```
+The num_shard argument determines the number of GPU's the model should be sharded on.
+## Step 2: Running inference
+After the loading of the model shards completed an inference can be executed by using one of the following commands:
+```bash
+curl 127.0.0.1:8080/generate \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
+    -H 'Content-Type: application/json'
+# OR for streaming inference
+curl 127.0.0.1:8080/generate_stream \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
+    -H 'Content-Type: application/json'
+```
+Further information can be found in the documentation of the [hf text-generation-inference](https://github.com/huggingface/text-generation-inference) solution.
--- a/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
+++ b/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import fire
+import torch
+from peft import PeftModel
+from transformers import LlamaForCausalLM, LlamaTokenizer
+def main(base_model: str,
+         peft_model: str,
+         output_dir: str):
+    model = LlamaForCausalLM.from_pretrained(
+        base_model,
+        load_in_8bit=False,
+        torch_dtype=torch.float16,
+        device_map="auto",
+        offload_folder="tmp", 
+    )
+    tokenizer = LlamaTokenizer.from_pretrained(
+        base_model
+    )
+    model = PeftModel.from_pretrained(
+        model, 
+        peft_model, 
+        torch_dtype=torch.float16,
+        device_map="auto",
+        offload_folder="tmp",
+    )
+    model = model.merge_and_unload()
+    model.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+if __name__ == "__main__":
+    fire.Fire(main)
\ No newline at end of file
--- a/recipes/inference/model_servers/llama-on-prem.md
+++ b/recipes/inference/model_servers/llama-on-prem.md
+# Llama 3 On-Prem Inference Using vLLM and TGI
+Enterprise customers may prefer to deploy Llama 3 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference), two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 3 instances with [LangChain](https://www.langchain.com/), an open-source LLM app development framework which we used for our other demo apps: [Getting to Know Llama](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb), Running Llama 3 <!-- markdown-link-check-disable -->[locally](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb) <!-- markdown-link-check-disable --> and [in the cloud](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/RAG/HelloLlamaCloud.ipynb). See [here](https://medium.com/@rohit.k/tgi-vs-vllm-making-informed-choices-for-llm-deployment-37c56d7ff705) for a detailed comparison of vLLM and TGI.
+For [Ollama](https://ollama.com) based on-prem inference with Llama 3, see the Running Llama 3 locally notebook above.
+We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an example of running vLLM and TGI with Llama 3, and you can replace this with your own server to implement on-prem Llama 3 deployment.
+The Colab notebook to connect via LangChain with Llama 3 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg), also shown in the sections below.
+This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page. 
+You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens).
+## Setting up vLLM with Llama 3
+On a terminal, run the following commands:
+```
+conda create -n llama3 python=3.11
+conda activate llama3
+pip install vllm
+```
+Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login.
+<!-- markdown-link-check-disable -->
+There are two ways to deploy Llama 3 via vLLM, as a general API server or an OpenAI-compatible server (see [here](https://platform.openai.com/docs/api-reference/authentication) on how the OpenAI API authenticates, but you won't need to provide a real OpenAI API key when running Llama 3 via vLLM in the OpenAI-compatible mode).
+<!-- markdown-link-check-enable -->
+### Deploying Llama 3 as an API Server
+Run the command below to deploy vLLM as a general Llama 3 service:
+```
+python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
+```
+Then on another terminal you can run:
+```
+curl http://localhost:5000/generate -d '{
+        "prompt": "Who wrote the book Innovators dilemma?",
+        "max_tokens": 300,
+        "temperature": 0
+    }'
+```
+to send a query (prompt) to Llama 3 via vLLM and get Llama 3's response:
+> Who wrote the book Innovators dilemma? The book "Innovator's Dilemma" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the "innovator's dilemma," which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline.
+Now in your Llama 3 client app, you can make an HTTP request as the `curl` command above to send a query to Llama and parse the response.
+If you add the port 5000 to your EC2 instance's security group's inbound rules with the TCP protocol, then you can run this on your Mac/Windows for test:
+```
+curl http://<EC2_public_ip>:5000/generate -d '{
+        "prompt": "Who wrote the book godfather?",
+        "max_tokens": 300,
+        "temperature": 0
+    }'
+```
+Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argument when starting the server (see [here](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html) for more info). For example, the command below runs the Llama 3 8b-instruct model on 4 GPUs:
+```
+git clone https://github.com/vllm-project/vllm
+cd vllm/vllm/entrypoints
+conda activate llama3
+python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4
+```
+With multiple GPUs, you can also run replica of models as long as your model size can fit into targeted GPU memory. For example, if you have two A10G with 24 GB memory, you can run two Llama 3 8B models at the same time. This can be done by launching two api servers each targeting specific CUDA cores on different ports:
+`CUDA_VISIBLE_DEVICES=0 python api_server.py --host 0.0.0.0 --port 5000  --model meta-llama/Meta-Llama-3-8B-Instruct`
+and
+`CUDA_VISIBLE_DEVICES=1 python api_server.py --host 0.0.0.0 --port 5001  --model meta-llama/Meta-Llama-3-8B-Instruct`
+The benefit would be that you can balance incoming requests to both models, reaching higher batch size processing for a trade-off of generation latency.
+### Deploying Llama 3 as OpenAI-Compatible Server
+You can also deploy the vLLM hosted Llama 3 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:
+```
+python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
+```
+Then on another terminal, run:
+```
+curl http://localhost:5000/v1/completions -H "Content-Type: application/json" -d '{
+        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+        "prompt": "Who wrote the book Innovators dilemma?",
+        "max_tokens": 300,
+        "temperature": 0
+    }'
+```
+and you'll see the following result:
+> The book "Innovator's Dilemma" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the "innovator's dilemma," which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline.
+## Querying with Llama 3 via vLLM
+On a Google Colab notebook, first install two packages:
+```
+!pip install langchain openai
+```
+Note that you only need to install the `openai` package with an `EMPTY` OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 3. 
+Then replace the <vllm_server_ip_address> below and run the code:
+```
+from langchain.llms import VLLMOpenAI
+llm = VLLMOpenAI(
+    openai_api_key="EMPTY",
+    openai_api_base="http://<vllm_server_ip_address>:5000/v1",
+    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
+)
+print(llm("Who wrote the book godfather?"))
+```
+You'll see an answer like:
+> The book "The Godfather" was written by Mario Puzo. It was first published in 1969 and has since become a classic of American literature. The book was later adapted into a successful film directed by Francis Ford Coppola, which was released in 1972.
+You can now use the Llama 3 instance `llm` created this way in any of the demo apps or your own Llama 3 apps to integrate seamlessly with LangChain to build powerful on-prem Llama 3 apps.
+## Setting Up TGI with Llama 3
+The easiest way to deploy Llama 3 with TGI is using its official docker image. First, replace `<your_hugging_face_access_token>` and set the three required shell variables (you may replace the `model` value above with another Llama 3 model):
+```
+model=meta-llama/Meta-Llama-3-8B-Instruct
+volume=$PWD/data
+token=<your_hugging_face_access_token>
+```
+Then run the command below to deploy a quantized version of the Llama 3 8b chat model with TGI:
+```
+docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
+```
+After this, you'll be able to run the command below on another terminal:
+```
+curl 127.0.0.1:8080/generate -X POST -H 'Content-Type: application/json' -d '{
+        "inputs": "Who wrote the book innovators dilemma?",
+        "parameters": {
+            "max_new_tokens":200
+        }
+    }'
+```
+Or its stream version:
+```
+curl 127.0.0.1:8080/generate_stream -X POST -H 'Content-Type: application/json' -d '{
+        "inputs": "Who wrote the book innovators dilemma?",
+        "parameters": {
+            "max_new_tokens":200
+        }
+    }'     
+```
+and see the answer generated by Llama 3 via TGI like below:
+> The book "The Innovator's Dilemma" was written by Clayton Christensen, a professor at Harvard Business School. It was first published in 1997 and has since become a widely recognized and influential book on the topic of disruptive innovation.
+## Querying with Llama 3 via TGI
+Using LangChain to integrate with TGI-hosted Llama 3 is also straightforward. In the Colab above, first add a new code cell to install the Hugging Face `text_generation` package:
+```
+!pip install text_generation
+```
+Then add and run the code below:
+```
+from langchain_community.llms import HuggingFaceTextGenInference
+llm = HuggingFaceTextGenInference(
+    inference_server_url="http://<tgi_server_ip_address>:8080/",
+    max_new_tokens=512,
+    top_k=10,
+    top_p=0.95,
+    typical_p=0.95,
+    temperature=0.01,
+    repetition_penalty=1.03,
+)
+llm("What wrote the book innovators dilemma?")
+```
+With the Llama 3 instance `llm` created this way, you can integrate seamlessly with LangChain to build powerful on-prem Llama 3 apps.