First add in 0524

5eaaba41 · Rayyyyy · 5eaaba41 · 5eaaba41 · 5eaaba41 · 5eaaba41
Commit 5eaaba41 authored May 24, 2024 by Rayyyyy
20 changed files
--- a/recipes/finetuning/README.md
+++ b/recipes/finetuning/README.md
+# Finetuning Llama
+
+
+This folder contains instructions to fine-tune Meta Llama 3 on a
+
+* [single-GPU setup](./singlegpu_finetuning.md)
+* [multi-GPU setup](./multigpu_finetuning.md)
+
+using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+
+If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
+
+> [!TIP]
+> If you want to try finetuning Meta Llama 3 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
+
+
+## How to configure finetuning settings?
+
+> [!TIP]
+> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
+
+
+* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
+
+It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
+
+```python
+    model_name: str="PATH/to/Model"
+    tokenizer_name: str=None
+    enable_fsdp: bool=False
+    low_cpu_fsdp: bool=False
+    run_validation: bool=True
+    batch_size_training: int=4
+    batching_strategy: str="packing" #alternative: padding
+    context_length: int=4096
+    gradient_accumulation_steps: int=1
+    gradient_clipping: bool = False
+    gradient_clipping_threshold: float = 1.0
+    num_epochs: int=3
+    max_train_step: int=0
+    max_eval_step: int=0
+    num_workers_dataloader: int=1
+    lr: float=1e-4
+    weight_decay: float=0.0
+    gamma: float= 0.85
+    seed: int=42
+    use_fp16: bool=False
+    mixed_precision: bool=True
+    val_batch_size: int=1
+    dataset = "samsum_dataset"
+    peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
+    use_peft: bool=False
+    from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
+    output_dir: str = "PATH/to/save/PEFT/model"
+    freeze_layers: bool = False
+    num_freeze_layers: int = 1
+    quantization: bool = False
+    one_gpu: bool = False
+    save_model: bool = True
+    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
+    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
+    save_optimizer: bool=False # will be used if using FSDP
+    use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    use_wandb: bool = False # Enable wandb for experient tracking
+    save_metrics: bool = False # saves training metrics to a json file for later plotting
+    flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
+    flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
+    use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
+    profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
+
+```
+
+* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
+
+* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified. We currently support LoRA and Llama-Adapter. Please note that LoRA is the only technique which is supported in combination with FSDP.
+
+* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
+
+    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
+
+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
+
+    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
+        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
+
+        * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
+
+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
+
+        * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
+
+* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
+
+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
+
+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
+
+
+## Weights & Biases Experiment Tracking
+
+You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
+
+```bash
+python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
+```
+You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
+<div style="display: flex;">
+    <img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
+</div>
+
+## FLOPS Counting and Pytorch Profiling
+
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/recipes/finetuning/datasets/README.md
+++ b/recipes/finetuning/datasets/README.md
+# Datasets and Evaluation Metrics
+
+The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+
+* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
+* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
+* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
+* [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
+
+## Batching Strategies
+Llama-recipes support two strategies to batch requests together.
+The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
+This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
+Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
+
+If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
+Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences.
+The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.
+
+The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`.
+
+## Using custom datasets
+
+The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
+To use a custom dataset there are two possible ways.
+The first provides a function returning the dataset in a .py file which can be given to the command line tool.
+This does not involve changing the source code of llama-recipes.
+The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
+
+### Training on custom data
+To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
+```@python
+def get_custom_dataset(dataset_config, tokenizer, split: str):
+```
+For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](custom_dataset.py).
+The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
+The split signals wether to return the training or validation dataset.
+The default function name is `get_custom_dataset` but this can be changed as described below.
+
+In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
+```
+To change the function name that is used in the .py you can append the name following a `:` like this:
+```
+python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
+```
+This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
+
+### Adding new dataset
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
+
+Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
+The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
+For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
+
+To add a custom dataset the following steps need to be performed.
+
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
+2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../../../src/llama_recipes/utils/dataset_utils.py)
+4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
+
+## Application
+Below we list other datasets and their main use cases that can be used for fine tuning.
+
+### Q&A these can be used for evaluation as well
+- [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation)
+- [BoolQ](https://huggingface.co/datasets/boolq)
+- [NarrativeQA](https://huggingface.co/datasets/narrativeqa)
+- [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book)
+- [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book)
+- [QuAC](https://huggingface.co/datasets/quac)
+- [HellaSwag](https://huggingface.co/datasets/hellaswag)
+- [OpenbookQA](https://huggingface.co/datasets/openbookqa)
+- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model)
+
+
+### instruction finetuning
+- [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned)	52k	instruction tuning
+- [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k	15k	instruction tuning
+
+
+### simple text generation for quick tests
+[English](https://huggingface.co/datasets/Abirate/english_quotes) quotes	2508	Multi-label text classification, text generation
+
+
+### Reasoning used mostly for evaluation of LLMs
+- [bAbI](https://research.facebook.com/downloads/babi/)
+- [Dyck](https://huggingface.co/datasets/dyk)
+- [GSM8K](https://huggingface.co/datasets/gsm8k)
+- [MATH](https://github.com/hendrycks/math)
+- [APPS](https://huggingface.co/datasets/codeparrot/apps)
+- [HumanEval](https://huggingface.co/datasets/openai_humaneval)
+- [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar)
+- [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching)
+
+### Toxicity evaluation
+- [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)
+
+### Bias evaluation
+- [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
+- WinoGender gender bias
+
+### Useful Links
+More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)
--- a/recipes/finetuning/datasets/custom_dataset.py
+++ b/recipes/finetuning/datasets/custom_dataset.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# For dataset details visit: https://huggingface.co/datasets/samsum
+
+import copy
+import datasets
+import itertools
+
+
+B_INST, E_INST = "[INST]", "[/INST]"
+
+def tokenize_dialog(dialog, tokenizer):
+    if tokenizer.vocab_size >= 128000:
+        dialog_tokens = tokenizer.apply_chat_template(dialog)
+        dialog_tokens = dialog_tokens[:-4] # Remove generation prompt <|start_header_id|>assistant<|end_header_id|>\n\n
+        eot_indices = [i for i,n in enumerate(dialog_tokens) if n == 128009]
+        labels = copy.copy(dialog_tokens)
+        last_idx = 0
+        for n, idx in enumerate(eot_indices):
+            if n % 2 == 1:
+                last_idx = idx
+            else:
+                labels[last_idx:idx+1] = [-100] * (idx-last_idx+1)
+
+        dialog_tokens = [dialog_tokens]
+        labels_tokens = [labels]
+    else:
+        prompt_tokens = [tokenizer.encode(f"{tokenizer.bos_token}{B_INST} {(prompt['content']).strip()} {E_INST}", add_special_tokens=False) for prompt in dialog[::2]]
+        answer_tokens = [tokenizer.encode(f"{answer['content'].strip()} {tokenizer.eos_token}", add_special_tokens=False) for answer in dialog[1::2]]
+        dialog_tokens = list(itertools.chain.from_iterable(zip(prompt_tokens, answer_tokens)))
+
+        #Add labels, convert prompt token to -100 in order to ignore in loss function
+        labels_tokens = [len(c)*[-100,] if i % 2 == 0 else c for i,c in enumerate(dialog_tokens)]
+
+    combined_tokens = {
+        "input_ids": list(itertools.chain(*(t for t in dialog_tokens))),
+        "labels": list(itertools.chain(*(t for t in labels_tokens))),
+    }
+
+    return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
+
+
+def get_custom_dataset(dataset_config, tokenizer, split):
+    dataset = datasets.load_dataset("OpenAssistant/oasst1", split=split)
+
+    dataset = dataset.map(lambda sample: {
+        "message_id": sample["message_id"],
+        "parent_id": sample["parent_id"],
+        "text": sample["text"],
+        },
+        batched=True,
+        remove_columns=list(dataset.features),)
+
+    nodes = {}
+
+    messages = {}
+    root_ids = []
+
+    for data in dataset:
+        if data["parent_id"]:
+            nodes[data["parent_id"]] = nodes.get(data["parent_id"], []) + [data["message_id"]]
+        else:
+            root_ids.append(data["message_id"])
+        messages[data["message_id"]]=data["text"]
+
+    def follow(thread, current_id):
+        thread = copy.copy(thread) + [messages[current_id]]
+        if current_id in nodes:
+            new_threads = []
+            for next_id in nodes[current_id]:
+                new_threads += follow(thread, next_id)
+            return new_threads
+        else:
+            return [thread]
+
+    def get_threads_from_root(root_id):
+        all_threads = []
+        thread = [messages[root_id]]
+        for cid in nodes[root_id]:
+            all_threads += follow(thread, cid)
+        return all_threads
+
+    dataset = dataset.filter(lambda x: x["message_id"] in root_ids)
+    dataset = dataset.map(lambda x: {"thread": get_threads_from_root(x["message_id"])}, remove_columns=list(dataset.features))
+    dataset = dataset.map(lambda x: {"thread": [i for row in x["thread"] for i in row]}, batched=True)
+
+    def to_dialog(thread):
+        dialog = []
+        for i, content in enumerate(thread):
+            dialog.append({
+                "role": "user" if i % 2 == 0 else "assistant",
+                "content": content,
+            })
+        return {"dialog": dialog}
+
+    dataset = dataset.map(lambda x: to_dialog(x["thread"]), remove_columns=list(dataset.features))
+    dataset = dataset.map(lambda x: tokenize_dialog(x["dialog"], tokenizer), remove_columns=list(dataset.features))
+
+    return dataset
--- a/recipes/finetuning/finetuning.py
+++ b/recipes/finetuning/finetuning.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+import fire
+from llama_recipes.finetuning import main
+
+if __name__ == "__main__":
+    fire.Fire(main)
\ No newline at end of file
--- a/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
+++ b/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
--- a/recipes/finetuning/multi_node.slurm
+++ b/recipes/finetuning/multi_node.slurm
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the GNU General Public License version 3.
+
+
+#!/bin/bash
+
+#SBATCH --job-name=Nano-2d-trainer-20b-8nodes
+
+#SBATCH --ntasks=2
+#SBATCH --nodes=2
+#SBATCH --gpus-per-task=4
+#SBATCH --partition=train 
+nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+nodes_array=($nodes)
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+# Enable for A100
+export FI_PROVIDER="efa"
+
+echo Node IP: $head_node_ip
+export LOGLEVEL=INFO
+# debugging flags (optional)
+export NCCL_DEBUG=WARN
+export NCCL_DEBUG_SUBSYS=WARN
+export PYTHONFAULTHANDLER=1
+export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
+export CUDA_LAUNCH_BLOCKING=0
+
+# on your cluster you might need these:
+# set the network interface
+export NCCL_SOCKET_IFNAME="ens"
+export FI_EFA_USE_DEVICE_RDMA=1
+
+srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora
+
--- a/recipes/finetuning/multigpu_finetuning.md
+++ b/recipes/finetuning/multigpu_finetuning.md
+# Fine-tuning with Multi GPU
+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
+
+
+## Requirements
+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
+
+We will also need 2 packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
+
+> [!NOTE]
+> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
+>
+> INT8 quantization is not currently supported in FSDP
+
+
+## How to run it
+Get access to a machine with multiple GPUs (in this case we tested with 4 A100 and A10s).
+
+### With FSDP + PEFT
+
+<details open>
+<summary>Single-node Multi-GPU</summary>
+
+    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+
+</details>
+
+<details>
+<summary>Multi-node Multi-GPU</summary>
+Here we use a slurm script to schedule a job with slurm over multiple nodes.
+
+    # Change the num nodes and GPU per nodes in the script before running.
+    sbatch ./multi_node.slurm
+
+</details>
+
+
+We use `torchrun` to spawn multiple processes for FSDP.
+
+The args used in the command above are:
+* `--enable_fsdp` boolean flag to enable FSDP  in the script
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+
+
+### With only FSDP
+If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
+```
+
+### Using less CPU memory (FSDP on 70B model)
+
+If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+```
+
+
+
+## Running with different datasets
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
+
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+
+* `samsum_dataset`
+
+To run with each of the datasets set the `dataset` flag in the command as shown below:
+
+```bash
+# grammer_dataset
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+# alpaca_dataset
+
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+
+# samsum_dataset
+
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+```
+
+
+
+## [TIP] Slow interconnect between nodes?
+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
+
+HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
+
+This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
+
+```bash
+
+torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
+
+```
+
+## FLOPS Counting and Pytorch Profiling
+
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
+
--- a/recipes/finetuning/singlegpu_finetuning.md
+++ b/recipes/finetuning/singlegpu_finetuning.md
+# Fine-tuning with Single GPU
+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
+
+These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+
+
+## Requirements
+
+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
+
+To run fine-tuning on a single GPU, we will make use of two packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for int8 quantization.
+
+
+## How to run it?
+
+```bash
+python finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+```
+The args used in the command above are:
+
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+* `--quantization` boolean flag to enable int8 quantization
+
+> [!NOTE]
+> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
+
+
+### How to run with different datasets?
+
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+
+* `alpaca_dataset` : to get this open source data please download the `alpaca.json` to `dataset` folder.
+
+
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+
+* `samsum_dataset`
+
+to run with each of the datasets set the `dataset` flag in the command as shown below:
+
+```bash
+# grammar_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+# alpaca_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+
+# samsum_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
+
+```
+
+## FLOPS Counting and Pytorch Profiling
+
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
--- a/recipes/inference/local_inference/README.md
+++ b/recipes/inference/local_inference/README.md
+# Local Inference
+
+For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
+To finetune all model parameters the output dir of the training has to be given as --model_name argument.
+In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
+Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
+
+**Content Safety**
+The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
+
+**Note**
+If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
+
+Examples:
+
+ ```bash
+# Full finetuning of all parameters
+cat <test_prompt_file> | python inference.py --model_name <training_config.output_dir> --use_auditnlg
+# PEFT method
+cat <test_prompt_file> | python inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
+# prompt as parameter
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
+ ```
+The  folder contains test prompts for summarization use-case:
+```
+samsum_prompt.txt
+...
+```
+
+**Note**
+Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
+
+```python
+tokenizer.add_special_tokens(
+        {
+
+            "pad_token": "<PAD>",
+        }
+    )
+model.resize_token_embeddings(model.config.vocab_size + 1)
+```
+Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
+
+
+## Chat completion
+The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
+
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg
+
+```
+
+## Flash Attention and Xformer Memory Efficient Kernels
+
+Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
+
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
+
+python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
+
+```
+
+## Loading back FSDP checkpoints
+
+In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
+**To convert the checkpoint use the following command**:
+
+This is helpful if you have fine-tuned you model using FSDP only as follows:
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  recipes/finetuning/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
+```
+Then convert your FSDP checkpoint to HuggingFace checkpoints using:
+```bash
+ python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
+
+ # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
+ ```
+By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
+
+Then run inference using:
+
+```bash
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
+
+```
\ No newline at end of file
--- a/recipes/inference/local_inference/chat_completion/chat_completion.py
+++ b/recipes/inference/local_inference/chat_completion/chat_completion.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+
+import fire
+import os
+import sys
+
+import torch
+from transformers import AutoTokenizer
+
+from llama_recipes.inference.chat_utils import read_dialogs_from_file
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+from llama_recipes.inference.safety_utils import get_safety_checker
+from accelerate.utils import is_xpu_available
+
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =256, #The maximum numbers of tokens to generate
+    min_new_tokens:int=0, #The minimum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    safety_score_threshold: float=0.5,
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
+    use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    enable_llamaguard_content_safety: bool = False,
+    **kwargs
+):
+    if prompt_file is not None:
+        assert os.path.exists(
+            prompt_file
+        ), f"Provided Prompt file does not exist {prompt_file}"
+
+        dialogs= read_dialogs_from_file(prompt_file)
+
+    elif not sys.stdin.isatty():
+        dialogs = "\n".join(sys.stdin.readlines())
+    else:
+        print("No user prompt provided. Exiting.")
+        sys.exit(1)
+
+    print(f"User dialogs:\n{dialogs}")
+    print("\n==================================\n")
+
+
+    # Set the seeds for reproducibility
+    if is_xpu_available():
+        torch.xpu.manual_seed(seed)
+    else:
+        torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.add_special_tokens(
+        {
+
+            "pad_token": "<PAD>",
+        }
+    )
+
+    chats = tokenizer.apply_chat_template(dialogs)
+
+    with torch.no_grad():
+        for idx, chat in enumerate(chats):
+            safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_saleforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        )
+            # Safety check of the user prompt
+            safety_results = [check(dialogs[idx][0]["content"]) for check in safety_checker]
+            are_safe = all([r[1] for r in safety_results])
+            if are_safe:
+                print(f"User prompt deemed safe.")
+                print("User prompt:\n", dialogs[idx][0]["content"])
+                print("\n==================================\n")
+            else:
+                print("User prompt deemed unsafe.")
+                for method, is_safe, report in safety_results:
+                    if not is_safe:
+                        print(method)
+                        print(report)
+                print("Skipping the inferece as the prompt is not safe.")
+                sys.exit(1)  # Exit the program with an error status
+            tokens= torch.tensor(chat).long()
+            tokens= tokens.unsqueeze(0)
+            if is_xpu_available():
+                tokens= tokens.to("xpu:0")
+            else:
+                tokens= tokens.to("cuda:0")
+            outputs = model.generate(
+                input_ids=tokens,
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                top_p=top_p,
+                temperature=temperature,
+                use_cache=use_cache,
+                top_k=top_k,
+                repetition_penalty=repetition_penalty,
+                length_penalty=length_penalty,
+                **kwargs
+            )
+
+            output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+            # Safety check of the model output
+            safety_results = [check(output_text) for check in safety_checker]
+            are_safe = all([r[1] for r in safety_results])
+            if are_safe:
+                print("User input and model output deemed safe.")
+                print(f"Model output:\n{output_text}")
+                print("\n==================================\n")
+
+            else:
+                print("Model output deemed unsafe.")
+                for method, is_safe, report in safety_results:
+                    if not is_safe:
+                        print(method)
+                        print(report)
+
+
+
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/inference/local_inference/chat_completion/chats.json
+++ b/recipes/inference/local_inference/chat_completion/chats.json
+[
+    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
+    [
+        {"role": "user", "content": "I am going to Paris, what should I see?"},
+        {
+            "role": "assistant",
+            "content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
+        },
+        {"role": "user", "content": "What is so great about #1?"}
+    ],
+    [
+        {"role": "system", "content": "Always answer with Haiku"},
+        {"role": "user", "content": "I am going to Paris, what should I see?"}
+    ],
+    [
+        {
+            "role": "system",
+            "content": "Always answer with emojis"
+        },
+        {"role": "user", "content": "How to go from Beijing to NY?"}
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
+        },
+        {"role": "user", "content": "Write a brief birthday message to John"}
+    ]
+]
\ No newline at end of file
--- a/recipes/inference/local_inference/inference.py
+++ b/recipes/inference/local_inference/inference.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+
+import fire
+import os
+import sys
+import time
+import gradio as gr
+
+import torch
+from transformers import AutoTokenizer
+
+from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+
+from accelerate.utils import is_xpu_available
+
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =100, #The maximum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False,
+    max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
+    use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    **kwargs
+):
+
+  def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,):
+    safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety
+                                        )
+
+    # Safety check of the user prompt
+    safety_results = [check(user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User prompt deemed safe.")
+        print(f"User prompt:\n{user_prompt}")
+    else:
+        print("User prompt deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+        print("Skipping the inference as the prompt is not safe.")
+        sys.exit(1)  # Exit the program with an error status
+
+    # Set the seeds for reproducibility
+    if is_xpu_available():
+        torch.xpu.manual_seed(seed)
+    else:
+        torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+
+    model.eval()
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    batch = tokenizer(user_prompt, padding='max_length', truncation=True, max_length=max_padding_length, return_tensors="pt")
+    if is_xpu_available():
+        batch = {k: v.to("xpu") for k, v in batch.items()}
+    else:
+        batch = {k: v.to("cuda") for k, v in batch.items()}
+
+    start = time.perf_counter()
+    with torch.no_grad():
+        outputs = model.generate(
+            **batch,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            top_p=top_p,
+            temperature=temperature,
+            min_length=min_length,
+            use_cache=use_cache,
+            top_k=top_k,
+            repetition_penalty=repetition_penalty,
+            length_penalty=length_penalty,
+            **kwargs
+        )
+    e2e_inference_time = (time.perf_counter()-start)*1000
+    print(f"the inference time is {e2e_inference_time} ms")
+    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+    # Safety check of the model output
+    safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User input and model output deemed safe.")
+        print(f"Model output:\n{output_text}")
+    else:
+        print("Model output deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+    return output_text
+
+  if prompt_file is not None:
+      assert os.path.exists(
+          prompt_file
+      ), f"Provided Prompt file does not exist {prompt_file}"
+      with open(prompt_file, "r") as f:
+          user_prompt = "\n".join(f.readlines())
+      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+  elif not sys.stdin.isatty():
+      user_prompt = "\n".join(sys.stdin.readlines())
+      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+  else:
+      gr.Interface(
+        fn=inference,
+        inputs=[
+            gr.components.Textbox(
+                lines=9,
+                label="User Prompt",
+                placeholder="none",
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=1, value=1.0, label="Temperature"
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=1, value=1.0, label="Top p"
+            ),
+            gr.components.Slider(
+                minimum=0, maximum=100, step=1, value=50, label="Top k"
+            ),
+            gr.components.Slider(
+                minimum=1, maximum=2000, step=1, value=200, label="Max tokens"
+            ),
+        ],
+        outputs=[
+            gr.components.Textbox(
+                lines=5,
+                label="Output",
+            )
+        ],
+        title="Meta Llama3 Playground",
+        description="https://github.com/facebookresearch/llama-recipes",
+      ).queue().launch(server_name="0.0.0.0", share=True)
+
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/inference/local_inference/samsum_prompt.txt
+++ b/recipes/inference/local_inference/samsum_prompt.txt
+Summarize this dialog:
+A: Hi Tom, are you busy tomorrow’s afternoon?
+B: I’m pretty sure I am. What’s up?
+A: Can you go with me to the animal shelter?.
+B: What do you want to do?
+A: I want to get a puppy for my son.
+B: That will make him so happy.
+A: Yeah, we’ve discussed it many times. I think he’s ready now.
+B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
+A: I'll get him one of those little dogs.
+B: One that won't grow up too big;-)
+A: And eat too much;-))
+B: Do you know which one he would like?
+A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
+B: I bet you had to drag him away.
+A: He wanted to take it home right away ;-).
+B: I wonder what he'll name it.
+A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
+---
+Summary:
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/README.md
+++ b/recipes/inference/mobile_inference/android_inference/README.md
+# Running Llama3 8B Instruct on Android with MLC-LLM
+
+Author: Thierry Moreau - tmoreau@octo.ai
+
+# Overview
+In this tutorial we'll learn how to deploy Llama3 8B Instruct on an Android-based phone using MLC-LLM.
+
+Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.
+
+You can read more about MLC-LLM at the following [link](https://github.com/mlc-ai/mlc-llm).
+
+MLC-LLM is also what powers the Llama3 inference APIs provided by [OctoAI](https://octo.ai/). You can use OctoAI for your Llama3 cloud-based inference needs by trying out the examples under the [following path](../../../llama_api_providers/OctoAI_API_examples/).
+
+This tutorial was tested with the following setup:
+* MacBook Pro 16 inch from 2021 with Apple M1 Max and 32GB of RAM running Sonoma 14.3.1
+* OnePlus 12 Android Smartphone with a Snapdragon 8Gen3 SoC and 12GB or RAM, running OxygenOS 14.0
+
+Running Llama3 on a phone will likely require a powerful chipset. We haven't tested extensively the range of chipset that will support this usecase. Feel free to update this README.md to specify what devices were successfully tested.
+
+| Phone      | Chipset          | RAM  | Status  | Comments |
+|------------|------------------|------|---------|----------|
+| OnePlus 12 | Snapdragon 8Gen3 | 12GB | Success | None     |
+|            |                  |      |         |          |
+
+This guide is heavily based on the [MLC Android Guide](https://llm.mlc.ai/docs/deploy/android.html), but several steps have been taken to streamline the instructions.
+
+# Pre-requisites
+
+## Python
+
+Whether you're using conda or virtual env to manage your environment, we highly recommend starting from scratch with a clean new environment.
+
+For instance with virtual environment:
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+
+Next you'll need to install the following packages:
+```bash
+python3 -m pip install -r requirements.txt
+```
+
+## Rust
+
+[Rust](https://www.rust-lang.org/tools/install) is needed to cross-compile HuggingFace tokenizers to Android.
+Make sure rustc, cargo, and rustup are available in $PATH.
+
+
+## Android Studio
+
+Install Android Studio from <!-- markdown-link-check-disable -->https://developer.android.com/studio<!-- markdown-link-check-enable --> with NDK and CMake.
+
+To install NDK and CMake, in the Android Studio welcome page, click “Projects → SDK Manager → SDK Tools”. Set up the following environment variables:
+
+* ANDROID_NDK so that $ANDROID_NDK/build/cmake/android.toolchain.cmake is available.
+* TVM_NDK_CC that points to NDK's clang compiler.
+
+For instance, the paths will look like the following on OSX for user `moreau`:
+```bash
+# Android + TVM setup
+export ANDROID_NDK="/Users/moreau/Library/Android/sdk/ndk/26.1.10909125"
+export TVM_NDK_CC="$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang"
+```
+
+This tutorial was tested successfully on Android Studio Hedgehog | 2023.1.1 Patch 1.
+
+## JDK
+
+JDK, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime.
+
+We strongly recommend setting the JAVA_HOME to the JDK bundled with Android Studio. Using Android Studio’s JBR bundle as recommended (<!-- markdown-link-check-disable -->https://developer.android.com/build/jdks<!-- markdown-link-check-enable -->) will reduce the chances of potential errors in JNI compilation.
+
+For instance on macOS, you'll need to point JAVA_HOME to the following.
+
+```bash
+export JAVA_HOME=/Applications/Android\ Studio.app/Contents/jbr/Contents/Home
+```
+
+To make sure the java binary can be found do an `ls $JAVA_HOME/bin/java`
+
+## MLC-LLM
+
+Let's clone mlc-llm from its repo in the directory of your choice:
+
+```bash
+cd /path/to/where/to/clone/repo
+git clone https://github.com/mlc-ai/mlc-llm --recursive
+export MLC_LLM_HOME=/path/to/mlc-llm
+```
+
+At the time of writing this README, we tested `mlc-llm` at the following sha: `21feb7010db02e0c2149489f5972d6a8a796b5a0`.
+
+## Phone Setup
+
+On your phone, enable debugging on your phone in your phone’s developer settings. Each phone manufacturer will have its own approach to enabling debug mode, so a simple Google search should equip you with the steps to do that on your phone.
+
+In addition, make sure to change your USB configuration from "Charging" to "MTP (Media Transfer Protocol)". This will allow us to connect to the device serially.
+
+Connect your phone to your development machine. On OSX, you'll be prompted on the dev machine whether you want to allow the accessory to connect. Hit "Allow".
+
+# Build Steps
+
+## Building the Android Package with MLC
+
+First edit the file under `android/MLCChat/mlc-package-config.json` and with the [mlc-package-config.json](./mlc-package-config.json) in llama-recipes.
+
+To understand what these JSON fields mean you can refer to this [documentation](https://llm.mlc.ai/docs/deploy/android.html#step-2-build-runtime-and-model-libraries).
+
+
+From the `mlc-llm` project root directory:
+
+```bash
+cd $MLC_LLM_HOME
+cd android/MLCChat
+python3 -m mlc_llm package  --package-config mlc-package-config.json --output dist
+```
+
+The command above will take a few minutes to run as it runs through the following steps:
+
+* Compile the Llama 3 8B instruct specified in the `mlc-package-config.json` into a binary model library.
+* Build the `mlc-llm` runtime and tokenizer. In addition to the model itself, a lightweight runtime and tokenizer are required to actually run the LLM.
+
+## Building and Running MLC Chat in Android Studio
+
+Now let's launch Android Studio.
+
+* On the "Welcome to Android Studio" page, hit "Open", and navigate to `$MLC_LLM_HOME/android/MLCChat`, then hit "Open"
+* A window will pop up asking whether to "Trust and Open project 'MLCChat'" - hit "Trust Project"
+* The project will now launch
+* Under File -> Project Structure... -> Project change the Gradle Version (second drop down from the top) to 8.5
+
+Connect your phone to your development machine - assuming you've followed the setup steps in the pre-requisite section, you should be able to see the device.
+
+Next you'll need to:
+
+* Hit Build -> Make Project.
+* Hit Run -> Run 'app'
+
+The MLCChat app will launch on your phone, now access your phone:
+
+* Under Model List you'll see the `Llama-3-8B-Instruct` LLM listed.
+* The model's not quite ready to launch yet, because the weights need to be downloaded over Wifi first. Hit the Download button on the right to the model name to download the weights from HuggingFace.
+
+Note that you can change the build settings to bundle the weights with the MLCChat app so you don't have to download the weights over wifi. To do so you can follow the instructions [here](https://llm.mlc.ai/docs/deploy/android.html#bundle-model-weights).
+
+Once the model weights are downloaded you can now interact with Llama 3 locally on your Android phone!
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/mlc-package-config.json
+++ b/recipes/inference/mobile_inference/android_inference/mlc-package-config.json
+{
+    "device": "android",
+    "model_list": [
+        {
+            "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
+            "estimated_vram_bytes": 4348727787,
+            "model_id": "Llama-3-8B-Instruct",
+            "overrides": {
+                "context_window_size": 768,
+                "prefill_chunk_size": 256
+            }
+        }
+    ]
+}
\ No newline at end of file
--- a/recipes/inference/mobile_inference/android_inference/requirements.txt
+++ b/recipes/inference/mobile_inference/android_inference/requirements.txt
+--pre
+--find-links https://mlc.ai/wheels
+mlc-llm-nightly
+mlc-ai-nightly
+attrs
+decorator
+numpy
+psutil
+pydantic
+requests
+scipy
+setuptools
+torch
+tqdm
\ No newline at end of file
--- a/recipes/inference/model_servers/README.md
+++ b/recipes/inference/model_servers/README.md
+## [Running Llama 3 On-Prem with vLLM and TGI](llama-on-prem.md)
+This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 3 on-prem apps.
--- a/recipes/inference/model_servers/hf_text_generation_inference/README.md
+++ b/recipes/inference/model_servers/hf_text_generation_inference/README.md
--- a/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
+++ b/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
--- a/recipes/inference/model_servers/llama-on-prem.md
+++ b/recipes/inference/model_servers/llama-on-prem.md