First add in 0524

5eaaba41 · Rayyyyy · 5eaaba41 · 5eaaba41 · 5eaaba41 · 5eaaba41
Commit 5eaaba41 authored May 24, 2024 by Rayyyyy
20 changed files
--- a/recipes/benchmarks/inference_throughput/requirements.txt
+++ b/recipes/benchmarks/inference_throughput/requirements.txt
+transformers
+requests
+azure-core
+azure-ai-contentsafety
+torch
--- a/recipes/benchmarks/inference_throughput/tokenizer/special_tokens_map.json
+++ b/recipes/benchmarks/inference_throughput/tokenizer/special_tokens_map.json
+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/recipes/benchmarks/inference_throughput/tokenizer/tokenizer.json
+++ b/recipes/benchmarks/inference_throughput/tokenizer/tokenizer.json
--- a/recipes/benchmarks/inference_throughput/tokenizer/tokenizer.model
+++ b/recipes/benchmarks/inference_throughput/tokenizer/tokenizer.model
--- a/recipes/benchmarks/inference_throughput/tokenizer/tokenizer_config.json
+++ b/recipes/benchmarks/inference_throughput/tokenizer/tokenizer_config.json
+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": false,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "legacy": true,
+  "use_default_system_prompt": false,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "tokenizer_class": "LlamaTokenizerFast",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/recipes/code_llama/README.md
+++ b/recipes/code_llama/README.md
+# Code Llama
+Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
+Find the scripts to run Code Llama, where there are two examples of running code completion and infilling.
+**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
+Make sure to install Transformers from source for now
+```bash
+pip install git+https://github.com/huggingface/transformers
+```
+To run the code completion example:
+```bash
+python code_completion_example.py --model_name MODEL_NAME  --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9
+```
+To run the code infilling example:
+```bash
+python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
+```
+To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
+```bash
+python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
+```
+You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
--- a/recipes/code_llama/code_completion_example.py
+++ b/recipes/code_llama/code_completion_example.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+import fire
+import os
+import sys
+import time
+import torch
+from transformers import AutoTokenizer
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =100, #The maximum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
+    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    **kwargs
+):
+    if prompt_file is not None:
+        assert os.path.exists(
+            prompt_file
+        ), f"Provided Prompt file does not exist {prompt_file}"
+        with open(prompt_file, "r") as f:
+            user_prompt = f.read()
+    else:
+        print("No user prompt provided. Exiting.")
+        sys.exit(1)
+    # Set the seeds for reproducibility
+    torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        )
+    # Safety check of the user prompt
+    safety_results = [check(user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User prompt deemed safe.")
+        print(f"User prompt:\n{user_prompt}")
+    else:
+        print("User prompt deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+        print("Skipping the inference as the prompt is not safe.")
+        sys.exit(1)  # Exit the program with an error status
+    batch = tokenizer(user_prompt, return_tensors="pt")
+    batch = {k: v.to("cuda") for k, v in batch.items()}
+    start = time.perf_counter()
+    with torch.no_grad():
+        outputs = model.generate(
+            **batch,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            top_p=top_p,
+            temperature=temperature,
+            min_length=min_length,
+            use_cache=use_cache,
+            top_k=top_k,
+            repetition_penalty=repetition_penalty,
+            length_penalty=length_penalty,
+            **kwargs 
+        )
+    e2e_inference_time = (time.perf_counter()-start)*1000
+    print(f"the inference time is {e2e_inference_time} ms")
+    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Safety check of the model output
+    safety_results = [check(output_text) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User input and model output deemed safe.")
+        print(f"Model output:\n{output_text}")
+    else:
+        print("Model output deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/code_llama/code_completion_prompt.txt
+++ b/recipes/code_llama/code_completion_prompt.txt
+import argparse
+def main(string: str):
+    print(string)
+    print(string[::-1])
+if __name__ == "__main__":
\ No newline at end of file
--- a/recipes/code_llama/code_infilling_example.py
+++ b/recipes/code_llama/code_infilling_example.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+import fire
+import torch
+import os
+import sys
+import time
+from transformers import AutoTokenizer
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =100, #The maximum numbers of tokens to generate
+    prompt_file: str=None,
+    seed: int=42, #seed value for reproducibility
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
+    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    **kwargs
+):
+    if prompt_file is not None:
+        assert os.path.exists(
+            prompt_file
+        ), f"Provided Prompt file does not exist {prompt_file}"
+        with open(prompt_file, "r") as f:
+            user_prompt = f.read()
+    else:
+        print("No user prompt provided. Exiting.")
+        sys.exit(1)
+    # Set the seeds for reproducibility
+    torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    model.config.tp_size=1
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        )
+    # Safety check of the user prompt
+    safety_results = [check(user_prompt) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User prompt deemed safe.")
+        print(f"User prompt:\n{user_prompt}")
+    else:
+        print("User prompt deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+        print("Skipping the inference as the prompt is not safe.")
+        sys.exit(1)  # Exit the program with an error status
+    batch = tokenizer(user_prompt, return_tensors="pt")
+    batch = {k: v.to("cuda") for k, v in batch.items()}
+    start = time.perf_counter()
+    with torch.no_grad():
+        outputs = model.generate(
+            **batch,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            top_p=top_p,
+            temperature=temperature,
+            min_length=min_length,
+            use_cache=use_cache,
+            top_k=top_k,
+            repetition_penalty=repetition_penalty,
+            length_penalty=length_penalty,
+            **kwargs 
+        )
+    e2e_inference_time = (time.perf_counter()-start)*1000
+    print(f"the inference time is {e2e_inference_time} ms")
+    filling = tokenizer.batch_decode(outputs[:, batch["input_ids"].shape[1]:], skip_special_tokens=True)[0]
+    # Safety check of the model output
+    safety_results = [check(filling) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User input and model output deemed safe.")
+        print(user_prompt.replace("<FILL_ME>", filling))
+    else:
+        print("Model output deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/code_llama/code_infilling_prompt.txt
+++ b/recipes/code_llama/code_infilling_prompt.txt
+def remove_non_ascii(s: str) -> str:
+    """ <FILL_ME>
+    return result
--- a/recipes/code_llama/code_instruct_example.py
+++ b/recipes/code_llama/code_instruct_example.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import fire
+import os
+import sys
+import time
+import torch
+from transformers import AutoTokenizer
+from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.model_utils import load_model, load_peft_model
+def handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt):
+    """
+    Handles the output based on the safety check of both user and system prompts.
+    Parameters:
+    - are_safe_user_prompt (bool): Indicates whether the user prompt is safe.
+    - user_prompt (str): The user prompt that was checked for safety.
+    - safety_results_user_prompt (list of tuples): A list of tuples for the user prompt containing the method, safety status, and safety report.
+    - are_safe_system_prompt (bool): Indicates whether the system prompt is safe.
+    - system_prompt (str): The system prompt that was checked for safety.
+    - safety_results_system_prompt (list of tuples): A list of tuples for the system prompt containing the method, safety status, and safety report.
+    """
+    def print_safety_results(are_safe_prompt, prompt, safety_results, prompt_type="User"):
+        """
+        Prints the safety results for a prompt.
+        Parameters:
+        - are_safe_prompt (bool): Indicates whether the prompt is safe.
+        - prompt (str): The prompt that was checked for safety.
+        - safety_results (list of tuples): A list of tuples containing the method, safety status, and safety report.
+        - prompt_type (str): The type of prompt (User/System).
+        """
+        if are_safe_prompt:
+            print(f"{prompt_type} prompt deemed safe.")
+            print(f"{prompt_type} prompt:\n{prompt}")
+        else:
+            print(f"{prompt_type} prompt deemed unsafe.")
+            for method, is_safe, report in safety_results:
+                if not is_safe:
+                    print(method)
+                    print(report)
+            print(f"Skipping the inference as the {prompt_type.lower()} prompt is not safe.")
+            sys.exit(1)
+    # Check user prompt
+    print_safety_results(are_safe_user_prompt, user_prompt, safety_results_user_prompt, "User")
+    # Check system prompt
+    print_safety_results(are_safe_system_prompt, system_prompt, safety_results_system_prompt, "System")
+def main(
+    model_name,
+    peft_model: str=None,
+    quantization: bool=False,
+    max_new_tokens =100, #The maximum numbers of tokens to generate
+    seed: int=42, #seed value for reproducibility
+    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool=False,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
+    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
+    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
+    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    **kwargs
+):
+    system_prompt = input("Please insert your system prompt: ")
+    user_prompt = input("Please insert your prompt: ")
+    chat = [
+   {"role": "system", "content": system_prompt},
+   {"role": "user", "content": user_prompt},
+    ]       
+    # Set the seeds for reproducibility
+    torch.cuda.manual_seed(seed)
+    torch.manual_seed(seed)
+    model = load_model(model_name, quantization, use_fast_kernels)
+    if peft_model:
+        model = load_peft_model(model, peft_model)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    safety_checker = get_safety_checker(enable_azure_content_safety,
+                                        enable_sensitive_topics,
+                                        enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        )
+    # Safety check of the user prompt
+    safety_results_user_prompt = [check(user_prompt) for check in safety_checker]
+    safety_results_system_prompt = [check(system_prompt) for check in safety_checker]
+    are_safe_user_prompt = all([r[1] for r in safety_results_user_prompt])
+    are_safe_system_prompt = all([r[1] for r in safety_results_system_prompt])
+    handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt)
+    inputs = tokenizer.apply_chat_template(chat, return_tensors="pt").to("cuda")
+    start = time.perf_counter()
+    with torch.no_grad():
+        outputs = model.generate(
+            input_ids=inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            top_p=top_p,
+            temperature=temperature,
+            min_length=min_length,
+            use_cache=use_cache,
+            top_k=top_k,
+            repetition_penalty=repetition_penalty,
+            length_penalty=length_penalty,
+            **kwargs 
+        )
+    e2e_inference_time = (time.perf_counter()-start)*1000
+    print(f"the inference time is {e2e_inference_time} ms")
+    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Safety check of the model output
+    safety_results = [check(output_text) for check in safety_checker]
+    are_safe = all([r[1] for r in safety_results])
+    if are_safe:
+        print("User input and model output deemed safe.")
+        print(f"Model output:\n{output_text}")
+    else:
+        print("Model output deemed unsafe.")
+        for method, is_safe, report in safety_results:
+            if not is_safe:
+                print(method)
+                print(report)
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/recipes/evaluation/README.md
+++ b/recipes/evaluation/README.md
+# Llama Model Evaluation
+Llama-Recipe make use of `lm-evaluation-harness` for evaluating our fine-tuned Meta Llama3 (or Llama2) model. It also can serve as a tool to evaluate quantized model to ensure the quality in lower precision or other optimization applied to the model that might need evaluation.
+`lm-evaluation-harness` provide a wide range of [features](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#overview):
+- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
+- Support for models loaded via transformers (including quantization via AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
+- Support for fast and memory-efficient inference with vLLM.
+- Support for commercial APIs including OpenAI, and TextSynth.
+- Support for evaluation on adapters (e.g. LoRA) supported in Hugging Face's PEFT library.
+- Support for local models and benchmarks.
+The Language Model Evaluation Harness is also the backend for 🤗 [Hugging Face's (HF) popular Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
+## Setup
+Before running the evaluation script, ensure you have all the necessary dependencies installed.
+### Dependencies
+- Python 3.8+
+- Your language model's dependencies
+### Installation
+Clone the lm-evaluation-harness repository and install it:
+```bash
+git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+cd lm-evaluation-harness
+pip install -e .
+```
+### Quick Test
+To run evaluation for Hugging Face `Llama3 8B` model  on a single GPU please run the following,
+```bash
+python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks hellaswag --device cuda:0   --batch_size 8
+```
+Tasks can be extended by using `,` between them for example `--tasks hellaswag,arc`.
+To set the number of shots you can use `--num_fewshot` to set the number for few shot evaluation.
+### PEFT Fine-tuned model Evaluation
+In case you have fine-tuned your model using PEFT you can set the PATH to the PEFT checkpoints using PEFT as part of model_args as shown below:
+```bash
+python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B, dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10  --device cuda:0 --batch_size 8
+```
+### Limit the number of examples in benchmarks
+There has been an study from [IBM on efficient benchmarking of LLMs](https://arxiv.org/pdf/2308.11696.pdf), with main take a way that to identify if a model is performing poorly, benchmarking on wider range of tasks is more important than the number example in each task. This means you could run the evaluation harness with fewer number of example to have initial decision if the performance got worse from the base line. To limit the number of example here, it can be set using `--limit` flag with actual desired number. But for the full assessment you would need to run the full evaluation. Please read more in the paper linked above.
+```bash
+python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10  --device cuda:0 --batch_size 8 --limit 100
+```
+### Reproducing Hugging Face Open-LLM-Leaderboard
+Here, we provided a list of tasks from `Open-LLM-Leaderboard` which can be used by passing `--open-llm-leaderboard-tasks` instead of `tasks` to the `eval.py`.
+**NOTE** Make sure to run the bash script below, that will set the `include paths` in the [config files](./open_llm_leaderboard/). The script will prompt you to enter the path to the cloned lm-evaluation-harness repo.**You would need this step only for the first time**.
+```bash
+bash open_llm_eval_prep.sh
+```
+Now we can run the eval benchmark:
+```bash
+python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="float",peft=../peft_output --num_fewshot 10  --device cuda:0 --batch_size 8 --limit 100 --open_llm_leaderboard_tasks
+```
+In the HF leaderboard, the [LLMs are evaluated on 7 benchmarks](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) from Language Model Evaluation Harness as described below:
+- AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
+- HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
+- MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
+- TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
+- Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
+- GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
+For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
+### Customized Llama Model
+In case you have customized the Llama model, for example a quantized version of model where it has different model loading from normal HF model, you can follow [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage) to add your model to the `eval.py` and run the eval benchmarks.
+You can also find full task list [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
+#### Multi-GPU Evaluation with Hugging Face `accelerate`
+Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library can be used for multi-GPU evaluation as it is supported by `lm-evaluation-harness`.
+To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), leverage the `accelerate` launcher as follows:
+```bash
+accelerate config
+accelerate launch eval.py --model hf --model_args "pretrained=meta-llama/Meta-Llama-3-8B" --limit 100 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples
+```
+In case your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
+**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
+In case your model is *too large to fit on a single GPU.*
+In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
+```bash
+python eval.py --model hf --model_args "pretrained=meta-llama/Meta-Llama-3-8B,parallelize=True" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples
+```
+This means that your model's weights will be split across all available GPUs.
+For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
+- `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
+- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
+- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
+- `offload_folder`: a folder where model weights will be offloaded to disk if needed.
+These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
+### Tensor + Data Parallel and Optimized Inference with `vLLM`
+Also `lm-evaluation-harness` supports vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
+```bash
+python eval.py --model vllm --model_args "pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=2" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples --batch_size auto
+```
+For a full list of supported vLLM configurations, please to [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/076372ee9ee81e25c4e2061256400570354a8d1a/lm_eval/models/vllm_causallms.py#L44-L62).
+**Note from `lm-evaluation-harness`** vLLM occasionally differs in output from Hugging Face. We treat Hugging Face as the reference implementation, and provide a [script](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF.
--- a/recipes/evaluation/eval.py
+++ b/recipes/evaluation/eval.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import argparse
+import json
+import logging
+import os
+import re
+import sys
+from pathlib import Path
+import numpy as np
+import lm_eval
+from lm_eval import tasks
+# from lm_eval.utils import make_table
+from lm_eval.evaluator import make_table
+def _handle_non_serializable(o):
+    if isinstance(o, np.int64) or isinstance(o, np.int32):
+        return int(o)
+    elif isinstance(o, set):
+        return list(o)
+    else:
+        return str(o)
+def setup_logging(verbosity):
+    logging.basicConfig(
+        level=verbosity.upper(), format="%(asctime)s - %(levelname)s - %(message)s"
+    )
+    return logging.getLogger(__name__)
+def handle_output(args, results, logger):
+    if not args.output_path:
+        if args.log_samples:
+            logger.error("Specify --output_path for logging samples.")
+            sys.exit(1)
+        logger.info(json.dumps(results, indent=2, default=_handle_non_serializable))
+        return
+    path = Path(args.output_path)
+    if path.is_file() or path.with_name("results.json").is_file():
+        logger.warning(f"File already exists at {path}. Results will be overwritten.")
+    output_dir = path.parent if path.suffix in (".json", ".jsonl") else path
+    output_dir.mkdir(parents=True, exist_ok=True)
+    results_str = json.dumps(results, indent=2, default=_handle_non_serializable)
+    if args.show_config:
+        logger.info(results_str)
+    file_path = os.path.join(args.output_path, "results.json")
+    with open(file_path , "w", encoding="utf-8") as f:
+        f.write(results_str)
+    if args.log_samples:
+        samples = results.pop("samples", {})
+        for task_name, _ in results.get("configs", {}).items():
+            output_name = re.sub(r"/|=", "__", args.model_args) + "_" + task_name
+            sample_file = output_dir.joinpath(f"{output_name}.jsonl")
+            sample_data = json.dumps(
+                samples.get(task_name, {}), indent=2, default=_handle_non_serializable
+            )
+            sample_file.write_text(sample_data, encoding="utf-8")
+    batch_sizes = ",".join(map(str, results.get("config", {}).get("batch_sizes", [])))
+    summary = f"{args.model} ({args.model_args}), gen_kwargs: ({args.gen_kwargs}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
+    logger.info(summary)
+    logger.info(make_table(results))
+    if "groups" in results:
+        logger.info(make_table(results, "groups"))
+def load_tasks(args):
+    if args.open_llm_leaderboard_tasks:
+        current_dir = os.getcwd()
+        config_dir = os.path.join(current_dir, "open_llm_leaderboard")
+        task_manager = tasks.TaskManager(include_path=config_dir)
+        return task_manager, [
+            "arc_challenge_25_shot",
+            "hellaswag_10_shot",
+            "truthfulqa_mc2",
+            "winogrande_5_shot",
+            "gsm8k",
+            "mmlu",
+        ]
+    return None, args.tasks.split(",") if args.tasks else []
+def parse_eval_args():
+    parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
+    parser.add_argument(
+        "--model", "-m", default="hf", help="Name of model, e.g., `hf`."
+    )
+    parser.add_argument(
+        "--tasks",
+        "-t",
+        default=None,
+        help="Comma-separated list of tasks, or 'list' to display available tasks.",
+    )
+    parser.add_argument(
+        "--model_args",
+        "-a",
+        default="",
+        help="Comma-separated string arguments for model, e.g., `pretrained=EleutherAI/pythia-160m`.",
+    )
+    parser.add_argument(
+        "--open_llm_leaderboard_tasks",
+        "-oplm",
+        action="store_true",
+        default=False,
+        help="Choose the list of tasks with specification in HF open LLM-leaderboard.",
+    )
+    parser.add_argument(
+        "--num_fewshot",
+        "-f",
+        type=int,
+        default=0,
+        help="Number of examples in few-shot context.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        "-b",
+        default=1,
+        help="Batch size, can be 'auto', 'auto:N', or an integer.",
+    )
+    parser.add_argument(
+        "--max_batch_size",
+        type=int,
+        default=None,
+        help="Maximal batch size with 'auto' batch size.",
+    )
+    parser.add_argument(
+        "--device", default=None, help="Device for evaluation, e.g., 'cuda', 'cpu'."
+    )
+    parser.add_argument(
+        "--output_path", "-o", type=str, default=None, help="Path for saving results."
+    )
+    parser.add_argument(
+        "--limit",
+        "-L",
+        type=float,
+        default=None,
+        help="Limit number of examples per task.",
+    )
+    parser.add_argument(
+        "--use_cache", "-c", default=None, help="Path to cache db file, if used."
+    )
+    parser.add_argument(
+        "--verbosity",
+        "-v",
+        default="INFO",
+        help="Logging level: CRITICAL, ERROR, WARNING, INFO, DEBUG.",
+    )
+    parser.add_argument(
+        "--gen_kwargs",
+        default=None,
+        help="Generation kwargs for tasks that support it.",
+    )
+    parser.add_argument(
+        "--check_integrity",
+        action="store_true",
+        help="Whether to run the relevant part of the test suite for the tasks.",
+    )
+    parser.add_argument(
+        "--write_out",
+        "-w",
+        action="store_true",
+        default=False,
+        help="Prints the prompt for the first few documents.",
+    )
+    parser.add_argument(
+        "--log_samples",
+        "-s",
+        action="store_true",
+        default=False,
+        help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis.",
+    )
+    parser.add_argument(
+        "--show_config",
+        action="store_true",
+        default=False,
+        help="If True, shows the full config of all tasks at the end of the evaluation.",
+    )
+    parser.add_argument(
+        "--include_path",
+        type=str,
+        default=None,
+        help="Additional path to include if there are external tasks.",
+    )
+    return parser.parse_args()
+def evaluate_model(args):
+    try:
+        task_manager, task_list = load_tasks(args)
+        # Customized model such as Quantized model etc.
+        # In case you are working with a custom model, you can use the following guide to add it here:
+        # https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
+        # Evaluate
+        results = lm_eval.simple_evaluate(
+            model=args.model,
+            model_args=args.model_args,
+            tasks=task_list,
+            num_fewshot=args.num_fewshot,
+            batch_size=args.batch_size,
+            max_batch_size=args.max_batch_size,
+            device=args.device,
+            no_cache=args.use_cache,
+            limit=args.limit,
+            check_integrity=args.check_integrity,
+            write_out=args.write_out,
+            log_samples=args.log_samples,
+            gen_kwargs=args.gen_kwargs,
+            task_manager=task_manager,
+        )
+        handle_output(args, results, logger)
+    except Exception as e:
+        logger.error(f"An error occurred during evaluation: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    args = parse_eval_args()
+    logger = setup_logging(args.verbosity)
+    evaluate_model(args)
--- a/recipes/evaluation/open_llm_eval_prep.sh
+++ b/recipes/evaluation/open_llm_eval_prep.sh
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+#!/bin/bash
+# Prompt the user for the EVAL_PATH
+read -p "Enter the asbolute path to the lm-evaluation-harness: " EVAL_PATH
+conda activate 
+# Directory containing YAML files
+DIR="open_llm_leaderboard"
+# Check if the directory exists
+if [ ! -d "$DIR" ]; then
+    echo "Error: Directory '$DIR' not found."
+    exit 1
+fi
+# Iterate over YAML files in the directory and update them
+for YAML_FILE in "$DIR"/*.yaml
+do
+    if [ -f "$YAML_FILE" ]; then
+        sed -i 's|{\$EVAL_PATH}|'"$EVAL_PATH"'|g' "$YAML_FILE"
+        echo "Updated $YAML_FILE with EVAL_PATH: $EVAL_PATH"
+    fi
+done
--- a/recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
+include: {$EVAL_PATH}/lm_eval/tasks/arc/arc_challenge.yaml
+task: arc_challenge_25_shot
+task_alias: arc 25 shot
+num_fewshot: 25
+metric_list:
+  - metric: acc_norm
--- a/recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
+include: {$EVAL_PATH}/lm_eval/tasks/hellaswag/hellaswag.yaml
+task: hellaswag_10_shot
+task_alias: hellaswag 10 shot
+num_fewshot: 10
+metric_list:
+  - metric: acc_norm
--- a/recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
+++ b/recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
+import datasets
+import re
+def preprocess(text):
+    text = text.strip()
+    # NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
+        out_doc = {
+            "query": preprocess(doc["activity_label"] + ": " + ctx),
+            "choices": [preprocess(ending) for ending in doc["endings"]],
+            "gold": int(doc["label"]),
+        }
+        return out_doc
+    return dataset.map(_process_doc)
--- a/recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
+include: {$EVAL_PATH}/lm_eval/tasks/mmlu/default/_mmlu.yaml
+task:
+  - mmlu_stem
+  - mmlu_other
+  - mmlu_social_sciences
+  - mmlu_humanities
+num_fewshot: 5
+metric_list:
+  - metric: acc
\ No newline at end of file
--- a/recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
+++ b/recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
+include: {$EVAL_PATH}/lm_eval/tasks/winogrande/default.yaml
+task: winogrande_5_shot
+task_alias: winogrande 5 shot
+num_fewshot: 5
+metric_list:
+  - metric: acc
--- a/recipes/finetuning/LLM_finetuning_overview.md
+++ b/recipes/finetuning/LLM_finetuning_overview.md
+## LLM Fine-Tuning
+Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
+## 1. **Parameter Efficient Model Fine-Tuning**
+ This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
+These methods will address three aspects:
+- **Cost of full fine-tuning** – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.
+- **Cost of deployment** – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
+- **Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in finetuning.
+HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
+## 2. **Full/ Partial Parameter Fine-Tuning**
+Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:
+-  Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.
+-  Keep the pretrained model frozen and add a few fully connected layers on the top.
+-  Fine-tuning on all the layers.
+You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.
+<div style="display: flex;">
+    <img src="../../docs/images/feature-based_FN.png" alt="Image 1" width="250" />
+    <img src="../../docs/images/feature-based_FN_2.png" alt="Image 2" width="250" />
+    <img src="../../docs/images/full-param-FN.png" alt="Image 3" width="250" />
+</div>
+In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
+The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
+For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
+**FSDP (Fully Sharded Data Parallel)**
+Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
+FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can't scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.
+Please read more on FSDP [here](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) & get started with FSDP [here](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
+To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:
+- **Mixed Precision** which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.
+- **Activation Checkpointing**  which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.
+- **auto_wrap_policy** Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the  model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and  MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.