First add in 0524

5eaaba41 · Rayyyyy · 5eaaba41 · 5eaaba41 · 5eaaba41 · 5eaaba41
Commit 5eaaba41 authored May 24, 2024 by Rayyyyy
20 changed files
--- a/.gitignore
+++ b/.gitignore
+.DS_Store
+__pycache__
+.ipynb_checkpoints
+wandb/
+artifacts/
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <opensource-conduct@fb.com>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# Contributing to llama-recipes
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We actively welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## License
+By contributing to llama-recipes, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
+## Tests
+Llama-recipes currently comes with a basic set of unit tests (covering the parts of the main training script and training loop) but we strive to increase our test coverage in the future in order to mitigate silent errors.
+When submitting a new feature PR please make sure to cover the newly added code with a unit test.
+Run the tests locally to ensure the new feature does not break an old one.
+We use **pytest** for our unit tests and to run them locally you need to install llama-recipes with optional [tests] dependencies enabled:
+```
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes[tests]
+```
+For development and contributing to llama-recipes please install from source with all optional dependencies:
+```
+pip install -U pip setuptools
+pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]
+```
+The unit tests can be found in the [src/tests](./src/tests/) folder and you can run them from the main directory using:
+```
+python -m pytest src/tests/
+```
+To run all tests of a single file you can give the filename directly:
+```
+python -m pytest src/tests/test_finetuning.py
+```
+To run a specific test you can filter for its name with
+```
+python -m pytest src/tests/test_finetuning.py -k test_finetuning_peft
+```
+To add a new test simply create a new test file under the tests folder (filename has to start with `test_`).
+Group tests spanning the same feature in the same file and create a subfolder if the tests are very extensive.
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# Llama Recipes: Examples to get started using the Llama models from Meta
+<!-- markdown-link-check-disable -->
+The 'llama-recipes' repository is a companion to the [Meta Llama 3](https://github.com/meta-llama/llama3) models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem. [Meta Llama 2](https://github.com/meta-llama/llama) is also supported in this repository. We highly recommend everyone to utilize [Meta Llama 3](https://github.com/meta-llama/llama3) due to its enhanced capabilities.
+<!-- markdown-link-check-enable -->
+> [!IMPORTANT]
+> Meta Llama 3 has a new prompt template and special tokens (based on the tiktoken tokenizer).
+> | Token | Description |
+> |---|---|
+> `<\|begin_of_text\|>` | This is equivalent to the BOS token. |
+> `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused. Instead, every message is terminated with `<\|eot_id\|>` instead.|
+> `<\|eot_id\|>` | This token signifies the end of the message in a turn i.e. the end of a single message by a system, user or assistant role as shown below.|
+> `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant. |
+>
+> A multiturn-conversation with Meta Llama 3 follows this prompt template:
+> ```
+> <|begin_of_text|><|start_header_id|>system<|end_header_id|>
+>
+> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
+>
+> {{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+>
+> {{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>
+>
+> {{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+> ```
+> Each message gets trailed by an `<|eot_id|>` token before a new header is started, signaling a role change.
+>
+> More details on the new tokenizer and prompt template can be found [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3#special-tokens-used-with-meta-llama-3).
+>
+> [!NOTE]
+> The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The `src/` folder has NOT been modified, so the functionality of this repo and package is not impacted.
+>
+> Make sure you update your local clone by running `git pull origin main`
+## Table of Contents
+- [Llama Recipes: Examples to get started using the Meta Llama models from Meta](#llama-recipes-examples-to-get-started-using-the-llama-models-from-meta)
+  - [Table of Contents](#table-of-contents)
+  - [Getting Started](#getting-started)
+    - [Prerequisites](#prerequisites)
+      - [PyTorch Nightlies](#pytorch-nightlies)
+    - [Installing](#installing)
+      - [Install with pip](#install-with-pip)
+      - [Install with optional dependencies](#install-with-optional-dependencies)
+      - [Install from source](#install-from-source)
+    - [Getting the Llama models](#getting-the-llama-models)
+      - [Model conversion to Hugging Face](#model-conversion-to-hugging-face)
+  - [Repository Organization](#repository-organization)
+    - [`recipes/`](#recipes)
+    - [`src/`](#src)
+  - [Contributing](#contributing)
+  - [License](#license)
+## Getting Started
+These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
+### Prerequisites
+#### PyTorch Nightlies
+If you want to use PyTorch nightlies instead of the stable release, go to [this guide](https://pytorch.org/get-started/locally/) to retrieve the right `--extra-index-url URL` parameter for the `pip install` commands on your platform.
+### Installing
+Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source.
+> [!NOTE]
+> Ensure you use the correct CUDA version (from `nvidia-smi`) when installing the PyTorch wheels. Here we are using 11.8 as `cu118`.
+> H100 GPUs work better with CUDA >12.0
+#### Install with pip
+```
+pip install llama-recipes
+```
+#### Install with optional dependencies
+Llama-recipes offers the installation of optional packages. There are three optional dependency groups.
+To run the unit tests we can install the required dependencies with:
+```
+pip install llama-recipes[tests]
+```
+For the vLLM example we need additional requirements that can be installed with:
+```
+pip install llama-recipes[vllm]
+```
+To use the sensitive topics safety checker install with:
+```
+pip install llama-recipes[auditnlg]
+```
+Optional dependencies can also be combines with [option1,option2].
+#### Install from source
+To install from source e.g. for development use these commands. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.
+```
+git clone git@github.com:meta-llama/llama-recipes.git
+cd llama-recipes
+pip install -U pip setuptools
+pip install -e .
+```
+For development and contributing to llama-recipes please install all optional dependencies:
+```
+git clone git@github.com:meta-llama/llama-recipes.git
+cd llama-recipes
+pip install -U pip setuptools
+pip install -e .[tests,auditnlg,vllm]
+```
+### Getting the Meta Llama models
+You can find Meta Llama models on Hugging Face hub [here](https://huggingface.co/meta-llama), **where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed**. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.
+#### Model conversion to Hugging Face
+The recipes and notebooks in this folder are using the Meta Llama model definition provided by Hugging Face's transformers library.
+Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with:
+```bash
+## Install Hugging Face Transformers from source
+pip freeze | grep transformers ## verify it is version 4.31.0 or higher
+git clone git@github.com:huggingface/transformers.git
+cd transformers
+pip install protobuf
+python src/transformers/models/llama/convert_llama_weights_to_hf.py \
+   --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
+```
+## Repository Organization
+Most of the code dealing with Llama usage is organized across 2 main folders: `recipes/` and `src/`.
+### `recipes/`
+Contains examples are organized in folders by topic:
+| Subfolder | Description |
+|---|---|
+[quickstart](./recipes/quickstart) | The "Hello World" of using Llama, start here if you are new to using Llama.
+[finetuning](./recipes/finetuning)|Scripts to finetune Llama on single-GPU and multi-GPU setups
+[inference](./recipes/inference)|Scripts to deploy Llama for inference locally and using model servers
+[use_cases](./recipes/use_cases)|Scripts showing common applications of Meta Llama3
+[responsible_ai](./recipes/responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
+[llama_api_providers](./recipes/llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
+[benchmarks](./recipes/benchmarks)|Scripts to benchmark Llama models inference on various backends
+[code_llama](./recipes/code_llama)|Scripts to run inference with the Code Llama models
+[evaluation](./recipes/evaluation)|Scripts to evaluate fine-tuned Llama models using `lm-evaluation-harness` from `EleutherAI`
+### `src/`
+Contains modules which support the example recipes:
+| Subfolder | Description |
+|---|---|
+| [configs](src/llama_recipes/configs/) | Contains the configuration files for PEFT methods, FSDP, Datasets, Weights & Biases experiment tracking. |
+| [datasets](src/llama_recipes/datasets/) | Contains individual scripts for each dataset to download and process. Note |
+| [inference](src/llama_recipes/inference/) | Includes modules for inference for the fine-tuned models. |
+| [model_checkpointing](src/llama_recipes/model_checkpointing/) | Contains FSDP checkpoint handlers. |
+| [policies](src/llama_recipes/policies/) | Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode). |
+| [utils](src/llama_recipes/utils/) | Utility files for:<br/> - `train_utils.py` provides training/eval loop and more train utils.<br/> - `dataset_utils.py` to get preprocessed datasets.<br/> - `config_utils.py` to override the configs received from CLI.<br/> - `fsdp_utils.py` provides FSDP  wrapping policy for PEFT methods.<br/> - `memory_utils.py` context manager to track different memory stats in train loop. |
+## Contributing
+Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
+## License
+<!-- markdown-link-check-disable -->
+See the License file for Meta Llama 3 [here](https://llama.meta.com/llama3/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama3/use-policy/)
+See the License file for Meta Llama 2 [here](https://llama.meta.com/llama2/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama2/use-policy/)
+<!-- markdown-link-check-enable -->
--- a/UPDATES.md
+++ b/UPDATES.md
+## System Prompt Update
+### Observed Issue
+We received feedback from the community on our prompt template and we are providing an update to reduce the false refusal rates seen. False refusals occur when the model incorrectly refuses to answer a question that it should, for example due to overly broad instructions to be cautious in how it provides responses. 
+### Updated approach
+Based on evaluation and analysis, we recommend the removal of the system prompt as the default setting.  Pull request [#626](https://github.com/facebookresearch/llama/pull/626) removes the system prompt as the default option, but still provides an example to help enable experimentation for those using it. 
+## Token Sanitization Update
+### Observed Issue
+The PyTorch scripts currently provided for tokenization and model inference allow for direct prompt injection via string concatenation. Prompt injections allow for the addition of special system and instruction prompt strings from user-provided prompts. 
+As noted in the documentation, these strings are required to use the fine-tuned chat models. However, prompt injections have also been used for manipulating or abusing models by bypassing their safeguards, allowing for the creation of content or behaviors otherwise outside the bounds of acceptable use. 
+### Updated approach
+We recommend sanitizing [these strings](https://github.com/meta-llama/llama?tab=readme-ov-file#fine-tuned-chat-models) from any user provided prompts. Sanitization of user prompts mitigates malicious or accidental abuse of these strings. The provided scripts have been updated to do this. 
+Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](./recipes/inference/local_inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository.
--- a/dev_requirements.txt
+++ b/dev_requirements.txt
+vllm
+pytest-mock
+auditnlg
\ No newline at end of file
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
+# FAQ
+Here we discuss frequently asked questions that may occur and we found useful along the way.
+1. Does FSDP support mixed precision in one FSDP unit? Meaning, in one FSDP unit some of the parameters are in Fp16/Bf16 and others in FP32.
+    FSDP requires each FSDP unit to have consistent precision, so this case is not supported at this point. It might be added in future but no ETA at the moment.
+2.  How does FSDP handles mixed grad requirements?
+    FSDP does not support mixed `require_grad` in one FSDP unit. This means if you are planning to freeze some layers, you need to do it on the FSDP unit level rather than model layer. For example, let us assume our model has 30 decoder layers and we want to freeze the bottom 28 layers and only train 2 top transformer layers. In this case, we need to make sure `require_grad` for the top two transformer layers are set to `True`.
+3. How do PEFT methods work with FSDP in terms of grad requirements/layer freezing?
+    We wrap the PEFT modules separate from the transformer layer in auto_wrapping policy, that would result in PEFT models having `require_grad=True` while the rest of the model is  `require_grad=False`.
+4. Can I add custom datasets?
+    Yes, you can find more information on how to do that [here](../recipes/finetuning/datasets/README.md).
+5. What are the hardware SKU requirements for deploying these models?
+    Hardware requirements vary based on latency, throughput and cost constraints. For good latency, the models were split across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs like A10G, T4, L4, or even commodity hardware can also be used to deploy these models (e.g. https://github.com/ggerganov/llama.cpp).
+    If working on a CPU, it is worth looking at this [blog post](https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html) from Intel for an idea of Llama 2's performance on a CPU.
+6. What are the hardware SKU requirements for fine-tuning Llama pre-trained models?
+    Fine-tuning requirements vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types like NVIDIA A10G or H100 are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora).
+7. How to handle CUDA memory fragmentations during fine-tuning that may lead into an OOM?
+    In some cases you may experience that after model checkpointing specially with FSDP (this usually does not happen with PEFT methods), the reserved and allocated CUDA memory has increased. This might be due to CUDA memory fragmentations. PyTorch recenly added an enviroment variable that helps to better manage memory fragmentation (this feature in available on PyTorch nightlies at the time of writing this doc July 30 2023). You can set this in your main training script as follows:
+    ```bash
+    os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'
+    ```
+    We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
+8. Additional debugging flags?
+    The environment variable `TORCH_DISTRIBUTED_DEBUG` can be used to trigger additional useful logging and collective synchronization checks to ensure all ranks are synchronized appropriately. `TORCH_DISTRIBUTED_DEBUG` can be set to either OFF (default), INFO, or DETAIL depending on the debugging level required. Please note that the most verbose option, DETAIL may impact the application performance and thus should only be used when debugging issues.
+    We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
+9. I am getting import errors when running inference.
+    Verify that CUDA environment variables are set correctly on your machine. For example for bitsandbytes, you can generally set it as below to get things working on A100 80g's on AWS.
+    ```bash
+    export CUDA_HOME="/usr/local/cuda-11.8"
+    export PATH=$CUDA_HOME/bin:$PATH
+    export LD_LIBRARY_PATH=$CUDA_HOME/lib:$CUDA_HOME/lib64:$CUDA_HOME/efa/lib:/opt/amazon/efa/lib:$LD_LIBRARY_PATH
+    ```
--- a/docs/LLM_finetuning.md
+++ b/docs/LLM_finetuning.md
+## LLM Fine-Tuning
+Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
+## 1. **Parameter Efficient Model Fine-Tuning**
+ This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
+These methods will address three aspects:
+- **Cost of full fine-tuning** – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.
+- **Cost of deployment** – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
+- **Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tuning.
+HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
+## 2. **Full/ Partial Parameter Fine-Tuning**
+Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:
+-  Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.
+-  Keep the pretrained model frozen and add a few fully connected layers on the top.
+-  Fine-tuning on all the layers.
+You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.
+<div style="display: flex;">
+    <img src="./images/feature-based_FN.png" alt="Image 1" width="250" />
+    <img src="./images/feature-based_FN_2.png" alt="Image 2" width="250" />
+    <img src="./images/full-param-FN.png" alt="Image 3" width="250" />
+</div>
+In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
+The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
+For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
+**FSDP (Fully Sharded Data Parallel)**
+Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
+FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can't scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.
+Please read more on FSDP [here](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) & get started with FSDP [here](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
+To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:
+- **Mixed Precision** which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.
+- **Activation Checkpointing**  which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.
+- **auto_wrap_policy** Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the  model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and  MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.
--- a/docs/images/feature-based_FN.png
+++ b/docs/images/feature-based_FN.png
--- a/docs/images/feature-based_FN_2.png
+++ b/docs/images/feature-based_FN_2.png
--- a/docs/images/full-param-FN.png
+++ b/docs/images/full-param-FN.png
--- a/docs/images/llama2-gradio.png
+++ b/docs/images/llama2-gradio.png
--- a/docs/images/llama2-streamlit.png
+++ b/docs/images/llama2-streamlit.png
--- a/docs/images/llama2-streamlit2.png
+++ b/docs/images/llama2-streamlit2.png
--- a/docs/images/messenger_api_settings.png
+++ b/docs/images/messenger_api_settings.png
--- a/docs/images/messenger_llama_arch.jpg
+++ b/docs/images/messenger_llama_arch.jpg
--- a/docs/images/wandb_screenshot.png
+++ b/docs/images/wandb_screenshot.png
--- a/docs/images/whatsapp_dashboard.jpg
+++ b/docs/images/whatsapp_dashboard.jpg
--- a/docs/images/whatsapp_llama_arch.jpg
+++ b/docs/images/whatsapp_llama_arch.jpg
--- a/docs/multi_gpu.md
+++ b/docs/multi_gpu.md
+# Fine-tuning with Multi GPU
+To run fine-tuning on multi-GPUs, we will  make use of two packages:
+1. [PEFT](https://huggingface.co/blog/peft) methods and in particular using the Hugging Face [PEFT](https://github.com/huggingface/peft)library.
+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
+Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
+## Requirements
+To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
+**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
+## How to run it
+Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
+This runs with the `samsum_dataset` for summarization application by default.
+**Multiple GPUs one node**:
+**NOTE** please make sure to use PyTorch Nightlies for using PEFT+FSDP. Also, note that int8 quantization from bit&bytes currently is not supported in FSDP.
+```bash
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+```
+The args used in the command above are:
+* `--enable_fsdp` boolean flag to enable FSDP  in the script
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`.
+We use `torchrun` here to spawn multiple processes for FSDP.
+## Flash Attention and Xformer Memory Efficient Kernels
+Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
+```bash
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
+```
+### Fine-tuning using FSDP Only
+If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
+```
+### Fine-tuning using FSDP on 70B Model
+If you are interested in running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+```
+**Multi GPU multi node**:
+Here we use a slurm script to schedule a job with slurm over multiple nodes.
+```bash
+sbatch examples/multi_node.slurm
+# Change the num nodes and GPU per nodes in the script before running.
+```
+## How to run with different datasets?
+Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py).
+* `grammar_dataset` : use this [notebook](../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process theJfleg and C4 200M datasets for grammar checking.
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
+```bash
+wget -P src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+* `samsum_dataset`
+To run with each of the datasets set the `dataset` flag in the command as shown below:
+```bash
+# grammer_dataset
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
+# alpaca_dataset
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp  --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+# samsum_dataset
+torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+```
+## Where to configure settings?
+* [Training config file](../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../src/llama_recipes/configs/)
+It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
+```python
+    model_name: str="PATH/to/Model"
+    tokenizer_name: str=None
+    enable_fsdp: bool=False
+    low_cpu_fsdp: bool=False
+    run_validation: bool=True
+    batch_size_training: int=4
+    batching_strategy: str="packing" #alternative: padding
+    context_length: int=4096
+    gradient_accumulation_steps: int=1
+    gradient_clipping: bool = False
+    gradient_clipping_threshold: float = 1.0
+    num_epochs: int=3
+    max_train_step: int=0
+    max_eval_step: int=0
+    num_workers_dataloader: int=1
+    lr: float=1e-4
+    weight_decay: float=0.0
+    gamma: float= 0.85
+    seed: int=42
+    use_fp16: bool=False
+    mixed_precision: bool=True
+    val_batch_size: int=1
+    dataset = "samsum_dataset"
+    peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
+    use_peft: bool=False
+    from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
+    output_dir: str = "PATH/to/save/PEFT/model"
+    freeze_layers: bool = False
+    num_freeze_layers: int = 1
+    quantization: bool = False
+    one_gpu: bool = False
+    save_model: bool = True
+    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
+    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
+    save_optimizer: bool=False # will be used if using FSDP
+    use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    use_wandb: bool = False # Enable wandb for experient tracking
+    save_metrics: bool = False # saves training metrics to a json file for later plotting
+    flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
+    flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
+    use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
+    profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
+```
+* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
+* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
+* [FSDP config file](../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
+    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
+    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
+        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
+        * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
+        * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
+* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
+## FLOPS Counting and Pytorch Profiling
+To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
+Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6.  The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.