Commit 5eaaba41 authored by Rayyyyy's avatar Rayyyyy
Browse files

First add in 0524

parents
Pipeline #1017 failed with stages
in 0 seconds
# Finetuning Llama
This folder contains instructions to fine-tune Meta Llama 3 on a
* [single-GPU setup](./singlegpu_finetuning.md)
* [multi-GPU setup](./multigpu_finetuning.md)
using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
> [!TIP]
> If you want to try finetuning Meta Llama 3 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
## How to configure finetuning settings?
> [!TIP]
> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
```python
model_name: str="PATH/to/Model"
tokenizer_name: str=None
enable_fsdp: bool=False
low_cpu_fsdp: bool=False
run_validation: bool=True
batch_size_training: int=4
batching_strategy: str="packing" #alternative: padding
context_length: int=4096
gradient_accumulation_steps: int=1
gradient_clipping: bool = False
gradient_clipping_threshold: float = 1.0
num_epochs: int=3
max_train_step: int=0
max_eval_step: int=0
num_workers_dataloader: int=1
lr: float=1e-4
weight_decay: float=0.0
gamma: float= 0.85
seed: int=42
use_fp16: bool=False
mixed_precision: bool=True
val_batch_size: int=1
dataset = "samsum_dataset"
peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
use_peft: bool=False
from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
output_dir: str = "PATH/to/save/PEFT/model"
freeze_layers: bool = False
num_freeze_layers: int = 1
quantization: bool = False
one_gpu: bool = False
save_model: bool = True
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
save_optimizer: bool=False # will be used if using FSDP
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
use_wandb: bool = False # Enable wandb for experient tracking
save_metrics: bool = False # saves training metrics to a json file for later plotting
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
```
* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified. We currently support LoRA and Llama-Adapter. Please note that LoRA is the only technique which is supported in combination with FSDP.
* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
* `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
* `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
* `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
* `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
* `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
* `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
* `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
* `pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
## Weights & Biases Experiment Tracking
You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
```bash
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
```
You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
<div style="display: flex;">
<img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
</div>
## FLOPS Counting and Pytorch Profiling
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
# Datasets and Evaluation Metrics
The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
* [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
## Batching Strategies
Llama-recipes support two strategies to batch requests together.
The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences.
The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.
The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`.
## Using custom datasets
The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
To use a custom dataset there are two possible ways.
The first provides a function returning the dataset in a .py file which can be given to the command line tool.
This does not involve changing the source code of llama-recipes.
The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
### Training on custom data
To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
```@python
def get_custom_dataset(dataset_config, tokenizer, split: str):
```
For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](custom_dataset.py).
The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
The split signals wether to return the training or validation dataset.
The default function name is `get_custom_dataset` but this can be changed as described below.
In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
```
To change the function name that is used in the .py you can append the name following a `:` like this:
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
```
This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
### Adding new dataset
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
To add a custom dataset the following steps need to be performed.
1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../../../src/llama_recipes/utils/dataset_utils.py)
4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
## Application
Below we list other datasets and their main use cases that can be used for fine tuning.
### Q&A these can be used for evaluation as well
- [MMLU](https://huggingface.co/datasets/lukaemon/mmlu/viewer/astronomy/validation)
- [BoolQ](https://huggingface.co/datasets/boolq)
- [NarrativeQA](https://huggingface.co/datasets/narrativeqa)
- [NaturalQuestions](https://huggingface.co/datasets/natural_questions) (closed-book)
- [NaturalQuestions](https://huggingface.co/datasets/openbookqa) (open-book)
- [QuAC](https://huggingface.co/datasets/quac)
- [HellaSwag](https://huggingface.co/datasets/hellaswag)
- [OpenbookQA](https://huggingface.co/datasets/openbookqa)
- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) ( can be helpful for fact checking/ misinformation of the model)
### instruction finetuning
- [Alpaca](https://huggingface.co/datasets/yahma/alpaca-cleaned) 52k instruction tuning
- [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 15k 15k instruction tuning
### simple text generation for quick tests
[English](https://huggingface.co/datasets/Abirate/english_quotes) quotes 2508 Multi-label text classification, text generation
### Reasoning used mostly for evaluation of LLMs
- [bAbI](https://research.facebook.com/downloads/babi/)
- [Dyck](https://huggingface.co/datasets/dyk)
- [GSM8K](https://huggingface.co/datasets/gsm8k)
- [MATH](https://github.com/hendrycks/math)
- [APPS](https://huggingface.co/datasets/codeparrot/apps)
- [HumanEval](https://huggingface.co/datasets/openai_humaneval)
- [LSAT](https://huggingface.co/datasets/dmayhem93/agieval-lsat-ar)
- [Entity matching](https://huggingface.co/datasets/lighteval/EntityMatching)
### Toxicity evaluation
- [Real_toxic_prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)
### Bias evaluation
- [Crows_pair](https://huggingface.co/datasets/crows_pairs) gender bias
- WinoGender gender bias
### Useful Links
More information on evaluation dataset can be found in [HELM](https://crfm.stanford.edu/helm/latest/)
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# For dataset details visit: https://huggingface.co/datasets/samsum
import copy
import datasets
import itertools
B_INST, E_INST = "[INST]", "[/INST]"
def tokenize_dialog(dialog, tokenizer):
if tokenizer.vocab_size >= 128000:
dialog_tokens = tokenizer.apply_chat_template(dialog)
dialog_tokens = dialog_tokens[:-4] # Remove generation prompt <|start_header_id|>assistant<|end_header_id|>\n\n
eot_indices = [i for i,n in enumerate(dialog_tokens) if n == 128009]
labels = copy.copy(dialog_tokens)
last_idx = 0
for n, idx in enumerate(eot_indices):
if n % 2 == 1:
last_idx = idx
else:
labels[last_idx:idx+1] = [-100] * (idx-last_idx+1)
dialog_tokens = [dialog_tokens]
labels_tokens = [labels]
else:
prompt_tokens = [tokenizer.encode(f"{tokenizer.bos_token}{B_INST} {(prompt['content']).strip()} {E_INST}", add_special_tokens=False) for prompt in dialog[::2]]
answer_tokens = [tokenizer.encode(f"{answer['content'].strip()} {tokenizer.eos_token}", add_special_tokens=False) for answer in dialog[1::2]]
dialog_tokens = list(itertools.chain.from_iterable(zip(prompt_tokens, answer_tokens)))
#Add labels, convert prompt token to -100 in order to ignore in loss function
labels_tokens = [len(c)*[-100,] if i % 2 == 0 else c for i,c in enumerate(dialog_tokens)]
combined_tokens = {
"input_ids": list(itertools.chain(*(t for t in dialog_tokens))),
"labels": list(itertools.chain(*(t for t in labels_tokens))),
}
return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
def get_custom_dataset(dataset_config, tokenizer, split):
dataset = datasets.load_dataset("OpenAssistant/oasst1", split=split)
dataset = dataset.map(lambda sample: {
"message_id": sample["message_id"],
"parent_id": sample["parent_id"],
"text": sample["text"],
},
batched=True,
remove_columns=list(dataset.features),)
nodes = {}
messages = {}
root_ids = []
for data in dataset:
if data["parent_id"]:
nodes[data["parent_id"]] = nodes.get(data["parent_id"], []) + [data["message_id"]]
else:
root_ids.append(data["message_id"])
messages[data["message_id"]]=data["text"]
def follow(thread, current_id):
thread = copy.copy(thread) + [messages[current_id]]
if current_id in nodes:
new_threads = []
for next_id in nodes[current_id]:
new_threads += follow(thread, next_id)
return new_threads
else:
return [thread]
def get_threads_from_root(root_id):
all_threads = []
thread = [messages[root_id]]
for cid in nodes[root_id]:
all_threads += follow(thread, cid)
return all_threads
dataset = dataset.filter(lambda x: x["message_id"] in root_ids)
dataset = dataset.map(lambda x: {"thread": get_threads_from_root(x["message_id"])}, remove_columns=list(dataset.features))
dataset = dataset.map(lambda x: {"thread": [i for row in x["thread"] for i in row]}, batched=True)
def to_dialog(thread):
dialog = []
for i, content in enumerate(thread):
dialog.append({
"role": "user" if i % 2 == 0 else "assistant",
"content": content,
})
return {"dialog": dialog}
dataset = dataset.map(lambda x: to_dialog(x["thread"]), remove_columns=list(dataset.features))
dataset = dataset.map(lambda x: tokenize_dialog(x["dialog"], tokenizer), remove_columns=list(dataset.features))
return dataset
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import fire
from llama_recipes.finetuning import main
if __name__ == "__main__":
fire.Fire(main)
\ No newline at end of file
This diff is collapsed.
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the GNU General Public License version 3.
#!/bin/bash
#SBATCH --job-name=Nano-2d-trainer-20b-8nodes
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --gpus-per-task=4
#SBATCH --partition=train
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Enable for A100
export FI_PROVIDER="efa"
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0
# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="ens"
export FI_EFA_USE_DEVICE_RDMA=1
srun torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora
# Fine-tuning with Multi GPU
This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
## Requirements
Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
We will also need 2 packages:
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
> [!NOTE]
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
>
> INT8 quantization is not currently supported in FSDP
## How to run it
Get access to a machine with multiple GPUs (in this case we tested with 4 A100 and A10s).
### With FSDP + PEFT
<details open>
<summary>Single-node Multi-GPU</summary>
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
</details>
<details>
<summary>Multi-node Multi-GPU</summary>
Here we use a slurm script to schedule a job with slurm over multiple nodes.
# Change the num nodes and GPU per nodes in the script before running.
sbatch ./multi_node.slurm
</details>
We use `torchrun` to spawn multiple processes for FSDP.
The args used in the command above are:
* `--enable_fsdp` boolean flag to enable FSDP in the script
* `--use_peft` boolean flag to enable PEFT methods in the script
* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
### With only FSDP
If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
```bash
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
```
### Using less CPU memory (FSDP on 70B model)
If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
```bash
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
```
## Running with different datasets
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
```bash
wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
```
* `samsum_dataset`
To run with each of the datasets set the `dataset` flag in the command as shown below:
```bash
# grammer_dataset
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
# alpaca_dataset
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
# samsum_dataset
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
```
## [TIP] Slow interconnect between nodes?
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
```bash
torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
```
## FLOPS Counting and Pytorch Profiling
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
# Fine-tuning with Single GPU
This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
## Requirements
Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
To run fine-tuning on a single GPU, we will make use of two packages:
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
2. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for int8 quantization.
## How to run it?
```bash
python finetuning.py --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
```
The args used in the command above are:
* `--use_peft` boolean flag to enable PEFT methods in the script
* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
* `--quantization` boolean flag to enable int8 quantization
> [!NOTE]
> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
### How to run with different datasets?
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
* `alpaca_dataset` : to get this open source data please download the `alpaca.json` to `dataset` folder.
```bash
wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
```
* `samsum_dataset`
to run with each of the datasets set the `dataset` flag in the command as shown below:
```bash
# grammar_dataset
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
# alpaca_dataset
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
# samsum_dataset
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
```
## FLOPS Counting and Pytorch Profiling
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
# Local Inference
For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
To finetune all model parameters the output dir of the training has to be given as --model_name argument.
In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
**Content Safety**
The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
**Note**
If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
Examples:
```bash
# Full finetuning of all parameters
cat <test_prompt_file> | python inference.py --model_name <training_config.output_dir> --use_auditnlg
# PEFT method
cat <test_prompt_file> | python inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
# prompt as parameter
python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
```
The folder contains test prompts for summarization use-case:
```
samsum_prompt.txt
...
```
**Note**
Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
```python
tokenizer.add_special_tokens(
{
"pad_token": "<PAD>",
}
)
model.resize_token_embeddings(model.config.vocab_size + 1)
```
Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
## Chat completion
The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
```bash
python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json --quantization --use_auditnlg
```
## Flash Attention and Xformer Memory Efficient Kernels
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
```bash
python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json --quantization --use_auditnlg --use_fast_kernels
python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
```
## Loading back FSDP checkpoints
In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
**To convert the checkpoint use the following command**:
This is helpful if you have fine-tuned you model using FSDP only as follows:
```bash
torchrun --nnodes 1 --nproc_per_node 8 recipes/finetuning/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
```
Then convert your FSDP checkpoint to HuggingFace checkpoints using:
```bash
python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
# --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
```
By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
Then run inference using:
```bash
python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file>
```
\ No newline at end of file
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import fire
import os
import sys
import torch
from transformers import AutoTokenizer
from llama_recipes.inference.chat_utils import read_dialogs_from_file
from llama_recipes.inference.model_utils import load_model, load_peft_model
from llama_recipes.inference.safety_utils import get_safety_checker
from accelerate.utils import is_xpu_available
def main(
model_name,
peft_model: str=None,
quantization: bool=False,
max_new_tokens =256, #The maximum numbers of tokens to generate
min_new_tokens:int=0, #The minimum numbers of tokens to generate
prompt_file: str=None,
seed: int=42, #seed value for reproducibility
safety_score_threshold: float=0.5,
do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
use_cache: bool=True, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
enable_llamaguard_content_safety: bool = False,
**kwargs
):
if prompt_file is not None:
assert os.path.exists(
prompt_file
), f"Provided Prompt file does not exist {prompt_file}"
dialogs= read_dialogs_from_file(prompt_file)
elif not sys.stdin.isatty():
dialogs = "\n".join(sys.stdin.readlines())
else:
print("No user prompt provided. Exiting.")
sys.exit(1)
print(f"User dialogs:\n{dialogs}")
print("\n==================================\n")
# Set the seeds for reproducibility
if is_xpu_available():
torch.xpu.manual_seed(seed)
else:
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
model = load_model(model_name, quantization, use_fast_kernels)
if peft_model:
model = load_peft_model(model, peft_model)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens(
{
"pad_token": "<PAD>",
}
)
chats = tokenizer.apply_chat_template(dialogs)
with torch.no_grad():
for idx, chat in enumerate(chats):
safety_checker = get_safety_checker(enable_azure_content_safety,
enable_sensitive_topics,
enable_saleforce_content_safety,
enable_llamaguard_content_safety,
)
# Safety check of the user prompt
safety_results = [check(dialogs[idx][0]["content"]) for check in safety_checker]
are_safe = all([r[1] for r in safety_results])
if are_safe:
print(f"User prompt deemed safe.")
print("User prompt:\n", dialogs[idx][0]["content"])
print("\n==================================\n")
else:
print("User prompt deemed unsafe.")
for method, is_safe, report in safety_results:
if not is_safe:
print(method)
print(report)
print("Skipping the inferece as the prompt is not safe.")
sys.exit(1) # Exit the program with an error status
tokens= torch.tensor(chat).long()
tokens= tokens.unsqueeze(0)
if is_xpu_available():
tokens= tokens.to("xpu:0")
else:
tokens= tokens.to("cuda:0")
outputs = model.generate(
input_ids=tokens,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
use_cache=use_cache,
top_k=top_k,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
**kwargs
)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Safety check of the model output
safety_results = [check(output_text) for check in safety_checker]
are_safe = all([r[1] for r in safety_results])
if are_safe:
print("User input and model output deemed safe.")
print(f"Model output:\n{output_text}")
print("\n==================================\n")
else:
print("Model output deemed unsafe.")
for method, is_safe, report in safety_results:
if not is_safe:
print(method)
print(report)
if __name__ == "__main__":
fire.Fire(main)
[
[{"role": "user", "content": "what is the recipe of mayonnaise?"}],
[
{"role": "user", "content": "I am going to Paris, what should I see?"},
{
"role": "assistant",
"content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
},
{"role": "user", "content": "What is so great about #1?"}
],
[
{"role": "system", "content": "Always answer with Haiku"},
{"role": "user", "content": "I am going to Paris, what should I see?"}
],
[
{
"role": "system",
"content": "Always answer with emojis"
},
{"role": "user", "content": "How to go from Beijing to NY?"}
],
[
{
"role": "system",
"content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
},
{"role": "user", "content": "Write a brief birthday message to John"}
]
]
\ No newline at end of file
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import fire
import os
import sys
import time
import gradio as gr
import torch
from transformers import AutoTokenizer
from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
from llama_recipes.inference.model_utils import load_model, load_peft_model
from accelerate.utils import is_xpu_available
def main(
model_name,
peft_model: str=None,
quantization: bool=False,
max_new_tokens =100, #The maximum numbers of tokens to generate
prompt_file: str=None,
seed: int=42, #seed value for reproducibility
do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
use_cache: bool=True, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
enable_llamaguard_content_safety: bool=False,
max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
**kwargs
):
def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,):
safety_checker = get_safety_checker(enable_azure_content_safety,
enable_sensitive_topics,
enable_salesforce_content_safety,
enable_llamaguard_content_safety
)
# Safety check of the user prompt
safety_results = [check(user_prompt) for check in safety_checker]
are_safe = all([r[1] for r in safety_results])
if are_safe:
print("User prompt deemed safe.")
print(f"User prompt:\n{user_prompt}")
else:
print("User prompt deemed unsafe.")
for method, is_safe, report in safety_results:
if not is_safe:
print(method)
print(report)
print("Skipping the inference as the prompt is not safe.")
sys.exit(1) # Exit the program with an error status
# Set the seeds for reproducibility
if is_xpu_available():
torch.xpu.manual_seed(seed)
else:
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
model = load_model(model_name, quantization, use_fast_kernels)
if peft_model:
model = load_peft_model(model, peft_model)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
batch = tokenizer(user_prompt, padding='max_length', truncation=True, max_length=max_padding_length, return_tensors="pt")
if is_xpu_available():
batch = {k: v.to("xpu") for k, v in batch.items()}
else:
batch = {k: v.to("cuda") for k, v in batch.items()}
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
**batch,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
min_length=min_length,
use_cache=use_cache,
top_k=top_k,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
**kwargs
)
e2e_inference_time = (time.perf_counter()-start)*1000
print(f"the inference time is {e2e_inference_time} ms")
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Safety check of the model output
safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
are_safe = all([r[1] for r in safety_results])
if are_safe:
print("User input and model output deemed safe.")
print(f"Model output:\n{output_text}")
else:
print("Model output deemed unsafe.")
for method, is_safe, report in safety_results:
if not is_safe:
print(method)
print(report)
return output_text
if prompt_file is not None:
assert os.path.exists(
prompt_file
), f"Provided Prompt file does not exist {prompt_file}"
with open(prompt_file, "r") as f:
user_prompt = "\n".join(f.readlines())
inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
elif not sys.stdin.isatty():
user_prompt = "\n".join(sys.stdin.readlines())
inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
else:
gr.Interface(
fn=inference,
inputs=[
gr.components.Textbox(
lines=9,
label="User Prompt",
placeholder="none",
),
gr.components.Slider(
minimum=0, maximum=1, value=1.0, label="Temperature"
),
gr.components.Slider(
minimum=0, maximum=1, value=1.0, label="Top p"
),
gr.components.Slider(
minimum=0, maximum=100, step=1, value=50, label="Top k"
),
gr.components.Slider(
minimum=1, maximum=2000, step=1, value=200, label="Max tokens"
),
],
outputs=[
gr.components.Textbox(
lines=5,
label="Output",
)
],
title="Meta Llama3 Playground",
description="https://github.com/facebookresearch/llama-recipes",
).queue().launch(server_name="0.0.0.0", share=True)
if __name__ == "__main__":
fire.Fire(main)
Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-)
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))
---
Summary:
\ No newline at end of file
# Running Llama3 8B Instruct on Android with MLC-LLM
Author: Thierry Moreau - tmoreau@octo.ai
# Overview
In this tutorial we'll learn how to deploy Llama3 8B Instruct on an Android-based phone using MLC-LLM.
Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.
You can read more about MLC-LLM at the following [link](https://github.com/mlc-ai/mlc-llm).
MLC-LLM is also what powers the Llama3 inference APIs provided by [OctoAI](https://octo.ai/). You can use OctoAI for your Llama3 cloud-based inference needs by trying out the examples under the [following path](../../../llama_api_providers/OctoAI_API_examples/).
This tutorial was tested with the following setup:
* MacBook Pro 16 inch from 2021 with Apple M1 Max and 32GB of RAM running Sonoma 14.3.1
* OnePlus 12 Android Smartphone with a Snapdragon 8Gen3 SoC and 12GB or RAM, running OxygenOS 14.0
Running Llama3 on a phone will likely require a powerful chipset. We haven't tested extensively the range of chipset that will support this usecase. Feel free to update this README.md to specify what devices were successfully tested.
| Phone | Chipset | RAM | Status | Comments |
|------------|------------------|------|---------|----------|
| OnePlus 12 | Snapdragon 8Gen3 | 12GB | Success | None |
| | | | | |
This guide is heavily based on the [MLC Android Guide](https://llm.mlc.ai/docs/deploy/android.html), but several steps have been taken to streamline the instructions.
# Pre-requisites
## Python
Whether you're using conda or virtual env to manage your environment, we highly recommend starting from scratch with a clean new environment.
For instance with virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate
```
Next you'll need to install the following packages:
```bash
python3 -m pip install -r requirements.txt
```
## Rust
[Rust](https://www.rust-lang.org/tools/install) is needed to cross-compile HuggingFace tokenizers to Android.
Make sure rustc, cargo, and rustup are available in $PATH.
## Android Studio
Install Android Studio from <!-- markdown-link-check-disable -->https://developer.android.com/studio<!-- markdown-link-check-enable --> with NDK and CMake.
To install NDK and CMake, in the Android Studio welcome page, click “Projects → SDK Manager → SDK Tools”. Set up the following environment variables:
* ANDROID_NDK so that $ANDROID_NDK/build/cmake/android.toolchain.cmake is available.
* TVM_NDK_CC that points to NDK's clang compiler.
For instance, the paths will look like the following on OSX for user `moreau`:
```bash
# Android + TVM setup
export ANDROID_NDK="/Users/moreau/Library/Android/sdk/ndk/26.1.10909125"
export TVM_NDK_CC="$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang"
```
This tutorial was tested successfully on Android Studio Hedgehog | 2023.1.1 Patch 1.
## JDK
JDK, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime.
We strongly recommend setting the JAVA_HOME to the JDK bundled with Android Studio. Using Android Studio’s JBR bundle as recommended (<!-- markdown-link-check-disable -->https://developer.android.com/build/jdks<!-- markdown-link-check-enable -->) will reduce the chances of potential errors in JNI compilation.
For instance on macOS, you'll need to point JAVA_HOME to the following.
```bash
export JAVA_HOME=/Applications/Android\ Studio.app/Contents/jbr/Contents/Home
```
To make sure the java binary can be found do an `ls $JAVA_HOME/bin/java`
## MLC-LLM
Let's clone mlc-llm from its repo in the directory of your choice:
```bash
cd /path/to/where/to/clone/repo
git clone https://github.com/mlc-ai/mlc-llm --recursive
export MLC_LLM_HOME=/path/to/mlc-llm
```
At the time of writing this README, we tested `mlc-llm` at the following sha: `21feb7010db02e0c2149489f5972d6a8a796b5a0`.
## Phone Setup
On your phone, enable debugging on your phone in your phone’s developer settings. Each phone manufacturer will have its own approach to enabling debug mode, so a simple Google search should equip you with the steps to do that on your phone.
In addition, make sure to change your USB configuration from "Charging" to "MTP (Media Transfer Protocol)". This will allow us to connect to the device serially.
Connect your phone to your development machine. On OSX, you'll be prompted on the dev machine whether you want to allow the accessory to connect. Hit "Allow".
# Build Steps
## Building the Android Package with MLC
First edit the file under `android/MLCChat/mlc-package-config.json` and with the [mlc-package-config.json](./mlc-package-config.json) in llama-recipes.
To understand what these JSON fields mean you can refer to this [documentation](https://llm.mlc.ai/docs/deploy/android.html#step-2-build-runtime-and-model-libraries).
From the `mlc-llm` project root directory:
```bash
cd $MLC_LLM_HOME
cd android/MLCChat
python3 -m mlc_llm package --package-config mlc-package-config.json --output dist
```
The command above will take a few minutes to run as it runs through the following steps:
* Compile the Llama 3 8B instruct specified in the `mlc-package-config.json` into a binary model library.
* Build the `mlc-llm` runtime and tokenizer. In addition to the model itself, a lightweight runtime and tokenizer are required to actually run the LLM.
## Building and Running MLC Chat in Android Studio
Now let's launch Android Studio.
* On the "Welcome to Android Studio" page, hit "Open", and navigate to `$MLC_LLM_HOME/android/MLCChat`, then hit "Open"
* A window will pop up asking whether to "Trust and Open project 'MLCChat'" - hit "Trust Project"
* The project will now launch
* Under File -> Project Structure... -> Project change the Gradle Version (second drop down from the top) to 8.5
Connect your phone to your development machine - assuming you've followed the setup steps in the pre-requisite section, you should be able to see the device.
Next you'll need to:
* Hit Build -> Make Project.
* Hit Run -> Run 'app'
The MLCChat app will launch on your phone, now access your phone:
* Under Model List you'll see the `Llama-3-8B-Instruct` LLM listed.
* The model's not quite ready to launch yet, because the weights need to be downloaded over Wifi first. Hit the Download button on the right to the model name to download the weights from HuggingFace.
Note that you can change the build settings to bundle the weights with the MLCChat app so you don't have to download the weights over wifi. To do so you can follow the instructions [here](https://llm.mlc.ai/docs/deploy/android.html#bundle-model-weights).
Once the model weights are downloaded you can now interact with Llama 3 locally on your Android phone!
\ No newline at end of file
{
"device": "android",
"model_list": [
{
"model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"estimated_vram_bytes": 4348727787,
"model_id": "Llama-3-8B-Instruct",
"overrides": {
"context_window_size": 768,
"prefill_chunk_size": 256
}
}
]
}
\ No newline at end of file
--pre
--find-links https://mlc.ai/wheels
mlc-llm-nightly
mlc-ai-nightly
attrs
decorator
numpy
psutil
pydantic
requests
scipy
setuptools
torch
tqdm
\ No newline at end of file
## [Running Llama 3 On-Prem with vLLM and TGI](llama-on-prem.md)
This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 3 on-prem apps.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment