LLM Post Training- Full fine-tune, LoRA, QLoRa etc. Llama/Mistral/Gemma and more

# Configuration Options This document outlines all available configuration options for training models. The configuration can be provided as a JSON request. ## Usage You can use these configuration Options: 1. As a JSON request body: ```json { "input": { "user_id": "user", "model_id": "model-name", "run_id": "run-id", "credentials": { "wandb_api_key": "", # add your Weights & biases key. TODO: you will be able to set this in Enviornment variables. "hf_token": "", # add your HF_token. TODO: you will be able to set this in Enviornment variables. }, "args": { "base_model": "NousResearch/Llama-3.2-1B", // ... other options } } } ``` ## Configuration Options ### Model Configuration | Option | Description | Default | | ------------------- | --------------------------------------------------------------------------------------------- | -------------------- | | `base_model` | Path to the base model (local or HuggingFace) | Required | | `base_model_config` | Configuration path for the base model | Same as base_model | | `revision_of_model` | Specific model revision from HuggingFace hub | Latest | | `tokenizer_config` | Custom tokenizer configuration path | Optional | | `model_type` | Type of model to load | AutoModelForCausalLM | | `tokenizer_type` | Type of tokenizer to use | AutoTokenizer | | `hub_model_id` | Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name) | Optional | ## Model Family Identification | Option | Default | Description | | -------------------------- | ------- | ------------------------------ | | `is_falcon_derived_model` | `false` | Whether model is Falcon-based | | `is_llama_derived_model` | `false` | Whether model is LLaMA-based | | `is_qwen_derived_model` | `false` | Whether model is Qwen-based | | `is_mistral_derived_model` | `false` | Whether model is Mistral-based | ## Model Configuration Overrides | Option | Default | Description | | ----------------------------------------------- | ---------- | ---------------------------------- | | `overrides_of_model_config.rope_scaling.type` | `"linear"` | RoPE scaling type (linear/dynamic) | | `overrides_of_model_config.rope_scaling.factor` | `1.0` | RoPE scaling factor | ### Model Loading Options | Option | Description | Default | | -------------- | ----------------------------- | ------- | | `load_in_8bit` | Load model in 8-bit precision | false | | `load_in_4bit` | Load model in 4-bit precision | false | | `bf16` | Use bfloat16 precision | false | | `fp16` | Use float16 precision | false | | `tf32` | Use tensor float 32 precision | false | ## Memory and Device Settings | Option | Default | Description | | ------------------ | --------- | ----------------------- | | `gpu_memory_limit` | `"20GiB"` | GPU memory limit | | `lora_on_cpu` | `false` | Load LoRA on CPU | | `device_map` | `"auto"` | Device mapping strategy | | `max_memory` | `null` | Max memory per device | ## Training Hyperparameters | Option | Default | Description | | ----------------------------- | --------- | --------------------------- | | `gradient_accumulation_steps` | `1` | Gradient accumulation steps | | `micro_batch_size` | `2` | Batch size per GPU | | `eval_batch_size` | `null` | Evaluation batch size | | `num_epochs` | `4` | Number of training epochs | | `warmup_steps` | `100` | Warmup steps | | `warmup_ratio` | `0.05` | Warmup ratio | | `learning_rate` | `0.00003` | Learning rate | | `lr_quadratic_warmup` | `false` | Quadratic warmup | | `logging_steps` | `null` | Logging frequency | | `eval_steps` | `null` | Evaluation frequency | | `evals_per_epoch` | `null` | Evaluations per epoch | | `save_strategy` | `"epoch"` | Checkpoint saving strategy | | `save_steps` | `null` | Saving frequency | | `saves_per_epoch` | `null` | Saves per epoch | | `save_total_limit` | `null` | Maximum checkpoints to keep | | `max_steps` | `null` | Maximum training steps | ### Dataset Configuration ```yaml datasets: - path: vicgalle/alpaca-gpt4 # HuggingFace dataset or TODO: You will be able to add the local path. type: alpaca # Format type (alpaca, gpteacher, oasst, etc.) ds_type: json # Dataset type data_files: path/to/data # Source data files train_on_split: train # Dataset split to use ``` ## Chat Template Settings | Option | Default | Description | | ------------------------ | -------------------------------- | ---------------------- | | `chat_template` | `"tokenizer_default"` | Chat template type | | `chat_template_jinja` | `null` | Custom Jinja template | | `default_system_message` | `"You are a helpful assistant."` | Default system message | ## Dataset Processing | Option | Default | Description | | ----------------------------- | -------------------------- | --------------------------------- | | `dataset_prepared_path` | `"data/last_run_prepared"` | Path for prepared dataset | | `push_dataset_to_hub` | `""` | Push dataset to HF hub | | `dataset_processes` | `4` | Number of preprocessing processes | | `dataset_keep_in_memory` | `false` | Keep dataset in memory | | `shuffle_merged_datasets` | `true` | Shuffle merged datasets | | `dataset_exact_deduplication` | `true` | Deduplicate datasets | ## LoRA Configuration | Option | Default | Description | | -------------------------- | ---------------------- | ------------------------------ | | `adapter` | `"lora"` | Adapter type (lora/qlora) | | `lora_model_dir` | `""` | Directory with pretrained LoRA | | `lora_r` | `8` | LoRA attention dimension | | `lora_alpha` | `16` | LoRA alpha parameter | | `lora_dropout` | `0.05` | LoRA dropout | | `lora_target_modules` | `["q_proj", "v_proj"]` | Modules to apply LoRA | | `lora_target_linear` | `false` | Target all linear modules | | `peft_layers_to_transform` | `[]` | Layers to transform | | `lora_modules_to_save` | `[]` | Modules to save | | `lora_fan_in_fan_out` | `false` | Fan in/out structure | ## Optimization Settings | Option | Default | Description | | ------------------------- | ------- | -------------------------- | | `train_on_inputs` | `false` | Train on input prompts | | `group_by_length` | `false` | Group by sequence length | | `gradient_checkpointing` | `false` | Use gradient checkpointing | | `early_stopping_patience` | `3` | Early stopping patience | ## Learning Rate Scheduling | Option | Default | Description | | -------------------------- | ---------- | -------------------- | | `lr_scheduler` | `"cosine"` | Scheduler type | | `lr_scheduler_kwargs` | `{}` | Scheduler parameters | | `cosine_min_lr_ratio` | `null` | Minimum LR ratio | | `cosine_constant_lr_ratio` | `null` | Constant LR ratio | | `lr_div_factor` | `null` | LR division factor | ## Optimizer Settings | Option | Default | Description | | ---------------------- | ------------ | ------------------- | | `optimizer` | `"adamw_hf"` | Optimizer choice | | `optim_args` | `{}` | Optimizer arguments | | `optim_target_modules` | `[]` | Target modules | | `weight_decay` | `null` | Weight decay | | `adam_beta1` | `null` | Adam beta1 | | `adam_beta2` | `null` | Adam beta2 | | `adam_epsilon` | `null` | Adam epsilon | | `max_grad_norm` | `null` | Gradient clipping | ## Attention Implementations | Option | Default | Description | | -------------------------- | ------- | ----------------------------- | | `flash_optimum` | `false` | Use better transformers | | `xformers_attention` | `false` | Use xformers | | `flash_attention` | `false` | Use flash attention | | `flash_attn_cross_entropy` | `false` | Flash attention cross entropy | | `flash_attn_rms_norm` | `false` | Flash attention RMS norm | | `flash_attn_fuse_qkv` | `false` | Fuse QKV operations | | `flash_attn_fuse_mlp` | `false` | Fuse MLP operations | | `sdp_attention` | `false` | Use scaled dot product | | `s2_attention` | `false` | Use shifted sparse attention | ## Tokenizer Modifications | Option | Default | Description | | ---------------- | ------- | ---------------------------- | | `special_tokens` | - | Special tokens to add/modify | | `tokens` | `[]` | Additional tokens | ## Distributed Training | Option | Default | Description | | ----------------------- | ------- | --------------------- | | `fsdp` | `null` | FSDP configuration | | `fsdp_config` | `null` | FSDP config options | | `deepspeed` | `null` | Deepspeed config path | | `ddp_timeout` | `null` | DDP timeout | | `ddp_bucket_cap_mb` | `null` | DDP bucket capacity | | `ddp_broadcast_buffers` | `null` | DDP broadcast buffers |

Example Configuration Request:

Here's a complete example for fine-tuning a LLaMA model using LoRA: ```json { "input": { "user_id": "user", "model_id": "llama-test", "run_id": "test-run", "credentials": { "wandb_api_key": "", "hf_token": "" }, "args": { "base_model": "NousResearch/Llama-3.2-1B", "load_in_8bit": false, "load_in_4bit": false, "strict": false, "datasets": [ { "path": "teknium/GPT4-LLM-Cleaned", "type": "alpaca" } ], "dataset_prepared_path": "last_run_prepared", "val_set_size": 0.1, "output_dir": "./outputs/lora-out", "adapter": "lora", "sequence_len": 2048, "sample_packing": true, "eval_sample_packing": true, "pad_to_sequence_len": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "lora_target_modules": [ "gate_proj", "down_proj", "up_proj", "q_proj", "v_proj", "k_proj", "o_proj" ], "gradient_accumulation_steps": 2, "micro_batch_size": 2, "num_epochs": 1, "optimizer": "adamw_8bit", "lr_scheduler": "cosine", "learning_rate": 0.0002, "train_on_inputs": false, "group_by_length": false, "bf16": "auto", "tf32": false, "gradient_checkpointing": true, "logging_steps": 1, "flash_attention": true, "loss_watchdog_threshold": 5, "loss_watchdog_patience": 3, "warmup_steps": 10, "evals_per_epoch": 4, "saves_per_epoch": 1, "weight_decay": 0, "hub_model_id": "runpod/llama-fr-lora", "wandb_name": "test-run-1", "wandb_project": "test-run-1", "wandb_entity": "axo-test", "special_tokens": { "pad_token": "<|end_of_text|>" } } } } ```
### Advanced Features #### Wandb Integration - `wandb_project`: Project name for Weights & Biases - `wandb_entity`: Team name in W&B - `wandb_watch`: Monitor model with W&B - `wandb_name`: Name of the W&B run - `wandb_run_id`: ID for the W&B run #### Performance Optimization - `sample_packing`: Enable efficient sequence packing - `eval_sample_packing`: Use sequence packing during evaluation - `torch_compile`: Enable PyTorch 2.0 compilation - `flash_attention`: Use Flash Attention implementation - `xformers_attention`: Use xFormers attention implementation ### Available Optimizers The following optimizers are supported: - `adamw_hf`: HuggingFace's AdamW implementation - `adamw_torch`: PyTorch's AdamW - `adamw_torch_fused`: Fused AdamW implementation - `adamw_torch_xla`: XLA-optimized AdamW - `adamw_apex_fused`: NVIDIA Apex fused AdamW - `adafactor`: Adafactor optimizer - `adamw_anyprecision`: Anyprecision AdamW - `adamw_bnb_8bit`: 8-bit AdamW from bitsandbytes - `lion_8bit`: 8-bit Lion optimizer - `lion_32bit`: 32-bit Lion optimizer - `sgd`: Stochastic Gradient Descent - `adagrad`: Adagrad optimizer ## Notes - Set `load_in_8bit: true` or `load_in_4bit: true` for memory-efficient training - Enable `flash_attention: true` for faster training on modern GPUs - Use `gradient_checkpointing: true` to reduce memory usage - Adjust `micro_batch_size` and `gradient_accumulation_steps` based on your GPU memory For more detailed information, please refer to the [documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html). ### Errors: - if you face any issues with the Flash Attention-2, Delete yoor worker and Re-start.