v1.0

e75bc9be · chenzk · e75bc9be · e75bc9be · e75bc9be · e75bc9be
Commit e75bc9be authored Nov 28, 2024 by chenzk
20 changed files
--- a/finetuning/README.md
+++ b/finetuning/README.md
+# Fine-tuning
+## SmolLM2 Instruct
+We build the SmolLM2 Instruct family by finetuning the base 1.7B on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and the base 360M and 135M models on [Smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) using `TRL` and the alignement handbook and then doing DPO on [UltraFeedBack](https://huggingface.co/datasets/openbmb/UltraFeedback). You can find the scipts and instructions for dohere: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2#instructions-to-train-smollm2-17b-instruct 
+## Custom script
+Here, we provide a simple script for finetuning SmolLM2. In this case, we fine-tune the base 1.7B on python data.
+### Setup
+Install `pytorch` [see documentation](https://pytorch.org/), and then install the requirements 
+```bash
+pip install -r requirements.txt
+```
+Before you run any of the scripts make sure you are logged in `wandb` and HuggingFace Hub to push the checkpoints, and you have `accelerate` configured:
+```bash
+wandb login
+huggingface-cli login
+accelerate config
+``` 
+Now that everything is done, you can clone the repository and get into the corresponding directory.
+```bash
+git clone https://github.com/huggingface/smollm
+cd smollm/finetune
+```
+### Training
+To fine-tune efficiently with a low cost, we use [PEFT](https://github.com/huggingface/peft) library for Low-Rank Adaptation (LoRA) training. We also use the `SFTTrainer` from [TRL](https://github.com/huggingface/trl).
+For this example, we will fine-tune SmolLM1-1.7B on the `Python` subset of [the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol). This is just for illustration purposes.
+To launch the training:
+```bash
+accelerate launch train.py \
+        --model_id "HuggingFaceTB/SmolLM2-1.7B" \
+        --dataset_name "bigcode/the-stack-smol" \
+        --subset "data/python" \
+        --dataset_text_field "content" \
+        --split "train" \
+        --max_seq_length 2048 \
+        --max_steps 5000 \
+        --micro_batch_size 1 \
+        --gradient_accumulation_steps 8 \
+        --learning_rate 3e-4 \
+        --warmup_steps 100 \
+        --num_proc "$(nproc)"
+```
+If you want to fine-tune on other text datasets, you need to change `dataset_text_field` argument to the name of the column containing the code/text you want to train on.
--- a/finetuning/Smol_VLM_FT.ipynb
+++ b/finetuning/Smol_VLM_FT.ipynb
--- a/finetuning/bigcode/the-stack-smol/README.md
+++ b/finetuning/bigcode/the-stack-smol/README.md
+---
+annotations_creators: []
+language_creators:
+- crowdsourced
+language: ["code"]
+multilinguality:
+- multilingual
+size_categories:
+- unknown
+source_datasets: []
+task_categories:
+- text-generation
+task_ids:
+- language-modeling
+extra_gated_prompt: |-
+  ## Terms of Use for The Stack
+  The Stack dataset is a collection of  3.1 TB of source code in 30 programming languages. We ask that you read and acknowledge the following points before using the dataset:
+  1. The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant. We facilitate this by providing provenance information for each data point.
+  2. The Stack is regularly updated to enact validated data removal requests. By clicking on "Access repository", you agree to update your own version of The Stack to the most recent usable version specified by the maintainers in [the following thread](https://huggingface.co/datasets/bigcode/the-stack/discussions/7). If you have questions about dataset versions and allowed uses, please also ask them in the dataset’s [community discussions](https://huggingface.co/datasets/bigcode/the-stack/discussions/new). We will also notify users via email when the latest usable version changes.
+  3. To host, share, or otherwise provide access to The Stack dataset, you must include [these Terms of Use](https://huggingface.co/datasets/bigcode/the-stack#terms-of-use-for-the-stack) and require users to agree to it.
+  By clicking on "Access repository" below, you accept that your contact information (email address and username) can be shared with the dataset maintainers as well.
+extra_gated_fields:
+ Email: text
+ I have read the License and agree with its terms: checkbox
+---
+## Dataset Description
+![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png)
+A small subset (~0.1%) of [the-stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).
+## Languages
+The dataset contains 30 programming languages:
+````
+"assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java",
+"javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust", 
+"scala", "shell", "sql", "tex", "typescript", "visual-basic"
+`````
+## Dataset Structure
+```python
+from datasets import load_dataset
+load_dataset("bigcode/the-stack-smol")
+DatasetDict({
+    train: Dataset({
+        features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
+        num_rows: 300000
+    })
+})
+```
+### How to use it
+You can either load the whole dataset like above, or load a specific language such as python by specifying the folder directory:
+```python
+load_dataset("bigcode/the-stack-smol", data_dir="data/python")
+DatasetDict({
+    train: Dataset({
+        features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
+        num_rows: 10000
+    })
+})
+```
--- a/finetuning/bigcode/the-stack-smol/data/python/data.json
+++ b/finetuning/bigcode/the-stack-smol/data/python/data.json
--- a/finetuning/finetune_smollm2_python/final_checkpoint/adapter_config.json
+++ b/finetuning/finetune_smollm2_python/final_checkpoint/adapter_config.json
+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "v_proj",
+    "q_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}
\ No newline at end of file
--- a/finetuning/finetune_smollm2_python/final_checkpoint/config.json
+++ b/finetuning/finetune_smollm2_python/final_checkpoint/config.json
+{
+  "_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 32,
+  "pad_token_id": 2,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 130000,
+  "tie_word_embeddings": true,
+  "torch_dtype": "float32",
+  "transformers.js_config": {
+    "kv_cache_dtype": {
+      "fp16": "float16",
+      "q4f16": "float16"
+    }
+  },
+  "transformers_version": "4.46.2",
+  "use_cache": true,
+  "vocab_size": 49152
+}
--- a/finetuning/finetune_smollm2_python/final_checkpoint/generation_config.json
+++ b/finetuning/finetune_smollm2_python/final_checkpoint/generation_config.json
+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "transformers_version": "4.46.2"
+}
--- a/finetuning/finetune_smollm2_python/final_checkpoint/model.safetensors.index.json
+++ b/finetuning/finetune_smollm2_python/final_checkpoint/model.safetensors.index.json
--- a/finetuning/finetune_smollm2_python/final_checkpoint/tokenizer_config.json
+++ b/finetuning/finetune_smollm2_python/final_checkpoint/tokenizer_config.json
+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "model_max_length": 2048,
+  "pad_token": "<|im_end|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}
--- a/finetuning/finetune_smollm2_python/final_merged_checkpoint/config.json
+++ b/finetuning/finetune_smollm2_python/final_merged_checkpoint/config.json
+{
+  "_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 32,
+  "pad_token_id": 2,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 130000,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers.js_config": {
+    "kv_cache_dtype": {
+      "fp16": "float16",
+      "q4f16": "float16"
+    }
+  },
+  "transformers_version": "4.46.2",
+  "use_cache": true,
+  "vocab_size": 49152
+}
--- a/finetuning/finetune_smollm2_python/final_merged_checkpoint/generation_config.json
+++ b/finetuning/finetune_smollm2_python/final_merged_checkpoint/generation_config.json
+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "transformers_version": "4.46.2"
+}
--- a/finetuning/infer.py
+++ b/finetuning/infer.py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
+model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
+messages = [{"role": "user", "content": "Write a 100-word article on 'Benefits of Open-Source in AI research"}]
+input_text=tokenizer.apply_chat_template(messages, tokenize=False)
+inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
+print(tokenizer.decode(outputs[0]))
--- a/finetuning/requirements.txt
+++ b/finetuning/requirements.txt
+transformers
+trl
+peft
+accelerate
+datasets
+scipy
+wandb # wandb offline & wandb disabled
+bitsandbytes
\ No newline at end of file
--- a/finetuning/train.py
+++ b/finetuning/train.py
+# Code adapted from https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py
+# and https://huggingface.co/blog/gemma-peft
+import argparse
+import multiprocessing
+import os
+import torch
+import transformers
+from accelerate import PartialState
+from datasets import load_dataset
+from peft import AutoPeftModelForCausalLM, LoraConfig
+from transformers import (
+    AutoModelForCausalLM,
+    BitsAndBytesConfig,
+    is_torch_npu_available,
+    is_torch_xpu_available,
+    logging,
+    set_seed,
+)
+from trl import SFTTrainer
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_id", type=str, default="HuggingFaceTB/SmolLM2-1.7B")
+    parser.add_argument("--dataset_name", type=str, default="bigcode/the-stack-smol")
+    parser.add_argument("--subset", type=str, default="data/python")
+    parser.add_argument("--split", type=str, default="train")
+    parser.add_argument("--dataset_text_field", type=str, default="content")
+    parser.add_argument("--max_seq_length", type=int, default=2048)
+    parser.add_argument("--max_steps", type=int, default=1000)
+    parser.add_argument("--micro_batch_size", type=int, default=1)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--bf16", type=bool, default=True)
+    parser.add_argument("--use_bnb", type=bool, default=False)
+    parser.add_argument("--attention_dropout", type=float, default=0.1)
+    parser.add_argument("--learning_rate", type=float, default=2e-4)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--warmup_steps", type=int, default=100)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--output_dir", type=str, default="finetune_smollm2_python")
+    parser.add_argument("--num_proc", type=int, default=None)
+    parser.add_argument("--save_merged_model", type=bool, default=True)
+    parser.add_argument("--push_to_hub", type=bool, default=True)
+    parser.add_argument("--repo_id", type=str, default="SmolLM2-1.7B-finetune")
+    return parser.parse_args()
+def main(args):
+    # config
+    lora_config = LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "v_proj"],
+        bias="none",
+        task_type="CAUSAL_LM",
+    )
+    bnb_config = None
+    if args.use_bnb:
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+        )
+    # load model and dataset
+    token = os.environ.get("HF_TOKEN", None)
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_id,
+        quantization_config=bnb_config,
+        device_map={"": PartialState().process_index},
+        attention_dropout=args.attention_dropout,
+    )
+    data = load_dataset(
+        args.dataset_name,
+        data_dir=args.subset,
+        split=args.split,
+        token=token,
+        num_proc=args.num_proc if args.num_proc else multiprocessing.cpu_count(),
+    )
+    # setup the trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=data,
+        max_seq_length=args.max_seq_length,
+        args=transformers.TrainingArguments(
+            per_device_train_batch_size=args.micro_batch_size,
+            gradient_accumulation_steps=args.gradient_accumulation_steps,
+            warmup_steps=args.warmup_steps,
+            max_steps=args.max_steps,
+            learning_rate=args.learning_rate,
+            lr_scheduler_type=args.lr_scheduler_type,
+            weight_decay=args.weight_decay,
+            bf16=args.bf16,
+            logging_strategy="steps",
+            logging_steps=10,
+            output_dir=args.output_dir,
+            optim="paged_adamw_8bit",
+            seed=args.seed,
+            run_name=f"train-{args.model_id.split('/')[-1]}",
+            report_to="wandb",
+        ),
+        peft_config=lora_config,
+        dataset_text_field=args.dataset_text_field,
+    )
+    # launch
+    print("Training...")
+    trainer.train()
+    print("Saving the last checkpoint of the model")
+    model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))
+    if args.save_merged_model:
+        # Free memory for merging weights
+        del model
+        if is_torch_xpu_available():
+            torch.xpu.empty_cache()
+        elif is_torch_npu_available():
+            torch.npu.empty_cache()
+        else:
+            torch.cuda.empty_cache()
+        model = AutoPeftModelForCausalLM.from_pretrained(args.output_dir, device_map="auto", torch_dtype=torch.bfloat16)
+        model = model.merge_and_unload()
+        output_merged_dir = os.path.join(args.output_dir, "final_merged_checkpoint")
+        model.save_pretrained(output_merged_dir, safe_serialization=True)
+        if args.push_to_hub:
+            model.push_to_hub(args.repo_id, "Upload model")
+    print("Training Done! 💥")
+if __name__ == "__main__":
+    args = get_args()
+    set_seed(args.seed)
+    os.makedirs(args.output_dir, exist_ok=True)
+    logging.set_verbosity_error()
+    main(args)
\ No newline at end of file
--- a/finetuning/train.sh
+++ b/finetuning/train.sh
+accelerate launch train.py \
+        --model_id "HuggingFaceTB/SmolLM2-1.7B-Instruct" \
+        --dataset_name "bigcode/the-stack-smol" \
+        --subset "data/python" \
+        --dataset_text_field "content" \
+        --split "train" \
+        --max_seq_length 2048 \
+        --max_steps 5000 \
+        --micro_batch_size 1 \
+        --gradient_accumulation_steps 8 \
+        --learning_rate 3e-4 \
+        --warmup_steps 100 \
+        --num_proc "$(nproc)"
--- a/icon.png
+++ b/icon.png
--- a/inference/smolvlm/SmolVLM_video_inference.py
+++ b/inference/smolvlm/SmolVLM_video_inference.py
+import torch
+from transformers import AutoProcessor, Idefics3ForConditionalGeneration
+from PIL import Image
+import cv2
+import numpy as np
+from typing import List
+import logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class VideoFrameExtractor:
+    def __init__(self, max_frames: int = 50):
+        self.max_frames = max_frames
+    def resize_and_center_crop(self, image: Image.Image, target_size: int) -> Image.Image:
+        # Get current dimensions
+        width, height = image.size
+        # Calculate new dimensions keeping aspect ratio
+        if width < height:
+            new_width = target_size
+            new_height = int(height * (target_size / width))
+        else:
+            new_height = target_size
+            new_width = int(width * (target_size / height))
+        # Resize
+        image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
+        # Center crop
+        left = (new_width - target_size) // 2
+        top = (new_height - target_size) // 2
+        right = left + target_size
+        bottom = top + target_size
+        return image.crop((left, top, right, bottom))
+    def extract_frames(self, video_path: str) -> List[Image.Image]:
+        cap = cv2.VideoCapture(video_path)
+        if not cap.isOpened():
+            raise ValueError(f"Could not open video: {video_path}")
+        # Get video properties
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        fps = int(cap.get(cv2.CAP_PROP_FPS))
+        # Calculate frame indices to extract (1fps)
+        frame_indices = list(range(0, total_frames, fps))
+        # If we have more frames than max_frames, sample evenly
+        if len(frame_indices) > self.max_frames:
+            indices = np.linspace(0, len(frame_indices) - 1, self.max_frames, dtype=int)
+            frame_indices = [frame_indices[i] for i in indices]
+        frames = []
+        for frame_idx in frame_indices:
+            cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+            ret, frame = cap.read()
+            if ret:
+                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                pil_image = Image.fromarray(frame)
+                pil_image = self.resize_and_center_crop(pil_image, 384)
+                frames.append(pil_image)
+        cap.release()
+        return frames
+def load_model(checkpoint_path: str, base_model_id: str = "HuggingFaceTB/SmolVLM-Instruct", device: str = "cuda"):
+    # Load processor from original model
+    processor = AutoProcessor.from_pretrained(base_model_id)
+    if checkpoint_path:
+        # Load fine-tuned model from checkpoint
+        model = Idefics3ForConditionalGeneration.from_pretrained(
+            checkpoint_path,
+            torch_dtype=torch.bfloat16,
+            device_map=device
+        )
+    else:
+        model = Idefics3ForConditionalGeneration.from_pretrained(
+            base_model_id,
+            torch_dtype=torch.bfloat16,
+            device_map=device
+        )    
+    # Configure processor for video frames
+    processor.image_processor.size = (384, 384)
+    processor.image_processor.do_resize = False
+    processor.image_processor.do_image_splitting = False
+    return model, processor
+def generate_response(model, processor, video_path: str, question: str, max_frames: int = 50):
+    # Extract frames
+    frame_extractor = VideoFrameExtractor(max_frames)
+    frames = frame_extractor.extract_frames(video_path)
+    logger.info(f"Extracted {len(frames)} frames from video")
+    # Create prompt with frames
+    image_tokens = [{"type": "image"} for _ in range(len(frames))]
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "Answer briefly."},
+                *image_tokens,
+                {"type": "text", "text": question}
+            ]
+        }
+    ]
+    # Process inputs
+    inputs = processor(
+        text=processor.apply_chat_template(messages, add_generation_prompt=True),
+        images=[img for img in frames],
+        return_tensors="pt"
+    ).to(model.device)
+    # Generate response
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=100,
+        num_beams=5,
+        temperature=0.7,
+        do_sample=True,
+        use_cache=True
+    )
+    # Decode response
+    response = processor.decode(outputs[0], skip_special_tokens=True)
+    return response
+def main():
+    # Configuration
+    #checkpoint_path = "/path/to/your/checkpoint"
+    checkpoint_path = None
+    base_model_id = "HuggingFaceTB/SmolVLM-Instruct"  
+    video_path = "/path/to/video.mp4"
+    question = "Describe the video"
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load model
+    logger.info("Loading model...")
+    model, processor = load_model(checkpoint_path, base_model_id, device)
+    # Generate response
+    logger.info("Generating response...")
+    response = generate_response(model, processor, video_path, question)
+    # Print results
+    print("Question:", question)
+    print("Response:", response)
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/local_inference/README.md
+++ b/local_inference/README.md
+# Local inference
+You can use SmolLM2 models locally with frameworks like Transformers.js, llama.cpp, MLX and MLC.
+Here you can find the code for running SmolLM locally using each of these libraries. You can also find the conversions of SmolLM & SmolLM2 in these collections: [SmolLM1](https://huggingface.co/collections/HuggingFaceTB/local-smollms-66c0f3b2a15b4eed7fb198d0) and [SmolLM2](https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9).
+Please first install each library by following its documentation:
+- [Transformers.js](https://github.com/huggingface/transformers.js)
+- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
+- [MLX](https://github.com/ml-explore/mlx)
+- [MLC](https://github.com/mlc-ai/web-llm)
+## Demos
+Below are some demos we built for running SmolLM models in the browser:
+- [WebGPU demo](https://huggingface.co/spaces/HuggingFaceTB/SmolLM2-1.7B-Instruct-WebGPU ) of SmolLM2 1.7B Instruct powered by Transformers.js and ONNX Runtime Web:  
+- [Bunny B1](https://github.com/dottxt-ai/demos/tree/main/its-a-smol-world) mapping natural language requests to local aplication calls using function calling and structured generation by [outlines](https://github.com/dottxt-ai/outlines).
+- [Instant SmolLM](https://huggingface.co/spaces/HuggingFaceTB/instant-smollm) powered by MLC for real-time generations of SmolLM-360M-Instruct.
+The models are also available on [Ollama](https://ollama.com/library/smollm2) and [PocketPal-AI](https://github.com/a-ghorbani/pocketpal-ai).
--- a/local_inference/llama-cpp-python.py
+++ b/local_inference/llama-cpp-python.py
+from llama_cpp import Llama
+llm = Llama.from_pretrained(
+    repo_id="HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF",
+    filename="*q4_k_m.gguf",
+    verbose=False
+)
+output = llm(
+      "Q: Name the planets in the solar system? A: ", # Prompt
+      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
+      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
+      echo=True # Echo the prompt back in the output
+) # Generate a completion, can also call create_completion
+print(output)
\ No newline at end of file
--- a/local_inference/mlc.py
+++ b/local_inference/mlc.py
+from mlc_llm import MLCEngine
+# Create engine
+model = "HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC"
+engine = MLCEngine(model)
+# Run chat completion in OpenAI API.
+for response in engine.chat.completions.create(
+    messages=[{"role": "user", "content": "What is the meaning of life?"}],
+    model=model,
+    stream=True,
+):
+    for choice in response.choices:
+        print(choice.delta.content, end="", flush=True)
+print("\n")
+engine.terminate()