"...git@developer.sourcefind.cn:chenpangpang/open-webui.git" did not exist on "72354e06a759075024d6be6bc6a8e717ec29d823"
Commit e75bc9be authored by chenzk's avatar chenzk
Browse files

v1.0

parents
# Fine-tuning
## SmolLM2 Instruct
We build the SmolLM2 Instruct family by finetuning the base 1.7B on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and the base 360M and 135M models on [Smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) using `TRL` and the alignement handbook and then doing DPO on [UltraFeedBack](https://huggingface.co/datasets/openbmb/UltraFeedback). You can find the scipts and instructions for dohere: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2#instructions-to-train-smollm2-17b-instruct
## Custom script
Here, we provide a simple script for finetuning SmolLM2. In this case, we fine-tune the base 1.7B on python data.
### Setup
Install `pytorch` [see documentation](https://pytorch.org/), and then install the requirements
```bash
pip install -r requirements.txt
```
Before you run any of the scripts make sure you are logged in `wandb` and HuggingFace Hub to push the checkpoints, and you have `accelerate` configured:
```bash
wandb login
huggingface-cli login
accelerate config
```
Now that everything is done, you can clone the repository and get into the corresponding directory.
```bash
git clone https://github.com/huggingface/smollm
cd smollm/finetune
```
### Training
To fine-tune efficiently with a low cost, we use [PEFT](https://github.com/huggingface/peft) library for Low-Rank Adaptation (LoRA) training. We also use the `SFTTrainer` from [TRL](https://github.com/huggingface/trl).
For this example, we will fine-tune SmolLM1-1.7B on the `Python` subset of [the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol). This is just for illustration purposes.
To launch the training:
```bash
accelerate launch train.py \
--model_id "HuggingFaceTB/SmolLM2-1.7B" \
--dataset_name "bigcode/the-stack-smol" \
--subset "data/python" \
--dataset_text_field "content" \
--split "train" \
--max_seq_length 2048 \
--max_steps 5000 \
--micro_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 3e-4 \
--warmup_steps 100 \
--num_proc "$(nproc)"
```
If you want to fine-tune on other text datasets, you need to change `dataset_text_field` argument to the name of the column containing the code/text you want to train on.
This diff is collapsed.
---
annotations_creators: []
language_creators:
- crowdsourced
language: ["code"]
multilinguality:
- multilingual
size_categories:
- unknown
source_datasets: []
task_categories:
- text-generation
task_ids:
- language-modeling
extra_gated_prompt: |-
## Terms of Use for The Stack
The Stack dataset is a collection of 3.1 TB of source code in 30 programming languages. We ask that you read and acknowledge the following points before using the dataset:
1. The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant. We facilitate this by providing provenance information for each data point.
2. The Stack is regularly updated to enact validated data removal requests. By clicking on "Access repository", you agree to update your own version of The Stack to the most recent usable version specified by the maintainers in [the following thread](https://huggingface.co/datasets/bigcode/the-stack/discussions/7). If you have questions about dataset versions and allowed uses, please also ask them in the dataset’s [community discussions](https://huggingface.co/datasets/bigcode/the-stack/discussions/new). We will also notify users via email when the latest usable version changes.
3. To host, share, or otherwise provide access to The Stack dataset, you must include [these Terms of Use](https://huggingface.co/datasets/bigcode/the-stack#terms-of-use-for-the-stack) and require users to agree to it.
By clicking on "Access repository" below, you accept that your contact information (email address and username) can be shared with the dataset maintainers as well.
extra_gated_fields:
Email: text
I have read the License and agree with its terms: checkbox
---
## Dataset Description
![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png)
A small subset (~0.1%) of [the-stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).
## Languages
The dataset contains 30 programming languages:
````
"assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java",
"javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust",
"scala", "shell", "sql", "tex", "typescript", "visual-basic"
`````
## Dataset Structure
```python
from datasets import load_dataset
load_dataset("bigcode/the-stack-smol")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 300000
})
})
```
### How to use it
You can either load the whole dataset like above, or load a specific language such as python by specifying the folder directory:
```python
load_dataset("bigcode/the-stack-smol", data_dir="data/python")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 10000
})
})
```
This diff is collapsed.
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 32,
"lora_dropout": 0.05,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 16,
"rank_pattern": {},
"revision": null,
"target_modules": [
"v_proj",
"q_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
}
\ No newline at end of file
{
"_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.1,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"num_key_value_heads": 32,
"pad_token_id": 2,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 130000,
"tie_word_embeddings": true,
"torch_dtype": "float32",
"transformers.js_config": {
"kv_cache_dtype": {
"fp16": "float16",
"q4f16": "float16"
}
},
"transformers_version": "4.46.2",
"use_cache": true,
"vocab_size": 49152
}
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 2,
"transformers_version": "4.46.2"
}
{
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<repo_name>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<reponame>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<file_sep>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<filename>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<gh_stars>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"8": {
"content": "<issue_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"9": {
"content": "<issue_comment>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"10": {
"content": "<issue_closed>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<jupyter_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<jupyter_text>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<jupyter_code>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"14": {
"content": "<jupyter_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"15": {
"content": "<jupyter_script>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"16": {
"content": "<empty_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>"
],
"bos_token": "<|im_start|>",
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"model_max_length": 2048,
"pad_token": "<|im_end|>",
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<|endoftext|>",
"vocab_size": 49152
}
{
"_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"num_key_value_heads": 32,
"pad_token_id": 2,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 130000,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers.js_config": {
"kv_cache_dtype": {
"fp16": "float16",
"q4f16": "float16"
}
},
"transformers_version": "4.46.2",
"use_cache": true,
"vocab_size": 49152
}
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 2,
"transformers_version": "4.46.2"
}
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
messages = [{"role": "user", "content": "Write a 100-word article on 'Benefits of Open-Source in AI research"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))
transformers
trl
peft
accelerate
datasets
scipy
wandb # wandb offline & wandb disabled
bitsandbytes
\ No newline at end of file
# Code adapted from https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py
# and https://huggingface.co/blog/gemma-peft
import argparse
import multiprocessing
import os
import torch
import transformers
from accelerate import PartialState
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM, LoraConfig
from transformers import (
AutoModelForCausalLM,
BitsAndBytesConfig,
is_torch_npu_available,
is_torch_xpu_available,
logging,
set_seed,
)
from trl import SFTTrainer
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model_id", type=str, default="HuggingFaceTB/SmolLM2-1.7B")
parser.add_argument("--dataset_name", type=str, default="bigcode/the-stack-smol")
parser.add_argument("--subset", type=str, default="data/python")
parser.add_argument("--split", type=str, default="train")
parser.add_argument("--dataset_text_field", type=str, default="content")
parser.add_argument("--max_seq_length", type=int, default=2048)
parser.add_argument("--max_steps", type=int, default=1000)
parser.add_argument("--micro_batch_size", type=int, default=1)
parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
parser.add_argument("--weight_decay", type=float, default=0.01)
parser.add_argument("--bf16", type=bool, default=True)
parser.add_argument("--use_bnb", type=bool, default=False)
parser.add_argument("--attention_dropout", type=float, default=0.1)
parser.add_argument("--learning_rate", type=float, default=2e-4)
parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
parser.add_argument("--warmup_steps", type=int, default=100)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--output_dir", type=str, default="finetune_smollm2_python")
parser.add_argument("--num_proc", type=int, default=None)
parser.add_argument("--save_merged_model", type=bool, default=True)
parser.add_argument("--push_to_hub", type=bool, default=True)
parser.add_argument("--repo_id", type=str, default="SmolLM2-1.7B-finetune")
return parser.parse_args()
def main(args):
# config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
bias="none",
task_type="CAUSAL_LM",
)
bnb_config = None
if args.use_bnb:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# load model and dataset
token = os.environ.get("HF_TOKEN", None)
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
quantization_config=bnb_config,
device_map={"": PartialState().process_index},
attention_dropout=args.attention_dropout,
)
data = load_dataset(
args.dataset_name,
data_dir=args.subset,
split=args.split,
token=token,
num_proc=args.num_proc if args.num_proc else multiprocessing.cpu_count(),
)
# setup the trainer
trainer = SFTTrainer(
model=model,
train_dataset=data,
max_seq_length=args.max_seq_length,
args=transformers.TrainingArguments(
per_device_train_batch_size=args.micro_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
warmup_steps=args.warmup_steps,
max_steps=args.max_steps,
learning_rate=args.learning_rate,
lr_scheduler_type=args.lr_scheduler_type,
weight_decay=args.weight_decay,
bf16=args.bf16,
logging_strategy="steps",
logging_steps=10,
output_dir=args.output_dir,
optim="paged_adamw_8bit",
seed=args.seed,
run_name=f"train-{args.model_id.split('/')[-1]}",
report_to="wandb",
),
peft_config=lora_config,
dataset_text_field=args.dataset_text_field,
)
# launch
print("Training...")
trainer.train()
print("Saving the last checkpoint of the model")
model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))
if args.save_merged_model:
# Free memory for merging weights
del model
if is_torch_xpu_available():
torch.xpu.empty_cache()
elif is_torch_npu_available():
torch.npu.empty_cache()
else:
torch.cuda.empty_cache()
model = AutoPeftModelForCausalLM.from_pretrained(args.output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()
output_merged_dir = os.path.join(args.output_dir, "final_merged_checkpoint")
model.save_pretrained(output_merged_dir, safe_serialization=True)
if args.push_to_hub:
model.push_to_hub(args.repo_id, "Upload model")
print("Training Done! 💥")
if __name__ == "__main__":
args = get_args()
set_seed(args.seed)
os.makedirs(args.output_dir, exist_ok=True)
logging.set_verbosity_error()
main(args)
\ No newline at end of file
accelerate launch train.py \
--model_id "HuggingFaceTB/SmolLM2-1.7B-Instruct" \
--dataset_name "bigcode/the-stack-smol" \
--subset "data/python" \
--dataset_text_field "content" \
--split "train" \
--max_seq_length 2048 \
--max_steps 5000 \
--micro_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 3e-4 \
--warmup_steps 100 \
--num_proc "$(nproc)"
icon.png

53.8 KB

import torch
from transformers import AutoProcessor, Idefics3ForConditionalGeneration
from PIL import Image
import cv2
import numpy as np
from typing import List
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class VideoFrameExtractor:
def __init__(self, max_frames: int = 50):
self.max_frames = max_frames
def resize_and_center_crop(self, image: Image.Image, target_size: int) -> Image.Image:
# Get current dimensions
width, height = image.size
# Calculate new dimensions keeping aspect ratio
if width < height:
new_width = target_size
new_height = int(height * (target_size / width))
else:
new_height = target_size
new_width = int(width * (target_size / height))
# Resize
image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
# Center crop
left = (new_width - target_size) // 2
top = (new_height - target_size) // 2
right = left + target_size
bottom = top + target_size
return image.crop((left, top, right, bottom))
def extract_frames(self, video_path: str) -> List[Image.Image]:
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
raise ValueError(f"Could not open video: {video_path}")
# Get video properties
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
fps = int(cap.get(cv2.CAP_PROP_FPS))
# Calculate frame indices to extract (1fps)
frame_indices = list(range(0, total_frames, fps))
# If we have more frames than max_frames, sample evenly
if len(frame_indices) > self.max_frames:
indices = np.linspace(0, len(frame_indices) - 1, self.max_frames, dtype=int)
frame_indices = [frame_indices[i] for i in indices]
frames = []
for frame_idx in frame_indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_image = Image.fromarray(frame)
pil_image = self.resize_and_center_crop(pil_image, 384)
frames.append(pil_image)
cap.release()
return frames
def load_model(checkpoint_path: str, base_model_id: str = "HuggingFaceTB/SmolVLM-Instruct", device: str = "cuda"):
# Load processor from original model
processor = AutoProcessor.from_pretrained(base_model_id)
if checkpoint_path:
# Load fine-tuned model from checkpoint
model = Idefics3ForConditionalGeneration.from_pretrained(
checkpoint_path,
torch_dtype=torch.bfloat16,
device_map=device
)
else:
model = Idefics3ForConditionalGeneration.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map=device
)
# Configure processor for video frames
processor.image_processor.size = (384, 384)
processor.image_processor.do_resize = False
processor.image_processor.do_image_splitting = False
return model, processor
def generate_response(model, processor, video_path: str, question: str, max_frames: int = 50):
# Extract frames
frame_extractor = VideoFrameExtractor(max_frames)
frames = frame_extractor.extract_frames(video_path)
logger.info(f"Extracted {len(frames)} frames from video")
# Create prompt with frames
image_tokens = [{"type": "image"} for _ in range(len(frames))]
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
*image_tokens,
{"type": "text", "text": question}
]
}
]
# Process inputs
inputs = processor(
text=processor.apply_chat_template(messages, add_generation_prompt=True),
images=[img for img in frames],
return_tensors="pt"
).to(model.device)
# Generate response
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
temperature=0.7,
do_sample=True,
use_cache=True
)
# Decode response
response = processor.decode(outputs[0], skip_special_tokens=True)
return response
def main():
# Configuration
#checkpoint_path = "/path/to/your/checkpoint"
checkpoint_path = None
base_model_id = "HuggingFaceTB/SmolVLM-Instruct"
video_path = "/path/to/video.mp4"
question = "Describe the video"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model
logger.info("Loading model...")
model, processor = load_model(checkpoint_path, base_model_id, device)
# Generate response
logger.info("Generating response...")
response = generate_response(model, processor, video_path, question)
# Print results
print("Question:", question)
print("Response:", response)
if __name__ == "__main__":
main()
\ No newline at end of file
# Local inference
You can use SmolLM2 models locally with frameworks like Transformers.js, llama.cpp, MLX and MLC.
Here you can find the code for running SmolLM locally using each of these libraries. You can also find the conversions of SmolLM & SmolLM2 in these collections: [SmolLM1](https://huggingface.co/collections/HuggingFaceTB/local-smollms-66c0f3b2a15b4eed7fb198d0) and [SmolLM2](https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9).
Please first install each library by following its documentation:
- [Transformers.js](https://github.com/huggingface/transformers.js)
- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- [MLX](https://github.com/ml-explore/mlx)
- [MLC](https://github.com/mlc-ai/web-llm)
## Demos
Below are some demos we built for running SmolLM models in the browser:
- [WebGPU demo](https://huggingface.co/spaces/HuggingFaceTB/SmolLM2-1.7B-Instruct-WebGPU ) of SmolLM2 1.7B Instruct powered by Transformers.js and ONNX Runtime Web:
- [Bunny B1](https://github.com/dottxt-ai/demos/tree/main/its-a-smol-world) mapping natural language requests to local aplication calls using function calling and structured generation by [outlines](https://github.com/dottxt-ai/outlines).
- [Instant SmolLM](https://huggingface.co/spaces/HuggingFaceTB/instant-smollm) powered by MLC for real-time generations of SmolLM-360M-Instruct.
The models are also available on [Ollama](https://ollama.com/library/smollm2) and [PocketPal-AI](https://github.com/a-ghorbani/pocketpal-ai).
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF",
filename="*q4_k_m.gguf",
verbose=False
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
\ No newline at end of file
from mlc_llm import MLCEngine
# Create engine
model = "HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC"
engine = MLCEngine(model)
# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print("\n")
engine.terminate()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment