LLM-inference.md 20.3 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
# LLM Inference Documentation
If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#inference-acceleration)

## Table of Contents
- [Environment Preparation](#Environment-Preparation)
- [Inference](#Inference)
- [Web-UI](#web-ui)

## Environment Preparation
GPU devices: A10, 3090, V100, A100 are all supported.
```bash
# Install ms-swift
pip install 'ms-swift[llm]' -U

# If you want to use models based on auto_gptq for inference.
# Models using auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/Supported Models and Datasets.md#Models`
# auto_gptq and cuda versions have a correspondence, please select the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
pip install auto_gptq -U

# Environment alignment (usually no need to run. If you encounter errors, you can run the code below, the latest environment is tested with the repository)
pip install -r requirements/framework.txt  -U
pip install -r requirements/llm.txt  -U
```

## Inference
### qwen-7b-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: qwen


kwargs = {}
# kwargs['use_flash_attn'] = True  # use flash_attn

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
# modify max_new_tokens
model.generation_config.max_new_tokens = 128

template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')

"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang province is Hangzhou.
query: What are some famous foods there?
response: Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang province is Hangzhou.'), ('What are some famous foods there?', 'Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.')]
"""

# Streaming output chat template
inference(model, template, 'What was the first question?', history, verbose=True, stream=True)
"""Out[1]
[PROMPT]<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Where is the capital of Zhejiang?<|im_end|>
<|im_start|>assistant
The capital of Zhejiang province is Hangzhou.<|im_end|>
<|im_start|>user
What are some famous foods there?<|im_end|>
<|im_start|>assistant
Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.<|im_end|>
<|im_start|>user
What was the first question<|im_end|>
<|im_start|>assistant
[OUTPUT]Your first question was "Where is the capital of Zhejiang?"<|im_end|>
"""
```

### qwen-7b-chat-int4
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_7b_chat_int4
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: qwen

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})

template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')

"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang province is Hangzhou.
query: What are some famous foods there?
response: Hangzhou has many famous local delicacies, such as West Lake Vinegar Fish, Dongpo Pork, Song Sao Fish Soup, Beggar's Chicken, etc. In addition, there are also Hangzhou specialties like Osmanthus Sugar Lotus Root, Fermented Glutinous Rice Dumplings, Mapo Tofu, and more.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang province is Hangzhou.'), ('What are some famous foods there?', "Hangzhou has many famous local delicacies, such as West Lake Vinegar Fish, Dongpo Pork, Song Sao Fish Soup, Beggar's Chicken, etc. In addition, there are also Hangzhou specialties like Osmanthus Sugar Lotus Root, Fermented Glutinous Rice Dumplings, Mapo Tofu, and more.")]
"""
```

### qwen-7b
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_7b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: default-generation

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 64
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Zhejiang -> Hangzhou\nAnhui -> Hefei\nSichuan ->'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
"""Out[0]
query: Zhejiang -> Hangzhou
Anhui -> Hefei
Sichuan ->
response:  Chengdu
Shandong -> Jinan
Fujian -> Fuzhou
Chongqing -> Chongqing
Guangdong -> Guangzhou
Beijing -> Beijing
Zhejiang -> Hangzhou
Anhui -> Hefei
Sichuan -> Chengdu
Shandong -> Jinan
Fujian -> Fuzhou
Chongqing
"""
```

### Stream Output
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference_stream, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: qwen

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})

template = get_template(template_type, tokenizer)
seed_everything(42)

query = 'What is the capital of Zhejiang Province?'
gen = inference_stream(model, template, query)
print(f'query: {query}')
for response, history in gen:
    pass
print(f'response: {response}')

# method1
query = 'What is there to eat?'
old_history = history
gen = inference_stream(model, template, query, old_history)
print(f'query: {query}')
for response, history in gen:
    print(f'response: {response}')
print(f'history: {history}')

# method2
query = 'What is there to eat?'
gen = inference_stream(model, template, query, old_history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
    delta = response[print_idx:]
    print(delta, end='', flush=True)
    print_idx = len(response)
print(f'\nhistory: {history}')

"""Out[0]
query: What is the capital of Zhejiang Province?
response: The capital of Zhejiang Province is Hangzhou.
query: What is there to eat?
response: Zhejiang
response: Zhejiang cuisine,
response: Zhejiang cuisine,
response: Zhejiang cuisine, also
...
response: Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.
history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.']]
query: What is there to eat?
response: There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).
history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).']]
"""
```

### qwen-vl-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_vl_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: qwen

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})

template = get_template(template_type, tokenizer)
seed_everything(42)
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'What is this?'},
])
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'Output the bounding box for the high-five'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
image = tokenizer.draw_bbox_on_latest_picture(response, history)
image.save('output_chat.jpg')
"""
query: Picture 1:<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>
What is this
response: The image shows a woman playing with a dog, which appears to be a Labrador Retriever, on a beach.
query: Output the bounding box for the high-five
response: <ref>High-five</ref><box>(523,513),(584,605)</box>
history: [('Picture 1:<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\nWhat is this', 'The image shows a woman playing with a dog, which appears to be a Labrador Retriever, on a beach.'), ('Output the bounding box for the high-five', '<ref>High-five</ref><box>(523,513),(584,605)</box>')]
"""
```

### qwen-audio-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.qwen_audio_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: qwen

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})

template = get_template(template_type, tokenizer)

seed_everything(42)
query = tokenizer.from_list_format([
    {'audio': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac'},
    {'text': 'what does the person say?'},
])
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'Find the start time and end time of the word "middle classes'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
query: Audio 1:<audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio>
what does the person say?
response: The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".
query: Find the start time and end time of the word "middle classes
response: The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.
history: [('Audio 1:<audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio>\nwhat does the person say?', 'The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".'), ('Find the start time and end time of the word "middle classes', 'The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.')]
"""
```

### chatglm3
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything

model_type = ModelType.chatglm3_6b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: chatglm3

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 128

template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')

"""Out[0]
response: Zhejiang has many delicious foods, here are some famous ones:

1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.

2. West Lake Vinegar Fish: This is one of Hangzhou's famous dishes, made by cooking grass carp and pouring over a specially made paste and vinegar, giving it a delicious flavor.

3. Zhejiang Stewed Chicken: This is one of the traditional famous dishes of Zhejiang province, made by slowly stewing chicken with ginger, green onion, soy sauce and other seasonings, resulting in a rich flavor.

4. Youpodouci: This is a traditional Zhejiang pastry, with a crispy exterior and sweet filling
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou.'), ('What are some famous foods there?', 'Zhejiang has many delicious foods, here are some famous ones:\n\n1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling. \n\n2. West Lake Vinegar Fish: This is one of Hangzhou's famous dishes, made by cooking grass carp and pouring over a specially made paste and vinegar, giving it a delicious flavor.\n\n3. Zhejiang Stewed Chicken: This is one of the traditional famous dishes of Zhejiang province, made by slowly stewing chicken with ginger, green onion, soy sauce and other seasonings, resulting in a rich flavor. \n\n4. Youpodouci: This is a traditional Zhejiang pastry, with a crispy exterior and sweet filling')]
"""
```


### BitsAndBytes Quantization
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
from modelscope import BitsAndBytesConfig
import torch

model_type = ModelType.chatglm3_6b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')  # template_type: chatglm3

torch_dtype = torch.bfloat16
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True)
model, tokenizer = get_model_tokenizer(model_type, torch_dtype, {'device_map': 'auto',
                                      'quantization_config': quantization_config})
model.generation_config.max_new_tokens = 128
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')

"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang is Hangzhou.
query: What are some famous foods there?
response: Zhejiang has many delicious foods, here are some famous ones:

1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.

2. Zhejiang Zongzi: Zhejiang zongzi come in many flavors, such as salted egg yolk pork zongzi, red bean paste zongzi, etc., with Hangzhou zongzi being the most famous.

3. Oil Fried Shrimp: This is one of the most representative seafood dishes in Zhejiang, made by stir-frying shrimp in hot oil until crispy and tender.

4. Salt and Pepper Shredded Potato: This is a traditional Zhejiang vegetable dish, made by stir-frying shredded potato with salt and pepper, resulting in a crisp and refreshing taste.

history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou.'), ('What are some famous foods there?', 'Zhejiang has many delicious foods, here are some famous ones:\n\n1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.\n\n2. Zhejiang Zongzi: Zhejiang zongzi come in many flavors, such as salted egg yolk pork zongzi, red bean paste zongzi, etc., with Hangzhou zongzi being the most famous. \n\n3. Oil Fried Shrimp: This is one of the most representative seafood dishes in Zhejiang, made by stir-frying shrimp in hot oil until crispy and tender.\n\n4. Salt and Pepper Shredded Potato: This is a traditional Zhejiang vegetable dish, made by stir-frying shredded potato with salt and pepper, resulting in a crisp and refreshing taste.\n')]
"""
```

### Using CLI
```bash
# qwen
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat
# yi
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat
```

### Fine-tuned Models
If you want to perform inference using fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning.md#Fine-tuned-Model)


## Web-UI
### qwen-7b-chat
Using CLI:
```bash
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b-chat
```

Using python:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import AppUIArguments, ModelType, app_ui_main

app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b_chat)
app_ui_main(app_ui_args)
```

Using bnb quantization:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import AppUIArguments, ModelType, app_ui_main

app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b_chat, quantization_bit=4)
app_ui_main(app_ui_args)
```

### qwen-7b
Using CLI:
```bash
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b
```

Using python:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import AppUIArguments, ModelType, app_ui_main

app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b)
app_ui_main(app_ui_args)
```

### Fine-tuned Models
To use the web-ui with fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning#fine-tuned-model)