inference.qmd 2.76 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Inference and Merging"
format:
  html:
    toc: true
    toc-depth: 3
    number-sections: true
execute:
  enabled: false
---

This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.

## Quick Start {#sec-quickstart}

::: {.callout-tip}
Use the same config used for training on inference/merging.
:::

### Basic Inference {#sec-basic}

::: {.panel-tabset}

## LoRA Models

```{.bash}
axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
```

## Full Fine-tuned Models

```{.bash}
axolotl inference your_config.yml --base-model="./completed-model"
```

:::

## Advanced Usage {#sec-advanced}

### Gradio Interface {#sec-gradio}

Launch an interactive web interface:

```{.bash}
axolotl inference your_config.yml --gradio
```

### File-based Prompts {#sec-file-prompts}

Process prompts from a text file:

```{.bash}
cat /tmp/prompt.txt | axolotl inference your_config.yml \
  --base-model="./completed-model" --prompter=None
```

### Memory Optimization {#sec-memory}

For large models or limited memory:

```{.bash}
axolotl inference your_config.yml --load-in-8bit=True
```

## Merging LoRA Weights {#sec-merging}

Merge LoRA adapters with the base model:

```{.bash}
axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"
```

### Memory Management for Merging {#sec-memory-management}

::: {.panel-tabset}

## Configuration Options

```{.yaml}
gpu_memory_limit: 20GiB  # Adjust based on your GPU
lora_on_cpu: true        # Process on CPU if needed
```

## Force CPU Merging

```{.bash}
CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...
```

:::

## Tokenization {#sec-tokenization}

### Common Issues {#sec-tokenization-issues}

::: {.callout-warning}
Tokenization mismatches between training and inference are a common source of problems.
:::

To debug:

1. Check training tokenization:
```{.bash}
axolotl preprocess your_config.yml --debug
```

2. Verify inference tokenization by decoding tokens before model input

3. Compare token IDs between training and inference

### Special Tokens {#sec-special-tokens}

Configure special tokens in your YAML:

```{.yaml}
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"
```

## Troubleshooting {#sec-troubleshooting}

### Common Problems {#sec-common-problems}

::: {.panel-tabset}

## Memory Issues

- Use 8-bit loading
- Reduce batch sizes
- Try CPU offloading

## Token Issues

- Verify special tokens
- Check tokenizer settings
- Compare training and inference preprocessing

## Performance Issues

- Verify model loading
- Check prompt formatting
- Ensure temperature/sampling settings

:::

For more details, see our [debugging guide](debugging.qmd).