tokenizers.md 4.23 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# New Tokenizer System

## Key Differences from the Old Tokenizer System

### 1. Hugging Face–style API

We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:

`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.

`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.

This eliminates the need for long initialization arguments and hard-coded settings in training scripts.

### 2. Tokenizer Metadata

A metadata file (JSON) now stores all essential tokenizer configuration in one place:
 - Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
 - Chat templates
 - Tokenizer class

Benefits:
 - You only need to set these parameters once.
 - No more passing multiple CLI arguments for tokenizer settings.
 - Easy sharing — just copy the tokenizer directory with its metadata file.

### 3. Library Classes Are Now Internal

In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.

In the new system:
 - The library is automatically detected from the metadata.
 - The correct tokenizer implementation is chosen under the hood.
 - Users don’t need to manually manage tokenizer classes.

### 3. Support for Model-specific Tokenizer Classes

The system now supports:
 - Built-in LLM-specific tokenizers. 
 - Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
 - This allows advanced customization while keeping defaults simple for most users.

### 4. Usage

**Creating and Saving Metadata**

```python
from megatron.core.tokenizers import MegatronTokenizer

# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
)

# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText

class CustomTokenizer(MegatronTokenizerText):
    ...

MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
    tokenizer_class=CustomTokenizer,
)

# To save metadata to another dir
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    metadata_path="/path/to/save/metadata.json",
)

```

**Restoring the tokenizer**

```python
from megatron.core.tokenizers import MegatronTokenizer

MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
)

# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
    metadata_path="/path/to/metadata.json",
)

# Pass metadata as dict
MegatronTokenizer.from_pretrained(
    tokenizer_path="GPT2BPETokenizer",
    metadata_path={"library": "megatron"},
    vocab_file="/path/to/vocab.txt",
)

# Pass additional params
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer/model.json",
    metadata_path={"library": "tiktoken"},
    pattern="v2",
    num_special_tokens=1000,
)

# Null tokenzier
MegatronTokenizer.from_pretrained(
    metadata_path={"library": "null"},
    vocab_size=131072,
)

```

### 4. Megatron-LM pretraining compatibility

New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.

```bash
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type NullTokenizer \
    --vocab-size 131072

# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model meta-llama/Meta-Llama-3-8B \
    --tokenizer-metadata /path/to/metadata.json

```

The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.