# New Tokenizer System ## Key Differences from the Old Tokenizer System ### 1. Hugging Face–style API We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s: `.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings. `.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters. This eliminates the need for long initialization arguments and hard-coded settings in training scripts. ### 2. Tokenizer Metadata A metadata file (JSON) now stores all essential tokenizer configuration in one place: - Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.) - Chat templates - Tokenizer class Benefits: - You only need to set these parameters once. - No more passing multiple CLI arguments for tokenizer settings. - Easy sharing — just copy the tokenizer directory with its metadata file. ### 3. Library Classes Are Now Internal In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually. In the new system: - The library is automatically detected from the metadata. - The correct tokenizer implementation is chosen under the hood. - Users don’t need to manually manage tokenizer classes. ### 3. Support for Model-specific Tokenizer Classes The system now supports: - Built-in LLM-specific tokenizers. - Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file. - This allows advanced customization while keeping defaults simple for most users. ### 4. Usage **Creating and Saving Metadata** ```python from megatron.core.tokenizers import MegatronTokenizer # The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory. MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", chat_template="chat template in jinja format", ) # To use custom tokenizer class from megatron.core.tokenizers.text import MegatronTokenizerText class CustomTokenizer(MegatronTokenizerText): ... MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", chat_template="chat template in jinja format", tokenizer_class=CustomTokenizer, ) # To save metadata to another dir MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", metadata_path="/path/to/save/metadata.json", ) ``` **Restoring the tokenizer** ```python from megatron.core.tokenizers import MegatronTokenizer MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer.model", ) # If metadata is not in tokenizer’s dir MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer.model", metadata_path="/path/to/metadata.json", ) # Pass metadata as dict MegatronTokenizer.from_pretrained( tokenizer_path="GPT2BPETokenizer", metadata_path={"library": "megatron"}, vocab_file="/path/to/vocab.txt", ) # Pass additional params MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer/model.json", metadata_path={"library": "tiktoken"}, pattern="v2", num_special_tokens=1000, ) # Null tokenzier MegatronTokenizer.from_pretrained( metadata_path={"library": "null"}, vocab_size=131072, ) ``` ### 4. Megatron-LM pretraining compatibility New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically. ```bash # Null tokenizer torchrun --nproc_per_node=1 pretrain_gpt.py \ ... \ --tokenizer-type NullTokenizer \ --vocab-size 131072 # HuggingFace tokenizer with specified metadata torchrun --nproc_per_node=1 pretrain_gpt.py \ ... \ --tokenizer-type HuggingFaceTokenizer \ --tokenizer-model meta-llama/Meta-Llama-3-8B \ --tokenizer-metadata /path/to/metadata.json ``` The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.