Unverified Commit 458c0219 authored by Simo Lin's avatar Simo Lin Committed by GitHub
Browse files

[router] simplify tokenizer dev doc (#10895)

parent a73eb8cd
# Tokenizer Architecture
## 1. Executive Summary
### High-Level Overview
The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router.
**Key Components:**
- **Factory Pattern**: Auto-detection and creation of appropriate tokenizer types from files or model names
- **HuggingFace Hub Integration**: Automatic downloading of tokenizer files from HuggingFace Hub for model IDs
- **Trait System**: `Encoder`, `Decoder`, and `Tokenizer` traits for implementation flexibility
- **Streaming**: Incremental decoding with UTF-8 boundary handling and buffering
- **Stop Sequences**: Complex pattern matching for stop tokens and sequences with "jail" buffering
- **Sequence Management**: Stateful token sequence tracking with incremental text generation
- **Chat Templates**: Jinja2-based conversation formatting with HuggingFace compatibility
- **Metrics Integration**: Comprehensive performance and error tracking across all operations
**Data Flow:**
1. Request → Factory (type detection/HF download) → Concrete Tokenizer Creation
2. Encode: Text → Tokenizer → Encoding (token IDs)
3. Stream: Token IDs → DecodeStream → Incremental Text Chunks
4. Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped
5. Sequence: Tokens → Sequence → Incremental Decoding → Text Output
### Architecture Highlights
- **Extended Backend Support**: HuggingFace, Tiktoken (GPT models), and Mock for testing
- **HuggingFace Hub Integration**: Automatic tokenizer downloads with caching
- **Comprehensive Metrics**: Full TokenizerMetrics integration for observability
- **Unified Dependencies**: All tokenizer backends included by default (no feature gates)
- **Stop Sequence Detection**: Sophisticated partial matching with jail buffer
- **Chat Template Support**: Full Jinja2 rendering with HuggingFace compatibility
- **Thread Safety**: Arc-based sharing with Send + Sync guarantees
## 2. Mermaid Diagrams
### Component Flow Diagram
```mermaid
graph TB
subgraph Input
R[Request] --> F[Factory]
end
subgraph Factory Layer
F --> FD[File Detection]
F --> MD[Model Detection]
FD --> HF[HuggingFace]
FD --> TK[Tiktoken]
MD --> TK
FD --> MK[Mock]
end
subgraph Tokenizer Implementations
HF --> T[Tokenizer Wrapper]
TK --> T
MK --> T
end
subgraph Processing
T --> E[Encode]
T --> D[Decode]
T --> DS[DecodeStream]
T --> SQ[Sequence]
T --> SD[StopSequenceDecoder]
end
subgraph Output
E --> ENC[Encoding]
D --> TXT[Text]
DS --> STRM[Stream Chunks]
SQ --> ITXT[Incremental Text]
SD --> SO[Stop Output]
end
subgraph Metrics
M[TokenizerMetrics]
E -.-> M
D -.-> M
DS -.-> M
SD -.-> M
end
```
### Sequence Flow Diagram
```mermaid
sequenceDiagram
participant C as Client
participant F as Factory
participant T as Tokenizer
participant DS as DecodeStream
participant SD as StopDecoder
participant M as Metrics
C->>F: create_tokenizer(path_or_model_id)
F->>F: detect_type()
alt local file
F->>T: new HF/Tiktoken/Mock
else HuggingFace model ID
F->>F: download_tokenizer_from_hf()
F->>T: new from downloaded files
end
F->>M: record_factory_load()
F-->>C: Arc<dyn Tokenizer>
C->>T: encode(text)
T->>M: record_encode_request()
T->>T: tokenize
T->>M: record_tokens_per_encode()
T-->>C: Encoding
C->>DS: new(tokenizer, tokens)
loop streaming
C->>DS: step(token_id)
DS->>T: decode(partial)
DS->>DS: check UTF-8 boundary
alt complete char
DS->>M: record_stream_token()
DS-->>C: Some(text)
else incomplete
DS->>M: record_incomplete_utf8()
DS-->>C: None
end
end
C->>SD: process_token(id)
SD->>SD: check stop conditions
alt stop token
SD->>M: record_stop_detected()
SD-->>C: Stopped
else partial match
SD->>M: record_partial_match()
SD-->>C: Held
else no match
SD->>T: decode incremental
SD-->>C: Text(output)
end
```
### Class/Type Diagram
```mermaid
classDiagram
class Encoder {
<<trait>>
+encode(input: &str) Result~Encoding~
+encode_batch(inputs: &[&str]) Result~Vec~Encoding~~
}
class Decoder {
<<trait>>
+decode(token_ids: &[u32], skip_special: bool) Result~String~
}
class TokenizerTrait {
<<trait>>
+vocab_size() usize
+get_special_tokens() &SpecialTokens
+token_to_id(token: &str) Option~u32~
+id_to_token(id: u32) Option~String~
}
class Tokenizer {
-Arc~dyn TokenizerTrait~
+from_file(path: &str) Result~Tokenizer~
+from_arc(Arc~dyn TokenizerTrait~) Self
+decode_stream(&[u32], bool) DecodeStream
+encode(&str) Result~Encoding~
+decode(&[u32], bool) Result~String~
}
class Encoding {
<<enum>>
Hf(Box~HfEncoding~)
Sp(Vec~u32~)
Tiktoken(Vec~usize~)
+token_ids() Vec~u32~
+token_ids_ref() &[u32]
}
class HuggingFaceTokenizer {
-tokenizer: HfTokenizer
-special_tokens: SpecialTokens
-vocab: HashMap~String, u32~
-reverse_vocab: HashMap~u32, String~
+from_file(path: &str) Result~Self~
+apply_chat_template(&[ChatMessage]) Result~String~
}
class TiktokenTokenizer {
-tokenizer: CoreBPE
-model: TiktokenModel
-special_tokens: SpecialTokens
-vocab_size: usize
+new(model: TiktokenModel) Result~Self~
+from_model_name(name: &str) Result~Self~
}
class MockTokenizer {
-vocab: HashMap~String, u32~
-reverse_vocab: HashMap~u32, String~
-special_tokens: SpecialTokens
+new() Self
}
class DecodeStream {
-tokenizer: Arc~dyn TokenizerTrait~
-all_token_ids: Vec~u32~
-prefix_offset: usize
-read_offset: usize
-skip_special_tokens: bool
+new(tokenizer, &[u32], bool) Self
+step(u32) Result~Option~String~~
+flush() Result~Option~String~~
}
class Sequence {
-tokenizer: Arc~dyn TokenizerTrait~
-token_ids: Vec~u32~
-prefix_offset: usize
-read_offset: usize
+append_text(&str) Result~()~
+append_token(u32) Result~String~
+text() Result~String~
}
class StopSequenceDecoder {
-tokenizer: Arc~dyn TokenizerTrait~
-config: StopSequenceConfig
-jail_buffer: String
-token_buffer: Vec~u32~
-stopped: bool
+process_token(u32) Result~SequenceDecoderOutput~
+flush() SequenceDecoderOutput
+reset()
}
Encoder <|.. HuggingFaceTokenizer
Encoder <|.. TiktokenTokenizer
Encoder <|.. MockTokenizer
Decoder <|.. HuggingFaceTokenizer
Decoder <|.. TiktokenTokenizer
Decoder <|.. MockTokenizer
TokenizerTrait <|.. HuggingFaceTokenizer
TokenizerTrait <|.. TiktokenTokenizer
TokenizerTrait <|.. MockTokenizer
TokenizerTrait --|> Encoder
TokenizerTrait --|> Decoder
Tokenizer o-- TokenizerTrait
DecodeStream o-- TokenizerTrait
Sequence o-- TokenizerTrait
StopSequenceDecoder o-- TokenizerTrait
```
## 3. Module-by-Module Deep Dive
### 3.1 mod.rs (Main Module)
**Location**: `src/tokenizer/mod.rs`
**Public API:**
```rust
pub struct Tokenizer(Arc<dyn traits::Tokenizer>);
impl Tokenizer {
pub fn from_file(file_path: &str) -> Result<Tokenizer>
pub fn from_file_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Tokenizer>
pub fn from_arc(tokenizer: Arc<dyn traits::Tokenizer>) -> Self
pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream
pub fn encode(&self, input: &str) -> Result<Encoding>
pub fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>
pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>
pub fn vocab_size(&self) -> usize
pub fn get_special_tokens(&self) -> &SpecialTokens
pub fn token_to_id(&self, token: &str) -> Option<u32>
pub fn id_to_token(&self, id: u32) -> Option<String>
}
```
**Key Responsibilities:**
- Main wrapper providing unified interface (mod.rs:36-93)
- Arc-based shared ownership for thread safety
- Delegates to concrete implementations via trait object
- Factory method integration for creation
**State Management:**
- Single field: `Arc<dyn traits::Tokenizer>` for polymorphic dispatch
- Immutable after creation, Clone via Arc
**Re-exports** (mod.rs:26-43):
- Factory functions: `create_tokenizer`, `create_tokenizer_async`, `create_tokenizer_from_file`, `create_tokenizer_with_chat_template`
- Types: `Sequence`, `StopSequenceConfig`, `DecodeStream`, `Encoding`, `TokenizerType`
- Chat template: `ChatMessage`
- Tokenizer implementations: `HuggingFaceTokenizer`, `TiktokenTokenizer`
### 3.2 traits.rs (Trait Definitions)
**Location**: `src/tokenizer/traits.rs`
**Core Traits:**
```rust
pub trait Encoder: Send + Sync {
fn encode(&self, input: &str) -> Result<Encoding>;
fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
}
pub trait Decoder: Send + Sync {
fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>;
}
pub trait Tokenizer: Encoder + Decoder {
fn vocab_size(&self) -> usize;
fn get_special_tokens(&self) -> &SpecialTokens;
fn token_to_id(&self, token: &str) -> Option<u32>;
fn id_to_token(&self, id: u32) -> Option<String>;
}
```
**Encoding Enum** (traits.rs:24-53):
```rust
pub enum Encoding {
Hf(Box<tokenizers::tokenizer::Encoding>), // HuggingFace
Sp(Vec<u32>), // SentencePiece
Tiktoken(Vec<usize>), // GPT models
}
```
**Key Design Decisions:**
- Separation of Encoder/Decoder allows partial implementations
- Send + Sync for thread safety
- Encoding enum handles different backend representations
- `token_ids()` returns `Vec<u32>` for compatibility (traits.rs:34-40)
- `token_ids_ref()` has limitation for Tiktoken (returns empty slice)
**SpecialTokens Struct** (traits.rs:55-65):
- Standard tokens: bos, eos, unk, sep, pad, cls, mask
- Additional tokens vector for custom special tokens
### 3.3 factory.rs (Tokenizer Creation)
**Location**: `src/tokenizer/factory.rs`
**Public Functions:**
```rust
pub fn create_tokenizer_from_file(file_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub async fn create_tokenizer_async(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn get_tokenizer_info(file_path: &str) -> Result<TokenizerType>
```
**Auto-Detection Logic** (factory.rs:94-132):
1. Read first 512 bytes of file
2. Check for JSON format (HuggingFace)
3. Check for GGUF magic bytes
4. Check for SentencePiece patterns
**File Type Detection** (factory.rs:135-161):
- JSON detection: Skip BOM, find `{` or `[`
- SentencePiece: Check for specific byte patterns
- GGUF: Check magic number "GGUF"
**Model Name Routing** (factory.rs:145-193):
- GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.)
- File paths → file-based creation
- HuggingFace model IDs → Automatic download from Hub
**HuggingFace Hub Integration**:
- Downloads tokenizer files (tokenizer.json, tokenizer_config.json, etc.)
- Respects HF_TOKEN environment variable for private models
- Caches downloaded files using hf-hub crate
- Async and blocking versions available
**Metrics Integration:**
- Records factory load/error events (factory.rs:56-57, 82-83)
- Tracks vocab size on successful load
- Measures load duration
### 3.4 huggingface.rs (HuggingFace Implementation)
**Location**: `src/tokenizer/huggingface.rs`
**Public API:**
```rust
pub struct HuggingFaceTokenizer {
tokenizer: HfTokenizer,
special_tokens: SpecialTokens,
vocab: HashMap<String, u32>,
reverse_vocab: HashMap<u32, String>,
}
impl HuggingFaceTokenizer {
pub fn from_file(file_path: &str) -> Result<Self>
pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self
pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result<String>
}
```
**Special Token Extraction** (huggingface.rs:58-82):
- Searches for common patterns: `<s>`, `</s>`, `<unk>`, `[CLS]`, etc.
- Falls back to None if not found
**Vocab Management:**
- Builds forward and reverse mappings on creation (huggingface.rs:26-30)
- Used for token↔ID conversions
**Metrics** (huggingface.rs:97-111, 136-150):
- Tracks encode/decode requests, durations
- Records character/token counts
- Reports errors with context
**Chat Template Integration** (huggingface.rs:21-144):
- Automatic loading from tokenizer_config.json
- Custom template loading from .jinja files
- Runtime template modification via `set_chat_template()`
- Full Jinja2 rendering via minijinja
- See section 3.10 for detailed chat template architecture
### 3.5 sequence.rs (Sequence Management)
**Location**: `src/tokenizer/sequence.rs`
**Core Structure:**
```rust
pub struct Sequence {
tokenizer: Arc<dyn TokenizerTrait>,
token_ids: Vec<u32>,
prefix_offset: usize, // Start of prefix window
read_offset: usize, // End of processed tokens
}
```
**Key Methods:**
```rust
impl Sequence {
pub fn new(tokenizer: Arc<dyn TokenizerTrait>) -> Self
pub fn with_tokens(tokenizer: Arc<dyn TokenizerTrait>, token_ids: Vec<u32>) -> Self
pub fn append_text(&mut self, input: &str) -> Result<()>
pub fn append_token(&mut self, token_id: u32) -> Result<String>
pub fn text(&self) -> Result<String>
}
```
**Incremental Decoding Algorithm** (sequence.rs:93-142):
1. Store old read_offset before adding token
2. Push new token, update read_offset
3. Decode prefix window (prefix_offset..old_read_offset)
4. Decode full window (prefix_offset..current)
5. Check for UTF-8 boundary issues
6. Extract only new text portion
7. Handle incomplete UTF-8 (�) by returning empty
**State Variables:**
- `token_ids`: Complete sequence of tokens
- `prefix_offset`: Where last decode started
- `read_offset`: Current position in sequence
### 3.6 stop.rs (Stop Sequence Detection)
**Location**: `src/tokenizer/stop.rs`
**Core Components:**
```rust
pub enum SequenceDecoderOutput {
Text(String), // Normal output
Held, // Partial match, holding text
Stopped, // Stop matched (hidden)
StoppedWithText(String),// Stop matched (visible)
}
pub struct StopSequenceConfig {
pub stop_tokens: HashSet<u32>,
pub stop_sequences: Vec<String>,
pub visible_stop_tokens: HashSet<u32>,
pub visible_stop_sequences: Vec<String>,
}
pub struct StopSequenceDecoder {
tokenizer: Arc<dyn traits::Tokenizer>,
config: StopSequenceConfig,
jail_buffer: String, // Held text for partial matches
token_buffer: Vec<u32>, // All tokens
prefix_offset: usize,
read_offset: usize,
stopped: bool,
skip_special_tokens: bool,
}
```
**Stop Detection Algorithm** (stop.rs:97-252):
1. **Token-level checks** (stop.rs:104-132):
- Check stop_tokens → return Stopped
- Check visible_stop_tokens → return StoppedWithText
2. **Incremental decode** (stop.rs:136-166):
- Decode previous context
- Decode including new token
- Check for incomplete UTF-8
3. **String matching** (stop.rs:169-202):
- Combine jail_buffer + new_text
- Check for complete matches
- Check visible sequences
4. **Partial match detection** (stop.rs:204-239):
- Check all suffixes as potential prefixes
- Split safe text vs potential match
- Jail potential match text
**Critical Fix** (stop.rs:385-424):
- Ensures no repeated/accumulated output
- Only outputs NEW text, not full buffer
### 3.7 stream.rs (Streaming Decode)
**Location**: `src/tokenizer/stream.rs`
**Structure:**
```rust
pub struct DecodeStream {
tokenizer: Arc<dyn traits::Tokenizer>,
skip_special_tokens: bool,
all_token_ids: Vec<u32>,
prefix_offset: usize,
read_offset: usize,
}
```
**Constants:**
- `INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5` (stream.rs:9)
- Matches HuggingFace TGI and vLLM standard
**Key Methods:**
```rust
impl DecodeStream {
pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self
pub fn step(&mut self, id: u32) -> Result<Option<String>>
pub fn step_batch(&mut self, token_ids: &[u32]) -> Result<Vec<String>>
pub fn flush(&mut self) -> Result<Option<String>>
pub fn tokens(&self) -> &[u32]
}
```
**Streaming Algorithm** (stream.rs:47-82):
1. Append token to buffer
2. Decode prefix window for context
3. Decode full window
4. Check for incomplete UTF-8 (�)
5. Extract new text if complete
6. Update offsets for next iteration
**Metrics:**
- Records stream tokens, incomplete UTF-8, step duration
### 3.8 tiktoken.rs (Tiktoken Implementation)
**Location**: `src/tokenizer/tiktoken.rs`
**Public API:**
```rust
pub struct TiktokenTokenizer {
tokenizer: CoreBPE,
model: TiktokenModel,
special_tokens: SpecialTokens,
vocab_size: usize,
}
pub enum TiktokenModel {
Cl100kBase, // GPT-4, GPT-3.5-turbo
P50kBase, // Codex, text-davinci-002/003
P50kEdit, // Edit models
R50kBase, // GPT-3 (davinci, curie, etc.)
}
```
**Model Detection** (tiktoken.rs:67-81):
- GPT-4, GPT-3.5, turbo → Cl100kBase
- davinci-002/003, codex → P50kBase
- edit models → P50kEdit
- davinci, curie, babbage, ada → R50kBase
**Vocab Sizes** (tiktoken.rs:46-50):
- Cl100kBase: 100,256 tokens
- P50k variants: 50,281 tokens
- R50kBase: 50,257 tokens
**Special Tokens** (tiktoken.rs:84-114):
- All models use `<|endoftext|>` for BOS/EOS/PAD
- Cl100k adds FIM tokens for code completion
**Limitations:**
- No token↔ID mapping (returns None) (tiktoken.rs:151-161)
- Requires Vec<u32> → Vec<usize> conversion
### 3.9 mock.rs (Testing Implementation)
**Location**: `src/tokenizer/mock.rs`
**Purpose:** Simple tokenizer for unit testing
**Vocabulary:**
- 8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc.
- Special tokens: `<eos>`→999, `<bos>`→1000
**Behavior:**
- Encode: Split on whitespace, lookup tokens
- Decode: Join tokens with spaces
- Skips special tokens when requested
### 3.10 hub.rs (HuggingFace Hub Download)
**Location**: `src/tokenizer/hub.rs`
**Purpose:** Download tokenizer files from HuggingFace Hub when given a model ID.
**Key Functions:**
# Tokenizer Module
## Overview
The `sgl-router` tokenizer subsystem exposes a single `Tokenizer` facade around multiple backends
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the
shared behaviours needed by the router–encoding user text, incrementally decoding streamed tokens,
tracking per-request state, and detecting stop conditions—behind trait objects so the rest of the
router can remain backend-agnostic.
Key capabilities:
- trait-based split between `Encoder`, `Decoder`, and `Tokenizer` for shared APIs across backends
- Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
- heuristic selection of OpenAI/tiktoken encodings for GPT model names
- incremental decoding utilities (`DecodeStream`, `Sequence`) that handle UTF-8 boundaries
- stop sequence handling via `StopSequenceDecoder` with token-level and string-level triggers
- optional Jinja2 chat-template rendering that matches Hugging Face semantics
The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
support mentioned in earlier drafts do **not** exist today. This document reflects the actual code
as of `sgl-router/src/tokenizer/*`.
## Source Map
- `mod.rs` – module exports and the `Tokenizer` wrapper around `Arc<dyn Tokenizer>`
- `traits.rs` – shared traits and the `Encoding`/`SpecialTokens` helper types
- `factory.rs` – backend discovery, file/model heuristics, and tokio-aware creation helpers
- `hub.rs` – Hugging Face Hub downloads via `hf_hub`
- `huggingface.rs` – wrapper over `tokenizers::Tokenizer`, chat template loading, vocab access
- `tiktoken.rs` – wrapper over `tiktoken-rs` encoders for OpenAI model families
- `chat_template.rs` – AST-driven Jinja template inspection and rendering utilities
- `sequence.rs` – stateful incremental decoding helper used by router sequences
- `stream.rs` – stateless streaming decoder that yields textual chunks from token streams
- `stop.rs` – stop-sequence detection with "jail" buffering and a builder API
- `mock.rs` – lightweight tokenizer used by unit tests
- `tests.rs` – smoke tests covering the trait facade and helpers (largely with the mock backend)
## Core Traits and Types (`traits.rs`)
- `Encoder`, `Decoder`, and `Tokenizer` traits stay `Send + Sync` so instances can be shared across
threads. Concrete backends implement the minimal methods: `encode`, `encode_batch`, `decode`,
`vocab_size`, special-token lookup, and optional token↔id conversions.
- `Encoding` wraps backend-specific results: `Hf` holds the Hugging Face encoding object,
`Sp` is a plain ID vector reserved for future SentencePiece support, and `Tiktoken` stores u32 IDs
from `tiktoken-rs`. `Encoding::token_ids()` is the zero-copy accessor used everywhere.
- `SpecialTokens` collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic
decisions.
- `Tokenizer` (in `mod.rs`) is a thin `Arc<dyn Tokenizer>` newtype that exposes convenience methods
(`encode`, `decode`, `decode_stream`, etc.) while keeping cloning cheap.
## Backend Implementations
### HuggingFaceTokenizer (`huggingface.rs`)
- Loads `tokenizer.json` (or similar) using `tokenizers::Tokenizer::from_file`.
- Caches vocab forward and reverse maps for `token_to_id`/`id_to_token` support.
- Extracts special tokens using common patterns (e.g. `<s>`, `[CLS]`).
- Supports optional chat templates: either auto-discovered next to the tokenizer via
`tokenizer_config.json` or overridable with an explicit template path.
- Exposes `apply_chat_template` which renders a minijinja template given JSON message payloads and
template parameters.
### TiktokenTokenizer (`tiktoken.rs`)
- Wraps the `tiktoken-rs` `CoreBPE` builders (`cl100k_base`, `p50k_base`, `p50k_edit`, `r50k_base`).
- `from_model_name` heuristically maps OpenAI model IDs (e.g. `gpt-4`, `text-davinci-003`) to those
bases. Unknown model names return an error rather than silently defaulting.
- Implements encode/decode operations; batch encode simply iterates sequentially.
- Provides approximate vocab sizes and common GPT special tokens. Direct token↔id lookup is not
implemented—the underlying library does not expose that mapping.
### MockTokenizer (`mock.rs`)
- Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
- Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.
## Factory and Backend Discovery (`factory.rs`)
- `create_tokenizer{,_async}` accept either a filesystem path or a model identifier. Logic:
1. Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
2. Strings that look like OpenAI model names (`gpt-*`, `davinci`, `curie`, `babbage`, `ada`) use
`TiktokenTokenizer`.
3. Everything else attempts a Hugging Face Hub download via `download_tokenizer_from_hf`.
- Chat templates can be injected with `create_tokenizer_with_chat_template`.
- Async creation uses `tokio` for network access. The blocking variant reuses or spins up a runtime
when called from synchronous contexts.
- SentencePiece (`.model`) and GGUF files are detected but currently return a clear `not supported`
error.
## Hugging Face Hub Integration (`hub.rs`)
- Uses the async `hf_hub` API to list and download tokenizer-related files
(`tokenizer.json`, `merges.txt`, `.model`, etc.), filtering out weights and docs.
- The helper returns the HF cache directory containing the fetched files; the factory then loads
from disk using standard file paths.
- Honour the `HF_TOKEN` environment variable for private or rate-limited models. Without it the
download may fail with an authorization error.
## Chat Template Support (`chat_template.rs`)
- Detects whether a template expects raw string content or the structured OpenAI-style `content`
list by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in
SGLang.
- `ChatTemplateProcessor` (constructed per call) renders templates against JSON `messages` and
`ChatTemplateParams` (system prompt, tools, EOS token handling, etc.). Errors surface as
`anyhow::Error`, keeping parity with Hugging Face error messages.
- The tokenizer wrapper stores both the template string and its detected content format so callers
can pre-transform message content correctly.
## Streaming and Stateful Helpers
### `DecodeStream` (`stream.rs`)
- Maintains a sliding window (`prefix_offset`, `read_offset`) over accumulated token IDs.
- Each `step` decodes the known prefix and the new slice; when the new slice produces additional
UTF-8 text (and does not end in the replacement character `�`), it returns the incremental chunk
and updates offsets. Otherwise it returns `None` and waits for more tokens.
- `step_batch` and `flush` offer convenience for batching and draining remaining text.
### `Sequence` (`sequence.rs`)
- Holds per-request decoding state: accumulated IDs plus offsets mirroring `DecodeStream`.
- `append_text` encodes extra prompt text; `append_token` decodes incremental output while
respecting UTF-8 boundaries and replacing stray `�` characters.
- Designed for integration with router sequence management where decoded text must be replayed.
### `StopSequenceDecoder` (`stop.rs`)
- Extends the incremental decoding approach with a "jail" buffer that holds potential partial
matches against configured stop sequences.
- Supports both token-level stops (visible or hidden) and arbitrary string sequences. When a string
stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can
decide whether it completes a stop sequence.
- Provides `StopSequenceDecoderBuilder` for ergonomic configuration and exposes `process_token`,
`process_tokens`, `flush`, `reset`, and `is_stopped` helpers.
## Testing
- Unit tests cover the mock tokenizer, the `Tokenizer` wrapper, incremental decoding helpers, and
stop-sequence behaviour (`tests.rs`, `sequence.rs`, `stop.rs`, `tiktoken.rs`, `factory.rs`,
`hub.rs`). Network-dependent Hugging Face downloads are exercised behind a best-effort async test
that skips in CI without credentials.
- Use `cargo test -p sgl-router tokenizer` to run the module’s test suite.
## Known Limitations & Future Work
- SentencePiece (`.model`) and GGUF tokenizers are detected but deliberately unimplemented.
- `Encoding::Sp` exists for future SentencePiece support but currently behaves as a simple `Vec<u32>`.
- `TiktokenTokenizer` cannot map individual tokens/IDs; the underlying library would need to expose
its vocabulary to implement `token_to_id`/`id_to_token`.
- There is no metrics or batching layer inside this module; the router records metrics elsewhere.
- Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.
## Usage Examples
```rust
pub async fn download_tokenizer_from_hf(model_id: impl AsRef<Path>) -> Result<PathBuf>
pub async fn from_hf(name: impl AsRef<Path>, ignore_weights: bool) -> Result<PathBuf>
```
**Features:**
- Downloads only tokenizer-related files by default
- Filters out model weights, images, and documentation
- Uses HF_TOKEN environment variable for authentication
- Returns cached directory path for subsequent use
- Progress indication during download
**File Detection:**
- Tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json
- Vocabulary files: vocab.json, merges.txt
- SentencePiece models: *.model files
### 3.11 chat_template.rs (Chat Template Support)
**Location**: `src/tokenizer/chat_template.rs`
**Purpose:** Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' `apply_chat_template` functionality.
**Core Components:**
```rust
pub struct ChatMessage {
pub role: String,
pub content: String,
}
pub struct ChatTemplateProcessor {
template: String,
bos_token: Option<String>,
eos_token: Option<String>,
use std::sync::Arc;
use sglang_router_rs::tokenizer::{
create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};
// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!")?;
assert!(!encoding.token_ids().is_empty());
// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;
// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
.stop_sequence("\nHuman:")
.build();
for &token in encoding.token_ids() {
if let Some(chunk) = stream.step(token)? {
match stop.process_token(token)? {
SequenceDecoderOutput::Text(t) => println!("{}", t),
SequenceDecoderOutput::StoppedWithText(t) => {
println!("{}", t);
break;
}
SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
}
}
}
```
**Key Features:**
1. **Jinja2 Template Rendering** (chat_template.rs:63-102):
- Uses minijinja crate for Jinja2 compatibility
- Supports full Jinja2 syntax (loops, conditionals, variables)
- Compatible with HuggingFace chat templates
2. **Template Loading Sources:**
- **tokenizer_config.json** (automatic): Default behavior when creating tokenizer
- **.jinja files** (explicit): Custom templates that override built-in
- **Programmatic** (runtime): `set_chat_template()` method
3. **Loading Priority:**
```rust
// Priority order:
// 1. Explicit .jinja file (if provided) - OVERRIDES all
// 2. tokenizer_config.json (if exists)
// 3. Fallback to simple formatting
```
**Template Variables:**
- `messages`: Array of chat messages with role and content
- `add_generation_prompt`: Boolean for assistant prompt
- `bos_token`: Beginning of sequence token
- `eos_token`: End of sequence token
**API Methods:**
```rust
// Factory level - create with custom template
pub fn create_tokenizer_with_chat_template(
tokenizer_path: &str,
chat_template_path: Option<&str>
) -> Result<Arc<dyn Tokenizer>>
// HuggingFace tokenizer methods
impl HuggingFaceTokenizer {
// Load with custom template (overrides built-in)
pub fn from_file_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Self>
// Set template after creation (mimics Python)
pub fn set_chat_template(&mut self, template: String)
// Apply template to messages
pub fn apply_chat_template(
&self,
messages: &[ChatMessage],
add_generation_prompt: bool
) -> Result<String>
}
```
**Template Examples:**
1. **Llama-style Template:**
```jinja
{%- for message in messages %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{{- message['content'] + '<|eot_id|>' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
```
2. **ChatML Format:**
```jinja
{%- for message in messages %}
{{- '<|im_start|>' + message['role'] + '\n' }}
{{- message['content'] + '<|im_end|>\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
```
**Integration with HuggingFace Tokenizer:**
1. **Automatic Loading** (huggingface.rs:108-124):
- Searches for tokenizer_config.json in same directory
- Extracts `chat_template` field if present
- Stores template for use in apply_chat_template
2. **Override Mechanism** (huggingface.rs:28-50):
- If chat_template_path provided, loads from .jinja file
- Replaces any existing template from tokenizer_config.json
- Matches Python's behavior: custom templates always override
3. **Runtime Modification** (huggingface.rs:140-144):
- `set_chat_template()` allows changing template after creation
- Equivalent to Python's `tokenizer.chat_template = template`
**Testing Coverage:**
- Template rendering with various formats (Llama, ChatML, custom)
- Loading from .jinja files
- Override behavior verification
- Runtime template modification
- Special token handling
## 4. Traits & Contracts
### Core Trait Hierarchy
1. **Encoder** (traits.rs:4-7)
- Contract: Convert text to token IDs
- Requirements: Send + Sync for thread safety
- Error handling via Result
2. **Decoder** (traits.rs:10-12)
- Contract: Convert token IDs to text
- `skip_special_tokens` parameter for filtering
3. **Tokenizer** (traits.rs:15-20)
- Extends both Encoder and Decoder
- Adds vocab introspection
- Token↔ID bidirectional mapping
### Encoding Contract
The `Encoding` enum must:
- Provide `token_ids()` returning Vec<u32>
- Support multiple backend representations
- Handle type conversions (usize→u32 for Tiktoken)
### Special Token Guarantees
- BOS/EOS tokens for sequence boundaries
- UNK for out-of-vocabulary handling
- Optional tokens may be None
- Additional tokens for custom use cases
## 5. Tokenizer Implementations
### HuggingFace Adapter
**Normalization/Pretokenization:**
- Handled by underlying `tokenizers` crate
- Configurable via JSON tokenizer files
- BPE, WordPiece, Unigram models supported
**API Mapping:**
- `encode(input, add_special_tokens=false)` → Encoding::Hf
- Batch encoding supported natively
- Vocab extraction for lookups
### Tiktoken Adapter
**Model Families:**
- cl100k_base: Modern GPT models (GPT-4, GPT-3.5)
- p50k_base: Codex and davinci-002/003
- p50k_edit: Edit-specific models
- r50k_base: Classic GPT-3
**Byte-Level Behavior:**
- Direct byte-pair encoding without pretokenization
- No subword regularization
- Deterministic encoding
### Sequence/Stop Modules
**Algorithms:**
1. **Substring Matching:**
- Exact match for stop sequences
- Prefix detection for partial matches
2. **Streaming Matcher:**
- Incremental text accumulation
- Jail buffer for uncertain text
- Release on divergence
3. **Overlap Handling:**
- Token boundaries respected
- UTF-8 boundary checking
- Multi-byte character safety
**Window Sizes:**
- Initial offset: 5 tokens (standard)
- Prefix window: Variable based on decoding
- Jail buffer: Unbounded (cleared on match/divergence)
## 6. Streaming Integration
### Pipeline Position
1. **Tokenization Phase:**
- Runs during request preprocessing
- Caches prompt encodings
2. **Decoding Phase:**
- Runs per-token during generation
- Maintains streaming state
### Buffering Policy
- **Token Buffer:** Complete sequence retained
- **Prefix Window:** Sliding window for context
- **Partial Detokenization:** Hold incomplete UTF-8
- **Chunk Boundaries:** Char-aligned output
### Emission Rules
- **Intermediate:** Emit on complete characters
- **Final:** Flush any remaining text
- **Stop Conditions:** Immediate termination
- **Errors:** Propagate with context
## 7. Testing & Benchmarking
### Test Coverage Summary
**Unit Tests (38 tests across 7 modules):**
- `factory.rs`: 4 tests - JSON detection, file types, model routing
- `huggingface.rs`: 1 test - Chat template handling
- `sequence.rs`: 5 tests - Append operations, incremental decode
- `stop.rs`: 9 tests - Stop detection, partial matches, jail buffer
- `tiktoken.rs`: 7 tests - Model detection, encode/decode roundtrip
- `chat_template.rs`: 3 tests - Template rendering, loading
- `tests.rs`: 9 tests - Cross-module integration
**Integration Tests (10 tests in tokenizer_integration.rs):**
- HuggingFace tokenizer hash verification
- Encode/decode lifecycle testing
- Sequence operations with real tokenizers
- Decode streaming with prefill
- Stop sequence detection scenarios
- Factory creation patterns
- Batch encoding verification
- Special token handling
- Thread safety validation
### Benchmark Suite (tokenizer_benchmark.rs)
**Performance Benchmarks (12 benchmark groups):**
1. **Encode Throughput**: Single-threaded encoding performance
2. **Batch Encode**: Batch vs individual encoding comparison
3. **Concurrent Encode**: Multi-request concurrent encoding
4. **Decode Performance**: Various decode scenarios
5. **Streaming Decode**: 100K token streaming performance
6. **Latency Distribution**: P50/P90/P99 latency metrics
7. **Concurrent Streaming**: Multi-stream performance
8. **Stop Sequences**: Stop detection overhead
9. **Multithreaded Encode**: Thread scaling characteristics
10. **Multithreaded Decode**: Decode thread scaling
11. **Memory Efficiency**: Memory usage patterns
12. **Scaling Characteristics**: Performance vs input size
**Test Prompts:**
- Short: 30 chars ("What is the capital of France?")
- Medium: 201 chars (Quantum computing explanation)
- Long: 638 chars (Software engineering review)
## 8. Operational Concerns
### Configuration
**Environment Variables:**
- `HF_TOKEN`: HuggingFace authentication token for private models
**Dependencies:**
- All tokenizer backends included by default
- No feature flags required
**Model Mapping:**
- Hardcoded in factory.rs
- TODO: Make configurable
### Metrics
**Metric Names (via TokenizerMetrics):**
- `sgl_tokenizer_encode_duration_seconds`
- `sgl_tokenizer_decode_duration_seconds`
- `sgl_tokenizer_tokens_per_encode`
- `sgl_tokenizer_chars_per_encode`
- `sgl_tokenizer_factory_load_duration_seconds`
- `sgl_tokenizer_stop_sequence_detected`
- `sgl_tokenizer_stream_incomplete_utf8_total`
**Labels:**
- `tokenizer_type`: huggingface, tiktoken, mock
- `operation`: encode, decode, factory_load
- `error_type`: Various error conditions
### Failure Modes
1. **File Not Found:** Clear error with path
2. **Unsupported Format:** Lists supported types
3. **Feature Disabled:** Suggests enabling feature
4. **Decode Errors:** Context in error message
5. **Incomplete UTF-8:** Handled gracefully
### Dynamic Batching Analysis
**Note**: Dynamic batching implementation was explored but found to have significant overhead:
- Channel communication adds ~3-4ms latency per request
- Single requests are ~300x slower with dynamic batching
- Even concurrent requests show 50-100% performance regression
- The async/channel overhead outweighs tokenization benefits
**Recommendation**: Use thread-local tokenizer pools or direct `encode_batch()` instead of dynamic batching for this use case.
## 9. Glossary
- **BPE (Byte-Pair Encoding):** Subword tokenization merging frequent pairs
- **Tokenizer Family:** Related tokenizers sharing vocabulary (GPT, BERT, etc.)
- **Stop Sequence:** Text pattern triggering generation termination
- **Detokenization:** Converting token IDs back to text
- **Jail Buffer:** Temporary hold for potentially matching stop sequences
- **Prefix Offset:** Starting position for incremental decoding window
- **Read Offset:** Current position in token sequence
- **Special Tokens:** Reserved tokens (BOS, EOS, PAD, etc.)
- **Vocab Size:** Total number of unique tokens
- **Chat Template:** Format for converting messages to model input
## 10. TODO
1. **TODO:** Implement `Encoding::get_hash()` for caching support
- File: `src/tokenizer/traits.rs`
- Symbol: `impl Encoding`
2. **TODO:** Add character offset tracking
- File: `src/tokenizer/traits.rs`
- Symbol: `pub type Offsets = (usize, usize)`
3. **TODO:** Support SentencePiece models
- File: `src/tokenizer/factory.rs:69-72`
- Symbol: Extension match arm for "model"
4. **TODO:** Support GGUF format
- File: `src/tokenizer/factory.rs:74-78`
- Symbol: Extension match arm for "gguf"
5. **TODO:** Add token↔ID mapping for Tiktoken
- File: `src/tokenizer/tiktoken.rs:151-161`
- Symbol: `token_to_id()` and `id_to_token()` methods
6. **TODO:** Fix `token_ids_ref()` for Tiktoken
- File: `src/tokenizer/traits.rs:46-50`
- Symbol: `Encoding::Tiktoken` match arm
7. **TODO:** Make model→tokenizer mapping configurable
- File: `src/tokenizer/factory.rs:174-184`
- Symbol: GPT model detection logic
// Apply a chat template when one is bundled with the tokenizer
use sglang_router_rs::tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};
let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
"./tokenizer.json",
Some("./chat_template.jinja"),
)?;
let messages = vec![
serde_json::json!({"role": "system", "content": "You are concise."}),
serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
&messages,
ChatTemplateParams {
add_generation_prompt: true,
continue_final_message: false,
tools: None,
documents: None,
template_kwargs: None,
},
)?;
```
Set `HF_TOKEN` in the environment if you need to download private models from the Hugging Face Hub.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment