README.md 10.6 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# Tokenizer Module

## Overview
The `sgl-router` tokenizer subsystem exposes a single `Tokenizer` facade around multiple backends
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock).  It packages the
shared behaviours needed by the router–encoding user text, incrementally decoding streamed tokens,
tracking per-request state, and detecting stop conditions—behind trait objects so the rest of the
router can remain backend-agnostic.

Key capabilities:
- trait-based split between `Encoder`, `Decoder`, and `Tokenizer` for shared APIs across backends
- Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
- heuristic selection of OpenAI/tiktoken encodings for GPT model names
- incremental decoding utilities (`DecodeStream`, `Sequence`) that handle UTF-8 boundaries
- stop sequence handling via `StopSequenceDecoder` with token-level and string-level triggers
- optional Jinja2 chat-template rendering that matches Hugging Face semantics

The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
support mentioned in earlier drafts do **not** exist today.  This document reflects the actual code
as of `sgl-router/src/tokenizer/*`.

## Source Map
- `mod.rs` – module exports and the `Tokenizer` wrapper around `Arc<dyn Tokenizer>`
- `traits.rs` – shared traits and the `Encoding`/`SpecialTokens` helper types
- `factory.rs` – backend discovery, file/model heuristics, and tokio-aware creation helpers
- `hub.rs` – Hugging Face Hub downloads via `hf_hub`
- `huggingface.rs` – wrapper over `tokenizers::Tokenizer`, chat template loading, vocab access
- `tiktoken.rs` – wrapper over `tiktoken-rs` encoders for OpenAI model families
- `chat_template.rs` – AST-driven Jinja template inspection and rendering utilities
- `sequence.rs` – stateful incremental decoding helper used by router sequences
- `stream.rs` – stateless streaming decoder that yields textual chunks from token streams
- `stop.rs` – stop-sequence detection with "jail" buffering and a builder API
- `mock.rs` – lightweight tokenizer used by unit tests
- `tests.rs` – smoke tests covering the trait facade and helpers (largely with the mock backend)

## Core Traits and Types (`traits.rs`)
- `Encoder`, `Decoder`, and `Tokenizer` traits stay `Send + Sync` so instances can be shared across
  threads.  Concrete backends implement the minimal methods: `encode`, `encode_batch`, `decode`,
  `vocab_size`, special-token lookup, and optional token↔id conversions.
- `Encoding` wraps backend-specific results: `Hf` holds the Hugging Face encoding object,
  `Sp` is a plain ID vector reserved for future SentencePiece support, and `Tiktoken` stores u32 IDs
  from `tiktoken-rs`.  `Encoding::token_ids()` is the zero-copy accessor used everywhere.
- `SpecialTokens` collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic
  decisions.
- `Tokenizer` (in `mod.rs`) is a thin `Arc<dyn Tokenizer>` newtype that exposes convenience methods
  (`encode`, `decode`, `decode_stream`, etc.) while keeping cloning cheap.

## Backend Implementations
### HuggingFaceTokenizer (`huggingface.rs`)
- Loads `tokenizer.json` (or similar) using `tokenizers::Tokenizer::from_file`.
- Caches vocab forward and reverse maps for `token_to_id`/`id_to_token` support.
- Extracts special tokens using common patterns (e.g. `<s>`, `[CLS]`).
- Supports optional chat templates: either auto-discovered next to the tokenizer via
  `tokenizer_config.json` or overridable with an explicit template path.
- Exposes `apply_chat_template` which renders a minijinja template given JSON message payloads and
  template parameters.

### TiktokenTokenizer (`tiktoken.rs`)
- Wraps the `tiktoken-rs` `CoreBPE` builders (`cl100k_base`, `p50k_base`, `p50k_edit`, `r50k_base`).
- `from_model_name` heuristically maps OpenAI model IDs (e.g. `gpt-4`, `text-davinci-003`) to those
  bases. Unknown model names return an error rather than silently defaulting.
- Implements encode/decode operations; batch encode simply iterates sequentially.
- Provides approximate vocab sizes and common GPT special tokens.  Direct token↔id lookup is not
  implemented—the underlying library does not expose that mapping.

### MockTokenizer (`mock.rs`)
- Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
- Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.

## Factory and Backend Discovery (`factory.rs`)
- `create_tokenizer{,_async}` accept either a filesystem path or a model identifier.  Logic:
   1. Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
   2. Strings that look like OpenAI model names (`gpt-*`, `davinci`, `curie`, `babbage`, `ada`) use
      `TiktokenTokenizer`.
   3. Everything else attempts a Hugging Face Hub download via `download_tokenizer_from_hf`.
- Chat templates can be injected with `create_tokenizer_with_chat_template`.
- Async creation uses `tokio` for network access. The blocking variant reuses or spins up a runtime
  when called from synchronous contexts.
- SentencePiece (`.model`) and GGUF files are detected but currently return a clear `not supported`
  error.

## Hugging Face Hub Integration (`hub.rs`)
- Uses the async `hf_hub` API to list and download tokenizer-related files
  (`tokenizer.json`, `merges.txt`, `.model`, etc.), filtering out weights and docs.
- The helper returns the HF cache directory containing the fetched files; the factory then loads
  from disk using standard file paths.
- Honour the `HF_TOKEN` environment variable for private or rate-limited models.  Without it the
  download may fail with an authorization error.

## Chat Template Support (`chat_template.rs`)
- Detects whether a template expects raw string content or the structured OpenAI-style `content`
  list by walking the minijinja AST.  This matches the Python-side detection logic used elsewhere in
  SGLang.
- `ChatTemplateProcessor` (constructed per call) renders templates against JSON `messages` and
  `ChatTemplateParams` (system prompt, tools, EOS token handling, etc.).  Errors surface as
  `anyhow::Error`, keeping parity with Hugging Face error messages.
- The tokenizer wrapper stores both the template string and its detected content format so callers
  can pre-transform message content correctly.

## Streaming and Stateful Helpers
### `DecodeStream` (`stream.rs`)
- Maintains a sliding window (`prefix_offset`, `read_offset`) over accumulated token IDs.
- Each `step` decodes the known prefix and the new slice; when the new slice produces additional
  UTF-8 text (and does not end in the replacement character `�`), it returns the incremental chunk
  and updates offsets.  Otherwise it returns `None` and waits for more tokens.
- `step_batch` and `flush` offer convenience for batching and draining remaining text.

### `Sequence` (`sequence.rs`)
- Holds per-request decoding state: accumulated IDs plus offsets mirroring `DecodeStream`.
- `append_text` encodes extra prompt text; `append_token` decodes incremental output while
  respecting UTF-8 boundaries and replacing stray `�` characters.
- Designed for integration with router sequence management where decoded text must be replayed.

### `StopSequenceDecoder` (`stop.rs`)
- Extends the incremental decoding approach with a "jail" buffer that holds potential partial
  matches against configured stop sequences.
- Supports both token-level stops (visible or hidden) and arbitrary string sequences.  When a string
  stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can
  decide whether it completes a stop sequence.
- Provides `StopSequenceDecoderBuilder` for ergonomic configuration and exposes `process_token`,
  `process_tokens`, `flush`, `reset`, and `is_stopped` helpers.

## Testing
- Unit tests cover the mock tokenizer, the `Tokenizer` wrapper, incremental decoding helpers, and
  stop-sequence behaviour (`tests.rs`, `sequence.rs`, `stop.rs`, `tiktoken.rs`, `factory.rs`,
  `hub.rs`).  Network-dependent Hugging Face downloads are exercised behind a best-effort async test
  that skips in CI without credentials.
- Use `cargo test -p sgl-router tokenizer` to run the module’s test suite.

## Known Limitations & Future Work
- SentencePiece (`.model`) and GGUF tokenizers are detected but deliberately unimplemented.
- `Encoding::Sp` exists for future SentencePiece support but currently behaves as a simple `Vec<u32>`.
- `TiktokenTokenizer` cannot map individual tokens/IDs; the underlying library would need to expose
  its vocabulary to implement `token_to_id`/`id_to_token`.
- There is no metrics or batching layer inside this module; the router records metrics elsewhere.
- Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.

## Usage Examples
139
```rust
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
use std::sync::Arc;
use sglang_router_rs::tokenizer::{
    create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};

// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!")?;
assert!(!encoding.token_ids().is_empty());

// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;

// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
    .stop_sequence("\nHuman:")
    .build();
for &token in encoding.token_ids() {
    if let Some(chunk) = stream.step(token)? {
        match stop.process_token(token)? {
            SequenceDecoderOutput::Text(t) => println!("{}", t),
            SequenceDecoderOutput::StoppedWithText(t) => {
                println!("{}", t);
                break;
            }
            SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
        }
    }
Simo Lin's avatar
Simo Lin committed
170
171
172
173
}
```

```rust
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
// Apply a chat template when one is bundled with the tokenizer
use sglang_router_rs::tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};

let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
    "./tokenizer.json",
    Some("./chat_template.jinja"),
)?;
let messages = vec![
    serde_json::json!({"role": "system", "content": "You are concise."}),
    serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
    &messages,
    ChatTemplateParams {
        add_generation_prompt: true,
        continue_final_message: false,
        tools: None,
        documents: None,
        template_kwargs: None,
    },
)?;
```

Set `HF_TOKEN` in the environment if you need to download private models from the Hugging Face Hub.