TESTING.md 15.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# Tool-Call / Reasoning Parser Corner Cases

Reference taxonomy for unit testing tool-call and reasoning parsers. Each parser
added under `src/tool_calling/` or `src/reasoning/` should cover the generic
`CASE.<n>` categories; family-specific parsers also cover their respective
`CASE.xml<n>` / `CASE.harmony<n>` categories. `N/A` should be called out
explicitly in the test file rather than silently omitted.

Category layout:
- **`CASE.1`–`CASE.16`****Generic**. Apply to every parser regardless of grammar.
- **`CASE.xml1`–`CASE.xml2`** — XML-family only (hermes, glm47, qwen3_coder, minimax_m2, kimi_k2).
- **`CASE.harmony1`** — Harmony only (gpt-oss).

Per-model gap tracking lives elsewhere (not in this repo).

## Quick reference

### Generic (all parsers)

- **`CASE.1`** Single tool call — happy path (one complete, well-formed call). Ex: `xml::test_parse_simple_tool_call`.
- **`CASE.2`** Multiple tool calls — sequential or parallel (2+ in one response). Ex: `tool_choice::test_streaming_required_tool_parallel`.
- **`CASE.3`** No tool call (response is text only). Ex: `xml::test_parse_no_tool_calls`.
- **`CASE.4`** Malformed / partial JSON args (truncated, missing close brace, invalid syntax). Ex: `deepseek_v3::test_parse_..._with_invalid_json`, `xml::test_parse_missing_..._closing_tag`.
- **`CASE.5`** Missing end-token recovery (recover calls when `section_end` is absent due to max_tokens / EOS). Ex: `kimi_k2::test_parse_malformed_no_section_end`.
- **`CASE.6`** Empty args (`arguments={}` / no-arg call). Ex: `kimi_k2::test_parse_no_arg_call`.
- **`CASE.7`** Complex arg types (nested objects, arrays, bool, number, Unicode / newlines in values). Ex: `kimi_k2::test_parse_complex_json_arguments`.
- **`CASE.8`** Streaming — token-by-token assembly + chunk-boundary splits. Ex: `basic::test_buffer_state_persistence_across_calls`, `basic::test_partial_token_matching_closing_tag`, `basic::test_kimi_k2_one_shot_split`.
- **`CASE.9`** Paired reasoning + tool in same response. Ex: `test_reasoning_parser::test_nemotron_with_reasoning_and_tool_calls`.
- **`CASE.10`** Reasoning only (think tags, no tool call). Ex: `basic::test_detect_and_parse_reasoning_reasoning`.
- **`CASE.11`** `tool_choice` = auto / required / named / none. Ex: `tool_choice::test_named_tool_choice_parses_json` (**hermes only today**).
- **`CASE.12`** `finish_reason` semantics (`stop` / `tool_calls` / `length` mapping). Ex: `tool_choice_finish_reasons::*` (hermes only); `test_streaming_tool_parsers::test_qwen_finish_reason_length_vllm`.
- **`CASE.13`** Normal text interleaved with tool calls. Ex: `xml::test_parse_with_normal_text`.
- **`CASE.14`** Empty content / empty `tool_calls` array / null response. Ex: `parallel_tool_call_integration::test_empty_tool_calls`.
- **`CASE.15`** Duplicate tool calls (same name twice). No test anywhere in the repo; universal gap.
- **`CASE.16`** Regression for a specific customer bug (ticket ID referenced in test name or body). Ex: `kimi_k2::test_parse_malformed_no_section_end`.

### XML-family (`CASE.xml*`)

- **`CASE.xml1`** XML entity / HTML unescape handling (`&lt;`, `&amp;`, `&quot;` in parameter values). Ex: `xml::test_html_unescape`, `glm47::test_xml_entity_decoding`.
- **`CASE.xml2`** Schema-aware type coercion (string → number/bool/array based on declared parameter schema). Ex: `xml::test_schema_aware_type_conversion`, `glm47::test_type_coercion_*`.

### Harmony (`CASE.harmony*`)

- **`CASE.harmony1`** Channel / recipient parsing (analysis / commentary / final channels). Ex: `harmony_parser::test_parse_tool_calls_harmony_*`.

### Universal gaps (no test anywhere, not promoted to numbered categories)

- Unicode in function names (non-ASCII tool names, emoji).
- Numeric overflow in args (very large int / float outside JSON spec range).
- Empty function name (`"name": ""`).
- Concurrent parallel requests (process-level contention during parse).
- Guided-decoding ↔ tool-call interaction (constrained generation emits malformed args).
- Extremely long output (≥10 KB tool-call JSON in a single call).
- Mid-stream error injection / interruption (worker kill, network drop mid-parse).
- Schema arg-count mismatch (model emits extra or missing args vs declared schema).

---

## `CASE.1` — Single tool call, happy path

One complete, well-formed call in the response.

- Applies to every tool-call parser.
- Baseline correctness check. If `CASE.1` fails, nothing else below matters.
- Example: `dsml/parser.rs::test_parse_single_tool_call_string_param`.

## `CASE.2` — Multiple tool calls (sequential or parallel)

Two or more calls in one response, in the same block or back-to-back.

- Applies to every tool-call parser.
- Some grammars emit parallel calls in one block (DSML, XML); others emit
  sequential top-level sentinels (JSON dialects). Either way, extract all.
- Example: `dsml/parser.rs::test_parse_multiple_tool_calls`,
  `tool_choice::test_streaming_required_tool_parallel`.

## `CASE.3` — No tool call

Response is plain text, no tool-call grammar present.

- Applies to every tool-call parser.
- Must return empty `Vec<ToolCall>` and the input as `normal_text`. Zero false
  positives.
- Example: `dsml/parser.rs::test_parse_no_tool_calls`.

## `CASE.4` — Malformed / partial JSON args

Truncated JSON, missing close brace, invalid syntax inside the arguments
payload.

- Applies to every tool-call parser. For parsers whose grammar never embeds
  JSON (none today — all top-N families embed JSON somewhere), mark explicit
  `N/A`.
- Behavior must be documented: either graceful fallback to string (DSML's
  current behavior via `serde_json::from_str(...).unwrap_or_else(|_| String(...))`)
  or explicit error. Silent drop is the failure mode.
- Example: `dsml/parser.rs::test_parse_deepseek_v4_malformed_json_value_falls_back_to_string`,
  `deepseek_v3_parser.rs::test_parse_tool_calls_deepseek_v3_with_invalid_json`.

## `CASE.5` — Missing end-token recovery

The model's response is truncated before the closing fence arrives
(`<|tool_calls_section_end|>` for Kimi, `</|DSML|tool_calls>` for DeepSeek
DSML, etc.) — typically because the engine hit `max_tokens` or the model
emitted EOS mid-generation.

- Applies to every tool-call parser with paired start/end fences.
- Customer-facing bug class: silent drop of the in-flight call looks like a
  successful HTTP 200 with no tool_calls and no error.
- Two acceptable resolutions: (a) recover completed invokes even without the
  outer close fence (Kimi K2 does this post-fix), or (b) return an explicit
  error. Either way, pin the behavior with a test so a future change is
  intentional.
- Example (post-recovery): `kimi_k2_parser.rs::test_parse_malformed_no_section_end`.
- Example (behavior-pinning, pre-recovery): `dsml/parser.rs::test_parse_deepseek_v4_missing_end_token`.

## `CASE.6` — Empty args

Tool call with `arguments={}`, or a no-parameter invoke.

- Applies to every tool-call parser.
- Must still return the call — empty args is a valid call, not a missing one.
- Example: `kimi_k2_parser.rs::test_parse_no_arg_call`,
  `dsml/parser.rs::test_parse_deepseek_v4_no_parameters`.

## `CASE.7` — Complex argument types

Nested objects, arrays, booleans, numbers, mixed types, Unicode values, and
newlines inside argument values.

- Applies to every tool-call parser.
- For grammars that carry type hints (DSML's `string="true|false"`), verify
  JSON round-tripping. For XML grammars without hints, the type-coercion
  half of the test is covered under `CASE.xml2` instead — here just verify
  that complex values make it through without truncation or escape bugs.
- Example: `dsml/parser.rs::test_parse_mixed_types_realistic`,
  `kimi_k2_parser.rs::test_parse_complex_json_arguments`.

## `CASE.8` — Streaming

Chunked input arriving over SSE. Covers two concerns that tend to fail
together:

1. **Token-by-token assembly** — the parser incrementally reconstructs the
   tool-call structure across many small chunks.
2. **Chunk-boundary splits** — start fence, end fence, or parameter name /
   value straddles a chunk boundary. Partial-token matching must return
   `true` (keep buffering, don't flush as plain text) and complete the
   match on the next chunk.

- Applies to every tool-call parser. Dominant production path.
- Example: `basic::test_buffer_state_persistence_across_calls`,
  `basic::test_partial_token_matching_closing_tag`,
  `basic::test_kimi_k2_one_shot_split`,
  `test_streaming_tool_parsers::test_deepseek_v4_e2e_fragmented_tokens_vllm`.

## `CASE.9` — Paired reasoning + tool in same response

Model emits `<think>...</think>` (or analog) followed by a tool call. Both
must be extracted: `reasoning_content` populated AND `tool_calls` populated.

- Applies to every (tool, reasoning) parser pair.
- Watch for the "unclosed think-tag swallows tool call" bug — if the reasoning
  parser is greedy it may eat the tool-call content that follows.
- Example: `test_reasoning_parser::test_nemotron_with_reasoning_and_tool_calls`,
  `test_reasoning_parser::test_kimi_k25_with_reasoning_and_tool_calls`.

## `CASE.10` — Reasoning only

`<think>...</think>` or analog present, no tool call. Parser must populate
`reasoning_content` and leave `tool_calls` empty.

- Applies to every reasoning parser.
- Example: `reasoning/base_parser.rs::test_detect_and_parse_reasoning_reasoning`,
  `reasoning/mod.rs::test_deepseek_v4_detect_and_parse`.

## `CASE.11` — `tool_choice` = auto / required / named / none

Each of the four OpenAI `tool_choice` modes exercised per parser.

- Applies to every tool-call parser.
- Cross-parser suites at `lib/llm/tests/tool_choice.rs` /
  `parallel_tool_call_integration.rs` / `tool_choice_finish_reasons.rs`
  run `hermes` only today. Adding a new parser requires parametrizing those
  suites or adding a per-parser equivalent.
- Universal gap across most parsers in the repo as of 2026-04.
- Example: `tool_choice::test_named_tool_choice_parses_json`,
  `tool_choice::test_required_tool_choice_parses_json_array`.

## `CASE.12` — `finish_reason` semantics

`stop` vs `tool_calls` vs `length` mapping, in both streaming and
non-streaming paths.

- Applies to every tool-call parser.
- When a tool call lands, `finish_reason` must become `tool_calls`. When
  `max_tokens` truncates mid-stream, `length` must propagate — this is
  often the signal that should trigger `CASE.5` recovery on the parser side.
- Example: `tool_choice_finish_reasons::test_named_tool_choice_normal_stop_becomes_tool_calls`,
  `test_streaming_tool_parsers::test_qwen_finish_reason_length_vllm`.

## `CASE.13` — Normal text interleaved with tool calls

Model emits narration text before / after / between tool-call blocks. Parser
must split content correctly: text → `normal_content`, calls → `tool_calls`.

- Applies to every tool-call parser.
- Example: `dsml/parser.rs::test_parse_with_normal_text`,
  `test_streaming_tool_parsers.rs::test_deepseek_v4_e2e_content_before_tool_vllm`.

## `CASE.14` — Empty content / empty `tool_calls` array / null response

Engine emits a chunk with `delta.content = ""`, or a final response with
`tool_calls: []`, or `null` values inside arguments.

- Applies to every tool-call parser.
- Null-value handling inside parameters is parser-level (`parse_parameters`
  in DSML handles it via `serde_json::Value::Null`). Empty-choices /
  empty-stream handling is typically at the e2e integration layer.
- Example: `dsml/parser.rs::test_parse_null_parameter`,
  `parallel_tool_call_integration::test_empty_tool_calls`.

## `CASE.15` — Duplicate tool calls (same name twice)

Two calls to the same function name in one response, possibly with the same
arguments.

- Applies to every tool-call parser.
- **Zero coverage across the entire repo as of 2026-04.** Universal gap.
- Expected behavior: both calls must appear in `tool_calls` with distinct
  IDs. (The runtime / client is responsible for deciding whether duplicate
  invocation is intended.)

## `CASE.16` — Regression for a specific customer bug

Test named after (or containing) a ticket reference, pinning the fix for
a customer-reported failure.

- Applies per-incident. Not a category every parser needs to cover in
  advance; populated as bugs are reported and fixed.
- Existing example: `kimi_k2_parser.rs::test_parse_malformed_no_section_end`.

---

## `CASE.xml1` — XML entity / HTML unescape handling

Parameter values contain XML-encoded entities (`&lt;`, `&amp;`, `&quot;`,
`&apos;`, numeric entities like `&#38;`) that must be decoded before the
value is surfaced to the client.

- Applies only to XML-family tool-call parsers: `hermes`, `glm47`,
  `qwen3_coder`, `minimax_m2`, `kimi_k2` (despite its special-token outer
  fence, the inner parameter payload is XML-ish).
- **N/A for DSML** — the `string="true|false"` attribute tells the parser
  whether to JSON-decode or pass through verbatim; no entity decoding pass.
- **N/A for JSON-family and Harmony** — JSON has its own escape semantics
  handled by `serde_json`.
- Example: `xml/parser.rs::test_html_unescape`,
  `glm47_parser.rs::test_xml_entity_decoding`.

## `CASE.xml2` — Schema-aware type coercion

Parser uses the declared tool schema to coerce string args to
number / bool / array based on the declared parameter type.

- Applies only to XML-family parsers without explicit type annotations in
  the wire format. `xml/parser.rs`, `glm47_parser.rs` do this.
- **N/A for DSML** — the `string="true|false"` attribute carries the type
  intent per parameter, so no schema lookup is needed.
- **N/A for JSON-family** — JSON has native types.
- **N/A for Harmony** — payload is JSON inside the channel envelope.
- Example: `xml/parser.rs::test_schema_aware_type_conversion`,
  `glm47_parser.rs::test_type_coercion_array_comma_separated`.

---

## `CASE.harmony1` — Channel / recipient parsing

OpenAI Harmony's token stream carries channel metadata
(`<|channel|>analysis|commentary|final<|message|>`) and recipient targets
(`to=functions.foo`). Parser must route the `commentary` channel content
into tool-call extraction while surfacing `analysis` as reasoning and
`final` as the user-visible output.

- **Harmony only.** N/A for every other family.
- Example: `harmony/harmony_parser.rs::test_parse_tool_calls_harmony_with_multi_args`,
  `harmony/harmony_parser.rs::test_parse_tool_calls_harmony_with_normal_text`.

---

## Applicability summary

| Category block | Parsers | Notes |
| -- | -- | -- |
| `CASE.1``CASE.16` (generic) | All | Required contract for every parser |
| `CASE.xml1``CASE.xml2` | XML-family only | Entity decoding + schema-aware coercion |
| `CASE.harmony1` | Harmony only | Channel routing |

## Adding a new parser: what you must include

Minimum viable set for a new tool-call parser:

1. `CASE.1`, `CASE.2`, `CASE.3` — baseline correctness.
2. `CASE.4` or explicit N/A justification — handle or refuse malformed input.
3. `CASE.5` — pin behavior when the outer fence is missing. Silent drop is a
   regression waiting to happen.
4. `CASE.6`, `CASE.7` — empty and complex args.
5. `CASE.8` — streaming. Essentially non-negotiable for any parser that sits
   behind a streaming frontend.
6. `CASE.13` — interleaved text.
7. `CASE.15` — document whether duplicate calls are supported. Flat gap
   today; landing a test with the parser establishes the contract.
8. Family-specific categories where applicable: `CASE.xml1` / `CASE.xml2`
   for XML grammars, `CASE.harmony1` for Harmony.

For reasoning parsers, replace `CASE.4` / `CASE.5` / `CASE.8`-assembly with
`CASE.8`-partial-close-tag and `CASE.10` (reasoning-only).