docs: Update Request Migration test instructions (#5754)

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

docs: Update Request Migration test instructions (#5754)
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
fa1ea1d5 · Jacky · GitHub · 5a00a7d6 · fa1ea1d5
Unverified Commit fa1ea1d5 authored Jan 28, 2026 by Jacky Committed by GitHub Jan 28, 2026
Show whitespace changes
Inline Side-by-side

Showing with 39 additions and 21 deletions

tests/fault_tolerance/README.md tests/fault_tolerance/README.md +39 -21

No files found.
--- a/tests/fault_tolerance/README.md
+++ b/tests/fault_tolerance/README.md
@@ -2,38 +2,56 @@

 ## Migration Tests

-The migration directory contains tests for worker fault tolerance with migration support.
+The migration directory contains tests for worker fault tolerance with migration support across multiple backends (vLLM, SGLang, TRT-LLM) in both aggregated and disaggregated modes.
+
+### Test Parameterization
+
+All migration tests are parameterized with the following dimensions:
+
+| Parameter IDs | Description |
+|---------------|-------------|
+| `migration_enabled`, `migration_disabled` | Controls whether migration is allowed |
+| `worker_failure` (SIGKILL), `graceful_shutdown` (SIGTERM) | Worker termination method |
+| `chat`, `completion` (skipped) | API endpoint to test |
+| `stream`, `unary` (skipped) | Streaming vs unary responses |
+| `nats`, `tcp` | Request plane transport |

 ### Test Matrix

-| Test | Shutdown Method | Migration Enabled | Expected Result | Verification |
-|------|----------------|-------------------|-----------------|--------------|
-| `test_request_migration_vllm_worker_failure` | SIGKILL (immediate) | Yes (default) | Request succeeds | "Stream disconnected... recreating stream..." in logs |
-| `test_request_migration_vllm_graceful_shutdown` | SIGTERM (10s timeout) | Yes (default) | Request succeeds | "Stream disconnected... recreating stream..." in logs |
-| `test_no_request_migration_vllm_worker_failure` | SIGKILL (immediate) | No (migration_limit=0) | Request fails (500) | "Migration limit exhausted" in logs |
-| `test_no_request_migration_vllm_graceful_shutdown` | SIGTERM (10s timeout) | No (migration_limit=0) | Request fails (500) | "Migration limit exhausted" in logs |
+Each backend (vLLM, SGLang, TRT-LLM) has the following test types:

-### Common Test Flow
+| Test | Mode | Setup |
+|------|------|-------|
+| `test_request_migration_{backend}_aggregated` | Aggregated | 2 workers |
+| `test_request_migration_{backend}_prefill` | Disaggregated | 1 decode + 2 prefill |
+| `test_request_migration_{backend}_kv_transfer` | Disaggregated | 1 prefill + 2 decode |
+| `test_request_migration_{backend}_decode` | Disaggregated | 1 prefill + 2 decode |

-All migration tests follow this pattern:
+Where `{backend}` is one of: `vllm`, `sglang`, `trtllm`
+
+### Common Test Flow

 1. Start a Dynamo frontend with round-robin routing
-2. Start 2 vLLM workers sequentially
-3. Send a long completion request (max_tokens=8192) in a separate daemon thread
-4. Use parallel polling to determine which worker received the request (checks for "New Request ID:" in logs)
-5. Terminate the worker processing the request (method varies by test)
-6. Validate the request outcome (success or failure based on migration setting)
-7. Verify migration behavior in frontend logs
+2. Start workers (configuration varies by mode: aggregated or disaggregated)
+3. Send a request (chat/completion, streaming/unary) in a background thread
+4. Determine which worker received the request via log polling
+5. For decode tests: wait for initial responses before termination
+6. Terminate the worker processing the request (SIGKILL or SIGTERM)
+7. Validate the request outcome based on `migration_limit`:
+   - `migration_limit > 0`: Request succeeds, verify TTFT/TPOT if streaming and migration metrics
+   - `migration_limit = 0`: Request fails with expected error
+8. Verify migration behavior in frontend logs

 **Run examples:**
 ```bash
-# With migration enabled
-pytest tests/fault_tolerance/migration/test_vllm.py::test_request_migration_vllm_worker_failure -v -s
-pytest tests/fault_tolerance/migration/test_vllm.py::test_request_migration_vllm_graceful_shutdown -v -s
+# Run all vLLM migration tests
+pytest tests/fault_tolerance/migration -m vllm -v -s
+
+# Run aggregated or decode tests for SGLang
+pytest tests/fault_tolerance/migration -m sglang -k "aggregated or decode" -v -s

-# With migration disabled
-pytest tests/fault_tolerance/migration/test_vllm.py::test_no_request_migration_vllm_worker_failure -v -s
-pytest tests/fault_tolerance/migration/test_vllm.py::test_no_request_migration_vllm_graceful_shutdown -v -s
+# Run specific parameter combination
+pytest tests/fault_tolerance/migration -m trtllm -k "aggregated and nats and stream and chat and worker_failure and migration_enabled" -v -s
 ```

 ## Cancellation Tests