"content":"In a quiet meadow tucked between rolling hills, a plump gray rabbit nibbled on clover beneath the shade of a gnarled oak tree. Its ears twitched at the faint rustle of leaves, but it remained calm, confident in the safety of its burrow just a few hops away. The late afternoon sun warmed its fur, and tiny dust motes danced in the golden light as bees hummed lazily nearby. Though the rabbit lived a simple life, every day was an adventure of scents, shadows, and snacks—an endless search for the tastiest patch of greens and the softest spot to nap.",
}
],
"stream":True,
"max_tokens":10,
}
# Shared TRT-LLM configuration for all tests
# Shared TRT-LLM configuration for all tests
# free_gpu_memory_fraction limits actual VRAM allocation (required for multi-worker on same GPU)
# free_gpu_memory_fraction limits actual VRAM allocation (required for multi-worker on same GPU)
...
@@ -67,7 +44,7 @@ TRTLLM_ARGS: Dict[str, Any] = {
...
@@ -67,7 +44,7 @@ TRTLLM_ARGS: Dict[str, Any] = {
}
}
classTRTLLMProcess:
classTRTLLMProcess(ManagedEngineProcessMixin):
"""Manages TRT-LLM workers using dynamo.trtllm (HTTP API + KV events).
"""Manages TRT-LLM workers using dynamo.trtllm (HTTP API + KV events).
This is a drop-in replacement for MockerProcess that uses real TRT-LLM workers.
This is a drop-in replacement for MockerProcess that uses real TRT-LLM workers.
...
@@ -223,97 +200,8 @@ class TRTLLMProcess:
...
@@ -223,97 +200,8 @@ class TRTLLMProcess:
f"with endpoint: {self.endpoint}"
f"with endpoint: {self.endpoint}"
)
)
def__enter__(self):
process_name="TRT-LLM worker"
"""Start all TRT-LLM worker processes with sequential initialization.
cleanup_name="TRT-LLM worker resources"
Workers are started sequentially with a delay between each to avoid
resource contention during initialization. This prevents
MPI initialization conflicts when multiple workers
"content":"In a quiet meadow tucked between rolling hills, a plump gray rabbit nibbled on clover beneath the shade of a gnarled oak tree. Its ears twitched at the faint rustle of leaves, but it remained calm, confident in the safety of its burrow just a few hops away. The late afternoon sun warmed its fur, and tiny dust motes danced in the golden light as bees hummed lazily nearby. Though the rabbit lived a simple life, every day was an adventure of scents, shadows, and snacks—an endless search for the tastiest patch of greens and the softest spot to nap.",
}
],
"stream":True,
"max_tokens":10,
}
# Shared vLLM configuration for all tests
# Shared vLLM configuration for all tests
# gpu_memory_utilization limits actual VRAM allocation (required for multi-worker on same GPU)
# gpu_memory_utilization limits actual VRAM allocation (required for multi-worker on same GPU)
VLLM_ARGS:Dict[str,Any]={
VLLM_ARGS:Dict[str,Any]={
...
@@ -70,7 +47,7 @@ VLLM_ARGS: Dict[str, Any] = {
...
@@ -70,7 +47,7 @@ VLLM_ARGS: Dict[str, Any] = {
}
}
classVLLMProcess:
classVLLMProcess(ManagedEngineProcessMixin):
"""Manages vLLM workers using dynamo.vllm (HTTP API + KV events).
"""Manages vLLM workers using dynamo.vllm (HTTP API + KV events).
This is a drop-in replacement for MockerProcess that uses real vLLM workers.
This is a drop-in replacement for MockerProcess that uses real vLLM workers.
...
@@ -271,95 +248,9 @@ class VLLMProcess:
...
@@ -271,95 +248,9 @@ class VLLMProcess:
f"with endpoint: {self.endpoint}"
f"with endpoint: {self.endpoint}"
)
)
def__enter__(self):
process_name="vLLM worker"
"""Start all vLLM worker processes with sequential initialization.
cleanup_name="vLLM worker resources"
init_delay_reason="initialize NIXL before starting next worker"
Workers are started sequentially with a delay between each to avoid
NIXL/UCX resource contention during initialization. This prevents
UCX shared memory handle allocation failures when multiple workers