[Bugfix] Fix render server crash for quantized models on CPU-only hosts (#37215)

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

[Bugfix] Fix render server crash for quantized models on CPU-only hosts (#37215)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
0fefd00e · Sage · GitHub · f5c081d4 · 0fefd00e
Unverified Commit 0fefd00e authored Mar 16, 2026 by Sage Committed by GitHub Mar 16, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 0 deletions

vllm/entrypoints/cli/launch.py vllm/entrypoints/cli/launch.py +5 -0

No files found.
--- a/vllm/entrypoints/cli/launch.py
+++ b/vllm/entrypoints/cli/launch.py
@@ -116,6 +116,11 @@ async def run_launch_fastapi(args: argparse.Namespace) -> None:
    # 2. Build and serve the API server
    engine_args = AsyncEngineArgs.from_cli_args(args)
    model_config = engine_args.create_model_config()
+    # Render servers preprocess data only — no inference, no quantized kernels.
+    # Clear quantization so VllmConfig skips quant dtype/capability validation.
+    model_config.quantization = None
    vllm_config = VllmConfig(model_config=model_config)
    shutdown_task = await build_and_serve_renderer(
        vllm_config, listen_address, sock, args