[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)

Signed-off-by: Andrew Xia <axia@meta.com>

[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)
Signed-off-by: Andrew Xia <axia@meta.com>
9ace378a · Andrew Xia · GitHub · 27d5ee3e · 9ace378a · 9ace378a
Unverified Commit 9ace378a authored Mar 23, 2026 by Andrew Xia Committed by GitHub Mar 23, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 1 deletion

docs/design/metrics.md docs/design/metrics.md +2 -1

vllm/entrypoints/openai/responses/serving.py vllm/entrypoints/openai/responses/serving.py +2 -0

No files found.
--- a/docs/design/metrics.md
+++ b/docs/design/metrics.md
@@ -244,6 +244,7 @@ statistics relating to that iteration:
  prefill in this iteration. However, we calculate this interval
  relative to when the request was first received by the frontend
  (`arrival_time`) in order to account for input processing time.
+  Currently `arrival_time` starts when tokenization begins.

 For any requests that were completed in a given iteration, we also
 record:
@@ -587,7 +588,7 @@ see:
 - [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
 - [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
 - <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
-  
+
 This is a non-trivial topic. Consider this comment from Rob:

 > I think this metric should focus on trying to estimate what the max

--- a/vllm/entrypoints/openai/responses/serving.py
+++ b/vllm/entrypoints/openai/responses/serving.py
@@ -710,9 +710,11 @@ class OpenAIServingResponses(OpenAIServing):
                "Only 'auto' tool_choice is supported in response API with Harmony"
            )

+        arrival_time = time.time()
        messages = self._construct_input_messages_with_harmony(request, prev_response)
        prompt_token_ids = render_for_completion(messages)
        engine_prompt = token_inputs(prompt_token_ids)
+        engine_prompt["arrival_time"] = arrival_time

        # Add cache_salt if provided in the request
        if request.cache_salt is not None: