[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)

Signed-off-by: Andrew Xia <axia@meta.com>

[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)
Signed-off-by: Andrew Xia <axia@meta.com>
9ace378a · Andrew Xia · GitHub · 27d5ee3e · 9ace378a · 9ace378a
Unverified Commit 9ace378a authored Mar 23, 2026 by Andrew Xia Committed by GitHub Mar 23, 2026
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 1 deletion

docs/design/metrics.md docs/design/metrics.md +2 -1

vllm/entrypoints/openai/responses/serving.py vllm/entrypoints/openai/responses/serving.py +2 -0

No files found.
--- a/docs/design/metrics.md
+++ b/docs/design/metrics.md
@@ -244,6 +244,7 @@ statistics relating to that iteration:
  prefill in this iteration. However, we calculate this interval
  relative to when the request was first received by the frontend
  (`arrival_time`) in order to account for input processing time.
+  Currently `arrival_time` starts when tokenization begins.
 For any requests that were completed in a given iteration, we also
 record:

--- a/vllm/entrypoints/openai/responses/serving.py
+++ b/vllm/entrypoints/openai/responses/serving.py
@@ -710,9 +710,11 @@ class OpenAIServingResponses(OpenAIServing):
                "Only 'auto' tool_choice is supported in response API with Harmony"
            )
+        arrival_time = time.time()
        messages = self._construct_input_messages_with_harmony(request, prev_response)
        prompt_token_ids = render_for_completion(messages)
        engine_prompt = token_inputs(prompt_token_ids)
+        engine_prompt["arrival_time"] = arrival_time
        # Add cache_salt if provided in the request
        if request.cache_salt is not None: