Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
9ace378a
Unverified
Commit
9ace378a
authored
Mar 23, 2026
by
Andrew Xia
Committed by
GitHub
Mar 23, 2026
Browse files
[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)
Signed-off-by:
Andrew Xia
<
axia@meta.com
>
parent
27d5ee3e
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
4 additions
and
1 deletion
+4
-1
docs/design/metrics.md
docs/design/metrics.md
+2
-1
vllm/entrypoints/openai/responses/serving.py
vllm/entrypoints/openai/responses/serving.py
+2
-0
No files found.
docs/design/metrics.md
View file @
9ace378a
...
@@ -244,6 +244,7 @@ statistics relating to that iteration:
...
@@ -244,6 +244,7 @@ statistics relating to that iteration:
prefill in this iteration. However, we calculate this interval
prefill in this iteration. However, we calculate this interval
relative to when the request was first received by the frontend
relative to when the request was first received by the frontend
(
`arrival_time`
) in order to account for input processing time.
(
`arrival_time`
) in order to account for input processing time.
Currently
`arrival_time`
starts when tokenization begins.
For any requests that were completed in a given iteration, we also
For any requests that were completed in a given iteration, we also
record:
record:
...
...
vllm/entrypoints/openai/responses/serving.py
View file @
9ace378a
...
@@ -710,9 +710,11 @@ class OpenAIServingResponses(OpenAIServing):
...
@@ -710,9 +710,11 @@ class OpenAIServingResponses(OpenAIServing):
"Only 'auto' tool_choice is supported in response API with Harmony"
"Only 'auto' tool_choice is supported in response API with Harmony"
)
)
arrival_time
=
time
.
time
()
messages
=
self
.
_construct_input_messages_with_harmony
(
request
,
prev_response
)
messages
=
self
.
_construct_input_messages_with_harmony
(
request
,
prev_response
)
prompt_token_ids
=
render_for_completion
(
messages
)
prompt_token_ids
=
render_for_completion
(
messages
)
engine_prompt
=
token_inputs
(
prompt_token_ids
)
engine_prompt
=
token_inputs
(
prompt_token_ids
)
engine_prompt
[
"arrival_time"
]
=
arrival_time
# Add cache_salt if provided in the request
# Add cache_salt if provided in the request
if
request
.
cache_salt
is
not
None
:
if
request
.
cache_salt
is
not
None
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment