profiling_index.rst 1.82 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Profiling vLLM 
=================================

We support tracing vLLM workers using the ``torch.profiler`` module. You can enable tracing by setting the ``VLLM_TORCH_PROFILER_DIR`` environment variable to the directory where you want to save the traces: ``VLLM_TORCH_PROFILER_DIR=/mnt/traces/``

The OpenAI server also needs to be started with the ``VLLM_TORCH_PROFILER_DIR`` environment variable set.

When using ``benchmarks/benchmark_serving.py``, you can enable profiling by passing the ``--profile`` flag.

.. warning::

   Only enable profiling in a development environment. 


Traces can be visualized using https://ui.perfetto.dev/.

.. tip::

   Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
20
21
22
23

.. tip::

   To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
24
25
   Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
   ``export VLLM_RPC_TIMEOUT=1800000``
26
27
28
29
30
31
32
33
34
  
Example commands and usage:
===========================

Offline Inference:
------------------

Refer to `examples/offline_inference_with_profiler.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py>`_ for an example.

35
36

OpenAI Server:
37
--------------
38
39
40

.. code-block:: bash

41
    VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B 
42
43
44
45
46
47

benchmark_serving.py:

.. code-block:: bash

    python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2