@@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
...
@@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
[API Reference](/api/offline_inference/index)
[API Reference](/api/offline_inference/index)
:::
:::
(configuration-options)=
## Configuration Options
## Configuration Options
This section lists the most common options for running the vLLM engine.
This section lists the most common options for running the vLLM engine.
...
@@ -59,6 +61,8 @@ model = LLM(
...
@@ -59,6 +61,8 @@ model = LLM(
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
(reducing-memory-usage)=
### Reducing memory usage
### Reducing memory usage
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
...
@@ -81,6 +85,12 @@ before initializing vLLM. Otherwise, you may run into an error like `RuntimeErro
...
@@ -81,6 +85,12 @@ before initializing vLLM. Otherwise, you may run into an error like `RuntimeErro
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::
:::
:::{note}
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
:::
#### Quantization
#### Quantization
Quantized models take less memory at the cost of lower precision.
Quantized models take less memory at the cost of lower precision.
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
:::
:::
:::{important}
:::{important}
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
# The model has an audio-specific lora directly in its model dir;
# it should be enabled whenever you pass audio inputs to the model.
speech_lora_path=model_name
audio_placeholder="<|audio|>"*audio_count
prompts=f"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>{audio_placeholder}{question}<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>"# noqa: E501