[OpenVINO] Updated documentation (#7687)

398521ad · Ilya Lavrenov · GitHub · 5288c06a · 398521ad
Unverified Commit 398521ad authored Aug 20, 2024 by Ilya Lavrenov Committed by GitHub Aug 20, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 3 deletions

docs/source/getting_started/openvino-installation.rst docs/source/getting_started/openvino-installation.rst +1 -3

No files found.
--- a/docs/source/getting_started/openvino-installation.rst
+++ b/docs/source/getting_started/openvino-installation.rst
@@ -70,7 +70,7 @@ vLLM OpenVINO backend uses the following environment variables to control behavi
 - ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off.
+- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
 To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)
@@ -91,5 +91,3 @@ Limitations
 - Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
 - Tensor and pipeline parallelism are not currently enabled in vLLM integration.
- Speculative sampling is not tested within vLLM integration.