Unverified Commit c3649e4f authored by Lukas Geiger's avatar Lukas Geiger Committed by GitHub
Browse files

[Docs] Fix syntax highlighting of shell commands (#19870)


Signed-off-by: default avatarLukas Geiger <lukas.geiger94@gmail.com>
parent 53243e5c
...@@ -16,7 +16,7 @@ Please download the visualization scripts in the post ...@@ -16,7 +16,7 @@ Please download the visualization scripts in the post
- Download `nightly-benchmarks.zip`. - Download `nightly-benchmarks.zip`.
- In the same folder, run the following code: - In the same folder, run the following code:
```console ```bash
export HF_TOKEN=<your HF token> export HF_TOKEN=<your HF token>
apt update apt update
apt install -y git apt install -y git
......
...@@ -10,7 +10,7 @@ title: Using Docker ...@@ -10,7 +10,7 @@ title: Using Docker
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
```console ```bash
docker run --runtime nvidia --gpus all \ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \ --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
...@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \ ...@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \
This image can also be used with other container engines such as [Podman](https://podman.io/). This image can also be used with other container engines such as [Podman](https://podman.io/).
```console ```bash
podman run --gpus all \ podman run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
...@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (` ...@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM: You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
```console ```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . \ DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \ --target vllm-openai \
...@@ -99,7 +99,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -99,7 +99,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
??? Command ??? Command
```console ```bash
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB) # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
python3 use_existing_torch.py python3 use_existing_torch.py
DOCKER_BUILDKIT=1 docker build . \ DOCKER_BUILDKIT=1 docker build . \
...@@ -118,7 +118,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -118,7 +118,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
Run the following command on your host machine to register QEMU user static handlers: Run the following command on your host machine to register QEMU user static handlers:
```console ```bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
``` ```
...@@ -128,7 +128,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -128,7 +128,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
To run vLLM with the custom-built Docker image: To run vLLM with the custom-built Docker image:
```console ```bash
docker run --runtime nvidia --gpus all \ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \ -p 8000:8000 \
......
...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
``` ```
......
...@@ -11,7 +11,7 @@ title: AutoGen ...@@ -11,7 +11,7 @@ title: AutoGen
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment - Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
```console ```bash
pip install vllm pip install vllm
# Install AgentChat and OpenAI client from Extensions # Install AgentChat and OpenAI client from Extensions
...@@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]" ...@@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 --model mistralai/Mistral-7B-Instruct-v0.2
``` ```
......
...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr ...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
To install the Cerebrium client, run: To install the Cerebrium client, run:
```console ```bash
pip install cerebrium pip install cerebrium
cerebrium login cerebrium login
``` ```
Next, create your Cerebrium project, run: Next, create your Cerebrium project, run:
```console ```bash
cerebrium init vllm-project cerebrium init vllm-project
``` ```
...@@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr ...@@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr
Then, run the following code to deploy it to the cloud: Then, run the following code to deploy it to the cloud:
```console ```bash
cerebrium deploy cerebrium deploy
``` ```
......
...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat vllm serve qwen/Qwen1.5-0.5B-Chat
``` ```
......
...@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend. ...@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve Qwen/Qwen1.5-7B-Chat vllm serve Qwen/Qwen1.5-7B-Chat
``` ```
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)): - Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
```console ```bash
git clone https://github.com/langgenius/dify.git git clone https://github.com/langgenius/dify.git
cd dify cd dify
cd docker cd docker
......
...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), ...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
To install dstack client, run: To install dstack client, run:
```console ```bash
pip install "dstack[all] pip install "dstack[all]
dstack server dstack server
``` ```
Next, to configure your dstack project, run: Next, to configure your dstack project, run:
```console ```bash
mkdir -p vllm-dstack mkdir -p vllm-dstack
cd vllm-dstack cd vllm-dstack
dstack init dstack init
......
...@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Setup vLLM and Haystack environment - Setup vLLM and Haystack environment
```console ```bash
pip install vllm haystack-ai pip install vllm haystack-ai
``` ```
...@@ -21,7 +21,7 @@ pip install vllm haystack-ai ...@@ -21,7 +21,7 @@ pip install vllm haystack-ai
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.1 vllm serve mistralai/Mistral-7B-Instruct-v0.1
``` ```
......
...@@ -22,7 +22,7 @@ Before you begin, ensure that you have the following: ...@@ -22,7 +22,7 @@ Before you begin, ensure that you have the following:
To install the chart with the release name `test-vllm`: To install the chart with the release name `test-vllm`:
```console ```bash
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
``` ```
...@@ -30,7 +30,7 @@ helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f val ...@@ -30,7 +30,7 @@ helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f val
To uninstall the `test-vllm` deployment: To uninstall the `test-vllm` deployment:
```console ```bash
helm uninstall test-vllm --namespace=ns-vllm helm uninstall test-vllm --namespace=ns-vllm
``` ```
......
...@@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM. ...@@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM.
- Setup vLLM and litellm environment - Setup vLLM and litellm environment
```console ```bash
pip install vllm litellm pip install vllm litellm
``` ```
...@@ -28,7 +28,7 @@ pip install vllm litellm ...@@ -28,7 +28,7 @@ pip install vllm litellm
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat vllm serve qwen/Qwen1.5-0.5B-Chat
``` ```
...@@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat ...@@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- Start the vLLM server with the supported embedding model, e.g. - Start the vLLM server with the supported embedding model, e.g.
```console ```bash
vllm serve BAAI/bge-base-en-v1.5 vllm serve BAAI/bge-base-en-v1.5
``` ```
......
...@@ -7,13 +7,13 @@ title: Open WebUI ...@@ -7,13 +7,13 @@ title: Open WebUI
2. Start the vLLM server with the supported chat completion model, e.g. 2. Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat vllm serve qwen/Qwen1.5-0.5B-Chat
``` ```
1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port): 1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
```console ```bash
docker run -d -p 3000:8080 \ docker run -d -p 3000:8080 \
--name open-webui \ --name open-webui \
-v open-webui:/app/backend/data \ -v open-webui:/app/backend/data \
......
...@@ -15,7 +15,7 @@ Here are the integrations: ...@@ -15,7 +15,7 @@ Here are the integrations:
- Setup vLLM and langchain environment - Setup vLLM and langchain environment
```console ```bash
pip install -U vllm \ pip install -U vllm \
langchain_milvus langchain_openai \ langchain_milvus langchain_openai \
langchain_community beautifulsoup4 \ langchain_community beautifulsoup4 \
...@@ -26,14 +26,14 @@ pip install -U vllm \ ...@@ -26,14 +26,14 @@ pip install -U vllm \
- Start the vLLM server with the supported embedding model, e.g. - Start the vLLM server with the supported embedding model, e.g.
```console ```bash
# Start embedding service (port 8000) # Start embedding service (port 8000)
vllm serve ssmits/Qwen2-7B-Instruct-embed-base vllm serve ssmits/Qwen2-7B-Instruct-embed-base
``` ```
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
# Start chat service (port 8001) # Start chat service (port 8001)
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
...@@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py ...@@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py
- Setup vLLM and llamaindex environment - Setup vLLM and llamaindex environment
```console ```bash
pip install vllm \ pip install vllm \
llama-index llama-index-readers-web \ llama-index llama-index-readers-web \
llama-index-llms-openai-like \ llama-index-llms-openai-like \
...@@ -64,14 +64,14 @@ pip install vllm \ ...@@ -64,14 +64,14 @@ pip install vllm \
- Start the vLLM server with the supported embedding model, e.g. - Start the vLLM server with the supported embedding model, e.g.
```console ```bash
# Start embedding service (port 8000) # Start embedding service (port 8000)
vllm serve ssmits/Qwen2-7B-Instruct-embed-base vllm serve ssmits/Qwen2-7B-Instruct-embed-base
``` ```
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
# Start chat service (port 8001) # Start chat service (port 8001)
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
......
...@@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet ...@@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)). - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that `sky check` shows clouds or Kubernetes are enabled. - Check that `sky check` shows clouds or Kubernetes are enabled.
```console ```bash
pip install skypilot-nightly pip install skypilot-nightly
sky check sky check
``` ```
...@@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil ...@@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
```console ```bash
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
``` ```
...@@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the ...@@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
```console ```bash
HF_TOKEN="your-huggingface-token" \ HF_TOKEN="your-huggingface-token" \
sky launch serving.yaml \ sky launch serving.yaml \
--gpus A100:8 \ --gpus A100:8 \
...@@ -159,7 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut ...@@ -159,7 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
Start the serving the Llama-3 8B model on multiple replicas: Start the serving the Llama-3 8B model on multiple replicas:
```console ```bash
HF_TOKEN="your-huggingface-token" \ HF_TOKEN="your-huggingface-token" \
sky serve up -n vllm serving.yaml \ sky serve up -n vllm serving.yaml \
--env HF_TOKEN --env HF_TOKEN
...@@ -167,7 +167,7 @@ HF_TOKEN="your-huggingface-token" \ ...@@ -167,7 +167,7 @@ HF_TOKEN="your-huggingface-token" \
Wait until the service is ready: Wait until the service is ready:
```console ```bash
watch -n10 sky serve status vllm watch -n10 sky serve status vllm
``` ```
...@@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica. ...@@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica.
To update the service with the new config: To update the service with the new config:
```console ```bash
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
``` ```
To stop the service: To stop the service:
```console ```bash
sky serve down vllm sky serve down vllm
``` ```
...@@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend, ...@@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
1. Start the chat web UI: 1. Start the chat web UI:
```console ```bash
sky launch \ sky launch \
-c gui ./gui.yaml \ -c gui ./gui.yaml \
--env ENDPOINT=$(sky serve status --endpoint vllm) --env ENDPOINT=$(sky serve status --endpoint vllm)
......
...@@ -15,13 +15,13 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu ...@@ -15,13 +15,13 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat vllm serve qwen/Qwen1.5-0.5B-Chat
``` ```
- Install streamlit and openai: - Install streamlit and openai:
```console ```bash
pip install streamlit openai pip install streamlit openai
``` ```
...@@ -29,7 +29,7 @@ pip install streamlit openai ...@@ -29,7 +29,7 @@ pip install streamlit openai
- Start the streamlit web UI and start to chat: - Start the streamlit web UI and start to chat:
```console ```bash
streamlit run streamlit_openai_chatbot_webserver.py streamlit run streamlit_openai_chatbot_webserver.py
# or specify the VLLM_API_BASE or VLLM_API_KEY # or specify the VLLM_API_BASE or VLLM_API_KEY
......
...@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta ...@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
To install Llama Stack, run To install Llama Stack, run
```console ```bash
pip install llama-stack -q pip install llama-stack -q
``` ```
......
...@@ -115,7 +115,7 @@ Next, start the vLLM server as a Kubernetes Deployment and Service: ...@@ -115,7 +115,7 @@ Next, start the vLLM server as a Kubernetes Deployment and Service:
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model): We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
```console ```bash
kubectl logs -l app.kubernetes.io/name=vllm kubectl logs -l app.kubernetes.io/name=vllm
... ...
INFO: Started server process [1] INFO: Started server process [1]
...@@ -358,14 +358,14 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -358,14 +358,14 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Apply the deployment and service configurations using `kubectl apply -f <filename>`: Apply the deployment and service configurations using `kubectl apply -f <filename>`:
```console ```bash
kubectl apply -f deployment.yaml kubectl apply -f deployment.yaml
kubectl apply -f service.yaml kubectl apply -f service.yaml
``` ```
To test the deployment, run the following `curl` command: To test the deployment, run the following `curl` command:
```console ```bash
curl http://mistral-7b.default.svc.cluster.local/v1/completions \ curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
......
...@@ -11,13 +11,13 @@ This document shows how to launch multiple vLLM serving containers and use Nginx ...@@ -11,13 +11,13 @@ This document shows how to launch multiple vLLM serving containers and use Nginx
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory. This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
```console ```bash
export vllm_root=`pwd` export vllm_root=`pwd`
``` ```
Create a file named `Dockerfile.nginx`: Create a file named `Dockerfile.nginx`:
```console ```dockerfile
FROM nginx:latest FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80 EXPOSE 80
...@@ -26,7 +26,7 @@ CMD ["nginx", "-g", "daemon off;"] ...@@ -26,7 +26,7 @@ CMD ["nginx", "-g", "daemon off;"]
Build the container: Build the container:
```console ```bash
docker build . -f Dockerfile.nginx --tag nginx-lb docker build . -f Dockerfile.nginx --tag nginx-lb
``` ```
...@@ -60,14 +60,14 @@ Create a file named `nginx_conf/nginx.conf`. Note that you can add as many serve ...@@ -60,14 +60,14 @@ Create a file named `nginx_conf/nginx.conf`. Note that you can add as many serve
## Build vLLM Container ## Build vLLM Container
```console ```bash
cd $vllm_root cd $vllm_root
docker build -f docker/Dockerfile . --tag vllm docker build -f docker/Dockerfile . --tag vllm
``` ```
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below: If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
```console ```bash
cd $vllm_root cd $vllm_root
docker build \ docker build \
-f docker/Dockerfile . \ -f docker/Dockerfile . \
...@@ -80,7 +80,7 @@ docker build \ ...@@ -80,7 +80,7 @@ docker build \
## Create Docker Network ## Create Docker Network
```console ```bash
docker network create vllm_nginx docker network create vllm_nginx
``` ```
...@@ -129,7 +129,7 @@ Notes: ...@@ -129,7 +129,7 @@ Notes:
## Launch Nginx ## Launch Nginx
```console ```bash
docker run \ docker run \
-itd \ -itd \
-p 8000:80 \ -p 8000:80 \
...@@ -142,7 +142,7 @@ docker run \ ...@@ -142,7 +142,7 @@ docker run \
## Verify That vLLM Servers Are Ready ## Verify That vLLM Servers Are Ready
```console ```bash
docker logs vllm0 | grep Uvicorn docker logs vllm0 | grep Uvicorn
docker logs vllm1 | grep Uvicorn docker logs vllm1 | grep Uvicorn
``` ```
......
...@@ -307,7 +307,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for ...@@ -307,7 +307,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
By default, the timeout for fetching images through HTTP URL is `5` seconds. By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```bash
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
``` ```
...@@ -370,7 +370,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for ...@@ -370,7 +370,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```bash
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
``` ```
...@@ -476,7 +476,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for ...@@ -476,7 +476,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```bash
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout> export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
``` ```
......
...@@ -9,7 +9,7 @@ The main benefits are lower latency and memory usage. ...@@ -9,7 +9,7 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq). You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
```console ```bash
pip install autoawq pip install autoawq
``` ```
...@@ -43,7 +43,7 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the ...@@ -43,7 +43,7 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command: To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console ```bash
python examples/offline_inference/llm_engine_example.py \ python examples/offline_inference/llm_engine_example.py \
--model TheBloke/Llama-2-7b-Chat-AWQ \ --model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq --quantization awq
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment