Unverified Commit b4bab816 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Remove unnecessary explicit title anchors and use relative links instead (#20620)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b91cb3fa
---
title: Offline Inference
---
[](){ #offline-inference }
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
......@@ -18,8 +17,8 @@ llm = LLM(model="facebook/opt-125m")
After initializing the `LLM` instance, use the available APIs to perform model inference.
The available APIs depend on the model type:
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
- [Pooling models][pooling-models] output their hidden states directly.
- [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
!!! info
[API Reference][offline-inference-api]
......
---
title: OpenAI-Compatible Server
---
[](){ #serving-openai-compatible-server }
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`](../configuration/serve_args.md) command. (You can also use our [Docker](../deployment/docker.md) image.)
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
......@@ -208,7 +207,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs][multimodal-inputs] guide for more information.
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
......
---
title: Frequently Asked Questions
---
[](){ #faq }
> Q: How can I serve multiple models on a single port using the OpenAI API?
......@@ -12,7 +11,7 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
> Q: Which model to use for offline inference embedding?
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
more are listed [here][supported-models].
more are listed [here](../models/supported_models.md).
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
......
......@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.
You can start the server using Python, or using [Docker][deployment-docker]:
You can start the server using Python, or using [Docker](../deployment/docker.md):
```bash
vllm serve unsloth/Llama-3.2-1B-Instruct
......
---
title: Troubleshooting
---
[](){ #troubleshooting }
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
......@@ -267,7 +266,7 @@ or:
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
```
But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
But you are sure that the model is in the [list of supported models](../models/supported_models.md), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
......
......@@ -90,7 +90,7 @@ vLLM V1 currently excludes model architectures with the `SupportsV0Only` protoco
!!! tip
This corresponds to the V1 column in our [list of supported models][supported-models].
This corresponds to the V1 column in our [list of supported models](../models/supported_models.md).
See below for the status of models that are not yet supported or have more features planned in V1.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment