# Context Extension

!!! note
    The `--rope-scaling` parameter used in older versions of vLLM is no longer supported. Please use the `--hf-overrides` method with `rope_parameters` instead.
This directory contains examples for extending the context length of models using vLLM.

## Offline Inference Example

The [`context_extension.py`](../../examples/offline_inference/context_extension) script demonstrates how to extend the context length of a Qwen model using the YARN method (rope_parameters) and run a simple chat example.

### Usage

```bash
python examples/offline_inference/context_extension.py
```

## OpenAI Online Method

You can also use vLLM's OpenAI-compatible API to serve models with extended context length.

### Usage

Run the vLLM server with the following command to extend the context length using YARN:

```bash
vllm serve Qwen/Qwen3-0.6B \
  --hf-overrides '{"rope_parameters": {"factor": 4.0, "original_max_position_embeddings": 32768, "rope_theta": 1000000, "rope_type": "yarn"}}' \
  --max-model-len 131072
```

### Client Example

After starting the server, you can use the OpenAI Python client to interact with it:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"  # Dummy API key, required by the client
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hello"}
    ],
    max_tokens=128,
    temperature=0.8,
    top_p=0.95
)

print(response.choices[0].message.content)
```

### Key Parameters

The available parameters depend on the `rope_type` you choose. For detailed information about all supported RoPE types and their specific parameters, please refer to the [Hugging Face Transformers RoPE documentation](https://huggingface.co/docs/transformers/main/en/internal/rope_utils#transformers.RopeParameters).

Common parameters include:

- `rope_type`: The type of RoPE implementation (e.g., "yarn", "linear", "dynamic")
- `factor`: The factor by which to extend the context length
- `original_max_position_embeddings`: The original maximum position embeddings of the model

The following parameters are specific to vLLM:

- `max_model_len`: The new maximum sequence length after extension (original * factor).
  Used for KV cache pre‑allocation and request limit at serving time.