Unverified Commit a0f44bb6 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Allow `markdownlint` to run locally (#36398)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent fde4771b
...@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan ...@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Client → Server Events #### Client → Server Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` | | `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` |
| `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` | | `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` |
| `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` | | `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` |
...@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan ...@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Server → Client Events #### Server → Client Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `session.created` | Connection established with session ID and timestamp | | `session.created` | Connection established with session ID and timestamp |
| `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` | | `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` |
| `transcription.done` | Final transcription with usage stats | | `transcription.done` | Final transcription with usage stats |
......
...@@ -83,13 +83,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the ...@@ -83,13 +83,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Hardware ### Hardware
| Hardware | Status | | Hardware | Status |
|------------------|-----------------------------------------------| | --------------| --------------- |
| **NVIDIA** | <nobr>🟢</nobr> | | **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> | | **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> | | **INTEL GPU** | <nobr>🟢</nobr> |
| **TPU** | <nobr>🟢</nobr> | | **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢</nobr> | | **CPU** | <nobr>🟢</nobr> |
!!! note !!! note
...@@ -104,13 +104,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the ...@@ -104,13 +104,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Models ### Models
| Model Type | Status | | Model Type | Status |
|-----------------------------|-------------------------------------------------------------------------| | -------------------------- | --------------------------------------- |
| **Decoder-only Models** | <nobr>🟢</nobr> | | **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> | | **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> | | **Pooling Models** | <nobr>🟢</nobr> |
| **Mamba Models** | <nobr>🟢</nobr> | | **Mamba Models** | <nobr>🟢</nobr> |
| **Multimodal Models** | <nobr>🟢</nobr> | | **Multimodal Models** | <nobr>🟢</nobr> |
See below for the status of models that are not yet supported or have more features planned in V1. See below for the status of models that are not yet supported or have more features planned in V1.
...@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system]( ...@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system](
### Features ### Features
| Feature | Status | | Feature | Status |
|---------------------------------------------|-----------------------------------------------------------------------------------| | ------------------------------------------- | --------------------------------------------------------------------------------- |
| **Prefix Caching** | <nobr>🟢 Functional</nobr> | | **Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> | | **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional</nobr> | | **LoRA** | <nobr>🟢 Functional</nobr> |
......
...@@ -34,7 +34,7 @@ deployment methods: ...@@ -34,7 +34,7 @@ deployment methods:
Both platforms provide equivalent monitoring capabilities: Both platforms provide equivalent monitoring capabilities:
| Dashboard | Description | | Dashboard | Description |
|-----------|-------------| | --------- | ----------- |
| **Performance Statistics** | Tracks latency, throughput, and performance metrics | | **Performance Statistics** | Tracks latency, throughput, and performance metrics |
| **Query Statistics** | Monitors request volume, query performance, and KPIs | | **Query Statistics** | Monitors request volume, query performance, and KPIs |
......
...@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will ...@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will
## Proxy Instance Flags (`disagg_epd_proxy.py`) ## Proxy Instance Flags (`disagg_epd_proxy.py`)
| Flag | Description | | Flag | Description |
|------|-------------| | ---- | ----------- |
| `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. | | `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. |
| `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). | | `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). |
| `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. | | `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. |
......
...@@ -34,7 +34,7 @@ python client.py ...@@ -34,7 +34,7 @@ python client.py
## 📁 Files ## 📁 Files
| File | Description | | File | Description |
|------|-------------| | ---- | ----------- |
| `service.sh` | Server startup script with chunked processing enabled | | `service.sh` | Server startup script with chunked processing enabled |
| `client.py` | Comprehensive test client for long text embedding | | `client.py` | Comprehensive test client for long text embedding |
...@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`: ...@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`:
Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length: Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description | | Component | Behavior | Description |
|-----------|----------|-------------| | --------- | -------- | ----------- |
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy | | **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | | **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
| **Performance** | Optimal | All chunks processed for complete semantic coverage | | **Performance** | Optimal | All chunks processed for complete semantic coverage |
...@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in ...@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in
### Environment Variables ### Environment Variables
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| | -------- | ------- | ----------- |
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) | | `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port | | `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use | | `GPU_COUNT` | `1` | Number of GPUs to use |
...@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process: ...@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process:
### Chunked Processing Performance ### Chunked Processing Performance
| Aspect | Behavior | Performance | | Aspect | Behavior | Performance |
|--------|----------|-------------| | ------ | -------- | ----------- |
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length | | **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead | | **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable | | **Memory Usage** | Proportional to number of chunks | Moderate, scalable |
......
...@@ -1153,11 +1153,11 @@ def _render_table( ...@@ -1153,11 +1153,11 @@ def _render_table(
) -> list[str]: ) -> list[str]:
"""Render a markdown table from column specs and backend data.""" """Render a markdown table from column specs and backend data."""
header = "| " + " | ".join(name for name, _ in columns) + " |" header = "| " + " | ".join(name for name, _ in columns) + " |"
sep = "|" + "|".join("-" * (len(name) + 2) for name, _ in columns) + "|" sep = "| " + " | ".join("-" * len(name) for name, _ in columns) + " |"
lines = [header, sep] lines = [header, sep]
for info in sorted(backends, key=_sort_key): for info in sorted(backends, key=_sort_key):
row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |" row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |"
lines.append(row) lines.append(row.replace(" ", " "))
return lines return lines
...@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]: ...@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]:
f"**{title}:**", f"**{title}:**",
"", "",
"| Priority | Backend |", "| Priority | Backend |",
"|----------|---------|", "| -------- | ------- |",
*[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)], *[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)],
"", "",
] ]
...@@ -1317,7 +1317,7 @@ def generate_legend() -> str: ...@@ -1317,7 +1317,7 @@ def generate_legend() -> str:
return """## Legend return """## Legend
| Column | Description | | Column | Description |
|--------|-------------| | ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
...@@ -1348,7 +1348,7 @@ def generate_mla_section( ...@@ -1348,7 +1348,7 @@ def generate_mla_section(
"configuration.", "configuration.",
"", "",
"| Backend | Description | Compute Cap. | Enable | Disable | Notes |", "| Backend | Description | Compute Cap. | Enable | Disable | Notes |",
"|---------|-------------|--------------|--------|---------|-------|", "| ------- | ----------- | ------------ | ------ | ------- | ----- |",
] ]
for backend in prefill_backends: for backend in prefill_backends:
...@@ -1360,7 +1360,7 @@ def generate_mla_section( ...@@ -1360,7 +1360,7 @@ def generate_mla_section(
backend["disable"], backend["disable"],
backend.get("notes", ""), backend.get("notes", ""),
) )
lines.append(row) lines.append(row.replace(" ", " "))
lines.extend( lines.extend(
[ [
......
...@@ -43,14 +43,14 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from ...@@ -43,14 +43,14 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from
### File Naming ### File Naming
| Kernel Type | File Name Template | Example | | Kernel Type | File Name Template | Example |
|---------------------------|--------------------------------------------|---------------------------------------------| | ------------------------- | ------------------------------------------- | -------------------------------------------- |
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` | | shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` | | expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` | | fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` |
| fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` | | fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` |
| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` | | fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` |
| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` | | fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` |
The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`. The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment