Unverified Commit a0f44bb6 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Allow `markdownlint` to run locally (#36398)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent fde4771b
...@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan ...@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Client → Server Events #### Client → Server Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` | | `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` |
| `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` | | `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` |
| `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` | | `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` |
...@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan ...@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Server → Client Events #### Server → Client Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `session.created` | Connection established with session ID and timestamp | | `session.created` | Connection established with session ID and timestamp |
| `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` | | `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` |
| `transcription.done` | Final transcription with usage stats | | `transcription.done` | Final transcription with usage stats |
......
...@@ -84,7 +84,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the ...@@ -84,7 +84,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Hardware ### Hardware
| Hardware | Status | | Hardware | Status |
|------------------|-----------------------------------------------| | --------------| --------------- |
| **NVIDIA** | <nobr>🟢</nobr> | | **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> | | **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> | | **INTEL GPU** | <nobr>🟢</nobr> |
...@@ -105,7 +105,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the ...@@ -105,7 +105,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Models ### Models
| Model Type | Status | | Model Type | Status |
|-----------------------------|-------------------------------------------------------------------------| | -------------------------- | --------------------------------------- |
| **Decoder-only Models** | <nobr>🟢</nobr> | | **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> | | **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> | | **Pooling Models** | <nobr>🟢</nobr> |
...@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system]( ...@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system](
### Features ### Features
| Feature | Status | | Feature | Status |
|---------------------------------------------|-----------------------------------------------------------------------------------| | ------------------------------------------- | --------------------------------------------------------------------------------- |
| **Prefix Caching** | <nobr>🟢 Functional</nobr> | | **Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> | | **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional</nobr> | | **LoRA** | <nobr>🟢 Functional</nobr> |
......
...@@ -34,7 +34,7 @@ deployment methods: ...@@ -34,7 +34,7 @@ deployment methods:
Both platforms provide equivalent monitoring capabilities: Both platforms provide equivalent monitoring capabilities:
| Dashboard | Description | | Dashboard | Description |
|-----------|-------------| | --------- | ----------- |
| **Performance Statistics** | Tracks latency, throughput, and performance metrics | | **Performance Statistics** | Tracks latency, throughput, and performance metrics |
| **Query Statistics** | Monitors request volume, query performance, and KPIs | | **Query Statistics** | Monitors request volume, query performance, and KPIs |
......
...@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will ...@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will
## Proxy Instance Flags (`disagg_epd_proxy.py`) ## Proxy Instance Flags (`disagg_epd_proxy.py`)
| Flag | Description | | Flag | Description |
|------|-------------| | ---- | ----------- |
| `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. | | `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. |
| `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). | | `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). |
| `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. | | `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. |
......
...@@ -34,7 +34,7 @@ python client.py ...@@ -34,7 +34,7 @@ python client.py
## 📁 Files ## 📁 Files
| File | Description | | File | Description |
|------|-------------| | ---- | ----------- |
| `service.sh` | Server startup script with chunked processing enabled | | `service.sh` | Server startup script with chunked processing enabled |
| `client.py` | Comprehensive test client for long text embedding | | `client.py` | Comprehensive test client for long text embedding |
...@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`: ...@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`:
Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length: Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description | | Component | Behavior | Description |
|-----------|----------|-------------| | --------- | -------- | ----------- |
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy | | **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | | **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
| **Performance** | Optimal | All chunks processed for complete semantic coverage | | **Performance** | Optimal | All chunks processed for complete semantic coverage |
...@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in ...@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in
### Environment Variables ### Environment Variables
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| | -------- | ------- | ----------- |
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) | | `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port | | `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use | | `GPU_COUNT` | `1` | Number of GPUs to use |
...@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process: ...@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process:
### Chunked Processing Performance ### Chunked Processing Performance
| Aspect | Behavior | Performance | | Aspect | Behavior | Performance |
|--------|----------|-------------| | ------ | -------- | ----------- |
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length | | **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead | | **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable | | **Memory Usage** | Proportional to number of chunks | Moderate, scalable |
......
...@@ -1153,11 +1153,11 @@ def _render_table( ...@@ -1153,11 +1153,11 @@ def _render_table(
) -> list[str]: ) -> list[str]:
"""Render a markdown table from column specs and backend data.""" """Render a markdown table from column specs and backend data."""
header = "| " + " | ".join(name for name, _ in columns) + " |" header = "| " + " | ".join(name for name, _ in columns) + " |"
sep = "|" + "|".join("-" * (len(name) + 2) for name, _ in columns) + "|" sep = "| " + " | ".join("-" * len(name) for name, _ in columns) + " |"
lines = [header, sep] lines = [header, sep]
for info in sorted(backends, key=_sort_key): for info in sorted(backends, key=_sort_key):
row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |" row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |"
lines.append(row) lines.append(row.replace(" ", " "))
return lines return lines
...@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]: ...@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]:
f"**{title}:**", f"**{title}:**",
"", "",
"| Priority | Backend |", "| Priority | Backend |",
"|----------|---------|", "| -------- | ------- |",
*[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)], *[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)],
"", "",
] ]
...@@ -1317,7 +1317,7 @@ def generate_legend() -> str: ...@@ -1317,7 +1317,7 @@ def generate_legend() -> str:
return """## Legend return """## Legend
| Column | Description | | Column | Description |
|--------|-------------| | ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
...@@ -1348,7 +1348,7 @@ def generate_mla_section( ...@@ -1348,7 +1348,7 @@ def generate_mla_section(
"configuration.", "configuration.",
"", "",
"| Backend | Description | Compute Cap. | Enable | Disable | Notes |", "| Backend | Description | Compute Cap. | Enable | Disable | Notes |",
"|---------|-------------|--------------|--------|---------|-------|", "| ------- | ----------- | ------------ | ------ | ------- | ----- |",
] ]
for backend in prefill_backends: for backend in prefill_backends:
...@@ -1360,7 +1360,7 @@ def generate_mla_section( ...@@ -1360,7 +1360,7 @@ def generate_mla_section(
backend["disable"], backend["disable"],
backend.get("notes", ""), backend.get("notes", ""),
) )
lines.append(row) lines.append(row.replace(" ", " "))
lines.extend( lines.extend(
[ [
......
...@@ -44,7 +44,7 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from ...@@ -44,7 +44,7 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from
### File Naming ### File Naming
| Kernel Type | File Name Template | Example | | Kernel Type | File Name Template | Example |
|---------------------------|--------------------------------------------|---------------------------------------------| | ------------------------- | ------------------------------------------- | -------------------------------------------- |
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` | | shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` | | expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` | | fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` |
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment