Unverified Commit a0f44bb6 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Allow `markdownlint` to run locally (#36398)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent fde4771b
......@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Client → Server Events
| Event | Description |
|-------|-------------|
| ----- | ----------- |
| `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` |
| `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` |
| `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` |
......@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Server → Client Events
| Event | Description |
|-------|-------------|
| ----- | ----------- |
| `session.created` | Connection established with session ID and timestamp |
| `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` |
| `transcription.done` | Final transcription with usage stats |
......
......@@ -83,13 +83,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Hardware
| Hardware | Status |
|------------------|-----------------------------------------------|
| **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> |
| **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢</nobr> |
| Hardware | Status |
| --------------| --------------- |
| **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> |
| **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢</nobr> |
!!! note
......@@ -104,13 +104,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Models
| Model Type | Status |
|-----------------------------|-------------------------------------------------------------------------|
| **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> |
| **Mamba Models** | <nobr>🟢</nobr> |
| **Multimodal Models** | <nobr>🟢</nobr> |
| Model Type | Status |
| -------------------------- | --------------------------------------- |
| **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> |
| **Mamba Models** | <nobr>🟢</nobr> |
| **Multimodal Models** | <nobr>🟢</nobr> |
See below for the status of models that are not yet supported or have more features planned in V1.
......@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system](
### Features
| Feature | Status |
|---------------------------------------------|-----------------------------------------------------------------------------------|
| ------------------------------------------- | --------------------------------------------------------------------------------- |
| **Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional</nobr> |
......
......@@ -34,7 +34,7 @@ deployment methods:
Both platforms provide equivalent monitoring capabilities:
| Dashboard | Description |
|-----------|-------------|
| --------- | ----------- |
| **Performance Statistics** | Tracks latency, throughput, and performance metrics |
| **Query Statistics** | Monitors request volume, query performance, and KPIs |
......
......@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will
## Proxy Instance Flags (`disagg_epd_proxy.py`)
| Flag | Description |
|------|-------------|
| ---- | ----------- |
| `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. |
| `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). |
| `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. |
......
......@@ -34,7 +34,7 @@ python client.py
## 📁 Files
| File | Description |
|------|-------------|
| ---- | ----------- |
| `service.sh` | Server startup script with chunked processing enabled |
| `client.py` | Comprehensive test client for long text embedding |
......@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`:
Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description |
|-----------|----------|-------------|
| --------- | -------- | ----------- |
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
| **Performance** | Optimal | All chunks processed for complete semantic coverage |
......@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| -------- | ------- | ----------- |
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use |
......@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process:
### Chunked Processing Performance
| Aspect | Behavior | Performance |
|--------|----------|-------------|
| ------ | -------- | ----------- |
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable |
......
......@@ -1153,11 +1153,11 @@ def _render_table(
) -> list[str]:
"""Render a markdown table from column specs and backend data."""
header = "| " + " | ".join(name for name, _ in columns) + " |"
sep = "|" + "|".join("-" * (len(name) + 2) for name, _ in columns) + "|"
sep = "| " + " | ".join("-" * len(name) for name, _ in columns) + " |"
lines = [header, sep]
for info in sorted(backends, key=_sort_key):
row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |"
lines.append(row)
lines.append(row.replace(" ", " "))
return lines
......@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]:
f"**{title}:**",
"",
"| Priority | Backend |",
"|----------|---------|",
"| -------- | ------- |",
*[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)],
"",
]
......@@ -1317,7 +1317,7 @@ def generate_legend() -> str:
return """## Legend
| Column | Description |
|--------|-------------|
| ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
......@@ -1348,7 +1348,7 @@ def generate_mla_section(
"configuration.",
"",
"| Backend | Description | Compute Cap. | Enable | Disable | Notes |",
"|---------|-------------|--------------|--------|---------|-------|",
"| ------- | ----------- | ------------ | ------ | ------- | ----- |",
]
for backend in prefill_backends:
......@@ -1360,7 +1360,7 @@ def generate_mla_section(
backend["disable"],
backend.get("notes", ""),
)
lines.append(row)
lines.append(row.replace(" ", " "))
lines.extend(
[
......
......@@ -43,14 +43,14 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from
### File Naming
| Kernel Type | File Name Template | Example |
|---------------------------|--------------------------------------------|---------------------------------------------|
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
| Kernel Type | File Name Template | Example |
| ------------------------- | ------------------------------------------- | -------------------------------------------- |
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` |
| fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` |
| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` |
| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` |
| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` |
| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` |
The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment