README.md

# TensorRT-LLM Engine Configurations

This directory contains TensorRT-LLM engine configuration files for various model deployments.


## Usage

These YAML configuration files can be passed to TensorRT-LLM workers using the `--extra-engine-args` parameter:

```bash
python3 -m dynamo.trtllm \
    --extra-engine-args "${ENGINE_ARGS}" \
    ...
```

Where `ENGINE_ARGS` points to one of the configuration files in this directory.

## Configuration Types

### Aggregated (agg/)
Single-node configurations that combine prefill and decode operations:
- **simple/**: Basic aggregated setup
- **mtp/**: Multi-token prediction configurations
- **wide_ep/**: Wide expert parallel configurations

### Disaggregated (disagg/)
Separate configurations for prefill and decode workers:
- **simple/**: Basic prefill/decode split
- **mtp/**: Multi-token prediction with separate prefill/decode
- **wide_ep/**: Wide expert parallel with expert load balancer

## Key Configuration Parameters

- **Parallelism**: `tensor_parallel_size`, `moe_expert_parallel_size`, `pipeline_parallel_size`
- **Memory**: `kv_cache_config.free_gpu_memory_fraction`, `kv_cache_config.dtype`
- **Batching**: `max_batch_size`, `max_num_tokens`, `max_seq_len`
- **Scheduling**: `disable_overlap_scheduler`, `cuda_graph_config`

## Notes

- For disaggregated setups, ensure `kv_cache_config.dtype` matches between prefill and decode configs
- WideEP configurations require an expert load balancer config (`eplb.yaml`)
- Adjust `free_gpu_memory_fraction` based on your workload and attention DP settings