add tgi2.4.0

81a882ad · jixx · 9822d7f6 · 81a882ad · 81a882ad · 81a882ad
Commit 81a882ad authored Nov 21, 2024 by jixx
20 changed files
--- a/docs/source/basic_tutorials/launcher.md
+++ b/docs/source/basic_tutorials/launcher.md
@@ -55,7 +55,9 @@ Options:
 ## QUANTIZE
 ```shell
      --quantize <QUANTIZE>
-          Whether you want the model to be quantized
+          Quantization method to use for the model. It is not necessary to specify this option for pre-quantized models, since the quantization method is read from the model configuration.
+          
+          Marlin kernels will be used automatically for GPTQ/AWQ models.
          
          [env: QUANTIZE=]

@@ -87,6 +89,15 @@ Options:
          [env: DTYPE=]
          [possible values: float16, bfloat16]

+```
+## KV_CACHE_DTYPE
+```shell
+      --kv-cache-dtype <KV_CACHE_DTYPE>
+          Specify the dtype for the key-value cache. When this option is not provided, the dtype of the model is used (typically `float16` or `bfloat16`). Currently the only supported value are `fp8_e4m3fn` and `fp8_e5m2` on CUDA
+          
+          [env: KV_CACHE_DTYPE=]
+          [possible values: fp8_e4m3fn, fp8_e5m2]
+
 ```
 ## TRUST_REMOTE_CODE
 ```shell
@@ -349,6 +360,12 @@ Options:
      --cors-allow-origin <CORS_ALLOW_ORIGIN>
          [env: CORS_ALLOW_ORIGIN=]

+```
+## API_KEY
+```shell
+      --api-key <API_KEY>
+          [env: API_KEY=]
+
 ```
 ## WATERMARK_GAMMA
 ```shell
@@ -424,6 +441,20 @@ Options:
          
          [env: LORA_ADAPTERS=]

+```
+## USAGE_STATS
+```shell
+      --usage-stats <USAGE_STATS>
+          Control if anonymous usage stats are collected. Options are "on", "off" and "no-stack" Defaul is on
+          
+          [env: USAGE_STATS=]
+          [default: on]
+
+          Possible values:
+          - on:       Default option, usage statistics are collected anonymously
+          - off:      Disables all collection of usage statistics
+          - no-stack: Doesn't send the error stack trace or error type, but allows sending a crash event
+
 ```
 ## HELP
 ```shell

--- a/docs/source/reference/metrics.md
+++ b/docs/source/reference/metrics.md
+# Metrics
+
+TGI exposes multiple metrics that can be collected via the `/metrics` Prometheus endpoint.
+These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.
+
+The following metrics are exposed:
+
+| Metric Name                                | Description                                                                              | Type      | Unit    |
+|--------------------------------------------|------------------------------------------------------------------------------------------|-----------|---------|
+| `tgi_batch_current_max_tokens`             | Maximum tokens for the current batch                                                     | Gauge     | Count   |
+| `tgi_batch_current_size`                   | Current batch size                                                                       | Gauge     | Count   |
+| `tgi_batch_decode_duration`                | Time spent decoding a batch per method (prefill or decode)                               | Histogram | Seconds |
+| `tgi_batch_filter_duration`                | Time spent filtering batches and sending generated tokens per method (prefill or decode) | Histogram | Seconds |
+| `tgi_batch_forward_duration`               | Batch forward duration per method (prefill or decode)                                    | Histogram | Seconds |
+| `tgi_batch_inference_count`                | Inference calls per method (prefill or decode)                                           | Counter   | Count   |
+| `tgi_batch_inference_duration`             | Batch inference duration                                                                 | Histogram | Seconds |
+| `tgi_batch_inference_success`              | Number of successful inference calls per method (prefill or decode)                      | Counter   | Count   |
+| `tgi_batch_next_size`                      | Batch size of the next batch                                                             | Histogram | Count   |
+| `tgi_queue_size`                           | Current queue size                                                                       | Gauge     | Count   |
+| `tgi_request_count`                        | Total number of requests                                                                 | Counter   | Count   |
+| `tgi_request_duration`                     | Total time spent processing the request (e2e latency)                                    | Histogram | Seconds |
+| `tgi_request_generated_tokens`             | Generated tokens per request                                                             | Histogram | Count   |
+| `tgi_request_inference_duration`           | Request inference duration                                                               | Histogram | Seconds |
+| `tgi_request_input_length`                 | Input token length per request                                                           | Histogram | Count   |
+| `tgi_request_max_new_tokens`               | Maximum new tokens per request                                                           | Histogram | Count   |
+| `tgi_request_mean_time_per_token_duration` | Mean time per token per request (inter-token latency)                                    | Histogram | Seconds |
+| `tgi_request_queue_duration`               | Time spent in the queue per request                                                      | Histogram | Seconds |
+| `tgi_request_skipped_tokens`               | Speculated tokens per request                                                            | Histogram | Count   |
+| `tgi_request_success`                      | Number of successful requests                                                            | Counter   |         |
+| `tgi_request_validation_duration`          | Time spent validating the request                                                        | Histogram | Seconds |
--- a/docs/source/supported_models.md
+++ b/docs/source/supported_models.md

-# Supported Models and Hardware
+# Supported Models

-Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models are hardware are supported.
-
-## Supported Models
+Text Generation Inference enables serving optimized models. The following sections list which models (VLMs & LLMs) are supported.

+- [Deepseek V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)
 - [Idefics 2](https://huggingface.co/HuggingFaceM4/idefics2-8b) (Multimodal)
 - [Llava Next (1.6)](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) (Multimodal)
- [Llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
+- [Llama](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)
 - [Phi 3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
+- [Granite](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct)
 - [Gemma](https://huggingface.co/google/gemma-7b)
- [Gemma2](https://huggingface.co/google/gemma2-9b)
+- [PaliGemma](https://huggingface.co/google/paligemma-3b-pt-224)
+- [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
 - [Cohere](https://huggingface.co/CohereForAI/c4ai-command-r-plus)
 - [Dbrx](https://huggingface.co/databricks/dbrx-instruct)
 - [Mamba](https://huggingface.co/state-spaces/mamba-2.8b-slimpj)
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
+- [Mistral](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
 - [Mixtral](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
 - [Gpt Bigcode](https://huggingface.co/bigcode/gpt_bigcode-santacoder)
 - [Phi](https://huggingface.co/microsoft/phi-1_5)
+- [PhiMoe](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct)
 - [Baichuan](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
 - [Falcon](https://huggingface.co/tiiuae/falcon-7b-instruct)
 - [StarCoder 2](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
@@ -30,7 +32,10 @@ Text Generation Inference enables serving optimized models on specific hardware
 - [Mpt](https://huggingface.co/mosaicml/mpt-7b-instruct)
 - [Gpt2](https://huggingface.co/openai-community/gpt2)
 - [Gpt Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
+- [Gptj](https://huggingface.co/EleutherAI/gpt-j-6b)
 - [Idefics](https://huggingface.co/HuggingFaceM4/idefics-9b) (Multimodal)
+- [Mllama](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) (Multimodal)
+


 If the above list lacks the model you would like to serve, depending on the model's pipeline type, you can try to initialize and serve the model anyways to see how well it performs, but performance isn't guaranteed for non-optimized models:

--- a/docs/source/usage_statistics.md
+++ b/docs/source/usage_statistics.md
+
+# Collection of Usage Statistics
+
+Text Generation Inference collects anonymous usage statistics to help us improve the service. The collected data is used to improve TGI and to understand what causes failures. The data is collected transparently and any sensitive information is omitted.
+
+Data is sent twice, once on server startup and once when server stops. Also, usage statistics are only enabled when TGI is running in docker to avoid collecting data then TGI runs directly on the host machine.
+
+## What data is collected
+
+The code that collects the data is available [here](https://github.com/huggingface/text-generation-inference/blob/main/router/src/usage_stats.rs).
+As of release 2.1.2 this is an example of the data collected:
+
+- From the TGI configuration:
+```json
+{
+  "event_type": "start",
+  "disable_grammar_support": false,
+  "max_batch_prefill_tokens": 4096,
+  "max_batch_size": null,
+  "max_batch_total_tokens": null,
+  "max_best_of": 2,
+  "max_client_batch_size": 4,
+  "max_concurrent_requests": 128,
+  "max_input_tokens": 1024,
+  "max_stop_sequences": 4,
+  "max_top_n_tokens": 5,
+  "max_total_tokens": 2048,
+  "max_waiting_tokens": 20,
+  "model_config": {
+    "model_type": "Bloom"
+  },
+  "revision": null,
+  "tokenizer_class": "BloomTokenizerFast",
+  "validation_workers": 2,
+  "waiting_served_ratio": 1.2,
+  "docker_label": "latest",
+  "git_sha": "cfc118704880453d29bcbe4fbbd91dda501cf5fe",
+  "nvidia_env": {
+    "name": "NVIDIA A10G",
+    "pci_bus_id": "00000000:00:1E.0",
+    "driver_version": "535.183.01",
+    "pstate": "P8",
+    "pcie_link_gen_max": "4",
+    "pcie_link_gen_current": "1",
+    "temperature_gpu": "31",
+    "utilization_gpu": "0 %",
+    "utilization_memory": "0 %",
+    "memory_total": "23028 MiB",
+    "memory_free": "22515 MiB",
+    "memory_used": "0 MiB",
+    "reset_status_reset_required": "No",
+    "reset_status_drain_and_reset_recommended": "No",
+    "compute_cap": "8.6",
+    "ecc_errors_corrected_volatile_total": "0",
+    "mig_mode_current": "[N/A]",
+    "power_draw_instant": "10.86 W",
+    "power_limit": "300.00 W"
+  },
+  "system_env": {
+    "cpu_count": 16,
+    "cpu_type": "AMD EPYC 7R32",
+    "total_memory": 66681196544,
+    "architecture": "x86_64",
+    "platform": "linux-unix-x86_64"
+  }
+}
+
+```
+
+## How to opt-out
+
+By passing the `--usage-stats` to the text-generation-launcher you can control how much usage statistics are being collected.
+`--usage-stats=no-stack` will not emit the stack traces from errors and the error types, but will continue to send start and stop events
+`--usage-stats=off` will completely disable everything
--- a/flake.lock
+++ b/flake.lock
+{
+  "nodes": {
+    "cachix": {
+      "inputs": {
+        "devenv": [
+          "crate2nix"
+        ],
+        "flake-compat": [
+          "crate2nix"
+        ],
+        "nixpkgs": "nixpkgs",
+        "pre-commit-hooks": [
+          "crate2nix"
+        ]
+      },
+      "locked": {
+        "lastModified": 1709700175,
+        "narHash": "sha256-A0/6ZjLmT9qdYzKHmevnEIC7G+GiZ4UCr8v0poRPzds=",
+        "owner": "cachix",
+        "repo": "cachix",
+        "rev": "be97b37989f11b724197b5f4c7ffd78f12c8c4bf",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "ref": "latest",
+        "repo": "cachix",
+        "type": "github"
+      }
+    },
+    "cachix_2": {
+      "inputs": {
+        "devenv": [
+          "crate2nix",
+          "crate2nix_stable"
+        ],
+        "flake-compat": [
+          "crate2nix",
+          "crate2nix_stable"
+        ],
+        "nixpkgs": "nixpkgs_2",
+        "pre-commit-hooks": [
+          "crate2nix",
+          "crate2nix_stable"
+        ]
+      },
+      "locked": {
+        "lastModified": 1716549461,
+        "narHash": "sha256-lHy5kgx6J8uD+16SO47dPrbob98sh+W1tf4ceSqPVK4=",
+        "owner": "cachix",
+        "repo": "cachix",
+        "rev": "e2bb269fb8c0828d5d4d2d7b8d09ea85abcacbd4",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "ref": "latest",
+        "repo": "cachix",
+        "type": "github"
+      }
+    },
+    "cachix_3": {
+      "inputs": {
+        "devenv": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable"
+        ],
+        "flake-compat": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable"
+        ],
+        "nixpkgs": "nixpkgs_3",
+        "pre-commit-hooks": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable"
+        ]
+      },
+      "locked": {
+        "lastModified": 1716549461,
+        "narHash": "sha256-lHy5kgx6J8uD+16SO47dPrbob98sh+W1tf4ceSqPVK4=",
+        "owner": "cachix",
+        "repo": "cachix",
+        "rev": "e2bb269fb8c0828d5d4d2d7b8d09ea85abcacbd4",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "ref": "latest",
+        "repo": "cachix",
+        "type": "github"
+      }
+    },
+    "crate2nix": {
+      "inputs": {
+        "cachix": "cachix",
+        "crate2nix_stable": "crate2nix_stable",
+        "devshell": "devshell_3",
+        "flake-compat": "flake-compat_3",
+        "flake-parts": "flake-parts_3",
+        "nix-test-runner": "nix-test-runner_3",
+        "nixpkgs": [
+          "tgi-nix",
+          "nixpkgs"
+        ],
+        "pre-commit-hooks": "pre-commit-hooks_3"
+      },
+      "locked": {
+        "lastModified": 1723311214,
+        "narHash": "sha256-xdGZQBEa1AC2us/sY3igS/CucWY6jErXsAvCFRhB2LI=",
+        "owner": "nix-community",
+        "repo": "crate2nix",
+        "rev": "236f6addfd452a48be805819e3216af79e988fd5",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-community",
+        "repo": "crate2nix",
+        "type": "github"
+      }
+    },
+    "crate2nix_stable": {
+      "inputs": {
+        "cachix": "cachix_2",
+        "crate2nix_stable": "crate2nix_stable_2",
+        "devshell": "devshell_2",
+        "flake-compat": "flake-compat_2",
+        "flake-parts": "flake-parts_2",
+        "nix-test-runner": "nix-test-runner_2",
+        "nixpkgs": "nixpkgs_5",
+        "pre-commit-hooks": "pre-commit-hooks_2"
+      },
+      "locked": {
+        "lastModified": 1719760004,
+        "narHash": "sha256-esWhRnt7FhiYq0CcIxw9pvH+ybOQmWBfHYMtleaMhBE=",
+        "owner": "nix-community",
+        "repo": "crate2nix",
+        "rev": "1dee214bb20855fa3e1e7bb98d28922ddaff8c57",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-community",
+        "ref": "0.14.1",
+        "repo": "crate2nix",
+        "type": "github"
+      }
+    },
+    "crate2nix_stable_2": {
+      "inputs": {
+        "cachix": "cachix_3",
+        "crate2nix_stable": "crate2nix_stable_3",
+        "devshell": "devshell",
+        "flake-compat": "flake-compat",
+        "flake-parts": "flake-parts",
+        "nix-test-runner": "nix-test-runner",
+        "nixpkgs": "nixpkgs_4",
+        "pre-commit-hooks": "pre-commit-hooks"
+      },
+      "locked": {
+        "lastModified": 1712821484,
+        "narHash": "sha256-rGT3CW64cJS9nlnWPFWSc1iEa3dNZecVVuPVGzcsHe8=",
+        "owner": "nix-community",
+        "repo": "crate2nix",
+        "rev": "42883afcad3823fa5811e967fb7bff54bc3c9d6d",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-community",
+        "ref": "0.14.0",
+        "repo": "crate2nix",
+        "type": "github"
+      }
+    },
+    "crate2nix_stable_3": {
+      "inputs": {
+        "flake-utils": "flake-utils"
+      },
+      "locked": {
+        "lastModified": 1702842982,
+        "narHash": "sha256-A9AowkHIjsy1a4LuiPiVP88FMxyCWK41flZEZOUuwQM=",
+        "owner": "nix-community",
+        "repo": "crate2nix",
+        "rev": "75ac2973affa6b9b4f661a7b592cba6e4f51d426",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-community",
+        "ref": "0.12.0",
+        "repo": "crate2nix",
+        "type": "github"
+      }
+    },
+    "devshell": {
+      "inputs": {
+        "flake-utils": "flake-utils_2",
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1717408969,
+        "narHash": "sha256-Q0OEFqe35fZbbRPPRdrjTUUChKVhhWXz3T9ZSKmaoVY=",
+        "owner": "numtide",
+        "repo": "devshell",
+        "rev": "1ebbe68d57457c8cae98145410b164b5477761f4",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "devshell",
+        "type": "github"
+      }
+    },
+    "devshell_2": {
+      "inputs": {
+        "flake-utils": "flake-utils_3",
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1717408969,
+        "narHash": "sha256-Q0OEFqe35fZbbRPPRdrjTUUChKVhhWXz3T9ZSKmaoVY=",
+        "owner": "numtide",
+        "repo": "devshell",
+        "rev": "1ebbe68d57457c8cae98145410b164b5477761f4",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "devshell",
+        "type": "github"
+      }
+    },
+    "devshell_3": {
+      "inputs": {
+        "flake-utils": "flake-utils_4",
+        "nixpkgs": [
+          "crate2nix",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1711099426,
+        "narHash": "sha256-HzpgM/wc3aqpnHJJ2oDqPBkNsqWbW0WfWUO8lKu8nGk=",
+        "owner": "numtide",
+        "repo": "devshell",
+        "rev": "2d45b54ca4a183f2fdcf4b19c895b64fbf620ee8",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "devshell",
+        "type": "github"
+      }
+    },
+    "flake-compat": {
+      "locked": {
+        "lastModified": 1696426674,
+        "narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
+        "rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
+        "revCount": 57,
+        "type": "tarball",
+        "url": "https://api.flakehub.com/f/pinned/edolstra/flake-compat/1.0.1/018afb31-abd1-7bff-a5e4-cff7e18efb7a/source.tar.gz"
+      },
+      "original": {
+        "type": "tarball",
+        "url": "https://flakehub.com/f/edolstra/flake-compat/1.tar.gz"
+      }
+    },
+    "flake-compat_2": {
+      "locked": {
+        "lastModified": 1696426674,
+        "narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
+        "rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
+        "revCount": 57,
+        "type": "tarball",
+        "url": "https://api.flakehub.com/f/pinned/edolstra/flake-compat/1.0.1/018afb31-abd1-7bff-a5e4-cff7e18efb7a/source.tar.gz"
+      },
+      "original": {
+        "type": "tarball",
+        "url": "https://flakehub.com/f/edolstra/flake-compat/1.tar.gz"
+      }
+    },
+    "flake-compat_3": {
+      "locked": {
+        "lastModified": 1696426674,
+        "narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
+        "rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
+        "revCount": 57,
+        "type": "tarball",
+        "url": "https://api.flakehub.com/f/pinned/edolstra/flake-compat/1.0.1/018afb31-abd1-7bff-a5e4-cff7e18efb7a/source.tar.gz"
+      },
+      "original": {
+        "type": "tarball",
+        "url": "https://flakehub.com/f/edolstra/flake-compat/1.tar.gz"
+      }
+    },
+    "flake-compat_4": {
+      "locked": {
+        "lastModified": 1696426674,
+        "narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
+        "owner": "edolstra",
+        "repo": "flake-compat",
+        "rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
+        "type": "github"
+      },
+      "original": {
+        "owner": "edolstra",
+        "repo": "flake-compat",
+        "type": "github"
+      }
+    },
+    "flake-parts": {
+      "inputs": {
+        "nixpkgs-lib": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1719745305,
+        "narHash": "sha256-xwgjVUpqSviudEkpQnioeez1Uo2wzrsMaJKJClh+Bls=",
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "rev": "c3c5ecc05edc7dafba779c6c1a61cd08ac6583e9",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "type": "github"
+      }
+    },
+    "flake-parts_2": {
+      "inputs": {
+        "nixpkgs-lib": [
+          "crate2nix",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1719745305,
+        "narHash": "sha256-xwgjVUpqSviudEkpQnioeez1Uo2wzrsMaJKJClh+Bls=",
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "rev": "c3c5ecc05edc7dafba779c6c1a61cd08ac6583e9",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "type": "github"
+      }
+    },
+    "flake-parts_3": {
+      "inputs": {
+        "nixpkgs-lib": [
+          "crate2nix",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1712014858,
+        "narHash": "sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm+GpZNw=",
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "rev": "9126214d0a59633752a136528f5f3b9aa8565b7d",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "flake-parts",
+        "type": "github"
+      }
+    },
+    "flake-utils": {
+      "inputs": {
+        "systems": "systems"
+      },
+      "locked": {
+        "lastModified": 1694529238,
+        "narHash": "sha256-zsNZZGTGnMOf9YpHKJqMSsa0dXbfmxeoJ7xHlrt+xmY=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "ff7b65b44d01cf9ba6a71320833626af21126384",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_2": {
+      "inputs": {
+        "systems": "systems_2"
+      },
+      "locked": {
+        "lastModified": 1701680307,
+        "narHash": "sha256-kAuep2h5ajznlPMD9rnQyffWG8EM/C73lejGofXvdM8=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "4022d587cbbfd70fe950c1e2083a02621806a725",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_3": {
+      "inputs": {
+        "systems": "systems_3"
+      },
+      "locked": {
+        "lastModified": 1701680307,
+        "narHash": "sha256-kAuep2h5ajznlPMD9rnQyffWG8EM/C73lejGofXvdM8=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "4022d587cbbfd70fe950c1e2083a02621806a725",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_4": {
+      "inputs": {
+        "systems": "systems_4"
+      },
+      "locked": {
+        "lastModified": 1701680307,
+        "narHash": "sha256-kAuep2h5ajznlPMD9rnQyffWG8EM/C73lejGofXvdM8=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "4022d587cbbfd70fe950c1e2083a02621806a725",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_5": {
+      "inputs": {
+        "systems": "systems_5"
+      },
+      "locked": {
+        "lastModified": 1710146030,
+        "narHash": "sha256-SZ5L6eA7HJ/nmkzGG7/ISclqe6oZdOZTNoesiInkXPQ=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "b1d9ab70662946ef0850d488da1c9019f3a9752a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_6": {
+      "inputs": {
+        "systems": "systems_6"
+      },
+      "locked": {
+        "lastModified": 1726560853,
+        "narHash": "sha256-X6rJYSESBVr3hBoH0WbKE5KvhPU5bloyZ2L4K60/fPQ=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "c1dfcf08411b08f6b8615f7d8971a2bfa81d5e8a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "flake-utils_7": {
+      "inputs": {
+        "systems": "systems_7"
+      },
+      "locked": {
+        "lastModified": 1726560853,
+        "narHash": "sha256-X6rJYSESBVr3hBoH0WbKE5KvhPU5bloyZ2L4K60/fPQ=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "c1dfcf08411b08f6b8615f7d8971a2bfa81d5e8a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "gitignore": {
+      "inputs": {
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "pre-commit-hooks",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1709087332,
+        "narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "type": "github"
+      }
+    },
+    "gitignore_2": {
+      "inputs": {
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "pre-commit-hooks",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1709087332,
+        "narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "type": "github"
+      }
+    },
+    "gitignore_3": {
+      "inputs": {
+        "nixpkgs": [
+          "crate2nix",
+          "pre-commit-hooks",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1709087332,
+        "narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
+        "type": "github"
+      },
+      "original": {
+        "owner": "hercules-ci",
+        "repo": "gitignore.nix",
+        "type": "github"
+      }
+    },
+    "nix-filter": {
+      "locked": {
+        "lastModified": 1710156097,
+        "narHash": "sha256-1Wvk8UP7PXdf8bCCaEoMnOT1qe5/Duqgj+rL8sRQsSM=",
+        "owner": "numtide",
+        "repo": "nix-filter",
+        "rev": "3342559a24e85fc164b295c3444e8a139924675b",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "nix-filter",
+        "type": "github"
+      }
+    },
+    "nix-test-runner": {
+      "flake": false,
+      "locked": {
+        "lastModified": 1588761593,
+        "narHash": "sha256-FKJykltAN/g3eIceJl4SfDnnyuH2jHImhMrXS2KvGIs=",
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "rev": "c45d45b11ecef3eb9d834c3b6304c05c49b06ca2",
+        "type": "github"
+      },
+      "original": {
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "type": "github"
+      }
+    },
+    "nix-test-runner_2": {
+      "flake": false,
+      "locked": {
+        "lastModified": 1588761593,
+        "narHash": "sha256-FKJykltAN/g3eIceJl4SfDnnyuH2jHImhMrXS2KvGIs=",
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "rev": "c45d45b11ecef3eb9d834c3b6304c05c49b06ca2",
+        "type": "github"
+      },
+      "original": {
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "type": "github"
+      }
+    },
+    "nix-test-runner_3": {
+      "flake": false,
+      "locked": {
+        "lastModified": 1588761593,
+        "narHash": "sha256-FKJykltAN/g3eIceJl4SfDnnyuH2jHImhMrXS2KvGIs=",
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "rev": "c45d45b11ecef3eb9d834c3b6304c05c49b06ca2",
+        "type": "github"
+      },
+      "original": {
+        "owner": "stoeffel",
+        "repo": "nix-test-runner",
+        "type": "github"
+      }
+    },
+    "nixpkgs": {
+      "locked": {
+        "lastModified": 1700612854,
+        "narHash": "sha256-yrQ8osMD+vDLGFX7pcwsY/Qr5PUd6OmDMYJZzZi0+zc=",
+        "owner": "NixOS",
+        "repo": "nixpkgs",
+        "rev": "19cbff58383a4ae384dea4d1d0c823d72b49d614",
+        "type": "github"
+      },
+      "original": {
+        "owner": "NixOS",
+        "ref": "nixos-unstable",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "nixpkgs_2": {
+      "locked": {
+        "lastModified": 1715534503,
+        "narHash": "sha256-5ZSVkFadZbFP1THataCaSf0JH2cAH3S29hU9rrxTEqk=",
+        "owner": "NixOS",
+        "repo": "nixpkgs",
+        "rev": "2057814051972fa1453ddfb0d98badbea9b83c06",
+        "type": "github"
+      },
+      "original": {
+        "owner": "NixOS",
+        "ref": "nixos-unstable",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "nixpkgs_3": {
+      "locked": {
+        "lastModified": 1715534503,
+        "narHash": "sha256-5ZSVkFadZbFP1THataCaSf0JH2cAH3S29hU9rrxTEqk=",
+        "owner": "NixOS",
+        "repo": "nixpkgs",
+        "rev": "2057814051972fa1453ddfb0d98badbea9b83c06",
+        "type": "github"
+      },
+      "original": {
+        "owner": "NixOS",
+        "ref": "nixos-unstable",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "nixpkgs_4": {
+      "locked": {
+        "lastModified": 1719506693,
+        "narHash": "sha256-C8e9S7RzshSdHB7L+v9I51af1gDM5unhJ2xO1ywxNH8=",
+        "path": "/nix/store/4p0avw1s3vf27hspgqsrqs37gxk4i83i-source",
+        "rev": "b2852eb9365c6de48ffb0dc2c9562591f652242a",
+        "type": "path"
+      },
+      "original": {
+        "id": "nixpkgs",
+        "type": "indirect"
+      }
+    },
+    "nixpkgs_5": {
+      "locked": {
+        "lastModified": 1719506693,
+        "narHash": "sha256-C8e9S7RzshSdHB7L+v9I51af1gDM5unhJ2xO1ywxNH8=",
+        "path": "/nix/store/4p0avw1s3vf27hspgqsrqs37gxk4i83i-source",
+        "rev": "b2852eb9365c6de48ffb0dc2c9562591f652242a",
+        "type": "path"
+      },
+      "original": {
+        "id": "nixpkgs",
+        "type": "indirect"
+      }
+    },
+    "nixpkgs_6": {
+      "locked": {
+        "lastModified": 1727675176,
+        "narHash": "sha256-xIjBFMYldWvj+g8ahxMPofsj+OqxvKJN6YylNHQ7gn4=",
+        "owner": "nixos",
+        "repo": "nixpkgs",
+        "rev": "a6d0207fea9212d28cd3d487efe6bc699663b93a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nixos",
+        "ref": "nixos-unstable-small",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "pre-commit-hooks": {
+      "inputs": {
+        "flake-compat": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "flake-compat"
+        ],
+        "gitignore": "gitignore",
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "nixpkgs"
+        ],
+        "nixpkgs-stable": [
+          "crate2nix",
+          "crate2nix_stable",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1719259945,
+        "narHash": "sha256-F1h+XIsGKT9TkGO3omxDLEb/9jOOsI6NnzsXFsZhry4=",
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "rev": "0ff4381bbb8f7a52ca4a851660fc7a437a4c6e07",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "type": "github"
+      }
+    },
+    "pre-commit-hooks_2": {
+      "inputs": {
+        "flake-compat": [
+          "crate2nix",
+          "crate2nix_stable",
+          "flake-compat"
+        ],
+        "gitignore": "gitignore_2",
+        "nixpkgs": [
+          "crate2nix",
+          "crate2nix_stable",
+          "nixpkgs"
+        ],
+        "nixpkgs-stable": [
+          "crate2nix",
+          "crate2nix_stable",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1719259945,
+        "narHash": "sha256-F1h+XIsGKT9TkGO3omxDLEb/9jOOsI6NnzsXFsZhry4=",
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "rev": "0ff4381bbb8f7a52ca4a851660fc7a437a4c6e07",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "type": "github"
+      }
+    },
+    "pre-commit-hooks_3": {
+      "inputs": {
+        "flake-compat": [
+          "crate2nix",
+          "flake-compat"
+        ],
+        "flake-utils": "flake-utils_5",
+        "gitignore": "gitignore_3",
+        "nixpkgs": [
+          "crate2nix",
+          "nixpkgs"
+        ],
+        "nixpkgs-stable": [
+          "crate2nix",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1712055707,
+        "narHash": "sha256-4XLvuSIDZJGS17xEwSrNuJLL7UjDYKGJSbK1WWX2AK8=",
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "rev": "e35aed5fda3cc79f88ed7f1795021e559582093a",
+        "type": "github"
+      },
+      "original": {
+        "owner": "cachix",
+        "repo": "pre-commit-hooks.nix",
+        "type": "github"
+      }
+    },
+    "root": {
+      "inputs": {
+        "crate2nix": "crate2nix",
+        "flake-utils": "flake-utils_6",
+        "nix-filter": "nix-filter",
+        "nixpkgs": [
+          "tgi-nix",
+          "nixpkgs"
+        ],
+        "rust-overlay": "rust-overlay",
+        "tgi-nix": "tgi-nix"
+      }
+    },
+    "rust-overlay": {
+      "inputs": {
+        "nixpkgs": [
+          "tgi-nix",
+          "nixpkgs"
+        ]
+      },
+      "locked": {
+        "lastModified": 1727836133,
+        "narHash": "sha256-JE0zciM5IGWvK8J/pE2VldNBf7oyMH5WrU8tZArefbg=",
+        "owner": "oxalica",
+        "repo": "rust-overlay",
+        "rev": "02321540b0c8000b36889b1b974d1fec585b25a4",
+        "type": "github"
+      },
+      "original": {
+        "owner": "oxalica",
+        "repo": "rust-overlay",
+        "type": "github"
+      }
+    },
+    "systems": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_2": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_3": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_4": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_5": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_6": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "systems_7": {
+      "locked": {
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
+        "type": "github"
+      },
+      "original": {
+        "owner": "nix-systems",
+        "repo": "default",
+        "type": "github"
+      }
+    },
+    "tgi-nix": {
+      "inputs": {
+        "flake-compat": "flake-compat_4",
+        "flake-utils": "flake-utils_7",
+        "nixpkgs": "nixpkgs_6"
+      },
+      "locked": {
+        "lastModified": 1729761651,
+        "narHash": "sha256-GYykQ9Fxji2EuXCGcPn0dx8Qx8VQBJTkRdcCytp4A/k=",
+        "owner": "huggingface",
+        "repo": "text-generation-inference-nix",
+        "rev": "f7e3c4fa67d70590ed9ee47feeab645bd9ba81b1",
+        "type": "github"
+      },
+      "original": {
+        "owner": "huggingface",
+        "ref": "marlin-kernels-0.3.1",
+        "repo": "text-generation-inference-nix",
+        "type": "github"
+      }
+    }
+  },
+  "root": "root",
+  "version": 7
+}
--- a/flake.nix
+++ b/flake.nix
+{
+  inputs = {
+    crate2nix = {
+      url = "github:nix-community/crate2nix";
+      inputs.nixpkgs.follows = "tgi-nix/nixpkgs";
+    };
+    nix-filter.url = "github:numtide/nix-filter";
+    tgi-nix.url = "github:huggingface/text-generation-inference-nix/marlin-kernels-0.3.1";
+    nixpkgs.follows = "tgi-nix/nixpkgs";
+    flake-utils.url = "github:numtide/flake-utils";
+    rust-overlay = {
+      url = "github:oxalica/rust-overlay";
+      inputs.nixpkgs.follows = "tgi-nix/nixpkgs";
+    };
+  };
+  outputs =
+    {
+      self,
+      crate2nix,
+      nix-filter,
+      nixpkgs,
+      flake-utils,
+      rust-overlay,
+      tgi-nix,
+    }:
+    flake-utils.lib.eachDefaultSystem (
+      system:
+      let
+        cargoNix = crate2nix.tools.${system}.appliedCargoNix {
+          name = "tgi";
+          src = ./.;
+          additionalCargoNixArgs = [ "--all-features" ];
+        };
+        pkgs = import nixpkgs {
+          inherit system;
+          inherit (tgi-nix.lib) config;
+          overlays = [
+            rust-overlay.overlays.default
+            tgi-nix.overlays.default
+            (import nix/overlay.nix)
+          ];
+        };
+        crateOverrides = import ./nix/crate-overrides.nix { inherit pkgs nix-filter; };
+        benchmark = cargoNix.workspaceMembers.text-generation-benchmark.build.override {
+          inherit crateOverrides;
+        };
+        launcher = cargoNix.workspaceMembers.text-generation-launcher.build.override {
+          inherit crateOverrides;
+        };
+        router =
+          let
+            routerUnwrapped = cargoNix.workspaceMembers.text-generation-router-v3.build.override {
+              inherit crateOverrides;
+            };
+            packagePath =
+              with pkgs.python3.pkgs;
+              makePythonPath [
+                protobuf
+                sentencepiece
+                torch
+                transformers
+              ];
+          in
+          pkgs.writeShellApplication {
+            name = "text-generation-router";
+            text = ''
+              PYTHONPATH="${packagePath}" ${routerUnwrapped}/bin/text-generation-router "$@"
+            '';
+          };
+        server = pkgs.python3.pkgs.callPackage ./nix/server.nix { inherit nix-filter; };
+        client = pkgs.python3.pkgs.callPackage ./nix/client.nix { };
+      in
+      {
+        checks = {
+          rust =
+            with pkgs;
+            rustPlatform.buildRustPackage {
+              name = "rust-checks";
+              src = ./.;
+              cargoLock = {
+                lockFile = ./Cargo.lock;
+              };
+              buildInputs = [ openssl.dev ];
+              nativeBuildInputs = [
+                clippy
+                pkg-config
+                protobuf
+                python3
+                rustfmt
+              ];
+              buildPhase = ''
+                cargo check
+              '';
+              checkPhase = ''
+                cargo fmt -- --check
+                cargo test -j $NIX_BUILD_CORES
+                cargo clippy
+              '';
+              installPhase = "touch $out";
+            };
+        };
+        formatter = pkgs.nixfmt-rfc-style;
+        devShells = with pkgs; rec {
+          default = pure;
+
+          pure = mkShell {
+            buildInputs = [
+              benchmark
+              launcher
+              router
+              server
+            ];
+          };
+          test = mkShell {
+            buildInputs =
+              [
+                benchmark
+                launcher
+                router
+                server
+                client
+                openssl.dev
+                pkg-config
+                cargo
+                rustfmt
+                clippy
+              ]
+              ++ (with python3.pkgs; [
+                docker
+                pytest
+                pytest-asyncio
+                syrupy
+                pre-commit
+                ruff
+              ]);
+          };
+
+          impure = callPackage ./nix/impure-shell.nix { inherit server; };
+
+          impureWithCuda = callPackage ./nix/impure-shell.nix {
+            inherit server;
+            withCuda = true;
+          };
+
+          impure-flash-attn-v1 = callPackage ./nix/impure-shell.nix {
+            server = server.override { flash-attn = python3.pkgs.flash-attn-v1; };
+          };
+        };
+
+        packages = rec {
+          default = pkgs.writeShellApplication {
+            name = "text-generation-inference";
+            runtimeInputs = [
+              server
+              router
+            ];
+            text = ''
+              ${launcher}/bin/text-generation-launcher "$@"
+            '';
+          };
+
+          dockerImage = pkgs.callPackage nix/docker.nix {
+            text-generation-inference = default;
+          };
+
+          dockerImageStreamed = pkgs.callPackage nix/docker.nix {
+            text-generation-inference = default;
+            stream = true;
+          };
+        };
+      }
+    );
+}
--- a/integration-tests/conftest.py
+++ b/integration-tests/conftest.py
@@ -4,22 +4,25 @@ import json
 import math
 import os
 import random
-import re
 import shutil
 import subprocess
 import sys
 import tempfile
 import time
-from typing import Dict, List, Optional
-
 import docker
 import pytest
+import base64
+
+from pathlib import Path
+from typing import Dict, List, Optional
 from aiohttp import ClientConnectorError, ClientOSError, ServerDisconnectedError
 from docker.errors import NotFound
 from syrupy.extensions.json import JSONSnapshotExtension
+
 from text_generation import AsyncClient
 from text_generation.types import (
    BestOfSequence,
+    Message,
    ChatComplete,
    ChatCompletionChunk,
    ChatCompletionComplete,
@@ -65,6 +68,7 @@ class ResponseComparator(JSONSnapshotExtension):
        self,
        data,
        *,
+        include=None,
        exclude=None,
        matcher=None,
    ):
@@ -80,7 +84,12 @@ class ResponseComparator(JSONSnapshotExtension):
            data = [d.model_dump() for d in data]

        data = self._filter(
-            data=data, depth=0, path=(), exclude=exclude, matcher=matcher
+            data=data,
+            depth=0,
+            path=(),
+            exclude=exclude,
+            include=include,
+            matcher=matcher,
        )
        return json.dumps(data, indent=2, ensure_ascii=False, sort_keys=False) + "\n"

@@ -92,25 +101,25 @@ class ResponseComparator(JSONSnapshotExtension):
    ) -> bool:
        def convert_data(data):
            data = json.loads(data)
-            if isinstance(data, Dict) and "choices" in data:
-                choices = data["choices"]
-                if isinstance(choices, List) and len(choices) >= 1:
-                    if "delta" in choices[0]:
-                        return ChatCompletionChunk(**data)
-                    if "text" in choices[0]:
-                        return Completion(**data)
-                return ChatComplete(**data)
+            return _convert_data(data)

+        def _convert_data(data):
            if isinstance(data, Dict):
-                return Response(**data)
+                if "choices" in data:
+                    data["choices"] = list(
+                        sorted(data["choices"], key=lambda x: x["index"])
+                    )
+                    choices = data["choices"]
+                    if isinstance(choices, List) and len(choices) >= 1:
+                        if "delta" in choices[0]:
+                            return ChatCompletionChunk(**data)
+                        if "text" in choices[0]:
+                            return Completion(**data)
+                    return ChatComplete(**data)
+                else:
+                    return Response(**data)
            if isinstance(data, List):
-                if (
-                    len(data) > 0
-                    and "object" in data[0]
-                    and data[0]["object"] == "text_completion"
-                ):
-                    return [Completion(**d) for d in data]
-                return [Response(**d) for d in data]
+                return [_convert_data(d) for d in data]
            raise NotImplementedError

        def eq_token(token: Token, other: Token) -> bool:
@@ -119,6 +128,7 @@ class ResponseComparator(JSONSnapshotExtension):
                and token.text == other.text
                and (
                    self.ignore_logprob
+                    or (token.logprob == other.logprob and token.logprob is None)
                    or math.isclose(token.logprob, other.logprob, rel_tol=self.rtol)
                )
                and token.special == other.special
@@ -257,7 +267,7 @@ class IgnoreLogProbResponseComparator(ResponseComparator):

 class LauncherHandle:
    def __init__(self, port: int):
-        self.client = AsyncClient(f"http://localhost:{port}")
+        self.client = AsyncClient(f"http://localhost:{port}", timeout=30)

    def _inner_health(self):
        raise NotImplementedError
@@ -271,7 +281,7 @@ class LauncherHandle:
            try:
                await self.client.generate("test")
                return
-            except (ClientConnectorError, ClientOSError, ServerDisconnectedError) as e:
+            except (ClientConnectorError, ClientOSError, ServerDisconnectedError):
                time.sleep(1)
        raise RuntimeError("Health check failed")

@@ -329,10 +339,14 @@ def launcher(event_loop):
        use_flash_attention: bool = True,
        disable_grammar_support: bool = False,
        dtype: Optional[str] = None,
+        kv_cache_dtype: Optional[str] = None,
        revision: Optional[str] = None,
        max_input_length: Optional[int] = None,
        max_batch_prefill_tokens: Optional[int] = None,
        max_total_tokens: Optional[int] = None,
+        lora_adapters: Optional[List[str]] = None,
+        cuda_graphs: Optional[List[int]] = None,
+        attention: Optional[str] = None,
    ):
        port = random.randint(8000, 10_000)
        master_port = random.randint(10_000, 20_000)
@@ -365,6 +379,9 @@ def launcher(event_loop):
        if dtype is not None:
            args.append("--dtype")
            args.append(dtype)
+        if kv_cache_dtype is not None:
+            args.append("--kv-cache-dtype")
+            args.append(kv_cache_dtype)
        if revision is not None:
            args.append("--revision")
            args.append(revision)
@@ -379,11 +396,22 @@ def launcher(event_loop):
        if max_total_tokens:
            args.append("--max-total-tokens")
            args.append(str(max_total_tokens))
+        if lora_adapters:
+            args.append("--lora-adapters")
+            args.append(",".join(lora_adapters))
+        if cuda_graphs:
+            args.append("--cuda-graphs")
+            args.append(",".join(map(str, cuda_graphs)))
+
+        print(" ".join(args), file=sys.stderr)

        env["LOG_LEVEL"] = "info,text_generation_router=debug"
+        env["PREFILL_CHUNKING"] = "1"

        if not use_flash_attention:
            env["USE_FLASH_ATTENTION"] = "false"
+        if attention is not None:
+            env["ATTENTION"] = attention

        with tempfile.TemporaryFile("w+") as tmp:
            # We'll output stdout/stderr to a temporary file. Using a pipe
@@ -414,10 +442,14 @@ def launcher(event_loop):
        use_flash_attention: bool = True,
        disable_grammar_support: bool = False,
        dtype: Optional[str] = None,
+        kv_cache_dtype: Optional[str] = None,
        revision: Optional[str] = None,
        max_input_length: Optional[int] = None,
        max_batch_prefill_tokens: Optional[int] = None,
        max_total_tokens: Optional[int] = None,
+        lora_adapters: Optional[List[str]] = None,
+        cuda_graphs: Optional[List[int]] = None,
+        attention: Optional[str] = None,
    ):
        port = random.randint(8000, 10_000)

@@ -433,6 +465,9 @@ def launcher(event_loop):
        if dtype is not None:
            args.append("--dtype")
            args.append(dtype)
+        if kv_cache_dtype is not None:
+            args.append("--kv-cache-dtype")
+            args.append(kv_cache_dtype)
        if revision is not None:
            args.append("--revision")
            args.append(revision)
@@ -447,6 +482,12 @@ def launcher(event_loop):
        if max_total_tokens:
            args.append("--max-total-tokens")
            args.append(str(max_total_tokens))
+        if lora_adapters:
+            args.append("--lora-adapters")
+            args.append(",".join(lora_adapters))
+        if cuda_graphs:
+            args.append("--cuda-graphs")
+            args.append(",".join(map(str, cuda_graphs)))

        client = docker.from_env()

@@ -455,6 +496,7 @@ def launcher(event_loop):
        try:
            container = client.containers.get(container_name)
            container.stop()
+            container.remove()
            container.wait()
        except NotFound:
            pass
@@ -463,9 +505,12 @@ def launcher(event_loop):

        env = {
            "LOG_LEVEL": "info,text_generation_router=debug",
+            "PREFILL_CHUNKING": "1",
        }
        if not use_flash_attention:
            env["USE_FLASH_ATTENTION"] = "false"
+        if attention is not None:
+            env["ATTENTION"] = attention

        if HF_TOKEN is not None:
            env["HF_TOKEN"] = HF_TOKEN
@@ -475,13 +520,28 @@ def launcher(event_loop):
            volumes = [f"{DOCKER_VOLUME}:/data"]

        if DOCKER_DEVICES:
-            devices = DOCKER_DEVICES.split(",")
+            if DOCKER_DEVICES.lower() == "none":
+                devices = []
+            else:
+                devices = DOCKER_DEVICES.strip().split(",")
            visible = os.getenv("ROCR_VISIBLE_DEVICES")
            if visible:
                env["ROCR_VISIBLE_DEVICES"] = visible
            device_requests = []
+            if not devices:
+                devices = None
+            elif devices == ["nvidia.com/gpu=all"]:
+                devices = None
+                device_requests = [
+                    docker.types.DeviceRequest(
+                        driver="cdi",
+                        # count=gpu_count,
+                        device_ids=[f"nvidia.com/gpu={i}"],
+                    )
+                    for i in range(gpu_count)
+                ]
        else:
-            devices = []
+            devices = None
            device_requests = [
                docker.types.DeviceRequest(count=gpu_count, capabilities=[["gpu"]])
            ]
@@ -497,24 +557,30 @@ def launcher(event_loop):
            devices=devices,
            volumes=volumes,
            ports={"80/tcp": port},
+            healthcheck={"timeout": int(10 * 1e9)},
            shm_size="1G",
        )

-        yield ContainerLauncherHandle(client, container.name, port)
+        try:
+            yield ContainerLauncherHandle(client, container.name, port)

-        if not use_flash_attention:
-            del env["USE_FLASH_ATTENTION"]
+            if not use_flash_attention:
+                del env["USE_FLASH_ATTENTION"]

-        try:
-            container.stop()
-            container.wait()
-        except NotFound:
-            pass
+            try:
+                container.stop()
+                container.wait()
+            except NotFound:
+                pass

-        container_output = container.logs().decode("utf-8")
-        print(container_output, file=sys.stderr)
+            container_output = container.logs().decode("utf-8")
+            print(container_output, file=sys.stderr)

-        container.remove()
+        finally:
+            try:
+                container.remove()
+            except Exception:
+                pass

    if DOCKER_IMAGE is not None:
        return docker_launcher
@@ -547,3 +613,56 @@ def generate_load():
        return await asyncio.gather(*futures)

    return generate_load_inner
+
+
+@pytest.fixture(scope="module")
+def generate_multi():
+    async def generate_load_inner(
+        client: AsyncClient,
+        prompts: List[str],
+        max_new_tokens: int,
+        seed: Optional[int] = None,
+    ) -> List[Response]:
+        import numpy as np
+
+        arange = np.arange(len(prompts))
+        perm = np.random.permutation(arange)
+        rperm = [-1] * len(perm)
+        for i, p in enumerate(perm):
+            rperm[p] = i
+
+        shuffled_prompts = [prompts[p] for p in perm]
+        futures = [
+            client.chat(
+                messages=[Message(role="user", content=prompt)],
+                max_tokens=max_new_tokens,
+                temperature=0,
+                seed=seed,
+            )
+            for prompt in shuffled_prompts
+        ]
+
+        shuffled_responses = await asyncio.gather(*futures)
+        responses = [shuffled_responses[p] for p in rperm]
+        return responses
+
+    return generate_load_inner
+
+
+# TODO fix the server parsser to count inline image tokens correctly
+@pytest.fixture
+def chicken():
+    path = Path(__file__).parent / "images" / "chicken_on_money.png"
+
+    with open(path, "rb") as image_file:
+        encoded_string = base64.b64encode(image_file.read())
+    return f"data:image/png;base64,{encoded_string.decode('utf-8')}"
+
+
+@pytest.fixture
+def cow_beach():
+    path = Path(__file__).parent / "images" / "cow_beach.png"
+
+    with open(path, "rb") as image_file:
+        encoded_string = base64.b64encode(image_file.read())
+    return f"data:image/png;base64,{encoded_string.decode('utf-8')}"
--- a/integration-tests/models/__snapshots__/test_bloom_560m/test_bloom_560m.json
+++ b/integration-tests/models/__snapshots__/test_bloom_560m/test_bloom_560m.json
@@ -11,42 +11,42 @@
      },
      {
        "id": 49833,
-        "logprob": -10.5625,
+        "logprob": -10.5703125,
        "text": " dég"
      },
      {
        "id": 21543,
-        "logprob": -0.14770508,
+        "logprob": -0.14746094,
        "text": "uster"
      },
      {
        "id": 447,
-        "logprob": -1.9287109,
+        "logprob": -1.9277344,
        "text": " un"
      },
      {
        "id": 46341,
-        "logprob": -15.4609375,
+        "logprob": -15.421875,
        "text": " ort"
      },
      {
        "id": 35567,
-        "logprob": -7.5585938,
+        "logprob": -7.5820312,
        "text": "olan"
      },
      {
        "id": 15,
-        "logprob": -1.4003906,
+        "logprob": -1.4013672,
        "text": ","
      },
      {
        "id": 1669,
-        "logprob": -1.5673828,
+        "logprob": -1.5595703,
        "text": " il"
      },
      {
        "id": 11580,
-        "logprob": -0.94628906,
+        "logprob": -0.9428711,
        "text": " faut"
      },
      {
@@ -56,7 +56,7 @@
      },
      {
        "id": 39261,
-        "logprob": -1.5732422,
+        "logprob": -1.7763672,
        "text": " d'abord"
      }
    ],
@@ -64,65 +64,66 @@
    "tokens": [
      {
        "id": 578,
-        "logprob": -1.6591797,
+        "logprob": -1.7822266,
        "special": false,
        "text": " le"
      },
      {
        "id": 5608,
-        "logprob": -2.4492188,
+        "logprob": -2.4882812,
        "special": false,
        "text": " faire"
      },
      {
-        "id": 159570,
-        "logprob": -6.6835938,
+        "id": 7735,
+        "logprob": -2.4199219,
        "special": false,
-        "text": " réch"
+        "text": " fond"
      },
      {
-        "id": 810,
+        "id": 289,
        "logprob": 0.0,
        "special": false,
-        "text": "au"
+        "text": "re"
      },
      {
-        "id": 12736,
-        "logprob": 0.0,
+        "id": 693,
+        "logprob": -2.4628906,
        "special": false,
-        "text": "ffer"
+        "text": " à"
      },
      {
-        "id": 1742,
-        "logprob": -2.5175781,
+        "id": 366,
+        "logprob": -1.1308594,
        "special": false,
-        "text": " au"
+        "text": " la"
      },
      {
-        "id": 6105,
-        "logprob": -2.0078125,
+        "id": 48844,
+        "logprob": -1.7900391,
        "special": false,
-        "text": " bain"
+        "text": " cass"
      },
      {
-        "id": 88254,
-        "logprob": -0.12695312,
+        "id": 1744,
+        "logprob": 0.0,
        "special": false,
-        "text": "-mar"
+        "text": "ero"
      },
      {
-        "id": 641,
+        "id": 327,
        "logprob": 0.0,
        "special": false,
-        "text": "ie"
+        "text": "le"
      },
      {
        "id": 2940,
-        "logprob": -3.5175781,
+        "logprob": -1.9306641,
        "special": false,
        "text": " avec"
      }
-    ]
+    ],
+    "top_tokens": null
  },
-  "generated_text": " le faire réchauffer au bain-marie avec"
+  "generated_text": " le faire fondre à la casserole avec"
 }
--- a/integration-tests/models/__snapshots__/test_bloom_560m/test_bloom_560m_all_params.json
+++ b/integration-tests/models/__snapshots__/test_bloom_560m/test_bloom_560m_all_params.json
@@ -11,7 +11,7 @@
      },
      {
        "id": 1669,
-        "logprob": -5.4414062,
+        "logprob": -5.4453125,
        "text": " il"
      },
      {
@@ -21,12 +21,12 @@
      },
      {
        "id": 3913,
-        "logprob": -4.3554688,
+        "logprob": -4.3320312,
        "text": " tout"
      },
      {
        "id": 39261,
-        "logprob": -2.9238281,
+        "logprob": -2.9160156,
        "text": " d'abord"
      }
    ],
@@ -34,65 +34,66 @@
    "tokens": [
      {
        "id": 408,
-        "logprob": -0.07891846,
+        "logprob": -0.16687012,
        "special": false,
        "text": " que"
      },
      {
        "id": 366,
-        "logprob": -1.2939453,
+        "logprob": -1.5517578,
        "special": false,
        "text": " la"
      },
      {
        "id": 8769,
-        "logprob": -0.3708496,
+        "logprob": -0.16687012,
        "special": false,
        "text": " personne"
      },
      {
        "id": 1479,
-        "logprob": -2.2871094,
+        "logprob": -2.1035156,
        "special": false,
        "text": " qui"
      },
      {
-        "id": 2997,
-        "logprob": -0.8671875,
+        "id": 143926,
+        "logprob": -2.8671875,
        "special": false,
-        "text": " vous"
+        "text": " réalise"
      },
      {
-        "id": 35977,
-        "logprob": -1.5097656,
+        "id": 578,
+        "logprob": 0.0,
        "special": false,
-        "text": " suit"
+        "text": " le"
      },
      {
-        "id": 21558,
-        "logprob": -0.07891846,
+        "id": 8138,
+        "logprob": -0.66748047,
        "special": false,
-        "text": " ait"
+        "text": " projet"
      },
      {
-        "id": 447,
-        "logprob": -0.12695312,
+        "id": 795,
+        "logprob": -1.6279297,
        "special": false,
-        "text": " un"
+        "text": " ne"
      },
      {
-        "id": 78606,
-        "logprob": -2.21875,
+        "id": 9802,
+        "logprob": -0.47875977,
        "special": false,
-        "text": " profil"
+        "text": " soit"
      },
      {
-        "id": 3899,
-        "logprob": -1.3535156,
+        "id": 1230,
+        "logprob": 0.0,
        "special": false,
-        "text": " bien"
+        "text": " pas"
      }
-    ]
+    ],
+    "top_tokens": null
  },
-  "generated_text": "Pour déguster un ortolan, il faut tout d'abord que la personne qui vous suit ait un profil bien"
+  "generated_text": "Pour déguster un ortolan, il faut tout d'abord que la personne qui réalise le projet ne soit pas"
 }
--- a/integration-tests/models/__snapshots__/test_bloom_560m_sharded/test_bloom_560m_sharded.json
+++ b/integration-tests/models/__snapshots__/test_bloom_560m_sharded/test_bloom_560m_sharded.json
@@ -11,52 +11,52 @@
      },
      {
        "id": 49833,
-        "logprob": -10.5390625,
+        "logprob": -10.546875,
        "text": " dég"
      },
      {
        "id": 21543,
-        "logprob": -0.14758301,
+        "logprob": -0.14819336,
        "text": "uster"
      },
      {
        "id": 447,
-        "logprob": -1.9296875,
+        "logprob": -1.9257812,
        "text": " un"
      },
      {
        "id": 46341,
-        "logprob": -15.4453125,
+        "logprob": -15.4296875,
        "text": " ort"
      },
      {
        "id": 35567,
-        "logprob": -7.59375,
+        "logprob": -7.5625,
        "text": "olan"
      },
      {
        "id": 15,
-        "logprob": -1.3994141,
+        "logprob": -1.4199219,
        "text": ","
      },
      {
        "id": 1669,
-        "logprob": -1.578125,
+        "logprob": -1.5634766,
        "text": " il"
      },
      {
        "id": 11580,
-        "logprob": -0.9453125,
+        "logprob": -0.9458008,
        "text": " faut"
      },
      {
        "id": 3913,
-        "logprob": -3.7011719,
+        "logprob": -3.6816406,
        "text": " tout"
      },
      {
        "id": 39261,
-        "logprob": -1.5732422,
+        "logprob": -1.7753906,
        "text": " d'abord"
      }
    ],
@@ -64,65 +64,66 @@
    "tokens": [
      {
        "id": 578,
-        "logprob": -1.6474609,
+        "logprob": -1.828125,
        "special": false,
        "text": " le"
      },
      {
        "id": 5608,
-        "logprob": -2.5097656,
+        "logprob": -2.5546875,
        "special": false,
        "text": " faire"
      },
      {
-        "id": 159570,
-        "logprob": -6.65625,
+        "id": 7735,
+        "logprob": -2.4277344,
        "special": false,
-        "text": " réch"
+        "text": " fond"
      },
      {
-        "id": 810,
+        "id": 289,
        "logprob": 0.0,
        "special": false,
-        "text": "au"
+        "text": "re"
      },
      {
-        "id": 12736,
-        "logprob": 0.0,
+        "id": 693,
+        "logprob": -2.4472656,
        "special": false,
-        "text": "ffer"
+        "text": " à"
      },
      {
-        "id": 1742,
-        "logprob": -2.5859375,
+        "id": 366,
+        "logprob": -1.1494141,
        "special": false,
-        "text": " au"
+        "text": " la"
      },
      {
-        "id": 6105,
-        "logprob": -2.03125,
+        "id": 48844,
+        "logprob": -1.7939453,
        "special": false,
-        "text": " bain"
+        "text": " cass"
      },
      {
-        "id": 88254,
-        "logprob": -0.12695312,
+        "id": 1744,
+        "logprob": 0.0,
        "special": false,
-        "text": "-mar"
+        "text": "ero"
      },
      {
-        "id": 641,
+        "id": 327,
        "logprob": 0.0,
        "special": false,
-        "text": "ie"
+        "text": "le"
      },
      {
        "id": 2940,
-        "logprob": -3.5175781,
+        "logprob": -1.9013672,
        "special": false,
        "text": " avec"
      }
-    ]
+    ],
+    "top_tokens": null
  },
-  "generated_text": " le faire réchauffer au bain-marie avec"
+  "generated_text": " le faire fondre à la casserole avec"
 }
--- a/integration-tests/models/__snapshots__/test_chat_llama/test_flash_llama_simple.json
+++ b/integration-tests/models/__snapshots__/test_chat_llama/test_flash_llama_simple.json
@@ -5,7 +5,7 @@
      "index": 0,
      "logprobs": null,
      "message": {
-        "content": "As of your last question, the weather in Brooklyn, New York, is typically hot and humid throughout the year. The suburbs around New York City are jealously sheltered, and at least in the Lower Bronx, there are very few outdoor environments to explore in the middle of urban confines. In fact, typical times for humidity levels in Brooklyn include:\n\n- Early morning: 80-85% humidity, with occas",
+        "content": "As of your last question, the weather in Brooklyn, New York, is typically hot and humid throughout the year. The suburbs around New York City are jealously sheltered, and at least in the Lower Bronx, there are very few outdoor environments to appreciate nature.\n\nIn terms of temperature, the warmest times of the year are from June to August, when average high temperatures typically range from around 73°F or 23°C",
        "name": null,
        "role": "assistant",
        "tool_calls": null
@@ -13,14 +13,14 @@
      "usage": null
    }
  ],
-  "created": 1716553098,
+  "created": 1724792495,
  "id": "",
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-  "object": "text_completion",
-  "system_fingerprint": "2.0.5-dev0-native",
+  "object": "chat.completion",
+  "system_fingerprint": "2.2.1-dev0-native",
  "usage": {
    "completion_tokens": 100,
-    "prompt_tokens": 62,
-    "total_tokens": 162
+    "prompt_tokens": 61,
+    "total_tokens": 161
  }
 }
--- a/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts.json
+++ b/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts.json
 {
  "choices": [
    {
-      "finish_reason": "eos_token",
-      "index": 1,
+      "finish_reason": "length",
+      "index": 0,
      "logprobs": null,
-      "text": " PR for more information?"
+      "text": " A Beginner’s Guide\nDeep learning is a subset"
    },
    {
      "finish_reason": "length",
-      "index": 0,
+      "index": 1,
      "logprobs": null,
-      "text": "le Business Incubator is providing a workspace"
+      "text": " This is a question that has puzzled many people for"
    },
    {
      "finish_reason": "length",
-      "index": 2,
+      "index": 3,
      "logprobs": null,
-      "text": " severely flawed and often has a substandard"
+      "text": "usculas_minusculas(s):\n    \"\"\"\n"
    },
    {
      "finish_reason": "length",
-      "index": 3,
+      "index": 2,
      "logprobs": null,
-      "text": "hd20220811-"
+      "text": " Paris\nWhat is the capital of France?\nThe"
    }
  ],
-  "created": 1713284455,
+  "created": 1725877154,
  "id": "",
-  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "object": "text_completion",
-  "system_fingerprint": "2.0.1-native",
+  "system_fingerprint": "2.2.1-dev0-native",
  "usage": {
-    "completion_tokens": 36,
-    "prompt_tokens": 8,
-    "total_tokens": 44
+    "completion_tokens": 40,
+    "prompt_tokens": 22,
+    "total_tokens": 62
  }
 }
--- a/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts_stream.json
+++ b/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts_stream.json
@@ -5,14 +5,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "\n"
+        "text": " A"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -20,14 +20,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "\n"
+        "text": " This"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -35,14 +35,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": "\n"
+        "text": " Paris"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -50,14 +50,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "hd"
+        "text": "us"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -65,14 +65,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "\n"
+        "text": " Beginner"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -80,14 +80,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "\n"
+        "text": " is"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -98,11 +98,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -110,14 +110,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "aho"
+        "text": "cul"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -125,14 +125,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "2"
+        "text": "’s"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -140,14 +140,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "2"
+        "text": " a"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -155,14 +155,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": "2"
+        "text": "What"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -170,14 +170,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "ima"
+        "text": "as"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -185,14 +185,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "."
+        "text": " Guide"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -200,14 +200,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "."
+        "text": " question"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -215,14 +215,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": "."
+        "text": " is"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -230,14 +230,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "\n"
+        "text": "_minus"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -245,14 +245,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": " Sarah"
+        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -260,14 +260,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": " Yes"
+        "text": " that"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -275,14 +275,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": " And"
+        "text": " the"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -290,14 +290,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "i"
+        "text": "cul"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -305,14 +305,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "'"
+        "text": "Deep"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -320,14 +320,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": ","
+        "text": " has"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -335,14 +335,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": " what"
+        "text": " capital"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -350,14 +350,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "'"
+        "text": "as"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -365,14 +365,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": "s"
+        "text": " learning"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -380,14 +380,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": " Moh"
+        "text": " puzzled"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -395,14 +395,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": " is"
+        "text": " of"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -410,14 +410,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": "m"
+        "text": "(s"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -425,14 +425,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": " Room"
+        "text": " is"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -440,14 +440,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "s"
+        "text": " many"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -455,14 +455,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": " the"
+        "text": " France"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -470,14 +470,14 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": " tired"
+        "text": "):\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -485,14 +485,14 @@
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
-        "text": ":"
+        "text": " a"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -500,14 +500,14 @@
        "finish_reason": "",
        "index": 1,
        "logprobs": null,
-        "text": "'"
+        "text": " people"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -515,14 +515,14 @@
        "finish_reason": "",
        "index": 2,
        "logprobs": null,
-        "text": " capital"
+        "text": "?\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -530,73 +530,73 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": " of"
+        "text": "   "
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 0,
        "logprobs": null,
-        "text": " She"
+        "text": " subset"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 1,
        "logprobs": null,
-        "text": " scale"
+        "text": " for"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 2,
        "logprobs": null,
-        "text": " of"
+        "text": "The"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 3,
        "logprobs": null,
-        "text": " being"
+        "text": " \"\"\"\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1725883643,
    "id": "",
-    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  }
 ]
--- a/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_single_prompt.json
+++ b/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_single_prompt.json
@@ -4,17 +4,17 @@
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
-      "text": " PR for flake8"
+      "text": " A Beginner’s Guide\nDeep learning is a subset"
    }
  ],
-  "created": 1713284454,
+  "created": 1725876621,
  "id": "",
-  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "object": "text_completion",
-  "system_fingerprint": "2.0.1-native",
+  "system_fingerprint": "2.2.1-dev0-native",
  "usage": {
-    "completion_tokens": 5,
+    "completion_tokens": 10,
    "prompt_tokens": 6,
-    "total_tokens": 11
+    "total_tokens": 16
  }
 }
--- a/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_stream_usage.json
+++ b/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_stream_usage.json
+[
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "**",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "Deep",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": " Learning",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": ":",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": " An",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": " Overview",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656043,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "**\n",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656044,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "================================",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656044,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "=====",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": null,
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656044,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": null
+  },
+  {
+    "choices": [
+      {
+        "delta": {
+          "content": "\n\n",
+          "role": "assistant",
+          "tool_calls": null
+        },
+        "finish_reason": "length",
+        "index": 0,
+        "logprobs": null
+      }
+    ],
+    "created": 1726656044,
+    "id": "",
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "object": "chat.completion.chunk",
+    "system_fingerprint": "2.2.1-dev0-native",
+    "usage": {
+      "completion_tokens": 10,
+      "prompt_tokens": 40,
+      "total_tokens": 50
+    }
+  }
+]
--- a/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2.json
+++ b/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2.json
+{
+  "details": {
+    "best_of_sequences": null,
+    "finish_reason": "length",
+    "generated_tokens": 10,
+    "prefill": [
+      {
+        "id": 100000,
+        "logprob": null,
+        "text": "<｜begin▁of▁sentence｜>"
+      },
+      {
+        "id": 3533,
+        "logprob": -9.625,
+        "text": "Test"
+      },
+      {
+        "id": 3102,
+        "logprob": -11.25,
+        "text": " request"
+      }
+    ],
+    "seed": null,
+    "tokens": [
+      {
+        "id": 185,
+        "logprob": -1.546875,
+        "special": false,
+        "text": "\n"
+      },
+      {
+        "id": 549,
+        "logprob": -2.859375,
+        "special": false,
+        "text": "The"
+      },
+      {
+        "id": 1727,
+        "logprob": -2.484375,
+        "special": false,
+        "text": " test"
+      },
+      {
+        "id": 3102,
+        "logprob": -0.83203125,
+        "special": false,
+        "text": " request"
+      },
+      {
+        "id": 317,
+        "logprob": -1.1484375,
+        "special": false,
+        "text": " is"
+      },
+      {
+        "id": 245,
+        "logprob": -1.578125,
+        "special": false,
+        "text": " a"
+      },
+      {
+        "id": 3412,
+        "logprob": -2.578125,
+        "special": false,
+        "text": " document"
+      },
+      {
+        "id": 344,
+        "logprob": -1.125,
+        "special": false,
+        "text": " that"
+      },
+      {
+        "id": 317,
+        "logprob": -1.6953125,
+        "special": false,
+        "text": " is"
+      },
+      {
+        "id": 1222,
+        "logprob": -1.71875,
+        "special": false,
+        "text": " used"
+      }
+    ],
+    "top_tokens": null
+  },
+  "generated_text": "\nThe test request is a document that is used"
+}
--- a/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_all_params.json
+++ b/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_all_params.json
+{
+  "details": {
+    "best_of_sequences": null,
+    "finish_reason": "eos_token",
+    "generated_tokens": 4,
+    "prefill": [
+      {
+        "id": 100000,
+        "logprob": null,
+        "text": "<｜begin▁of▁sentence｜>"
+      },
+      {
+        "id": 3533,
+        "logprob": -9.625,
+        "text": "Test"
+      },
+      {
+        "id": 3102,
+        "logprob": -11.25,
+        "text": " request"
+      }
+    ],
+    "seed": 0,
+    "tokens": [
+      {
+        "id": 2143,
+        "logprob": -1.828125,
+        "special": false,
+        "text": " sent"
+      },
+      {
+        "id": 10081,
+        "logprob": -0.41210938,
+        "special": false,
+        "text": " successfully"
+      },
+      {
+        "id": 13,
+        "logprob": 0.0,
+        "special": false,
+        "text": "."
+      },
+      {
+        "id": 100001,
+        "logprob": -0.16015625,
+        "special": true,
+        "text": "<｜end▁of▁sentence｜>"
+      }
+    ],
+    "top_tokens": null
+  },
+  "generated_text": "Test request sent successfully."
+}
--- a/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_load.json
+++ b/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_load.json
+[
+  {
+    "details": {
+      "best_of_sequences": null,
+      "finish_reason": "length",
+      "generated_tokens": 10,
+      "prefill": [
+        {
+          "id": 100000,
+          "logprob": null,
+          "text": "<｜begin▁of▁sentence｜>"
+        },
+        {
+          "id": 3533,
+          "logprob": -9.625,
+          "text": "Test"
+        },
+        {
+          "id": 3102,
+          "logprob": -11.25,
+          "text": " request"
+        }
+      ],
+      "seed": null,
+      "tokens": [
+        {
+          "id": 185,
+          "logprob": -1.546875,
+          "special": false,
+          "text": "\n"
+        },
+        {
+          "id": 549,
+          "logprob": -2.859375,
+          "special": false,
+          "text": "The"
+        },
+        {
+          "id": 1727,
+          "logprob": -2.4375,
+          "special": false,
+          "text": " test"
+        },
+        {
+          "id": 3102,
+          "logprob": -0.83984375,
+          "special": false,
+          "text": " request"
+        },
+        {
+          "id": 317,
+          "logprob": -1.1328125,
+          "special": false,
+          "text": " is"
+        },
+        {
+          "id": 254,
+          "logprob": -1.515625,
+          "special": false,
+          "text": " the"
+        },
+        {
+          "id": 1022,
+          "logprob": -1.15625,
+          "special": false,
+          "text": " first"
+        },
+        {
+          "id": 3458,
+          "logprob": -0.3671875,
+          "special": false,
+          "text": " step"
+        },
+        {
+          "id": 279,
+          "logprob": -0.88671875,
+          "special": false,
+          "text": " in"
+        },
+        {
+          "id": 254,
+          "logprob": -0.69140625,
+          "special": false,
+          "text": " the"
+        }
+      ],
+      "top_tokens": null
+    },
+    "generated_text": "\nThe test request is the first step in the"
+  },
+  {
+    "details": {
+      "best_of_sequences": null,
+      "finish_reason": "length",
+      "generated_tokens": 10,
+      "prefill": [
+        {
+          "id": 100000,
+          "logprob": null,
+          "text": "<｜begin▁of▁sentence｜>"
+        },
+        {
+          "id": 3533,
+          "logprob": -9.625,
+          "text": "Test"
+        },
+        {
+          "id": 3102,
+          "logprob": -11.25,
+          "text": " request"
+        }
+      ],
+      "seed": null,
+      "tokens": [
+        {
+          "id": 185,
+          "logprob": -1.546875,
+          "special": false,
+          "text": "\n"
+        },
+        {
+          "id": 549,
+          "logprob": -2.859375,
+          "special": false,
+          "text": "The"
+        },
+        {
+          "id": 1727,
+          "logprob": -2.4375,
+          "special": false,
+          "text": " test"
+        },
+        {
+          "id": 3102,
+          "logprob": -0.83984375,
+          "special": false,
+          "text": " request"
+        },
+        {
+          "id": 317,
+          "logprob": -1.1328125,
+          "special": false,
+          "text": " is"
+        },
+        {
+          "id": 254,
+          "logprob": -1.515625,
+          "special": false,
+          "text": " the"
+        },
+        {
+          "id": 1022,
+          "logprob": -1.15625,
+          "special": false,
+          "text": " first"
+        },
+        {
+          "id": 3458,
+          "logprob": -0.3671875,
+          "special": false,
+          "text": " step"
+        },
+        {
+          "id": 279,
+          "logprob": -0.88671875,
+          "special": false,
+          "text": " in"
+        },
+        {
+          "id": 254,
+          "logprob": -0.69140625,
+          "special": false,
+          "text": " the"
+        }
+      ],
+      "top_tokens": null
+    },
+    "generated_text": "\nThe test request is the first step in the"
+  },
+  {
+    "details": {
+      "best_of_sequences": null,
+      "finish_reason": "length",
+      "generated_tokens": 10,
+      "prefill": [
+        {
+          "id": 100000,
+          "logprob": null,
+          "text": "<｜begin▁of▁sentence｜>"
+        },
+        {
+          "id": 3533,
+          "logprob": -9.625,
+          "text": "Test"
+        },
+        {
+          "id": 3102,
+          "logprob": -11.25,
+          "text": " request"
+        }
+      ],
+      "seed": null,
+      "tokens": [
+        {
+          "id": 185,
+          "logprob": -1.546875,
+          "special": false,
+          "text": "\n"
+        },
+        {
+          "id": 549,
+          "logprob": -2.859375,
+          "special": false,
+          "text": "The"
+        },
+        {
+          "id": 1727,
+          "logprob": -2.4375,
+          "special": false,
+          "text": " test"
+        },
+        {
+          "id": 3102,
+          "logprob": -0.83984375,
+          "special": false,
+          "text": " request"
+        },
+        {
+          "id": 317,
+          "logprob": -1.1328125,
+          "special": false,
+          "text": " is"
+        },
+        {
+          "id": 254,
+          "logprob": -1.515625,
+          "special": false,
+          "text": " the"
+        },
+        {
+          "id": 1022,
+          "logprob": -1.15625,
+          "special": false,
+          "text": " first"
+        },
+        {
+          "id": 3458,
+          "logprob": -0.3671875,
+          "special": false,
+          "text": " step"
+        },
+        {
+          "id": 279,
+          "logprob": -0.88671875,
+          "special": false,
+          "text": " in"
+        },
+        {
+          "id": 254,
+          "logprob": -0.69140625,
+          "special": false,
+          "text": " the"
+        }
+      ],
+      "top_tokens": null
+    },
+    "generated_text": "\nThe test request is the first step in the"
+  },
+  {
+    "details": {
+      "best_of_sequences": null,
+      "finish_reason": "length",
+      "generated_tokens": 10,
+      "prefill": [
+        {
+          "id": 100000,
+          "logprob": null,
+          "text": "<｜begin▁of▁sentence｜>"
+        },
+        {
+          "id": 3533,
+          "logprob": -9.625,
+          "text": "Test"
+        },
+        {
+          "id": 3102,
+          "logprob": -11.25,
+          "text": " request"
+        }
+      ],
+      "seed": null,
+      "tokens": [
+        {
+          "id": 185,
+          "logprob": -1.546875,
+          "special": false,
+          "text": "\n"
+        },
+        {
+          "id": 549,
+          "logprob": -2.859375,
+          "special": false,
+          "text": "The"
+        },
+        {
+          "id": 1727,
+          "logprob": -2.4375,
+          "special": false,
+          "text": " test"
+        },
+        {
+          "id": 3102,
+          "logprob": -0.83984375,
+          "special": false,
+          "text": " request"
+        },
+        {
+          "id": 317,
+          "logprob": -1.1328125,
+          "special": false,
+          "text": " is"
+        },
+        {
+          "id": 254,
+          "logprob": -1.515625,
+          "special": false,
+          "text": " the"
+        },
+        {
+          "id": 1022,
+          "logprob": -1.15625,
+          "special": false,
+          "text": " first"
+        },
+        {
+          "id": 3458,
+          "logprob": -0.3671875,
+          "special": false,
+          "text": " step"
+        },
+        {
+          "id": 279,
+          "logprob": -0.88671875,
+          "special": false,
+          "text": " in"
+        },
+        {
+          "id": 254,
+          "logprob": -0.69140625,
+          "special": false,
+          "text": " the"
+        }
+      ],
+      "top_tokens": null
+    },
+    "generated_text": "\nThe test request is the first step in the"
+  }
+]
--- a/integration-tests/models/__snapshots__/test_flash_gemma/test_flash_gemma_all_params.json
+++ b/integration-tests/models/__snapshots__/test_flash_gemma/test_flash_gemma_all_params.json
@@ -11,12 +11,12 @@
      },
      {
        "id": 2015,
-        "logprob": -10.0,
+        "logprob": -10.0625,
        "text": "Test"
      },
      {
        "id": 3853,
-        "logprob": -10.875,
+        "logprob": -11.0,
        "text": " request"
      }
    ],
@@ -24,7 +24,7 @@
    "tokens": [
      {
        "id": 7539,
-        "logprob": -0.73046875,
+        "logprob": -0.609375,
        "special": false,
        "text": " forms"
      },
@@ -36,7 +36,7 @@
      },
      {
        "id": 671,
-        "logprob": -1.703125,
+        "logprob": -1.5546875,
        "special": false,
        "text": " an"
      },
@@ -66,24 +66,24 @@
      },
      {
        "id": 11859,
-        "logprob": -1.6953125,
+        "logprob": -1.953125,
        "special": false,
        "text": " lab"
      },
      {
        "id": 2185,
-        "logprob": -1.3125,
+        "logprob": -1.7734375,
        "special": false,
        "text": " process"
      },
      {
-        "id": 578,
-        "logprob": -1.5,
+        "id": 235265,
+        "logprob": 0.0,
        "special": false,
-        "text": " and"
+        "text": "."
      }
    ],
    "top_tokens": null
  },
-  "generated_text": "Test request forms are an essential part of the lab process and"
+  "generated_text": "Test request forms are an essential part of the lab process."
 }
--- a/integration-tests/models/__snapshots__/test_flash_gemma/test_flash_gemma.json
+++ b/integration-tests/models/__snapshots__/test_flash_gemma/test_flash_gemma.json
@@ -11,12 +11,12 @@
      },
      {
        "id": 2015,
-        "logprob": -10.0,
+        "logprob": -10.0625,
        "text": "Test"
      },
      {
        "id": 3853,
-        "logprob": -10.875,
+        "logprob": -11.0,
        "text": " request"
      }
    ],
@@ -24,13 +24,13 @@
    "tokens": [
      {
        "id": 1736,
-        "logprob": -2.09375,
+        "logprob": -2.109375,
        "special": false,
        "text": " form"
      },
      {
        "id": 109,
-        "logprob": -1.8671875,
+        "logprob": -1.90625,
        "special": false,
        "text": "\n\n"
      },
@@ -42,43 +42,43 @@
      },
      {
        "id": 2121,
-        "logprob": -1.8203125,
+        "logprob": -1.796875,
        "special": false,
        "text": " test"
      },
      {
        "id": 3853,
-        "logprob": -0.23242188,
+        "logprob": -0.24511719,
        "special": false,
        "text": " request"
      },
      {
        "id": 1736,
-        "logprob": -0.08544922,
+        "logprob": -0.09326172,
        "special": false,
        "text": " form"
      },
      {
        "id": 603,
-        "logprob": -0.9375,
+        "logprob": -0.95703125,
        "special": false,
        "text": " is"
      },
      {
        "id": 1671,
-        "logprob": -1.671875,
+        "logprob": -1.5859375,
        "special": false,
        "text": " used"
      },
      {
        "id": 577,
-        "logprob": -0.40429688,
+        "logprob": -0.39257812,
        "special": false,
        "text": " to"
      },
      {
        "id": 3853,
-        "logprob": -1.1875,
+        "logprob": -1.25,
        "special": false,
        "text": " request"
      }