Lots of improvements (Still 2 allocators) (#2449)

* Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>
e415b690 · Nicolas Patry · GitHub · 4e821c00 · e415b690 · e415b690
Unverified Commit e415b690 authored Aug 29, 2024 by Nicolas Patry Committed by GitHub Aug 29, 2024
20 changed files
--- a/.github/workflows/tests.yaml
+++ b/.github/workflows/tests.yaml
@@ -35,7 +35,7 @@ jobs:
        with:
          # Released on: 02 May, 2024
          # https://releases.rs/docs/1.78.0/
-          toolchain: 1.79.0
+          toolchain: 1.80.0
          override: true
          components: rustfmt, clippy
      - name: Install Protoc

--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Dockerfile
+++ b/Dockerfile
 # Rust builder
-FROM lukemathwalker/cargo-chef:latest-rust-1.79 AS chef
+FROM lukemathwalker/cargo-chef:latest-rust-1.80 AS chef
 WORKDIR /usr/src

 ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
@@ -184,6 +184,12 @@ WORKDIR /usr/src
 COPY server/Makefile-selective-scan Makefile
 RUN make build-all

+# Build flashinfer
+FROM kernel-builder AS flashinfer-builder
+WORKDIR /usr/src
+COPY server/Makefile-flashinfer Makefile
+RUN make install-flashinfer
+
 # Text Generation Inference base image
 FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS base

@@ -236,6 +242,7 @@ COPY --from=vllm-builder /usr/src/vllm/build/lib.linux-x86_64-cpython-310 /opt/c
 # Copy build artifacts from mamba builder
 COPY --from=mamba-builder /usr/src/mamba/build/lib.linux-x86_64-cpython-310/ /opt/conda/lib/python3.10/site-packages
 COPY --from=mamba-builder /usr/src/causal-conv1d/build/lib.linux-x86_64-cpython-310/ /opt/conda/lib/python3.10/site-packages
+COPY --from=flashinfer-builder /opt/conda/lib/python3.10/site-packages/flashinfer/ /opt/conda/lib/python3.10/site-packages/flashinfer/

 # Install flash-attention dependencies
 RUN pip install einops --no-cache-dir

--- a/Dockerfile_amd
+++ b/Dockerfile_amd
 # Rust builder
-FROM lukemathwalker/cargo-chef:latest-rust-1.79 AS chef
+FROM lukemathwalker/cargo-chef:latest-rust-1.80 AS chef
 WORKDIR /usr/src

 ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

--- a/Dockerfile_intel
+++ b/Dockerfile_intel
 ARG PLATFORM=xpu

-FROM lukemathwalker/cargo-chef:latest-rust-1.79 AS chef
+FROM lukemathwalker/cargo-chef:latest-rust-1.80 AS chef
 WORKDIR /usr/src

 ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

--- a/backends/client/src/v3/client.rs
+++ b/backends/client/src/v3/client.rs
@@ -153,6 +153,8 @@ impl Client {
                }),
                // We truncate the input on the server side to be sure that it has the correct size
                truncate,
+                // Most request will have that
+                add_special_tokens: true,
                // Blocks and slots will be set on the server side if we use paged attention
                blocks: vec![],
                slots: vec![],

--- a/backends/client/src/v3/sharded_client.rs
+++ b/backends/client/src/v3/sharded_client.rs
@@ -221,6 +221,7 @@ impl Health for ShardedClient {
                chunks: vec![Chunk::Text("liveness".into()).into()],
            }),
            truncate: 10,
+            add_special_tokens: true,
            prefill_logprobs: false,
            parameters: Some(NextTokenChooserParameters {
                temperature: 1.0,

--- a/backends/v3/src/backend.rs
+++ b/backends/v3/src/backend.rs
@@ -35,27 +35,15 @@ impl BackendV3 {
        window_size: Option<u32>,
        speculate: u32,
    ) -> Self {
-        let prefix_caching = if let Ok(prefix_caching) = std::env::var("USE_PREFIX_CACHING") {
-            matches!(prefix_caching.as_str(), "true" | "1")
-        } else {
-            false
-        };
-        let attention = if let Ok(attention) = std::env::var("ATTENTION") {
-            attention
-                .parse()
-                .unwrap_or_else(|_| panic!("Invalid attention was specified :`{attention}`"))
-        } else if prefix_caching {
-            Attention::FlashInfer
-        } else {
-            Attention::Paged
-        };
-        let block_size = if attention == Attention::FlashDecoding {
-            256
-        } else if attention == Attention::FlashInfer {
-            1
-        } else {
-            16
-        };
+        let prefix_caching =
+            std::env::var("USE_PREFIX_CACHING").expect("Expect prefix caching env var");
+        let prefix_caching = matches!(prefix_caching.as_str(), "true" | "1");
+        let attention: String = std::env::var("ATTENTION").expect("attention env var");
+
+        let attention: Attention = attention
+            .parse()
+            .unwrap_or_else(|_| panic!("Invalid attention was specified :`{attention}`"));
+        let block_size = attention.block_size();

        let queue = Queue::new(
            requires_padding,

--- a/backends/v3/src/block_allocator.rs
+++ b/backends/v3/src/block_allocator.rs
-use std::{cmp::min, sync::Arc};
+use std::sync::Arc;
 use tokio::sync::{mpsc, oneshot};

 use crate::radix::RadixAllocator;
@@ -137,7 +137,6 @@ pub trait Allocator {

    fn free(&mut self, blocks: Vec<u32>, allocation_id: u64);
 }
-
 pub struct SimpleAllocator {
    free_blocks: Vec<u32>,
    block_size: u32,
@@ -167,7 +166,7 @@ impl Allocator for SimpleAllocator {
                None => (tokens, 1),
                Some(window_size) => {
                    let repeats = (tokens + window_size - 1) / window_size;
-                    let tokens = min(tokens, window_size);
+                    let tokens = core::cmp::min(tokens, window_size);
                    (tokens, repeats as usize)
                }
            };

--- a/backends/v3/src/client/grpc_client.rs
+++ b/backends/v3/src/client/grpc_client.rs
@@ -149,6 +149,7 @@ impl Client {
            requests.push(Request {
                id: 0,
                inputs,
+                add_special_tokens: true,
                input_chunks: Some(Input {
                    chunks: input_chunks,
                }),

--- a/backends/v3/src/client/sharded_client.rs
+++ b/backends/v3/src/client/sharded_client.rs
@@ -222,6 +222,7 @@ impl Health for ShardedClient {
                chunks: vec![Chunk::Text("liveness".into()).into()],
            }),
            truncate: 10,
+            add_special_tokens: true,
            prefill_logprobs: false,
            parameters: Some(NextTokenChooserParameters {
                temperature: 1.0,

--- a/backends/v3/src/queue.rs
+++ b/backends/v3/src/queue.rs
@@ -383,6 +383,7 @@ impl State {
                }),
                inputs: entry.request.inputs.chunks_to_string(),
                truncate: entry.request.truncate,
+                add_special_tokens: entry.request.add_special_tokens,
                parameters: Some(NextTokenChooserParameters::from(
                    entry.request.parameters.clone(),
                )),
@@ -517,6 +518,7 @@ mod tests {
                inputs: vec![],
                input_ids: Some(Arc::new(vec![])),
                input_length: 0,
+                add_special_tokens: true,
                truncate: 0,
                decoder_input_details: false,
                parameters: ValidParameters {

--- a/backends/v3/src/radix.rs
+++ b/backends/v3/src/radix.rs
+use crate::block_allocator::{Allocator, BlockAllocation};
+use slotmap::{DefaultKey, SlotMap};
 use std::{
    collections::{BTreeSet, HashMap},
    sync::Arc,
 };

-use slotmap::{DefaultKey, SlotMap};
-
-use crate::block_allocator::{Allocator, BlockAllocation};
-
 pub struct RadixAllocator {
    allocation_id: u64,

@@ -16,26 +14,26 @@ pub struct RadixAllocator {

    /// Blocks that are immediately available for allocation.
    free_blocks: Vec<u32>,
+
+    #[allow(dead_code)]
+    // This isn't used because the prefix need to match without the windowing
+    // mecanism. This at worst is overallocating, not necessarily being wrong.
+    window_size: Option<u32>,
+
+    block_size: u32,
 }

 impl RadixAllocator {
    pub fn new(block_size: u32, n_blocks: u32, window_size: Option<u32>) -> Self {
-        assert_eq!(
-            block_size, 1,
-            "Radix tree allocator only works with block_size=1, was: {}",
-            block_size
-        );
-        if window_size.is_some() {
-            unimplemented!("Window size not supported in the prefix-caching block allocator yet");
-        }
-
        RadixAllocator {
            allocation_id: 0,
            allocations: HashMap::new(),
-            cache_blocks: RadixTrie::new(),
+            cache_blocks: RadixTrie::new(block_size as usize),

            // Block 0 is reserved for health checks.
            free_blocks: (1..n_blocks).collect(),
+            window_size,
+            block_size,
        }
    }

@@ -63,6 +61,7 @@ impl RadixAllocator {
    }
 }

+// Allocator trait
 impl Allocator for RadixAllocator {
    fn allocate(
        &mut self,
@@ -86,10 +85,12 @@ impl Allocator for RadixAllocator {
            .incref(prefix_node)
            .expect("Failed to increment refcount");

-        let prefix_len = blocks.len();
+        let prefix_len = blocks.len() * self.block_size as usize;
        let suffix_len = tokens - prefix_len as u32;

-        match self.alloc_or_reclaim(suffix_len as usize) {
+        let suffix_blocks = (suffix_len + self.block_size - 1) / self.block_size;
+
+        match self.alloc_or_reclaim(suffix_blocks as usize) {
            Some(suffix_blocks) => blocks.extend(suffix_blocks),
            None => {
                self.cache_blocks
@@ -100,7 +101,20 @@ impl Allocator for RadixAllocator {
        }

        // 1:1 mapping of blocks and slots.
-        let slots = blocks.clone();
+        let slots = if self.block_size == 1 {
+            blocks.clone()
+        } else {
+            let mut slots = Vec::with_capacity(blocks.len() * self.block_size as usize);
+            'slots: for block_id in &blocks {
+                for s in (block_id * self.block_size)..((block_id + 1) * self.block_size) {
+                    slots.push(s);
+                    if slots.len() as u32 == tokens {
+                        break 'slots;
+                    }
+                }
+            }
+            slots
+        };

        let allocation = RadixAllocation {
            prefix_node,
@@ -108,6 +122,8 @@ impl Allocator for RadixAllocator {
            prefill_tokens: prefill_tokens.clone(),
        };

+        tracing::debug!("Blocks {blocks:?}");
+
        self.allocation_id += 1;
        self.allocations.insert(self.allocation_id, allocation);

@@ -136,27 +152,38 @@ impl Allocator for RadixAllocator {
            // If there are prefill tokens that did not come from the cache,
            // add them to the cache.
            if prefill_tokens.len() > allocation.cached_prefix_len {
-                let prefix_len = self
-                    .cache_blocks
-                    .insert(prefill_tokens, &blocks[..prefill_tokens.len()])
-                    // Unwrap, failing is a programming error.
-                    .expect("Failed to store prefill tokens");
-
-                // We can have a prefill with the following structure:
-                //
-                // |---| From the prefix cache.
-                // A B C D E F G
-                //|--------| Found in the trie during insertion.
-                //
-                // This means that while processing this request there was a
-                // partially overlapping request that had A..=E in its
-                // prefill. In this case we need to free the blocks D E.
-                self.free_blocks
-                    .extend(&blocks[allocation.cached_prefix_len..prefix_len]);
+                let aligned =
+                    (prefill_tokens.len() / self.block_size as usize) * self.block_size as usize;
+                if aligned > 0 {
+                    let prefix_len = self
+                        .cache_blocks
+                        .insert(
+                            &prefill_tokens[..aligned],
+                            &blocks[..aligned / self.block_size as usize],
+                        )
+                        // Unwrap, failing is a programming error.
+                        .expect("Failed to store prefill tokens");
+                    // We can have a prefill with the following structure:
+                    //
+                    // |---| From the prefix cache.
+                    // A B C D E F G
+                    //|--------| Found in the trie during insertion.
+                    //
+                    // This means that while processing this request there was a
+                    // partially overlapping request that had A..=E in its
+                    // prefill. In this case we need to free the blocks D E.
+                    if prefix_len > allocation.cached_prefix_len {
+                        self.free_blocks.extend(
+                            &blocks[allocation.cached_prefix_len / self.block_size as usize
+                                ..prefix_len / self.block_size as usize],
+                        );
+                    }
+                }
            }

            // Free non-prefill blocks.
-            self.free_blocks.extend(&blocks[prefill_tokens.len()..]);
+            self.free_blocks
+                .extend(&blocks[prefill_tokens.len() / self.block_size as usize..]);
        } else {
            self.free_blocks.extend(blocks);
        }
@@ -204,17 +231,14 @@ pub struct RadixTrie {
    /// Time as a monotonically increating counter to avoid the system
    /// call that a real time lookup would require.
    time: u64,
-}

-impl Default for RadixTrie {
-    fn default() -> Self {
-        Self::new()
-    }
+    /// All blocks need to be aligned with this
+    block_size: usize,
 }

 impl RadixTrie {
    /// Construct a new radix trie.
-    pub fn new() -> Self {
+    pub fn new(block_size: usize) -> Self {
        let root = TrieNode::new(vec![], vec![], 0, None);
        let mut nodes = SlotMap::new();
        let root = nodes.insert(root);
@@ -223,13 +247,14 @@ impl RadixTrie {
            nodes,
            root,
            time: 0,
+            block_size,
        }
    }

    /// Find the prefix of the given tokens.
    ///
    /// The blocks corresponding to the part of the prefix that could be found
-    /// are writteng to `blocks`. The number of blocks is in `0..=tokens.len()`.
+    /// are written to `blocks`. The number of blocks is in `0..=tokens.len()`.
    /// Returns the identifier of the trie node that contains the longest
    /// prefix. The node identifier can be used by callers to e.g. increase its
    /// reference count.
@@ -247,8 +272,9 @@ impl RadixTrie {
        if let Some(&child_id) = node.children.get(&key[0]) {
            self.update_access_time(child_id);
            let child = self.nodes.get(child_id).expect("Invalid child identifier");
-            let shared_prefix_len = child.key.shared_prefix_len(key);
-            blocks.extend(&child.blocks[..shared_prefix_len]);
+            let shared_prefix_len = shared_prefix(&child.key, key, self.block_size);
+            assert_eq!(shared_prefix_len % self.block_size, 0);
+            blocks.extend(&child.blocks[..shared_prefix_len / self.block_size]);

            let key = &key[shared_prefix_len..];
            if !key.is_empty() {
@@ -349,7 +375,8 @@ impl RadixTrie {
    /// the first 10 elements of the tree **the blocks are not updated**.
    pub fn insert(&mut self, tokens: &[u32], blocks: &[u32]) -> Result<usize, TrieError> {
        self.time += 1;
-        self.insert_(self.root, tokens, blocks)
+        let common = self.insert_(self.root, tokens, blocks)?;
+        Ok(common)
    }

    /// Insertion worker.
@@ -363,7 +390,7 @@ impl RadixTrie {
        // the part of the prefix that is already in the trie to detect
        // mismatches.

-        if tokens.len() != blocks.len() {
+        if tokens.len() != blocks.len() * self.block_size {
            return Err(TrieError::BlockTokenCountMismatch);
        }

@@ -374,10 +401,10 @@ impl RadixTrie {
                .get_mut(child_id)
                // Unwrap here, since failure is a bug.
                .expect("Child node does not exist");
-            let shared_prefix_len = child.key.shared_prefix_len(tokens);
+            let shared_prefix_len = shared_prefix(&child.key, tokens, self.block_size);

            // We are done, the prefix is already in the trie.
-            if shared_prefix_len == tokens.len() {
+            if shared_prefix_len == tokens.len() || shared_prefix_len == 0 {
                return Ok(shared_prefix_len);
            }

@@ -387,7 +414,7 @@ impl RadixTrie {
                    + self.insert_(
                        child_id,
                        &tokens[shared_prefix_len..],
-                        &blocks[shared_prefix_len..],
+                        &blocks[shared_prefix_len / self.block_size..],
                    )?);
            }

@@ -396,7 +423,7 @@ impl RadixTrie {
            // remainder of the prefix into the node again
            let child_id = self.split_node(child_id, shared_prefix_len);
            let key = &tokens[shared_prefix_len..];
-            let blocks = &blocks[shared_prefix_len..];
+            let blocks = &blocks[shared_prefix_len / self.block_size..];
            Ok(shared_prefix_len + self.insert_(child_id, key, blocks)?)
        } else {
            self.add_node(node_id, tokens, blocks);
@@ -550,34 +577,53 @@ impl TrieNode {
    }
 }

-/// Helper trait to get the length of the shared prefix of two sequences.
-trait SharedPrefixLen {
-    fn shared_prefix_len(&self, other: &Self) -> usize;
-}
-
-impl<T> SharedPrefixLen for [T]
-where
-    T: PartialEq,
-{
-    fn shared_prefix_len(&self, other: &Self) -> usize {
-        self.iter().zip(other).take_while(|(a, b)| a == b).count()
-    }
+fn shared_prefix(left: &[u32], right: &[u32], block_size: usize) -> usize {
+    let full = left.iter().zip(right).take_while(|(a, b)| a == b).count();
+    (full / block_size) * block_size
 }

 #[cfg(test)]
 mod tests {
    use std::sync::Arc;

-    use crate::block_allocator::Allocator;
+    use super::*;

-    use super::RadixAllocator;
+    #[test]
+    fn allocator_block_size() {
+        let mut cache = RadixAllocator::new(2, 12, None);
+        let allocation = cache.allocate(8, Some(Arc::new(vec![0, 1, 2, 3]))).unwrap();
+        assert_eq!(allocation.blocks, vec![8, 9, 10, 11]);
+        assert_eq!(allocation.slots, vec![16, 17, 18, 19, 20, 21, 22, 23]);
+        assert_eq!(allocation.prefix_len, 0);
+        cache.free(allocation.blocks.clone(), allocation.allocation_id);
+
+        let allocation = cache.allocate(8, Some(Arc::new(vec![0, 1, 2, 3]))).unwrap();
+        assert_eq!(allocation.blocks, vec![8, 9, 10, 11]);
+        assert_eq!(allocation.slots, vec![16, 17, 18, 19, 20, 21, 22, 23]);
+        assert_eq!(allocation.prefix_len, 4);
+    }
+
+    #[test]
+    fn allocator_block_size_non_aligned() {
+        let mut cache = RadixAllocator::new(2, 12, None);
+        let allocation = cache.allocate(7, Some(Arc::new(vec![0, 1, 2]))).unwrap();
+        assert_eq!(allocation.blocks, vec![8, 9, 10, 11]);
+        assert_eq!(allocation.slots, vec![16, 17, 18, 19, 20, 21, 22]);
+        assert_eq!(allocation.prefix_len, 0);
+        cache.free(allocation.blocks.clone(), allocation.allocation_id);
+
+        let allocation = cache.allocate(7, Some(Arc::new(vec![0, 1, 2]))).unwrap();
+        assert_eq!(allocation.blocks, vec![8, 9, 10, 11]);
+        assert_eq!(allocation.slots, vec![16, 17, 18, 19, 20, 21, 22]);
+        assert_eq!(allocation.prefix_len, 2);
+    }

    #[test]
    fn allocator_reuses_prefixes() {
        let mut cache = RadixAllocator::new(1, 12, None);
        let allocation = cache.allocate(8, Some(Arc::new(vec![0, 1, 2, 3]))).unwrap();
        assert_eq!(allocation.blocks, vec![4, 5, 6, 7, 8, 9, 10, 11]);
-        assert_eq!(allocation.slots, allocation.slots);
+        assert_eq!(allocation.blocks, allocation.slots);
        assert_eq!(allocation.prefix_len, 0);
        cache.free(allocation.blocks.clone(), allocation.allocation_id);

@@ -666,7 +712,7 @@ mod tests {

    #[test]
    fn trie_insertions_have_correct_prefix_len() {
-        let mut trie = super::RadixTrie::new();
+        let mut trie = RadixTrie::new(1);

        assert_eq!(trie.insert(&[0, 1, 2], &[0, 1, 2]).unwrap(), 0);

@@ -687,9 +733,33 @@ mod tests {
        );
    }

+    #[test]
+    fn trie_insertions_block_size() {
+        let mut trie = RadixTrie::new(2);
+
+        assert_eq!(trie.insert(&[0, 1, 2, 3], &[0, 1]).unwrap(), 0);
+
+        // Already exists.
+        // But needs to be block_size aligned
+        assert_eq!(trie.insert(&[0, 1, 2, 3], &[0, 1]).unwrap(), 4);
+
+        // Completely new at root-level
+        assert_eq!(trie.insert(&[1, 2, 3, 4], &[1, 2]).unwrap(), 0);
+
+        // Contains full prefix, but longer.
+        assert_eq!(trie.insert(&[0, 1, 2, 3, 4, 5], &[0, 1, 2]).unwrap(), 4);
+
+        // Shares partial prefix, we need a split.
+        assert_eq!(
+            trie.insert(&[0, 1, 3, 4, 5, 6, 7, 8], &[0, 1, 2, 3])
+                .unwrap(),
+            2
+        );
+    }
+
    #[test]
    fn trie_get_returns_correct_blocks() {
-        let mut trie = super::RadixTrie::new();
+        let mut trie = RadixTrie::new(1);
        trie.insert(&[0, 1, 2], &[0, 1, 2]).unwrap();
        trie.insert(&[1, 2, 3], &[1, 2, 3]).unwrap();
        trie.insert(&[0, 1, 2, 3, 4], &[0, 1, 2, 3, 4]).unwrap();
@@ -723,7 +793,7 @@ mod tests {

    #[test]
    fn trie_evict_removes_correct_blocks() {
-        let mut trie = super::RadixTrie::new();
+        let mut trie = RadixTrie::new(1);
        trie.insert(&[0, 1, 2], &[0, 1, 2]).unwrap();
        trie.insert(&[0, 1, 2, 3, 5, 6, 7], &[0, 1, 2, 3, 5, 6, 7])
            .unwrap();

--- a/benchmark/src/generation.rs
+++ b/benchmark/src/generation.rs
@@ -148,6 +148,7 @@ async fn prefill(
            }),
            inputs: sequence.clone(),
            truncate: sequence_length,
+            add_special_tokens: true,
            parameters: Some(parameters.clone()),
            stopping_parameters: Some(StoppingCriteriaParameters {
                max_new_tokens: decode_length,

--- a/flake.lock
+++ b/flake.lock
@@ -835,11 +835,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1724206841,
-        "narHash": "sha256-L8dKaX4T3k+TR2fEHCfGbH4UXdspovz/pj87iai9qmc=",
+        "lastModified": 1724638882,
+        "narHash": "sha256-ap2jIQi/FuUHR6HCht6ASWhoz8EiB99XmI8Esot38VE=",
        "owner": "oxalica",
        "repo": "rust-overlay",
-        "rev": "45e98fbd62c32e5927e952d2833fa1ba4fb35a61",
+        "rev": "19b70f147b9c67a759e35824b241f1ed92e46694",
        "type": "github"
      },
      "original": {

--- a/integration-tests/models/__snapshots__/test_chat_llama/test_flash_llama_simple.json
+++ b/integration-tests/models/__snapshots__/test_chat_llama/test_flash_llama_simple.json
@@ -5,7 +5,7 @@
      "index": 0,
      "logprobs": null,
      "message": {
-        "content": "As of your last question, the weather in Brooklyn, New York, is typically hot and humid throughout the year. The suburbs around New York City are jealously sheltered, and at least in the Lower Bronx, there are very few outdoor environments to explore in the middle of urban confines. In fact, typical times for humidity levels in Brooklyn include:\n\n- Early morning: 80-85% humidity, with occas",
+        "content": "As of your last question, the weather in Brooklyn, New York, is typically hot and humid throughout the year. The suburbs around New York City are jealously sheltered, and at least in the Lower Bronx, there are very few outdoor environments to appreciate nature.\n\nIn terms of temperature, the warmest times of the year are from June to August, when average high temperatures typically range from around 73°F or 23°C",
        "name": null,
        "role": "assistant",
        "tool_calls": null
@@ -13,14 +13,14 @@
      "usage": null
    }
  ],
-  "created": 1716553098,
+  "created": 1724792495,
  "id": "",
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-  "object": "text_completion",
-  "system_fingerprint": "2.0.5-dev0-native",
+  "object": "chat.completion",
+  "system_fingerprint": "2.2.1-dev0-native",
  "usage": {
    "completion_tokens": 100,
-    "prompt_tokens": 62,
-    "total_tokens": 162
+    "prompt_tokens": 61,
+    "total_tokens": 161
  }
 }
--- a/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts_stream.json
+++ b/integration-tests/models/__snapshots__/test_completion_prompts/test_flash_llama_completion_many_prompts_stream.json
@@ -8,11 +8,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -23,11 +23,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -38,11 +38,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -53,11 +53,11 @@
        "text": "hd"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -68,11 +68,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -83,11 +83,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -98,11 +98,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -113,11 +113,11 @@
        "text": "aho"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -128,11 +128,11 @@
        "text": "2"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -143,11 +143,11 @@
        "text": "2"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -158,11 +158,11 @@
        "text": "2"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -173,11 +173,11 @@
        "text": "ima"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -188,11 +188,11 @@
        "text": "."
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -203,11 +203,11 @@
        "text": "."
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -218,11 +218,11 @@
        "text": "."
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -233,11 +233,11 @@
        "text": "\n"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -248,11 +248,11 @@
        "text": " Sarah"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -263,11 +263,11 @@
        "text": " Yes"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -278,11 +278,11 @@
        "text": " And"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -293,11 +293,11 @@
        "text": "i"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -308,11 +308,11 @@
        "text": "'"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -323,11 +323,11 @@
        "text": ","
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -338,11 +338,11 @@
        "text": " what"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -353,11 +353,11 @@
        "text": "'"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -368,11 +368,11 @@
        "text": "s"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -383,11 +383,11 @@
        "text": " Moh"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -398,11 +398,11 @@
        "text": " is"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -413,11 +413,11 @@
        "text": "m"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -428,11 +428,11 @@
        "text": " Room"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -443,11 +443,11 @@
        "text": "s"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -458,11 +458,11 @@
        "text": " the"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -473,11 +473,11 @@
        "text": " tired"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -488,11 +488,11 @@
        "text": ":"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -503,11 +503,11 @@
        "text": "'"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -518,11 +518,11 @@
        "text": " capital"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
@@ -530,73 +530,73 @@
        "finish_reason": "",
        "index": 3,
        "logprobs": null,
-        "text": " of"
+        "text": ","
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 0,
        "logprobs": null,
        "text": " She"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 1,
        "logprobs": null,
        "text": " scale"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 2,
        "logprobs": null,
        "text": " of"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  },
  {
    "choices": [
      {
-        "finish_reason": "",
+        "finish_reason": "length",
        "index": 3,
        "logprobs": null,
-        "text": " being"
+        "text": " its"
      }
    ],
-    "created": 1713284431,
+    "created": 1724833943,
    "id": "",
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "object": "text_completion",
-    "system_fingerprint": "2.0.1-native"
+    "system_fingerprint": "2.2.1-dev0-native"
  }
 ]
--- a/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2.json
+++ b/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2.json
@@ -16,7 +16,7 @@
      },
      {
        "id": 3102,
-        "logprob": -11.1875,
+        "logprob": -11.25,
        "text": " request"
      }
    ],
@@ -24,66 +24,66 @@
    "tokens": [
      {
        "id": 185,
-        "logprob": -1.5546875,
+        "logprob": -1.546875,
        "special": false,
        "text": "\n"
      },
      {
        "id": 549,
-        "logprob": -2.84375,
+        "logprob": -2.859375,
        "special": false,
        "text": "The"
      },
      {
        "id": 1727,
-        "logprob": -2.34375,
+        "logprob": -2.484375,
        "special": false,
        "text": " test"
      },
      {
        "id": 3102,
-        "logprob": -0.8359375,
+        "logprob": -0.83203125,
        "special": false,
        "text": " request"
      },
      {
        "id": 317,
-        "logprob": -1.0859375,
+        "logprob": -1.1484375,
        "special": false,
        "text": " is"
      },
      {
-        "id": 254,
-        "logprob": -1.5390625,
+        "id": 245,
+        "logprob": -1.578125,
        "special": false,
-        "text": " the"
+        "text": " a"
      },
      {
-        "id": 1022,
-        "logprob": -1.1875,
+        "id": 3412,
+        "logprob": -2.578125,
        "special": false,
-        "text": " first"
+        "text": " document"
      },
      {
-        "id": 3458,
-        "logprob": -0.35546875,
+        "id": 344,
+        "logprob": -1.125,
        "special": false,
-        "text": " step"
+        "text": " that"
      },
      {
-        "id": 279,
-        "logprob": -0.8828125,
+        "id": 317,
+        "logprob": -1.6953125,
        "special": false,
-        "text": " in"
+        "text": " is"
      },
      {
-        "id": 254,
-        "logprob": -0.71484375,
+        "id": 1222,
+        "logprob": -1.71875,
        "special": false,
-        "text": " the"
+        "text": " used"
      }
    ],
    "top_tokens": null
  },
-  "generated_text": "\nThe test request is the first step in the"
+  "generated_text": "\nThe test request is a document that is used"
 }
--- a/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_load.json
+++ b/integration-tests/models/__snapshots__/test_flash_deepseek_v2/test_flash_deepseek_v2_load.json
@@ -37,56 +37,56 @@
        },
        {
          "id": 1727,
-          "logprob": -2.359375,
+          "logprob": -2.4375,
          "special": false,
          "text": " test"
        },
        {
          "id": 3102,
-          "logprob": -0.83203125,
+          "logprob": -0.83984375,
          "special": false,
          "text": " request"
        },
        {
          "id": 317,
-          "logprob": -1.125,
+          "logprob": -1.1328125,
          "special": false,
          "text": " is"
        },
        {
-          "id": 245,
-          "logprob": -1.5703125,
+          "id": 254,
+          "logprob": -1.515625,
          "special": false,
-          "text": " a"
+          "text": " the"
        },
        {
-          "id": 3412,
-          "logprob": -2.578125,
+          "id": 1022,
+          "logprob": -1.15625,
          "special": false,
-          "text": " document"
+          "text": " first"
        },
        {
-          "id": 344,
-          "logprob": -1.125,
+          "id": 3458,
+          "logprob": -0.3671875,
          "special": false,
-          "text": " that"
+          "text": " step"
        },
        {
-          "id": 317,
-          "logprob": -1.6953125,
+          "id": 279,
+          "logprob": -0.88671875,
          "special": false,
-          "text": " is"
+          "text": " in"
        },
        {
-          "id": 1222,
-          "logprob": -1.75,
+          "id": 254,
+          "logprob": -0.69140625,
          "special": false,
-          "text": " used"
+          "text": " the"
        }
      ],
      "top_tokens": null
    },
-    "generated_text": "\nThe test request is a document that is used"
+    "generated_text": "\nThe test request is the first step in the"
  },
  {
    "details": {
@@ -126,56 +126,56 @@
        },
        {
          "id": 1727,
-          "logprob": -2.359375,
+          "logprob": -2.4375,
          "special": false,
          "text": " test"
        },
        {
          "id": 3102,
-          "logprob": -0.83203125,
+          "logprob": -0.83984375,
          "special": false,
          "text": " request"
        },
        {
          "id": 317,
-          "logprob": -1.125,
+          "logprob": -1.1328125,
          "special": false,
          "text": " is"
        },
        {
-          "id": 245,
-          "logprob": -1.5703125,
+          "id": 254,
+          "logprob": -1.515625,
          "special": false,
-          "text": " a"
+          "text": " the"
        },
        {
-          "id": 3412,
-          "logprob": -2.578125,
+          "id": 1022,
+          "logprob": -1.15625,
          "special": false,
-          "text": " document"
+          "text": " first"
        },
        {
-          "id": 344,
-          "logprob": -1.125,
+          "id": 3458,
+          "logprob": -0.3671875,
          "special": false,
-          "text": " that"
+          "text": " step"
        },
        {
-          "id": 317,
-          "logprob": -1.6953125,
+          "id": 279,
+          "logprob": -0.88671875,
          "special": false,
-          "text": " is"
+          "text": " in"
        },
        {
-          "id": 1222,
-          "logprob": -1.75,
+          "id": 254,
+          "logprob": -0.69140625,
          "special": false,
-          "text": " used"
+          "text": " the"
        }
      ],
      "top_tokens": null
    },
-    "generated_text": "\nThe test request is a document that is used"
+    "generated_text": "\nThe test request is the first step in the"
  },
  {
    "details": {
@@ -215,56 +215,56 @@
        },
        {
          "id": 1727,
-          "logprob": -2.359375,
+          "logprob": -2.4375,
          "special": false,
          "text": " test"
        },
        {
          "id": 3102,
-          "logprob": -0.83203125,
+          "logprob": -0.83984375,
          "special": false,
          "text": " request"
        },
        {
          "id": 317,
-          "logprob": -1.125,
+          "logprob": -1.1328125,
          "special": false,
          "text": " is"
        },
        {
-          "id": 245,
-          "logprob": -1.5703125,
+          "id": 254,
+          "logprob": -1.515625,
          "special": false,
-          "text": " a"
+          "text": " the"
        },
        {
-          "id": 3412,
-          "logprob": -2.578125,
+          "id": 1022,
+          "logprob": -1.15625,
          "special": false,
-          "text": " document"
+          "text": " first"
        },
        {
-          "id": 344,
-          "logprob": -1.125,
+          "id": 3458,
+          "logprob": -0.3671875,
          "special": false,
-          "text": " that"
+          "text": " step"
        },
        {
-          "id": 317,
-          "logprob": -1.6953125,
+          "id": 279,
+          "logprob": -0.88671875,
          "special": false,
-          "text": " is"
+          "text": " in"
        },
        {
-          "id": 1222,
-          "logprob": -1.75,
+          "id": 254,
+          "logprob": -0.69140625,
          "special": false,
-          "text": " used"
+          "text": " the"
        }
      ],
      "top_tokens": null
    },
-    "generated_text": "\nThe test request is a document that is used"
+    "generated_text": "\nThe test request is the first step in the"
  },
  {
    "details": {
@@ -304,55 +304,55 @@
        },
        {
          "id": 1727,
-          "logprob": -2.359375,
+          "logprob": -2.4375,
          "special": false,
          "text": " test"
        },
        {
          "id": 3102,
-          "logprob": -0.83203125,
+          "logprob": -0.83984375,
          "special": false,
          "text": " request"
        },
        {
          "id": 317,
-          "logprob": -1.125,
+          "logprob": -1.1328125,
          "special": false,
          "text": " is"
        },
        {
-          "id": 245,
-          "logprob": -1.5703125,
+          "id": 254,
+          "logprob": -1.515625,
          "special": false,
-          "text": " a"
+          "text": " the"
        },
        {
-          "id": 3412,
-          "logprob": -2.578125,
+          "id": 1022,
+          "logprob": -1.15625,
          "special": false,
-          "text": " document"
+          "text": " first"
        },
        {
-          "id": 344,
-          "logprob": -1.125,
+          "id": 3458,
+          "logprob": -0.3671875,
          "special": false,
-          "text": " that"
+          "text": " step"
        },
        {
-          "id": 317,
-          "logprob": -1.6953125,
+          "id": 279,
+          "logprob": -0.88671875,
          "special": false,
-          "text": " is"
+          "text": " in"
        },
        {
-          "id": 1222,
-          "logprob": -1.75,
+          "id": 254,
+          "logprob": -0.69140625,
          "special": false,
-          "text": " used"
+          "text": " the"
        }
      ],
      "top_tokens": null
    },
-    "generated_text": "\nThe test request is a document that is used"
+    "generated_text": "\nThe test request is the first step in the"
  }
 ]
--- a/integration-tests/models/__snapshots__/test_flash_llama_fp8/test_flash_llama_fp8_all_params.json
+++ b/integration-tests/models/__snapshots__/test_flash_llama_fp8/test_flash_llama_fp8_all_params.json
 {
  "details": {
    "best_of_sequences": null,
-    "finish_reason": "length",
-    "generated_tokens": 10,
+    "finish_reason": "stop_sequence",
+    "generated_tokens": 5,
    "prefill": [
      {
        "id": 128000,
@@ -16,7 +16,7 @@
      },
      {
        "id": 1715,
-        "logprob": -10.375,
+        "logprob": -10.4375,
        "text": " request"
      }
    ],
@@ -29,61 +29,31 @@
        "text": ":"
      },
      {
-        "id": 2209,
-        "logprob": -2.78125,
+        "id": 923,
+        "logprob": -2.84375,
        "special": false,
-        "text": " Is"
+        "text": " add"
      },
      {
-        "id": 279,
-        "logprob": -0.6328125,
+        "id": 264,
+        "logprob": 0.0,
        "special": false,
-        "text": " the"
-      },
-      {
-        "id": 734,
-        "logprob": -2.703125,
-        "special": false,
-        "text": " function"
+        "text": " a"
      },
      {
        "id": 330,
-        "logprob": -0.34179688,
+        "logprob": -0.31640625,
        "special": false,
        "text": " \""
      },
      {
-        "id": 4110,
-        "logprob": -2.359375,
-        "special": false,
-        "text": "Create"
-      },
-      {
-        "id": 7575,
-        "logprob": -2.1875,
-        "special": false,
-        "text": "Process"
-      },
-      {
-        "id": 1,
-        "logprob": -0.07910156,
-        "special": false,
-        "text": "\""
-      },
-      {
-        "id": 304,
-        "logprob": -0.83203125,
-        "special": false,
-        "text": " in"
-      },
-      {
-        "id": 12468,
-        "logprob": -1.8203125,
+        "id": 1985,
+        "logprob": 0.0,
        "special": false,
-        "text": " Win"
+        "text": "test"
      }
    ],
    "top_tokens": null
  },
-  "generated_text": "Test request: Is the function \"CreateProcess\" in Win"
+  "generated_text": "Test request: add a \"test"
 }