1. 19 Mar, 2025 4 commits
    • Alexander Zaitsev's avatar
      feat: enable LTO and codegen-units = 1 optimizations (#279) · af8ee9db
      Alexander Zaitsev authored
      #### Overview:
      
      This PR enables more aggressive compiler optimizations for the project which should lead to better performance and smaller binary sizes.
      
      In this PR, I decided to use Fat LTO instead of ThinLTO since it provides higher optimization level.
      
      I have made quick tests (AMD Ryzen 5900x, Fedora 41, Rust 1.85.1, the latest version of the project at the moment, `cargo build --release` command) - here are the results about the binary size improvements.
      
      | Binary\Build mode | dynamo-run | libdynamo_llm_capi.so | http | llmctl | metrics | mock_worker |
      | --- | --- | --- | --- | --- | --- | --- |
      | Release | 55 Mib | 14 Mib | 19 Mib | 14 Mib | 21 Mib | 14 Mib |
      | Release + `codegen-units = 1` + ThinLTO | 43 Mib | 11 Mib | 15 Mib | 11 Mib | 17 Mib | 11 Mib |
      | Release + `codegen-units = 1` + FatLTO | 38 Mib | 9.2 Mib | 13 Mib | 9.6 Mib | 15 Mib | 9.6 Mib |
      
      #### Details:
      
      Enable `codegen-units = 1` and Fat LTO for better optimizations.
      
      #### Where should the reviewer start?
      
      Just check the `Cargo.toml` file ;)
      
      #### Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
      
      - closes GitHub issue: #278 
      af8ee9db
    • Graham King's avatar
      fix(mistralrs): Disable paged attention (#234) · fd95f37b
      Graham King authored
      Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.
      fd95f37b
    • mohammedabdulwahhab's avatar
    • Graham King's avatar
  2. 18 Mar, 2025 20 commits
  3. 17 Mar, 2025 16 commits