1. 10 Oct, 2024 1 commit
  2. 08 Oct, 2024 1 commit
  3. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  4. 03 Oct, 2024 2 commits
  5. 02 Oct, 2024 3 commits
    • drbh's avatar
      Unroll notify error into generate response (#2597) · d22b0c1f
      drbh authored
      * feat: unroll notify_error if no tool is choosen
      
      * fix: expect simple message when no tool is selected
      
      * fix: improve test to avoid notify_error
      
      * fix: improve docs and indicate change in expected response
      
      * fix: adjust linting in test file
      d22b0c1f
    • drbh's avatar
      CI (2592): Allow LoRA adapter revision in server launcher (#2602) · 23354595
      drbh authored
      
      
      allow revision for lora adapters from launcher
      Co-authored-by: default avatarSida <sida@kulamind.com>
      Co-authored-by: default avatarteamclouday <teamclouday@gmail.com>
      23354595
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  6. 30 Sep, 2024 3 commits
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Mohit Sharma's avatar
      Update ROCM libs and improvements (#2579) · f9e561ec
      Mohit Sharma authored
      * style
      
      * update torch
      
      * ix issues
      
      * fix clone
      
      * revert mkl
      
      * added custom PA
      
      * style
      
      * fix style
      
      * style
      
      * hide env vart
      
      * fix mixtral model
      
      * add skinny kernel and merge fixes
      
      * fixed style
      
      * fix issue for sliding window models
      
      * addressed review comments
      
      * fix import
      
      * improved error messag
      
      * updated default value
      
      * remove import
      
      * fix imports after rebase
      
      * float16 dep
      
      * improve dockerfile
      
      * cleaned dockerfile
      f9e561ec
    • Ikram Ul Haq's avatar
      Update architecture.md (#2577) · e790cfc0
      Ikram Ul Haq authored
      e790cfc0
  7. 24 Sep, 2024 2 commits
  8. 20 Sep, 2024 1 commit
  9. 19 Sep, 2024 2 commits
  10. 06 Sep, 2024 1 commit
  11. 05 Sep, 2024 1 commit
  12. 29 Aug, 2024 2 commits
  13. 28 Aug, 2024 1 commit
  14. 27 Aug, 2024 1 commit
    • drbh's avatar
      Pr 2451 ci branch (#2454) · cfa73b5c
      drbh authored
      
      
      * fix[router]: Fix tools not passed in chat template
      Signed-off-by: default avatarGitHub <noreply@github.com>
      
      * feat: improve default tool serialization and lints
      
      * feat: refactor tool logic to include notify_error in prompt and adjust typing
      
      * fix: adjust non tool template apply
      
      * fix: simplify tool grammar logic and improve schema
      
      * feat: avoid skip tool test and avoid empty tool prompts
      
      * fix: increase test client timeout for grammar compilation tests
      
      ---------
      Signed-off-by: default avatarGitHub <noreply@github.com>
      Co-authored-by: default avatarSimone Rossi <simone.rossi.93@gmail.com>
      cfa73b5c
  15. 16 Aug, 2024 3 commits
  16. 12 Aug, 2024 1 commit
  17. 09 Aug, 2024 3 commits
  18. 08 Aug, 2024 2 commits
  19. 05 Aug, 2024 1 commit
  20. 31 Jul, 2024 2 commits
    • Erik Kaunismäki's avatar
      refactor usage stats (#2339) · 7451041e
      Erik Kaunismäki authored
      
      
      * refactor usage stats
      
      * Update docs/source/usage_statistics.md
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * changes based on feedback
      
      * run python3 udpate_doc.py
      
      * fix pre-commit
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete option around usage stats arg
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      7451041e
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt-llm required libraries
      
      create cmake install target to put everything relevant in installation folder
      
      add auth_token CLI argument to provide hf hub authentification token
      
      allow converting huggingface::tokenizers error to TensorRtLlmBackendError
      
      use correct include for spdlog
      
      include guard to build example in cmakelists
      
      working setup of the ffi layer
      
      remove fmt import
      
      use external fmt lib
      
      end to end ffi flow working
      
      make sure to track include/ffi.h to trigger rebuild from cargo
      
      impl the rust backend which currently cannot move the actual computation in background thread
      
      expose shutdown function at ffi layer
      
      impl RwLock scenario for TensorRtLllmBackend
      
      oops missing c++ backend definitions
      
      compute the number of maximum new tokens for each request independently
      
      make sure the context is not dropped in the middle of the async decoding.
      
      remove unnecessary log
      
      add all the necessary plumbery to return the generated content
      
      update invalid doc in cpp file
      
      correctly forward back the log probabilities
      
      remove unneeded scope variable for now
      
      refactor Stream impl for Generation to factorise code
      
      expose the internal missing start/queue timestamp
      
      forward tgi parameters rep/freq penalty
      
      add some more validation about grammar not supported
      
      define a shared struct to hold the result of a decoding step
      
      expose information about potential error happening while decoding
      
      remove logging
      
      add logging in case of decoding error
      
      make sure executor_worker is provided
      
      add initial Dockerfile for TRTLLM backend
      
      add some more information in CMakeLists.txt to correctly install executorWorker
      
      add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
      
      simplify prebuilt trtllm libraries name definition
      
      do the same name definition stuff for tensorrt_llm_executor_static
      
      leverage pkg-config to probe libraries paths and reuse new install structure from cmake
      
      fix bad copy/past missing nvinfer linkage direction
      
      align all the linker search dependency
      
      add missing pkgconfig folder for MPI in Dockerfile
      
      correctly setup linking search path for runtime layer
      
      fix missing / before tgi lib path
      
      adding missing ld_library_path for cuda stubs in Dockerfile
      
      update tgi entrypoint
      
      commenting out Python part for TensorRT installation
      
      refactored docker image
      
      move to TensorRT-LLM v0.11.0
      
      make docker linter happy with same capitalization rule
      
      fix typo
      
      refactor the compute capabilities detection along with num gpus
      
      update TensorRT-LLM to latest version
      
      update TensorRT install script to latest
      
      update build.rs to link to cuda 12.5
      
      add missing dependant libraries for linking
      
      clean up a bit
      
      install to decoder_attention target
      
      add some custom stuff for nccl linkage
      
      fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
      
      use std::env::const::ARCH
      
      make sure variable live long enough...
      
      look for cuda 12.5
      
      add some more basic info in README.md
      
      * Rebase.
      
      * Fix autodocs.
      
      * Let's try to enable trtllm backend.
      
      * Ignore backends/v3 by default.
      
      * Fixing client.
      
      * Fix makefile + autodocs.
      
      * Updating the schema thing + redocly.
      
      * Fix trtllm lint.
      
      * Adding pb files ?
      
      * Remove cargo fmt temporarily.
      
      * ?
      
      * Tmp.
      
      * Remove both check + clippy  ?
      
      * Backporting telemetry.
      
      * Backporting 457fb0a1
      
      
      
      * Remove PB from git.
      
      * Fixing PB with default member backends/client
      
      * update TensorRT-LLM to latest version
      
      * provided None for api_key
      
      * link against libtensorrt_llm and not libtensorrt-llm
      
      ---------
      Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
      Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      2b19d671
  21. 29 Jul, 2024 1 commit
    • Erik Kaunismäki's avatar
      Run ci api key (#2315) · 583d37a2
      Erik Kaunismäki authored
      
      
      * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.
      
      * change name to info routes
      
      * Fix comment
      
      * convert strings to lowercase for case insensitive comparison
      
      * convert header to string
      
      * fixes and update docs
      
      * update docs again
      
      * revert wrong update
      
      ---------
      Co-authored-by: default avatarKevin Duffy <kevin.duffy94@gmail.com>
      583d37a2
  22. 23 Jul, 2024 1 commit
  23. 19 Jul, 2024 4 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • drbh's avatar
      fix: adjust default tool choice (#2244) · 68a9685f
      drbh authored
      * fix: adjust default tool choice
      
      * feat: improve tool choice syntax and response parsing/errors
      
      * fix: remove dev tests
      
      * feat: add ToolChoice to docs
      68a9685f
    • Erik Kaunismäki's avatar
      add usage stats to toctree (#2260) · 40f5dc3e
      Erik Kaunismäki authored
      quick fix
      40f5dc3e
    • Erik Kaunismäki's avatar
      usage stats and crash reports (#2220) · 4c19593a
      Erik Kaunismäki authored
      
      
      * draft of usage stats
      
      * fix wrong link
      
      * launcher doesn't need sysinfo dep
      
      * only tokenizer class instead of hole struct
      
      * unused import
      
      * fix clippy errors
      
      * update openAPI doc
      
      * cargo fmt
      
      * fix error in passing flags to router
      
      * try again to update docs
      
      * run pre-commit locally
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * on crash use anonymous error event
      
      * delete json_output and ngrok
      
      * more robust way of checking if is in container
      
      * more robust nvidia smi
      
      * parse xpu more robustly
      
      * fix errors
      
      * add nvidia-smi details in docs
      
      * cargo fmt
      
      * fix clippy
      
      * should make docs check pass
      
      * Update router/src/usage_stats.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * error reason can't be in nested json
      
      * cargo fmt
      
      ---------
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      Co-authored-by: default avatarErik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
      4c19593a