1. 20 Sep, 2024 1 commit
  2. 19 Sep, 2024 2 commits
  3. 06 Sep, 2024 1 commit
  4. 05 Sep, 2024 1 commit
  5. 29 Aug, 2024 2 commits
  6. 28 Aug, 2024 1 commit
  7. 27 Aug, 2024 1 commit
    • drbh's avatar
      Pr 2451 ci branch (#2454) · cfa73b5c
      drbh authored
      
      
      * fix[router]: Fix tools not passed in chat template
      Signed-off-by: default avatarGitHub <noreply@github.com>
      
      * feat: improve default tool serialization and lints
      
      * feat: refactor tool logic to include notify_error in prompt and adjust typing
      
      * fix: adjust non tool template apply
      
      * fix: simplify tool grammar logic and improve schema
      
      * feat: avoid skip tool test and avoid empty tool prompts
      
      * fix: increase test client timeout for grammar compilation tests
      
      ---------
      Signed-off-by: default avatarGitHub <noreply@github.com>
      Co-authored-by: default avatarSimone Rossi <simone.rossi.93@gmail.com>
      cfa73b5c
  8. 16 Aug, 2024 3 commits
  9. 12 Aug, 2024 1 commit
  10. 09 Aug, 2024 3 commits
  11. 08 Aug, 2024 2 commits
  12. 05 Aug, 2024 1 commit
  13. 31 Jul, 2024 2 commits
    • Erik Kaunismäki's avatar
      refactor usage stats (#2339) · 7451041e
      Erik Kaunismäki authored
      
      
      * refactor usage stats
      
      * Update docs/source/usage_statistics.md
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * changes based on feedback
      
      * run python3 udpate_doc.py
      
      * fix pre-commit
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete option around usage stats arg
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      7451041e
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt-llm required libraries
      
      create cmake install target to put everything relevant in installation folder
      
      add auth_token CLI argument to provide hf hub authentification token
      
      allow converting huggingface::tokenizers error to TensorRtLlmBackendError
      
      use correct include for spdlog
      
      include guard to build example in cmakelists
      
      working setup of the ffi layer
      
      remove fmt import
      
      use external fmt lib
      
      end to end ffi flow working
      
      make sure to track include/ffi.h to trigger rebuild from cargo
      
      impl the rust backend which currently cannot move the actual computation in background thread
      
      expose shutdown function at ffi layer
      
      impl RwLock scenario for TensorRtLllmBackend
      
      oops missing c++ backend definitions
      
      compute the number of maximum new tokens for each request independently
      
      make sure the context is not dropped in the middle of the async decoding.
      
      remove unnecessary log
      
      add all the necessary plumbery to return the generated content
      
      update invalid doc in cpp file
      
      correctly forward back the log probabilities
      
      remove unneeded scope variable for now
      
      refactor Stream impl for Generation to factorise code
      
      expose the internal missing start/queue timestamp
      
      forward tgi parameters rep/freq penalty
      
      add some more validation about grammar not supported
      
      define a shared struct to hold the result of a decoding step
      
      expose information about potential error happening while decoding
      
      remove logging
      
      add logging in case of decoding error
      
      make sure executor_worker is provided
      
      add initial Dockerfile for TRTLLM backend
      
      add some more information in CMakeLists.txt to correctly install executorWorker
      
      add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
      
      simplify prebuilt trtllm libraries name definition
      
      do the same name definition stuff for tensorrt_llm_executor_static
      
      leverage pkg-config to probe libraries paths and reuse new install structure from cmake
      
      fix bad copy/past missing nvinfer linkage direction
      
      align all the linker search dependency
      
      add missing pkgconfig folder for MPI in Dockerfile
      
      correctly setup linking search path for runtime layer
      
      fix missing / before tgi lib path
      
      adding missing ld_library_path for cuda stubs in Dockerfile
      
      update tgi entrypoint
      
      commenting out Python part for TensorRT installation
      
      refactored docker image
      
      move to TensorRT-LLM v0.11.0
      
      make docker linter happy with same capitalization rule
      
      fix typo
      
      refactor the compute capabilities detection along with num gpus
      
      update TensorRT-LLM to latest version
      
      update TensorRT install script to latest
      
      update build.rs to link to cuda 12.5
      
      add missing dependant libraries for linking
      
      clean up a bit
      
      install to decoder_attention target
      
      add some custom stuff for nccl linkage
      
      fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
      
      use std::env::const::ARCH
      
      make sure variable live long enough...
      
      look for cuda 12.5
      
      add some more basic info in README.md
      
      * Rebase.
      
      * Fix autodocs.
      
      * Let's try to enable trtllm backend.
      
      * Ignore backends/v3 by default.
      
      * Fixing client.
      
      * Fix makefile + autodocs.
      
      * Updating the schema thing + redocly.
      
      * Fix trtllm lint.
      
      * Adding pb files ?
      
      * Remove cargo fmt temporarily.
      
      * ?
      
      * Tmp.
      
      * Remove both check + clippy  ?
      
      * Backporting telemetry.
      
      * Backporting 457fb0a1
      
      
      
      * Remove PB from git.
      
      * Fixing PB with default member backends/client
      
      * update TensorRT-LLM to latest version
      
      * provided None for api_key
      
      * link against libtensorrt_llm and not libtensorrt-llm
      
      ---------
      Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
      Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      2b19d671
  14. 29 Jul, 2024 1 commit
    • Erik Kaunismäki's avatar
      Run ci api key (#2315) · 583d37a2
      Erik Kaunismäki authored
      
      
      * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.
      
      * change name to info routes
      
      * Fix comment
      
      * convert strings to lowercase for case insensitive comparison
      
      * convert header to string
      
      * fixes and update docs
      
      * update docs again
      
      * revert wrong update
      
      ---------
      Co-authored-by: default avatarKevin Duffy <kevin.duffy94@gmail.com>
      583d37a2
  15. 23 Jul, 2024 1 commit
  16. 19 Jul, 2024 4 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • drbh's avatar
      fix: adjust default tool choice (#2244) · 68a9685f
      drbh authored
      * fix: adjust default tool choice
      
      * feat: improve tool choice syntax and response parsing/errors
      
      * fix: remove dev tests
      
      * feat: add ToolChoice to docs
      68a9685f
    • Erik Kaunismäki's avatar
      add usage stats to toctree (#2260) · 40f5dc3e
      Erik Kaunismäki authored
      quick fix
      40f5dc3e
    • Erik Kaunismäki's avatar
      usage stats and crash reports (#2220) · 4c19593a
      Erik Kaunismäki authored
      
      
      * draft of usage stats
      
      * fix wrong link
      
      * launcher doesn't need sysinfo dep
      
      * only tokenizer class instead of hole struct
      
      * unused import
      
      * fix clippy errors
      
      * update openAPI doc
      
      * cargo fmt
      
      * fix error in passing flags to router
      
      * try again to update docs
      
      * run pre-commit locally
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * Update router/src/main.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * on crash use anonymous error event
      
      * delete json_output and ngrok
      
      * more robust way of checking if is in container
      
      * more robust nvidia smi
      
      * parse xpu more robustly
      
      * fix errors
      
      * add nvidia-smi details in docs
      
      * cargo fmt
      
      * fix clippy
      
      * should make docs check pass
      
      * Update router/src/usage_stats.rs
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      
      * error reason can't be in nested json
      
      * cargo fmt
      
      ---------
      Co-authored-by: default avatarHugo Larcher <hugo.larcher@huggingface.co>
      Co-authored-by: default avatarErik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
      4c19593a
  17. 09 Jul, 2024 2 commits
  18. 08 Jul, 2024 1 commit
  19. 05 Jul, 2024 1 commit
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
  20. 04 Jul, 2024 1 commit
  21. 03 Jul, 2024 4 commits
  22. 27 Jun, 2024 2 commits
  23. 25 Jun, 2024 2 commits
    • drbh's avatar
      Enable multiple LoRa adapters (#2010) · 04e1af94
      drbh authored
      
      
      * feat: first draft load multiple lora
      
      * feat: load weights within layer and refactor lora pass
      
      * fix: refactor and reduce lora math
      
      * feat: baseline impl single request multi lora support
      
      * feat: prefer lorax implementation and port loading logic
      
      * fix: prefer adapter_data and refactors
      
      * feat: perfer loraxs custom punica kernels and add mlp loras
      
      * fix: adjust batch for bgmv
      
      * fix: adjust adapter_segments logic when in batch
      
      * fix: refactor and move changes to v3 proto
      
      * fix: pass model_id for all flash causal lms
      
      * fix: pass model_id for all causal and seq2seq lms
      
      * fix: add model_id to model test
      
      * feat: add lora support to mistral and refactors
      
      * feat: prefer model id in request
      
      * fix: include rust code for adapter id
      
      * feat: bump launcher and add new lora docs
      
      * feat: support base model generation and refactors
      
      * fix: rename doc to retry ci build
      
      * feat: support if vlm models
      
      * fix: add adapter_data param and avoid missing layers
      
      * fix: add adapter_data param to phi and neox
      
      * fix: update all models forwards to include adapter_data
      
      * fix: add model_id to IdeficsCausalLM
      
      * Update lora.md
      
      Fixed a typo
      
      * Update lora.md
      
      Fixing spam image
      
      * fix: add lora kernel to dockerfile, support running without kernels and refactors
      
      * fix: avoid dockerfile conflict
      
      * fix: refactors and adjust flash llama lora logic
      
      * fix: skip llama test due to CI issue (temp)
      
      * fix: skip llama test CI (temp) 2
      
      * fix: revert skips and prefer updated ci token for tests
      
      * fix: refactors and helpful comments
      
      * fix: add noop in TensorParallelAdapterRowLinear too
      
      * fix: refactor and move shard_lora_weights logic
      
      * fix: exit early if no adapter_data
      
      ---------
      Co-authored-by: default avatarDerek <datavistics@gmail.com>
      04e1af94
    • KevinDuffy94's avatar
      Add OTLP Service Name Environment Variable (#2076) · 1869ee2f
      KevinDuffy94 authored
      * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069
      
      * Update Docs
      
      * Update README.md
      
      * Update Launcher Docs
      
      * Update Launcher Docs
      Removing Option
      1869ee2f