1. 26 Jul, 2024 2 commits
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
    • Daniël de Kok's avatar
  2. 24 Jul, 2024 2 commits
  3. 23 Jul, 2024 2 commits
  4. 22 Jul, 2024 2 commits
  5. 21 Jul, 2024 1 commit
  6. 20 Jul, 2024 1 commit
    • OlivierDehaene's avatar
      feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79
      OlivierDehaene authored
      * feat(fp8): add support for fbgemm
      
      * allow loading fp8 weights directly
      
      * update outlines
      
      * fix makefile
      
      * build fbgemm
      
      * avoid circular import and fix dockerfile
      
      * add default dtype
      
      * refactored weights loader
      
      * fix auto conversion
      
      * fix quantization config parsing
      
      * force new nccl on install
      
      * missing get_weights implementation
      
      * increase timeout
      53ec0b79
  7. 19 Jul, 2024 5 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • Daniël de Kok's avatar
      3b41e93a
    • Daniël de Kok's avatar
      18db78f2
    • Daniël de Kok's avatar
    • Daniël de Kok's avatar
      Improve the handling of quantized weights (#2250) · ba291dad
      Daniël de Kok authored
      * Improve the handling of quantized weights
      
      Handling of quantized weights was split between two mechanisms:
      
      - For quantized checkpoints, we used the new weight loader
        infrastructure.
      - For quantization while loading (EETQ, FP8, bitsandbytes) we
        instead relied on conditional in `get_linear`.
      
      Weight loaders support context managers to selectively load
      particular layers with different weight loaders, which is useful
      for models like Idefics2 AWQ, which uses a quantized text model,
      but unquantized vision and connector models. However, the context
      manager would be overrided by `get_linear`, which string-checks
      `quantizer`. Also, the context manager would not work with
      EETQ, FP8, and bitsandbytes.
      
      This change migrates all quantizers to the weight loader infrastructure.
      This has several benefits:
      
      - We can use context managers with all quantizers.
      - All the implementation details move down to the quantizer layers,
        `get_linear` does not need to know how to handle quantizer linear
        layers.
      - All quantizer weights are strongly typed, we don't pass around
        raw tensors.
      - We don't have to pass around the `quantizer` string everywhere.
      
      * Exclude non-MLP layers when using FP8 quantization with Llama
      ba291dad
  8. 18 Jul, 2024 1 commit
  9. 16 Jul, 2024 1 commit
  10. 09 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Move quantized weight handling out of the `Weights` class (#2194) · 8511669c
      Daniël de Kok authored
      Quantized weights were loaded in the `Weights` class, but this was
      getting quite unwieldy, where every higher level method to load weights
      was a long conditional to cover all the different quantizers.
      
      This change moves loading of quantized weights out of the `Weights`
      class. This is done by defining a simple `WeightsLoader` interface
      that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
      and `MarlinWeightsLoader`. These implementations are in the quantizers'
      respective modules. The `Weights` class provides the low-level load
      operations (such as loading tensors or sharded tensors), but delegates
      loads that need quantizer-specific weight processing to a loader. The
      loaders still use the low-level functionality provided by `Weights`.
      
      I initially tried making a hierarchy where a class like `GPTQWeights`
      would inherit from `Weights`. But it is not very flexible (e.g. does
      not work well with the new weight storage mock used in tests) and
      the implicit indirections made the code harder to follow.
      8511669c
  11. 08 Jul, 2024 2 commits
  12. 05 Jul, 2024 4 commits
  13. 02 Jul, 2024 2 commits
  14. 01 Jul, 2024 3 commits
    • Nicolas Patry's avatar
      [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e
      Nicolas Patry authored
      * Using flash decoding
      
      Conditional flashdecoding.
      
      Fix max_q.
      
      Working kvcache
      
      Working version with flash decoding.
      
      Make it work for mistral.
      
      Fix after rebase..
      
      Less intrusive.
      
      REvert changes in modeling.
      
      Speedup flashdecoding.
      
      HHachweew
      Hack to make other models work.
      
      Fixing non flash decoding llama path.
      
      Router logic knows about page size.
      
      Missing 2 models.
      
      Missing cohere.
      
      Fixing cohere flash decoding.
      
      Revamped all this architecture.
      
      Fix cohere.
      
      Fixing falcon.
      
      Enabling custom block size schedule.
      
      Update router/src/infer.rs
      
      Not sending preallocated output.
      
      * Making it work on non flash decoding.
      
      * Fix Cohere.
      
      * Fix non decoding paths.
      
      * Rebased.
      
      * No need for cache_manager anymore.
      
      * Update?
      
      * "ipex" -> "cpu"
      
      * These do not belong.
      
      * Factoring cu_seqlen_qk for better abstracting over every model.
      
      * Fixing non flash tests/imports.
      
      * Changing return everywhere.
      
      * Update mistral past.
      
      * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).
      
      * Fixup mistral clamping (had issues with cuda graphs).
      
      * No need to recreate anything actually.
      4327210e
    • Nicolas Patry's avatar
      Fixing baichuan override. (#2158) · 4f55f158
      Nicolas Patry authored
      4f55f158
    • drbh's avatar
      fix: use weights from base_layer (#2141) · 25f57e2e
      drbh authored
      25f57e2e
  15. 27 Jun, 2024 2 commits
  16. 25 Jun, 2024 3 commits
    • drbh's avatar
      Enable multiple LoRa adapters (#2010) · 04e1af94
      drbh authored
      
      
      * feat: first draft load multiple lora
      
      * feat: load weights within layer and refactor lora pass
      
      * fix: refactor and reduce lora math
      
      * feat: baseline impl single request multi lora support
      
      * feat: prefer lorax implementation and port loading logic
      
      * fix: prefer adapter_data and refactors
      
      * feat: perfer loraxs custom punica kernels and add mlp loras
      
      * fix: adjust batch for bgmv
      
      * fix: adjust adapter_segments logic when in batch
      
      * fix: refactor and move changes to v3 proto
      
      * fix: pass model_id for all flash causal lms
      
      * fix: pass model_id for all causal and seq2seq lms
      
      * fix: add model_id to model test
      
      * feat: add lora support to mistral and refactors
      
      * feat: prefer model id in request
      
      * fix: include rust code for adapter id
      
      * feat: bump launcher and add new lora docs
      
      * feat: support base model generation and refactors
      
      * fix: rename doc to retry ci build
      
      * feat: support if vlm models
      
      * fix: add adapter_data param and avoid missing layers
      
      * fix: add adapter_data param to phi and neox
      
      * fix: update all models forwards to include adapter_data
      
      * fix: add model_id to IdeficsCausalLM
      
      * Update lora.md
      
      Fixed a typo
      
      * Update lora.md
      
      Fixing spam image
      
      * fix: add lora kernel to dockerfile, support running without kernels and refactors
      
      * fix: avoid dockerfile conflict
      
      * fix: refactors and adjust flash llama lora logic
      
      * fix: skip llama test due to CI issue (temp)
      
      * fix: skip llama test CI (temp) 2
      
      * fix: revert skips and prefer updated ci token for tests
      
      * fix: refactors and helpful comments
      
      * fix: add noop in TensorParallelAdapterRowLinear too
      
      * fix: refactor and move shard_lora_weights logic
      
      * fix: exit early if no adapter_data
      
      ---------
      Co-authored-by: default avatarDerek <datavistics@gmail.com>
      04e1af94
    • Nicolas Patry's avatar
      Removing IPEX_AVAIL. (#2115) · 9e2fdf57
      Nicolas Patry authored
      * Removing IPEX_AVAIL.
      
      Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
      except for a very few spots.
      
      The biggest number of spots is the kv-cache layout and the flash_xxx.py
      files.
      Since those files should be removed soon and factored away, we should
      not need them.
      
      * Forgot a few places.
      
      * Unrelated change.
      
      * Fixing HF_TOKEN.
      
      * HF_TOKEN
      9e2fdf57
    • Wang, Yi's avatar
      Cpu tgi (#1936) · b64c70c9
      Wang, Yi authored
      
      
      * add CPU tgi support
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * ipex distributed ops support
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: default avatarFuntowicz Morgan <mfuntowicz@users.noreply.github.com>
      b64c70c9
  17. 20 Jun, 2024 1 commit
  18. 14 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for GPTQ Marlin (#2052) · 093a27c5
      Daniël de Kok authored
      Add support for GPTQ Marlin kernels
      
      GPTQ Marlin extends the Marlin kernels to support common GPTQ
      configurations:
      
      - bits: 4 or 8
      - groupsize: -1, 32, 64, or 128
      - desc_act: true/false
      
      Using the GPTQ Marlin kernels requires repacking the parameters in the
      Marlin quantizer format.
      
      The kernels were contributed by Neural Magic to VLLM. We vendor them
      here for convenience.
      093a27c5
  19. 12 Jun, 2024 1 commit
  20. 10 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add Phi-3 medium support (#2039) · 85dfc392
      Daniël de Kok authored
      Add support for Phi-3-medium
      
      The main difference between the medium and mini models is that medium
      uses grouped query attention with a packed QKV matrix. This change adds
      support for GQA with packed matrixes to `Weights.get_weights_col_packed`
      and uses it for Phi-3. This also allows us to remove the custom
      implementation of GQA from dbrx attention loading.
      85dfc392
  21. 06 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for Marlin-quantized models · 4594e6fa
      Daniël de Kok authored
      This change adds support for Marlin-quantized models. Marlin is an
      FP16xINT4 matmul kernel, which provides good speedups decoding batches
      of 16-32 tokens. It supports quantized models with symmetric
      quantization, groupsize -1 or 128, and 4-bit.
      
      Tested with:
      
      - Llama 2
      - Llama 3
      - Phi 3
      4594e6fa
  22. 05 Jun, 2024 1 commit