• Daniël de Kok's avatar
    Add support for Deepseek V2 (#2224) · e52be9bb
    Daniël de Kok authored
    Deepseek V2 is a MoE model from Deepseek. Relevant variations
    compared to other models:
    
    - Grouped top-K in expert selection.
    - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
      configuration options.
    - `mscale_all_dim` is also used in scaling attention softmax.
    - Permuting of the query/key representations before applying rotary
      embeddings.
    - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
      So, we need weight loads that supports quantized weights. To this
      end `{Weights,WeightLoader}.get_weight` was added.
    - The query/key head dimensionality differs from that of the value,
      so we need to pad during attention.
    - Heads with size 192, needs an extension to our paged attention
      fork and we need to ensure that the KV cache is allocated with the
      correct size.
    - Shared experts.
    e52be9bb
exl2.py 2.4 KB