• Nicolas Patry's avatar
    feat(server): Implements sharding for non divisible `vocab_size`. (#583) · 67347950
    Nicolas Patry authored
    - The code is relatively easy (just disable the checks on Embedding and
    Head)
    
    This cannot be done in the same easy fashion for hidden_dim/head_dim.
    It's relatively easy on some models (classic MHA) but it would make the
    other
    models (MQA) much more complex, and GPTQ quantization another quite
    hairy piece
    of code.
    67347950
layers.py 13.9 KB