server/text_generation_server/utils/layers.py · 67347950b7518efeb64c7f99ee360af685b53934 · OpenDAS / text-generation-inference

feat(server): Implements sharding for non divisible `vocab_size`. (#583) · 67347950

Nicolas Patry authored Jul 12, 2023

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

67347950

layers.py 13.9 KB

Replace layers.py