feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code.
Showing
Please register or sign in to comment