"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "906b638efadc2680fe1d4eb7f79e39cd459e462e"
Unverified Commit 38611086 authored by Aaron Jimenez's avatar Aaron Jimenez Committed by GitHub
Browse files

[docs] Fix mistral link in mixtral.md (#28143)

Fix mistral link in mixtral.md
parent 23f8e4db
...@@ -42,7 +42,7 @@ Mixtral-45B is a decoder-based LM with the following architectural choices: ...@@ -42,7 +42,7 @@ Mixtral-45B is a decoder-based LM with the following architectural choices:
* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length. * Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length.
The following implementation details are shared with Mistral AI's first model [mistral](~models/doc/mistral): The following implementation details are shared with Mistral AI's first model [mistral](mistral):
* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens * Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
* GQA (Grouped Query Attention) - allowing faster inference and lower cache size. * GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens. * Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment