@@ -16,6 +16,26 @@ In training case the mean/variance need to store out (TBD, not supported yet)
...
@@ -16,6 +16,26 @@ In training case the mean/variance need to store out (TBD, not supported yet)
since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out.
since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out.
## dynamic quantization
we support dynamic quantization for `int8` output, by setting `-fsweep=1` and `-prec_o=int8`. In this case the output will doing a rowwise dynamic quantization like below:
```
# assume output int8, hidden_states is [m, n] shape and in fp16/bf16