# This is because it instantiates it's attention layer from torch.nn.MultiheadAttention, which calls to
# `torch.nn.functional.multi_head_attention_forward` with the weights and bias. Since the hook is never
# triggered with a forward pass call, the weights stay on the CPU. There are more examples where we skip
# this test because of MHA (example: HunyuanDiT because of AttentionPooling layer).
pass
# TODO(aryan): Create a dummy gemma model with smol vocab size
# TODO(aryan): Create a dummy gemma model with smol vocab size
@unittest.skip(
@unittest.skip(
"A very small vocab size is used for fast tests. So, any kind of prompt other than the empty default used in other tests will lead to a embedding lookup error. This test uses a long prompt that causes the error."
"A very small vocab size is used for fast tests. So, any kind of prompt other than the empty default used in other tests will lead to a embedding lookup error. This test uses a long prompt that causes the error."