# This is because it instantiates it's attention layer from torch.nn.MultiheadAttention, which calls to
# `torch.nn.functional.multi_head_attention_forward` with the weights and bias. Since the hook is never
# triggered with a forward pass call, the weights stay on the CPU. There are more examples where we skip
# this test because of MHA (example: HunyuanDiT because of AttentionPooling layer).
pass
# TODO(aryan): Create a dummy gemma model with smol vocab size
@unittest.skip(
"A very small vocab size is used for fast tests. So, any kind of prompt other than the empty default used in other tests will lead to a embedding lookup error. This test uses a long prompt that causes the error."