"Let's implement a simplified version of the Graph Attention Network (GAT) layer.\n",
"\n",
"A GAT layer has two inputs: the adjacency matrix $A$ and the node input features $X$. The idea of GAT layer is to update each node's representation with a weighted average of the node's own representation and its neighbors' representations. In particular, when computing the output for node $i$, the GAT layer does the following:\n",
"1. Compute the scores $S_{ij}$ representing the attention logit from neighbor $j$ to node $i$. $S_{ij}$ is a function of $i$ and $j$'s input features $X_i$ and $X_j$: $$S_{ij} = LeakyReLU(X_i^\\top v_1 + X_j^\\top v_2)$$, where $v_1$ and $v_2$ are trainable vectors.\n",
"2. Compute a softmax attention $R_{ij} = \\exp S_{ij} / \\left( \\sum_{j' \\in \\mathcal{N}_i} s_{ij'} \\right)$, where $\\mathcal{N}_j$ means the neighbors of $j$. This means that $R$ is a row-wise softmax attention of $S$.\n",
"3. Compute the weighted average $H_i = \\sum_{j' : j' \\in \\mathcal{N}_i} R_{j'} X_{j'} W$, where $W$ is a trainable matrix.\n",
"\n",
"The following code defined all the parameters you need but only completes step 1. Could you implement step 2 and step 3?"