rocket Gated Graph Transformers
Gated Graph Transformers for graphlevel property prediction, i.e. graph classification and regression.
Associated article: Transformers are Graph Neural Networks, by Chaitanya K. Joshi, published with The Gradient.
This repository is a continuously updated personal project to build intuitions about and track progress in Graph Representation Learning research. I aim to develop the most universal and powerful model which unifies stateoftheart architectures from Graph Neural Networks and Transformers, without incorporating domainspecific tricks.
Key Architectural Ideas
:robot: Deep, Residual Transformer Backbone
 As the backbone architecture, I borrow the twosublayered, prenormalization variant of Transformer encoders that has emerged as the standard in the NLP community, e.g. GPT3. Each Transformer block consists of a messagepassing sublayer followed by a nodewise feedforward sublayer. The graph convolution is described later.
 The feedforward sublayer projects node embeddings to an absurdly large dimension, passes them through a nonlinear activation function, does dropout, and reduces back to the original embedding dimension.
 The Transformer backbone enables training very deep and extremely overparameterized models. Overparameterization is important for performance in NLP and other combinatorially large domains, but was previously not possible for GNNs trained on small graph classifcation datasets. Coupled with unique node positional encodings (described later) and the feedforward sublayer, overparameterization ensures that our GNN is Turing Universal (based on A. Loukas's recent insightful work, including this paper).
:envelope: Anisotropic Graph Convolutions
Source: 'Deep Parametric Continuous Convolutional Neural Networks', Wang et al., 2018
 As the graph convolution layer, I use the Gated Graph Convolution with dense attention mechanism, which we found to be the best performing graph convolution in Benchmarking GNNs. Intuitively, Gated GraphConv generalizes directional CNN filters for 2D images to arbitrary graphs by learning a weighted aggregations over the local neighbors of each node. It upgrades the nodetonode attention mechanism from GATs and MoNet (i.e. one attention weight per node pair) to consider dense featuretofeature attention (i.e. d attention weights for pairs of ddimensional node embeddings).
 Another intuitive motivation for the Gated GraphConv is as a learnable directional diffusion process over the graph, or as a coupled PDE over node and edge features in the graph. Gated GraphConv makes the diffusion process/neighborhood aggregation anisotropic or directional, countering oversmoothing/oversquashing of features and enabling deeper models.
 This graph convolution was originally proposed as a sentence encoder for NLP and further developed at NTU for molecule generation and combinatorial optimization. Evidently, I am partial to this idea. At the same time, it is worth noting that anisotropic local aggregations and generalizations of directed CNN filters have demonstrated strong performance across a myriad of applications, including 3D point clouds, drug discovery, material science, and programming languages.
:arrows_counterclockwise: Graph Positional Encodings
Source: 'Geometric Deep Learning: Going beyond Euclidean Data', Bronstein et al., 2017
 I use the topk nontrivial Laplacian Eigenvectors as unique node identifiers to inject structural/positional priors into the Transformer backbone. Laplacian Eigenvectors are a generalization of sinusoidal positional encodings from the original Transformers, and were concurrently proposed in the Benchmarking GNNs, EigenGNNs, and GCC papers.
 Randomly flipping the sign of Laplacian Eigenvectors during training (due to symmetry) can be seen as an additional data augmentation or regularization technique, helping delay overfitting to training patterns. Going further, the Directional Graph Networks paper presents a more principled approach for using Laplacian Eigenvectors.
Some ideas still in the pipeline include:

Graphspecific Normalization  Originally motivated in Benchmarking GNNs as 'graph size normalization', there have been several subsequent graphspecific normalization techniques such as GraphNorm and MessageNorm, aiming to replace or augment standard Batch Normalization. Intuitively, there is room for improvement as BatchNorm flattens minibatches of graphs instead of accounting for the underlying graph structure.

Theoretically Expressive Aggregation  There are several exciting ideas aiming to bridge the gap between theoretical expressive power, computational feasability, and generalization capacity for GNNs: PNAstyle multihead aggregation and scaling, generalized aggreagators from DeeperGCNs, precomputing structural motifs as in GSN, etc.

Virtual Node and Low Rank Global Attention  After the messagepassing step, the virtual node trick adds messages toandfro a virtual/super node connected to all graph nodes. LRGA comes with additional theretical motivations but does something similar. Intuitively, these techniques enable modelling long range or latent interactions in graphs and counter the oversquashing problem with deeper networks.

General Purpose Pretraining  It isn't truly a Transformer unless its pretrained on hundreds of GPUs for thousands of hours...but general purpose pretraining for graph representation learning remains an open question!
Installation and Usage
# Create new Anaconda environment
conda create n newenv python=3.7
conda activate newenv
# Install PyTorch 1.6 for CUDA 10.x
conda install pytorch=1.6 cudatoolkit=10.x c pytorch
# Install DGL for CUDA 10.x
conda install c dglteam dglcuda10.x
# Install other dependencies
conda install tqdm scikitlearn pandas urllib3 tensorboard
pip install U ogb
# Train GNNs on ogbgmol* datasets
python main_mol.py dataset [ogbgmolhiv/ogbgmolpcba] gnn [gatedgcn/gcn/mlp]
# Prepare submission for OGB leaderboards
bash scripts/ogbgmol*.sh
# Collate results for submission
python submit.py dataset [ogbgmolhiv/ogbgmolpcba] expt [pathtologs]
Note: The code was tested on Ubuntu 16.04, using Python 3.6, PyTorch 1.6 and CUDA 10.1.
Citation
@article{joshi2020transformers,
author = {Joshi, Chaitanya K},
title = {Transformers are Graph Neural Networks},
journal = {The Gradient},
year = {2020},
howpublished = {\url{https://thegradient.pub/transformersaregaphneuralnetworks/ } },
}