Published on Dec 11, 2021 (edited on Dec 13, 2021)
Representation learning also known as feature learning is a machine learning specialisation focused on learning representations of data. These representations aim at capturing intrinsic factors useful for building predictive models. That is,
"learning representations of the data that make it easier to extract useful information when building classifiers or other predictors." [source]
In this article, we are interested in representation learning for graph data with a focus on scalability, that is, applicable to large graphs. We consider a large graph one that has 10s of millions of nodes or more. As an example, the Bitcoin transaction network has billions of nodes corresponding to transactions between cryptocurrency holders. Commonly, the number of edges will be a multiple of the number of nodes in the graph.
In order to explain the main concepts, we are going to use a citation network as a working example. Let us consider the problem of predicting the topic of a research paper using information from the paper's text content and citation links. Cora (see Figure 1) is a well studied citation network where the nodes represent papers associated with feature vectors indicating the presence or absence of a small set of words (vocabulary). Edges between the nodes represent citation relationships. Each paper is labelled with one of 6 subjects. The dimensionality of the node feature vectors is 1453 and the total number of nodes and edges is approximately 2700 and 5400 respectively. Strictly speaking, this is not a large dataset but it is easy to understand and visualise and has been frequently used as a benchmark in the machine learning literature making it a good choice for this article. Our aim is to train a multi-class classification model for predicting the subject of a paper.
Figure 1: Visualisation of the Cora citation network. Node colours indicate the paper's subject (1 of seven) and node size the degree (number of neighbours.)
There are several data structures for representing a graph each with its own advantages and disadvantages. We represent a graph using its adjacency matrix, A. In the general case that a graph has N nodes and E edges, a binary adjacency matrix would be size NxN (2700x2700 for Cora) and it would have E non-zeros entries. Real-world graphs tend to be sparse such that number of edges (5400 for Cora) is much smaller than the total number of edges possible. A fully connected and undirected graph would have N(N-1)/2 (approximately 3.6 million for Cora) edges. Because graphs are sparse, A can be stored in memory very efficiently using a sparse matrix representation.
The node features can be stored in a dense matrix of size Nxd where d is the dimensionality of the node feature vectors. For Cora, the feature matrix, X, has size 2700x1453. Note that this is a dense matrix so it will not be possible to store efficiently using a sparse matrix representation.
We should keep in mind the above storage requirements and the sizes of the data matrices as we will need them to calculate the storage and computational requirements of different graph representation learning algorithms.
Graph Neural Networks (GNNs) are the preferred method for representation learning on graph-structured data. GNN literature is large and fast growing. We begin our exploration of scalable graph representation learning with an overview of Graph Convolutional Networks (GCNs) which is the method that kick-started the recent interest in GNNs. We then demonstrate why GCN is not scalable and then survey several extensions that are.
GCN was proposed as a scalable version of ChebNet. We will not discuss further the relationship between GCN and prior works since this would be out of scope for this article. First, we present the core idea behind a graph convolutional neural network layer and second we look into its memory and computational requirements.
A graph convolutional layer is defined using the following straightforward equation,
Z = s(A'XW)
where Z are the learnt node representations output by the layer; X is the matrix of node features; A' is a normalized version of the graph adjacency matrix with added self-loops; W are learnable parameters and s is a non-linear function, e.g., sigmoid, applied element-wise. The normalised adjacency matrix is given by,
A' = D-1/2AD-1/2
where D is the degree matrix; the degree matrix is a square diagonal matrix where the diagonal stores the degree (number of neighbours) for each node in the graph. A close examination of the equation for calculating Z reveals that a node's representation is calculated as the weighted sum of its neighbouring node representations (those nodes it is connected with an edge) plus its own representation. The weights are functions of the node degrees.
A graph convolutional network can be created by stacking one or more of the above layers. The output of the last layer can be used to train a node classifier in a semi-supervised way. Usually, only a small number of nodes will have labels so we define our loss as a function of this subset of labelled nodes; however, as it is evident from the above equations, we consider all the nodes in the graph, labelled and not, when calculating the node representations required for evaluating the loss.
Let us consider the limitations of training a GCN model. We focus the discussion on a 3-layer model such that the output node representations at the 3rd layer are given by,
Z(3) = s(A's(A's(A'XW1)W2)W3)
The i-th layer produces a set of node embeddings denoted by Z(i) where Z(0) is equal to X, the node features.
We can train our 3-layer model using full batch gradient descent. That is, all the node information is needed at each pass to calculate the loss and gradients. This is the first limitation for GCN. Full batch gradient descent will converge slowly requiring training for a large number of epochs. For this reason GCN convergence properties are poor. It is also computationally expensive requiring the multiplication of node feature, adjacency matrix, and weight matrices. Lastly, the storage requirements are high since we need to store dense matrices for the node features and embeddings as output by each graph convolutional layer. Luckily, for sparse graphs storing the adjacency matrix can be done efficiently. To recap, GCN scales poorly for the following three reasons,
Clearly, if we want GCN-like models to scale to larger graphs, then we have to tackle the above challenges. We need to extend GCN such that it can be trained using mini-batch gradient descent, reduce the number of multiplications for each layer, reduce the model size, and store less data or minimize the amount of data that is needed for performing the calculations for each mini-batch. Next, we present models that extend GCN to tackle these issues.
Let us first consider a solution to all the above problems by adapting GCN to mini-batch training. If we can train using mini-batch gradient descent, then training will converge faster and we won't have to store the entire graph on GPU memory all at once perhaps scaling to larger graphs.
We can achieve these goals if we consider the number of nodes in the graph that influence the learnt representation for an output node. Let Nc be the set of nodes for which we have labels. Normally, the set of nodes with labels is going to be much smaller than the total number of nodes in the graph.
Let us consider just one of the nodes in Nc and work our way backwards from the last graph convolutional layer in a 3-layer graph convolutional network. We will refer to the node of interest as the centre node.
We can see that at the first layer, the nodes that directly influence the output representation for the centre node are its one-hop neighbours. The nodes that influence the output representation at the second graph convolutional layer are the one-hop plus the two-hop neighbours and so on with the output of the third layer. In other words, the set of nodes that need to be considered in estimating the centre node's representation grows exponentially as a function of the network depth (number of graph convolutional layers). This is commonly referred to in the literature as the neighbourhood explosion problem.
However, calculating the loss for the centre node does not necessarily involve all the nodes in the graph but only a subset (the set of nodes in its receptive field). This subset is the set of nodes in the k-depth neighbourhood where k is the number of graph convolutional layers in our GNN. This hints to a solution for efficient mini-batch training.
We can eliminate lots of unnecessary calculations (matrix multiplications) for nodes that are not involved with estimating the loss for the nodes in a mini-batch; a mini-batch is a subset of nodes with labels.
Let us consider one more issue to be resolved. What if a node's neighbourhood includes a hub node? Hub nodes are nodes that are well connected, that is have lots of neighbours in the graph. Such nodes are common in real-world graphs that exhibit power law degree distributions; node degrees tend to be small for the majority of nodes in the graph except for a small number of nodes that are well connected and have high degree. The latter are the hub nodes. If a hub node is in the k-hop neighbourhood of a node in the mini-batch, then a large portion of the nodes in the graph influence that node's learnt representation. In other words, for real-world graphs and in the worst case, for each node in the mini-batch we need to access data from all (or most) nodes in the graph. Back to square one for us.
Figure 2: The node degree distribution for the Cora citation network. We can see that most nodes have degree less than 20. Circles indicate 3 hub nodes with degrees 74, 78, and 168.
Researchers have proposed an elegant solution to this problem, one that works well in practice and scales nicely to very large graphs. The method is known as Graph Sample and Aggregate (GraphSAGE).
The core idea is that instead of aggregating information from all neighbours, we sample a fixed number of them, e.g., 10. We need to sample with replacement such that if a node has fewer than the required number of neighbours, we can still sample the correct number albeit some of the nodes might be sampled more than once. When we come across a hub node, we need not worry about its large number of neighbours. Best of all, the size of the graph that is taken into account when updating the representation for a single node in the mini-batch is now capped to a maximum that is easily computed.
If we sample m neighbours for each node and we have a k-layer graph neural network, then we always need mk nodes. For a mini-batch of size b the total number of nodes becomes b*mk. We can select values for b, m, and k to fully utilise GPU memory. As long as we have a routine to quickly sample graphs and send them to the GPU for training, we have a fast and very scalable algorithm. With GraphSAGE's clever node sampling approach, the size of the original graph no longer matters.
Node sampling certainly paved the way for scalable GNNs. There are alternative approaches that achieve similar performance (in terms of model accuracy) and in some cases are even more computationally efficient because they allow the sampling to be performed as a pre-processing step. These approaches focus on graph sampling as opposed to node sampling as the core mechanism for scaling GNNs to large graphs.
Node sampling is a brilliant approach to the problem of neighbourhood explosion. Another approach for avoiding this problem is to keep the size of the graph small in the first place. We know that GCN can easily handle small graphs. So, what if we take our big graph and partition it into a set of smaller graphs that GCN can easily handle? If we treat each of these sub-graphs as a mini-batch (assuming that each sub-graph includes a subset of labelled nodes), then during one training epoch we update the GCN parameters once for each sub-graph.
We will consider the graph partitioning part as a pre-processing step that can be performed once and potentially implemented in a distributed algorithm to reduce its computational cost. Then, we can train GCN models efficiently using the described method. What remains to specify is how to partition the graph, i.e., what makes for a good partition, and what effect does the latter choice has on our trained model (think bias and variance.)
Let us consider Cluster-GCN as the first approach implementing scalable GNNs via graph sampling. In the paper, the authors clearly show Cluster-GCN's advantages over GCN. Cluster-GCN is certainly a scalable algorithm that can handle any size graph as long as said graph can be efficiently partitioned into a set of sub-graphs.
The authors considered two different partitioning algorithms. One was random; that is each subgraph consists of a uniformly random subset of nodes and all the edges between these nodes. They show that the random partition performs poorly. Then they considered using the METIS graph partitioning algorithm. METIS generates a set of subgraphs such that each subgraph is tightly connected whereas the number of edges between subgraphs is minimised. METIS partitioning vastly out-performed random partitioning.
However, the loss of edges between sub-graphs (the edges connecting nodes that are assigned to different sub-graphs) induces a strong bias that hinders the model's performance. So, Cluster-GCN extends the basic algorithm such that a mini-batch is constructed by combining a random set of subgraphs (two subgraphs is sufficient) re-introducing the between-graph edges in the process. For each epoch and for each mini-batch the sets of subgraphs combined differs so all edges in the original graph have a non-zero chance of being used to propagate information. This enhancement produced state-of-the-art results for a number of standard benchmark datasets.
The only issue with Cluster-GCN is that using METIS for graph partitioning can be a very expensive pre-processing step. Unfortunately, this limits the applicability of Cluster-GCN to medium sized graphs. Furthermore, the paper introducing Cluster-GCN did not perform a careful analysis of the model's bias and variance. These issues are rectified by the Graph Sampling Based Inductive Learning Method (GraphSAINT).
GraphSAINT follows the basic methodology of graph partitioning and mini-batch training we have already described. How it differs from Cluster-GCN is that is uses a more efficient graph partitioning algorithm and more carefully tackles the bias and variance issues.
With regard to the partitioning approach, GraphSAINT uses random walks. Random walks are easy to implement and parallelise. Given a graph node, a random walk is generated by following a randomly selected outgoing edge. Every time an edge is followed, the random walker emits the node's identifier. It then continues to select another edge to follow and so on up to a specified number of steps (a hyper-parameter that needs to be tuned). At the conclusion of a random walk, we have generated a set of node identifiers. These nodes and the edges between them constitute a sub-graph. Normally, a sub-graph will be the outcome of several random walks starting at different nodes. Random walks are computationally efficient, trivial to implement, and easily parallelisable. Hence this pre-processing step is much cheaper than using a partitioning algorithm such as METIS.
In order to tackle the bias issue, GraphSAINT proposes that the node representations calculated using the GCN layers be multiplied by a factor that corrects for the bias induced by the partitioning algorithm. Furthermore, the loss is also corrected by an easy to compute factor. These factors are functions of the probabilities of sampling nodes and edges. These probabilities for each node and edge can be estimated during the graph partitioning pre-processing step hence killing two birds with one stone. The variance can be controlled by sampling sub-graphs using a heuristic estimate of the edge distribution (the probability that a given edge would be sampled). This heuristic is just a function of the node degrees of those nodes connected by the edge and so very easy to calculate. The heuristic penalises hub nodes and gives preference to nodes that have fewer connections. This heuristic is supported by the empirical evaluation results.
GraphSAINT scales to larger graphs than Cluster-GCN and achieves state-of-the-art performance. GraphSAINT's pre-processing step is more computationally efficient and the total time pre-processing plus model training is much smaller than Cluster-GCN.
Node and graph sampling are not the only two approaches for scalable GNNs. FastGCN is another sampling approach that focuses on layer-wise sampling within the graph convolutional layers. FastGCN scales to larger graphs than GCN but suffers from high variance. VR-GCN improves on FastGCN by controlling the variance; the authors show that using their variance reduction technique as few as a single edge/node can be sampled while still achieving high classification accuracy.
Another class of methods simplifies GNNs by avoiding the expensive computation of the adjacency matrix and the node feature matrix. Simplifying GCN (SGCN) simply calculates powers of the adjacency matrix and then multiplies with the node feature matrix once; effectively, this operation performs a smoothing of the node features based on the graph structure. The smoothed node features are used as input to an MLP. SGCN matches GCN's performance and is considerably more computationally efficient. The one matrix multiplication step can be calculated in a distributed system once as a pre-processing step. Training an MLP can be performed efficiently on any GPU.
PPNP and APPNP do something similar except that in this case an MLP is first trained on the node features (no smoothing like SGCN is applied at this stage). The MLP predictions are then propagated/diffused across the graph using the Personalised PageRank (PPR) scores to determine the amount of information a node passes on. Since PPR can be expensive to compute exactly for large graphs, APPNP employs an approximation to the PPR easily calculated via a suitable recurrence relation. PPNP/APPNP models can be trained end-to-end or propagation can be reserved for training or inference time only.
In this article we introduced GCN-based GNNs and explained why such models do not scale to large graphs. We then presented several approaches for scaling such GNNs. These include node sampling, graph sampling, and layer sampling. Lastly we considered approaches that simplify GNNs by decoupling the graph structure propagation from the neural network training achieving improved computational efficiency while maintaining model performance.
Our list of techniques is certainly not exhaustive but should provide a good starting point for anyone interested in this line of work.