Most people visualize a pair of perpendicular lines describing the relation between two entities using a line, a curve, or bars of various heights when they hear the term “graph.” However, in data science and machine learning, a graph represents a two-part data structure: Vertices and Edges, or G = (V, E). Here, V denotes the collection of vertices, while E denotes the edges connecting the vertices. Today, graphs are being used to represent and examine relationships between data, including those relating to social networking, finances, transportation, energy grids, learning molecular fingerprints, predicting protein interface, and disease classification. Graph neural network is a subfield of machine learning that focuses on creating neural networks for graph data as efficiently as possible. They help data scientists work with data in non-tabular formats.
Building mini-batches in graph neural networks is extremely computationally expensive due to the exponential growth of multi-hop graph neighbors over network layers, in contrast to regular neural networks. This makes it difficult to enhance the training and inference performance of graph neural networks. In order to solve these problems, MIT researchers worked with IBM Research to create a novel approach dubbed SALIENT (SAmpling, sLIcing, and data movemeNT). Their technique drastically reduces the runtime of graph neural networks on large datasets, such as those with 100 million nodes and 1 billion edges, by addressing three primary bottlenecks. The newly developed method also scales well when the computational capacity is increased by one to sixteen graphical processing units.
The research was presented at the Fifth Conference on Machine Learning and Systems. It was supported by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator, as well as by the MIT-IBM Watson AI Lab.
The need for SALIENT became even more evident when researchers started investigating at the challenges that current systems faced when scaling cutting-edge machine learning algorithms for graphs to enormous datasets, practically at the order of billions. The majority of current technology achieves adequate performance on smaller datasets that can easily fit into GPU memory.
According to co-author Jie Chen, a senior research scientist at IBM Research and the manager of the MIT-IBM Watson AI Lab, large datasets refer to scales like the whole Bitcoin network, where certain patterns and data links might indicate trends or criminal activity. Of the blockchain, there are around one billion Bitcoin transactions. If researchers wish to spot illegal activity inside such a vast network, they will have to deal with a graph similar to this size. The team’s main objective is to build a system that can manage graphs that may be used to represent the whole Bitcoin network. To keep up with the rate at which new data is created almost daily, they also want the system to be as efficient and streamlined as possible.
Read More: How Google’s GraphWorld solves Bottlenecks in Graph Neural Network Benchmarking?
The team worked on building SALIENT with a systems-oriented approach that included basic optimization techniques for components that fit into pre-existing machine-learning frameworks, such as PyTorch Geometric (a popular machine-learning library for GNNs) and the deep graph library (DGL), which are interfaces for creating a machine-learning model. The key objective of developing a technique that could easily be included in current GNN architectures was to make it intuitive for domain experts to apply this work to their specialized domains in order to speed up model training and pluck out insights during inference faster. The team modified its architecture by continually using all available hardware, such as GPUs, data lines, and CPUs. For instance, while the CPU samples the graph and prepares mini-batches of data to be sent across the data link, GPU would either train the machine-learning model or perform inference.
The researchers started by examining the performance of PyTorch Geometric, which revealed an astonishingly low usage of the available GPU resources. The researchers increased GPU usage from 10 to 30% by using minor modifications, which led to a 1.4 to 2 times performance increase compared to open-source benchmark codes. With this fast baseline code, the algorithm could run through one full pass (or “epoch”) on a large training dataset in 50.4 seconds. The MIT research team set out to examine the bottlenecks that develop at the beginning of the data pipeline as well as the algorithms for graph sampling and mini-batch preparation since they felt they might get even improved results.
In contrast to conventional neural networks, GNNs carry out a neighborhood aggregation operation, which calculates information about a node using input from other neighboring nodes in the graph, such as information from friends of friends of a user in a social network graph. The number of nodes a GNN must connect to also increases with the number of layers, which might occasionally strain a computer’s capabilities. Although some neighborhood sampling methods employ randomization to boost efficiency marginally, this is insufficient because the methods were developed when contemporary GPUs were still in their infancy.
To overcome this, the team came up with a combination of data structures and algorithmic improvements that increased sampling performance. As a result, the sampling operation alone was enhanced by approximately three times, decreasing the runtime per epoch from 50.4 to 34.6 seconds. The team adds that they uncovered a previously overlooked fact: sampling can be done during inference at an appropriate rate, boosting total energy efficiency and performance.
According to MIT, earlier systems used a multi-process strategy for this sampling stage, resulting in additional data and pointless data transfer between the processes. By building a single process with small threads that retained the data on the CPU in shared memory, the researchers improved the dexterity of their SALIENT technology. The researchers also point out that SALIENT utilizes the shared memory of the CPU core cache to parallelize feature slicing, which collects relevant data from nodes of interest and their immediate surroundings and edges. As a result, the overall runtime per epoch dropped from 34.6 to 27.8 seconds.
The last bottleneck included a prefetching step to streamline the exchange of mini-batch data between the CPU and GPU, which helps prep data right before it’s needed. The researchers predicted that doing so would use all of the available bandwidth on the data link and bring the approach up to 100% utilization, but they only witnessed about 90%. They also discovered and solved a performance issue in a well-known PyTorch library that resulted in redundant round-trip CPU and GPU connections, giving SALIENT a runtime of 16.5 seconds per epoch.
The team believes that they were able to achieve such outstanding outcomes because of their painstaking attention to detail. The team hopes to apply the graph neural network training system to the existing algorithms used to categorize or forecast the features of each node in the future, as well as focus on identifying deeper subgraph patterns. The latter can benefit in identifying financial crimes.