MIMD GPUs for Scalable Graph Analysis: The Next Computer Hardware Revolution?
The human brain is a massively parallel processor — each of its 100 billion-plus neurons is processing data all the time, and responding if appropriate. By contrast, today’s default computer architecture is highly serial — our processors operate, by and large, one step at a time. Certain types of operations that are slow for brain cells are extremely fast for today’s computer processors, enabling our computers to do a lot of amazing things in spite of their serial architecture.
But certain types of processing seem to beg for parallel computing. Graphics is one of these, which is why specialized Graphics Processing Units (GPUs) have become prevalent in recent years, embodying focused parallel processing aimed at rapid 3D computer graphics. And Artificial Intelligence (AI) is another, which is why AI researchers have repeatedly toyed with experimental massively parallel hardware — though without the dramatic practical success that GPUs have found. The vast majority of AI algorithms these days would work more naturally on appropriate parallel hardware (with suitable algorithmic tweaks), but are run in practice on standard serial hardware. GPUs tailored for graphics are usable for certain aspects of AI implementation but are far from ideal for AI purposes.
Nvidia’s Tesla GPU – Graphics-focused SIMD parallel processing thoroughly and successfully commoditized.
As an AI researcher, I find this situation frustrating. My own AI architectures practically cry out for massive parallelism, but since no suitable parallel hardware exists, I run them on the same standard Linux boxes as everybody else. I find myself thinking from time to time about how it might be possible to transform the computer hardware scene in such a way as to lead to the prevalence of parallel hardware suitable for the sort of AI-related work I do. Hence this article, in which I forecast a path by which a certain class of highly valuable practical applications (graph analysis) might lead to the commoditization of the particular sort of parallel hardware (distributed-memory MIMD) that I think would be most useful for advanced AI.
Varieties of Parallelism
One sort of limited parallelism, multicore computing, has been gaining a lot of steam lately; dual and quad-core laptops and PCs are now standard and leading chipmakers are actively planning future chips with dozens or hundreds of cores. However, as one moves beyond dozens of cores, the currently popular approaches to multicore meet various technical challenges. In response, chipmakers have proposed various “network-on-a-chip” designs.
A further challenge here is that, while graphics and AI are naturals for multicore computing, many other practical computing problems are difficult to implement in a way that exploits multicore or other parallel hardware fully — using current computer science knowledge. As hardware becomes increasingly parallel, computer science theory and software design must struggle to keep up. The history of parallel computing is full of examples of wonderful hardware that ended up largely unexploited due to the difficulty of writing effective software for it. Sony’s PlayStation 3 (PS3) is a recent example. Few if any of the popular games written for the PS3 make full use of the innovative IBM Cell processor inside, with its novel eight-processor architecture.
An important distinction in parallel processing theory is between SIMD (single instruction, multiple datastream) and MIMD (multiple instruction, multiple datastream) architectures. Today’s GPUs are SIMD, which means that they apply one computational operation at a time, but they can apply it to many pieces of data at the same time. This is fine for graphics, in which one often wants to perform the same operation across all the different parts of a visual scene. On the other hand, the Connection Machine CM5 parallel architecture that Danny Hillis’ company Thinking Machines built in the 1990s was MIMD. Each of the Connection Machine’s processors could do something different, with all operating at the same time and all exchanging information among each other.
A finer distinction, also important, is between shared-memory and distributed-memory MIMD architectures. In a distributed-memory MIMD architecture, each processor has its own local memory, which other processors can’t access directly. This approach seems to be necessary if one wants to efficiently scale up to a large number of processors, and it’s the approach CM5 took. The PS3’s Cell processor was moving in this direction, with a separate RAM memory unit for each of its eight processors, but current multicore chips are shared-memory.
Graphical depiction of distributed-memory MIMD parallel hardware architecture. From http://www.hpcc.ecs.soton.ac.uk/EandT/courseware/IntroHPC/architectures.html
Some AI approaches can make good use of SIMD parallel hardware like GPUs. Most neural net architectures fit this category — they involve a large number of neurons, all operating according to the same basic equation: single operation (neural firing and adaptation), multiple datastream (multiple neurons). On the other hand, many AI approaches require MIMD because their infrastructures are more heterogeneous and require different items of memory to undergo different sorts of processes at the same time. And given the scale of parallelism needed by real-world AI systems, what they require is more specifically distributed-memory MIMD.
There are clever workarounds for bypassing the SIMD/MIMD distinction and techniques for implementing MIMD operations on SIMD GPUs (see http://gpucomputing.net/?q=node/2446). But while this is a promising direction, for really intensive applications naturally matching the distributed-memory MIMD model it brings a large performance hit. Also, GPU makers are gradually incorporating more and more MIMD ideas into their designs but in a fairly limited and specialized way, again not making life easy for algorithms that are naturally distributed-memory MIMD in nature.
My own AI work currently makes some use of SIMD GPU cards for operations like sensory data processing but would require MIMD for a full parallelization, so I’ve long nursed an interest in MIMD hardware. I enjoyed programming the CM5 back in the 90s, but that line of hardware was long ago discontinued for economic reasons and the old Connection Machines are now dinosaurs; tens of thousands of 1990s-era processors networked together are no match for a single mid-range modern server. What we need to accelerate AI progress is distributed-memory MIMD hardware making use of all the wonderful advances in computer hardware that have been made by following the serial paradigm over the last few decades.
Graph Analysis: MIMD’s Killer App?
But what, practically, are the forces that would be likely to make a new wave of distributed-memory MIMD hardware come about?
General purpose programming for distributed-memory MIMD hardware seems very difficult for programmers to wrap their brains around, and I don’t see this problem going away anytime soon. And the software world tends to make things easier and easier for programmers so that people can embark on programming careers with less and less computer science knowledge — a trend which seems like it would be tough to reverse. So I somewhat doubt general purpose MIMD hardware will take off in the near future. I suspect this will wait until we have fairly advanced programming-capable AIs that can handle the adaptation of algorithms and data structures to new hardware architectures better than humans can.
But if not general purpose MIMD, then what specialized domain might drive distributed-memory MIMD development? Graphics has driven SIMD development and in a way that has ultimately benefited many applications besides graphics like scientific computing and various aspects of AI such as vision processing. But graphics doesn’t appear to have desperate need of distributed-memory MIMD parallelism.
What about AI? AI would benefit tremendously from MIMD hardware, but AI is not yet such a huge business that it would motivate chipmakers to embark on an expensive new direction like this.
My best guess is that what will drive the development of specialized distributed-memory MIMD parallel hardware, analogous to SIMD GPUs, is graph analysis. A “graph,” in this context, means a collection of nodes with links going between them (not a coordinate graph). This is something that consumers don’t know much about. However, they rely on it taking place server-side for delivery of the web services they use every day. A graph is the tool used inside computer software to represent a large number of entities that all relate to each other in complex ways. If the relationships are strictly hierarchical in nature, then one can use specialized graphs called “trees,” but most of the time real-world relatedness is more complex than this and one has general, tangled-up graphs (or even more complex beasts called “hypergraphs,” which can be reduced to graphs by the appropriate mathematics).
If you have a tech background and are eager for more details, what I’m calling “graph analysis” very broadly is the asking of queries based on graph structure (“find a subgraph that looks like this;” “find all nodes that are reachable from this particular node via paths of length K with certain properties”), the transformation of graphs into other graphs with more convenient structure, the learning of graphs from other graphs (which is a very general model of inference) and the enablement of dynamical systems on graphs.
Snapshot of the Internet graph, where nodes indicate Web pages and links indicate hyperlinks. From http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/WebImages/i5.jpg
Facebook network graph, indicating a small subnetwork of Facebook’s social network. From http://www.flickr.com/photos/greenem/11696663/
Graph visualization showing a tiny cross-section of the interconnectivity of the international economy. From http://www.financialnetworkanalysis.com/
Graph depicting interconnectivity of macaque brain regions. From Modha and Singh’s 2010 paper, which I discussed at: http://multiverseaccordingtoben.blogspot.com/2011/06/unraveling-modha-singhs-map-of-macaque.html
Semantic graph summarizing a portion of a news article, produced via AI software (From http://emeraldinsight.com). Now imagine a more accurate semantic graph produced by stronger AI, one spanning all the information on the Web. Such graphs can now be produced with moderate accuracy using current hardware and software; doing anything useful with them is very expensive.
Google uses graph analysis to answer search queries and place ads. Facebook uses it to deliver information based on your social network. Government agencies use it to study the vast masses of data they gather about matters of interest around the globe. Financial institutions will need to use it more and more in future to make decisions in accordance with the increasingly interconnected world. Biomedical scientists will need to use it more and more to discover and analyze therapeutics in the context of complex biological networks.
The example graph diagrams shown in this article indicate a fraction of the scope of the applicability of graph representation and analysis, but fail to indicate the scale and complexity of real-world graphs. This is because these aspects are simply incomprehensible to the human eye and to the untrained human mind. Graph visualization is a powerful tool, but it’s limited to the detailed visualization of small graphs and the rough visualization of large graphs. Graph analysis and search software can do better and can analyze large graphs in detail, but this requires heavy computational firepower as well as sophisticated algorithms. Mass-produced, economical MIMD hardware could help a lot here.
Graph analysis matches naturally to distributed-memory MIMD parallelism because many times one wants each node of a graph to be able to do something different, based on the specific data that it contains. In a full-on MIMD approach, each graph node could get its own processor and memory to use as it wishes. An alternative, nearly as good, is for each processor/memory unit to handle a cluster of interconnected graph nodes. This is particularly effective in the (common) case where graph manipulation processes need to deal with a set of closely connected nodes all together. There is a lot of algorithmic subtlety involved here, the specifics of which are dependent not only on the graph algorithms but also on the particulars of the hardware model. But this is a fairly constrained sort of problem and we already have a large amount of theory useful for addressing it.
If someone were to build a Graph Processing Unit (a different kind of GPU, this time MIMD rather than SIMD-parallel), it would find a massive variety of uses server-side, across nearly all areas of commerce, government and research. Graphs are a flexible and convenient representation for nearly any kind of data and computer science has developed a rich pool of algorithms for searching and manipulating them in diverse ways. Furthermore, there is already a lot of knowledge in the academic research literature regarding ways of parallelizing various graph algorithms. This is a much smaller and better-understood problem than MIMD parallelizing algorithms in general. Certainly many mathematical and practical problems remain in effectively implementing the full scope of graph algorithms on commodity MIMD processors, but this is a tractable domain of research that the computer science community knows well how to attack.
Granted, every advanced AI architecture can’t be cast in terms of “graph analysis,” even as broadly as I’ve construed it here. Surely I’m biased a bit by the fact that my own AI approach is very naturally framed in this way. But I have a strong suspicion that, just as a variety of scientific computing problems unrelated to graphics have been fruitfully implemented on current GPUs, similarly a huge variety of AI algorithms and approaches would benefit from distributed-memory MIMD GPUs customized for graph analysis.
And for the programmers in the audience, it’s worth noting that the hairy MIMD-parallel algorithmic voodoo could mainly be hidden from ordinary programmers by creating graph manipulation “library functions” that could be invoked via scripting languages. Recently in our OpenCog AI project, we had to make some of our code use multiple threads so as to better exploit multicore architectures; it was surprisingly simple because the code used the C++ Standard Template Library (STL) for most of its data manipulations, while others had already written multi-threaded versions of the STL features we used. Similar things could be done for MIMD-parallel GPUs. For instance, parallel graph algorithms could be wrapped inside the Boost Graph Library, commonly used by C++ programmers for graph manipulation. (Sorry non-coders for all the tech talk!)
Prediction Plus A Proactive Approach
I’m going to make a prediction: I think someone will introduce specialized distributed-memory MIMD hardware for large-scale graph manipulation, probably sometime in the next decade or two. And I’m also being proactive and trying to nudge things along a bit, hoping that by putting the idea out there explicitly, I may perhaps play a small role in making it happen sooner rather than later.
I could use MIMD GPUs for my AI work and the world could use it for a host of purposes. Come on, hardware makers — let’s get moving!