An Advanced Graph Processor Prototype

Size: px

Start display at page:

Download "An Advanced Graph Processor Prototype"

Lilian Bond
6 years ago
Views:

An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for

1 An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA C-0002 and/or FA D Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering Massachusetts Institute of Technology. Delivered to the US Government with Unlimited Rights, as defined in DFARS Part or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS or DFARS as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

2 Graph Analysis at Scale Interested in enabling advanced data analysis of large graphs in the embedded and data center environments Aerial Support Command Center Tactical Support Ground Station Graph Processor - 2 Data Center

3 Mathematical Foundation Graphs capture relationship information between entities Molecular forces Social interactions Semantic concepts Vehicle tracks Graphs can be fully expressed in the language of linear algebra Represented as sparse matrices Enable mathematic foundation for data analysis Leverage existing linear algebra techniques and methods Define a small set of well-defined mathematical operations Vertices Graph Representations Vertices Vertices Adjacency Matrix (NxN) Edges Incidence matrix (NxM) Graph Processor - 3

Graph Structure Structured Graphs Vertices Contain inherent connectivity patterns Edges constrained via some physical phenomenology Can be processed efficiently via careful hand tuning and mapping

4 Graph Structure Structured Graphs Vertices Contain inherent connectivity patterns Edges constrained via some physical phenomenology Can be processed efficiently via careful hand tuning and mapping Vertices Unstructured Graphs Vertices Vertices No inherent structure Random distribution of edges No clear optimization for processing Unstructured graphs are inherently difficult to process Graph Processor - 4

Unstructured Graphs of Interest Cross-domain Dataset

information Relationships between individuals Network

entities and connections ~10M 10B individuals and

connections Scale Graphs of interests are large,

5 Unstructured Graphs of Interest Cross-domain Dataset Examples ISR Social Cyber Bio Intelligence information Relationships between individuals Network patterns Connectivity between brain regions ~1K 1M entities and connections ~10M 10B individuals and interactions ~1M 1B network events ~1B 1T regions and connections Scale Graphs of interests are large, unstructured and often follow a power-law distribution Graph Processor - 5

Graph Analysis Application Stack Applications Threat Detection Sentiment

Hardware Community Detection Classification Centrality Analysis GraphBLAS

standard API enables hardware diversity Hardware acceleration of a small number

6 Graph Analysis Application Stack Applications Threat Detection Sentiment Analysis Recommender Engine composed of analytics Graph Analysis Kernels API Hardware Community Detection Classification Centrality Analysis GraphBLAS (Semi-ring Linear Algebra API) x86-based/gpu-based/other implemented on top of a standard API enables hardware diversity Hardware acceleration of a small number of well-defined mathematical operations enable an extensive analytic ecosystem Graph Processor - 6

Commercial HPC* Solutions 10 10 Graph Processing Single-core Performance Graph Processing Performance on Commercial Parallel Multiprocessors FLOP/Second FL OP/sec 10 9 10 8 10 7 1000x 1000x 10 6 10 5

7 Commercial HPC* Solutions Graph Processing Single-core Performance Graph Processing Performance on Commercial Parallel Multiprocessors FLOP/Second FL OP/sec x 1000x Number 104 of Non-zeros 106 Per Column/Row Number of Non-zeros Per Column/Row Graph algorithms run orders of magnitude slower on conventional processors Graph Processor - 7 * High Performance Computing (HPC)

8 Commercial HPC System Limitations General-purpose processor architecture Cache-based memory architecture Vector-unit processing Lack of application specialization Communication architectures Insufficient cross-sectional bandwidth End-to-end oriented reliable communication paradigm Inefficient network utilization Graph Processor - 8

11 1E+11 10 10 1E+10 1E+9 10 9 1E+8 10 8 Cray XK7 Titan (Measured) Cray XT4

Center Applications 1E+7 10 7 1E+1 10 1 1E+2 10 2 1E+3 10 3 1E+4 10 4 1E+5 10 5

9 Commercial HPC Performance vs. Power Traversed Edges Per Second (TEPS) 1E E E E E+10 1E E Cray XK7 Titan (Measured) Cray XT4 Franklin (Measured) Embedded Applications 1M-10M Entities 100M-10B Entities Data Center Applications 1E E E E E E E E E E Watts Insufficient performance for important DoD and commercial applications Graph Processor - 9

10 Graph Processor Requirements Scalable architecture to enable graph analysis application Size, Weight and Power(SWaP) Provide computational throughput required for real-world graph application Native support for all GraphBLAS primitives Access to expert analytic community Graph Processor - 10

11 Novel Graph Processor Enabling Technologies High Bandwidth Communication Network Cacheless Memory Accelerator-Based Architecture Proc. Cache Mem. Multidimensional reliable toroid interconnect Randomized routing (US Patent No. 8,819,272) Data/Algorithm Dependent Multiprocessor Mapping Efficient load balancing and memory usage (US Patent No. 8,751,556) Optimized for sparse matrix processing access patterns GRAPH PROCESSOR Up to 1M nodes >100 throughput >100 power efficiency Graph BLAS-Based Instruction Set Sparse matrix-based architecture Dedicated VLSI computation modules (US Patent No. 8,751,556) Systolic sorting technology (US Patent No. 8,190,943 Custom Low-Power Circuits Full custom design for critical circuitry Graph Processor - 11

Graph Processor Performance Projections Traversed Edges

11 1E+11 10 10 1E+10 1E+9 10 9 1E+8 10 8 ASIC Graph

XK7 Titan (Measured) Cray XT4 Franklin (Measured) 8 Nodes

Racks Embedded Applications 1M-10M Entities 4k Nodes 16

Applications 1E+7 10 7 1E+1 10 1 1E+2 10 2 1E+3 10 3 1E+4

Watts Architectures under development provide >100x

12 Graph Processor Performance Projections Traversed Edges Per Second (TEPS) 1E E E E E+10 1E E ASIC Graph Processor (Projected) FPGA Graph Processor (Measured) Cray XK7 Titan (Measured) Cray XT4 Franklin (Measured) 8 Nodes Mini-Chassis 64 Nodes Chassis 256 Nodes Rack 1024 Nodes 4 Racks Embedded Applications 1M-10M Entities 4k Nodes 16 Racks 100M-10B Entities 16k Nodes, 64 Racks Data Center Applications 1E E E E E E E E E E Watts Architectures under development provide >100x performance improvement while scaling to DoD problems of interest Graph Processor - 12

13 Supported Sparse Matrix Operations Applications Graph Analysis Kernels GraphBLAS API Hardware Operation C = A +.* B C = A.± B C = A.* B C = A./ B B = op(k,a) Comments Matrix multiply operation is the throughput driver for many important benchmark graph algorithms. Processor architecture highly optimized for this operation. Dot operations performed within local memory. Operation with matrix and constant. Can also be used to redistribute matrix and sum columns or rows. The +, -, *, and / operations can be replaced with any arithmetic or logical operators e.g. max, min, AND, OR, XOR, Instruction set can efficiently support most graph algorithms Graph Processor - 13

14 Graph Processor Node Architecture Key attributes: Accelerator-based reconfigurable architecture High-performance optimized hardware for all instructions Flexible memory arbitration for all modules Ability to pipeline multiple accelerators together Optimizes external memory access Native hardware support for sparse matrix formats Simple FIFO-based network interface Graph Processor - 14

15 Early Concept Demonstration System 4-board COTS PCIe system 320 MTEPS Supports: Up to 8 processing nodes 1D toroidal interconnect(can be expanded to 2D) Parallel sparse matrix-matrix operations (including multiplication and element-wise operations) Graph Processor - 15

Large-Scale High-Performance FPGA Board System Development

specifications: 4 nodes 32GB of SDRAM 960 Gb/s I/O network

Up to 6D network topology Full GraphBLAS API Processing Board

16 Large-Scale High-Performance FPGA Board System Development Scalable OpenVPX-based FPGA system Up to 40 TTEPS Board specifications: 4 nodes 32GB of SDRAM 960 Gb/s I/O network bandwidth Supports: Up to 256K boards and 1M processing nodes Up to 6D network topology Full GraphBLAS API Processing Board Control Network Rear Transition Module (RTM) Backplane Data plane Graph Processor - 16

17 Technology Development and Demonstration Plan Custom FPGA Board Custom FPGA Rack LL Grid MGHPCC** 5T TEPS 100T TEPS 10G TEPS Data Center: FY19-21 FY16 FY17 FY18 COTS-based FPGA Prototype Custom FPGA Processor SWaP- Optimized ASIC Embedded: FY19-21 ASIC Processor 320M TEPS* 2,560M TEPS 100G TEPS Graph Processor - 17 * Traversed Edges Per Second (TEPS) ** MA Green High Performance Computing Center (MGHPCC)

18 Summary Graph processing is critical to many commercial, DoD, and intelligence community applications Conventional processors perform poorly on graph algorithms Architecture is poorly match to computational flow MIT LL has developed a novel sparse matrix processor architecture optimized for graph processing Numerous innovations enable highly efficient graph computing Orders of magnitude higher performance projected versus conventional supercomputers MIT LL is developing a Graph Processor Prototype using FPGA technology Future ASIC version expected to deliver significantly higher performance and power efficiency to enable ultra large scale applications Graph Processor - 18

DataSToRM: Data Science and Technology Research Environment

The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering