The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering. Distribution Statement A: Approved for public release: distribution unlimited. 2018 Massachusetts Institute of Technology. Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work. Mr. Vitaliy Gleyzer MIT Lincoln Laboratory 5 March 2018
Advancing the State of Big Data Analytics: Raw Data to Insight How can the insight be presented to improve human cognition? Can different data help improve human understanding? Are there new application areas enabled by new analytics? Big Data Application presentation What new information should be collected? data analytics What new insight can be gained from the data? technologies DataSToRM - 2 How can data be stored/collected more efficiently? How can existing analytics be accelerated? Can analytics be modified to allow efficient implementation?
Large-Scale Graph Applications Today Applications Development Environment Drug Discovery Personalized Healthcare Hardware Platform Graph Analysis Frameworks and Databases Algorithms Visualization Toolkits Fraud Detection Recommender Systems Lincoln Laboratory Supercomputing Center D4M PageRank Centrality Walktrap InfoMap K-Truss Functional Brain Mapping Cyber Data Analysis BFS MST Diverse, quickly evolving ecosystem DataSToRM - 3
Advancing the State of Big Data Analytics: Challenges Technology moves quickly New algorithms and analytic techniques New storage solutions New processing technologies New database technologies and frameworks New applications New framework adoption is a serious investment How to leverage new technologies? How to enable co-design opportunities? How to integrate disparate communities to enable co-design? Keeping up with big data technology is challenging DataSToRM - 4
Example Standardization Efforts Applications Development Environment Drug Discovery Personalized Healthcare Hardware Platform Graph Analysis Frameworks and Databases Algorithms Visualization Toolkits Fraud Detection Recommender Systems Lincoln Laboratory Supercomputing Center GraphBLAS D4M Apache Gremlin PageRank Centrality Walktrap InfoMap K-Truss Functional Brain Mapping Cyber Data Analysis BFS MST Different communities are attempting to unify and standardize interface and languages in the ecosystem DataSToRM - 5
Example Standardization Efforts Drug Discovery Applications Personalized Healthcare Hardware Platform Development Backend Environment interface standardization enables hardware innovation! Graph Analysis Frameworks and Databases Algorithms Visualization Toolkits Fraud Detection Functional Brain Mapping Recommender Systems Cyber Data Analysis Lincoln Laboratory Supercomputing Center GraphBLAS D4M Apache Gremlin PageRank Centrality Walktrap InfoMap K-Truss BFS MST Parallel inmemory database 100 1000 graph edge traversal speed Simple linear algebra API Different communities are attempting to unify and standardize interface and languages in the ecosystem DataSToRM - 6
Vertices Vertices Unifying Principles for Big Data Graphs Graphs capture relationship information between entities Molecular forces Social interactions Semantic concepts Vehicle tracks Graphs can be fully expressed in the language of linear algebra Represented as sparse matrices Enable mathematic foundation for data analysis Leverage existing linear algebra techniques and methods Define a small set of well-defined mathematical operations Graph Representations Vertices Adjacency Matrix (N N) Edges Incidence matrix (N M) DataSToRM - 7
DataSToRM: Data Science and Technology Research Environment Applications Graph Analysis Kernels API Threat Detection Community Detection Sentiment Analysis Classification Recommender Engine Centrality Analysis GraphBLAS (Semi-ring Linear Algebra API) Composed of analytics Implemented on top of a standard API Enables hardware diversity Hardware Hardware acceleration of a small number of well-defined mathematical operations enable an extensive analytic ecosystem DataSToRM - 8
GraphBLAS Overview Five key operations A = NxM (i,j,v) (i,j,v) = A C = A B C = A C C = A B = A. B Can be used to build 12 GraphBLAS standard functions buildmatrix, extracttuples, Transpose, mxm, mxv, vxm, extract, assign, ewiseadd, ewisemult, apply, reduce Can be used to build a variety of graph utility functions Tril(), Triu(), Degreed Filtered BFS, Can be used to build a variety of graph algorithms K-Truss, Jaccard Coefficient, Non-Negative Matrix Factorization, That work on a wide range of graphs Hyper, multi-directed, multi-weighted, multi-partite, multi-edge Unifying interface for backend graph processing DataSToRM - 9
Lincoln Laboratory Technologies Targeting Large-Scale Graph Analytics Dynamic Distributed Dimensional Data Model (D4M) Data analysis framework based on associative array algebra Graph Algorithms Graph Algorithms D4M GraphBLAS Standard API for graph analytics using Sparse Linear Algebra primitives Concise language for complex graph analytics Mathematical closure Linear Algebra underpinning Graph Processor GraphBLAS LLSC Simple hardware agnostic API Graph Processor Novel graph processing architecture Scalable graph processing hardware architecture Unprecedented performance Native linear algebra instruction set Lincoln Laboratory Super Computing Center (LLSC) State-of-the-art super computing environment Heterogeneous processing capabilities Ideal technology integration environment DataSToRM - 10
Graph Processor Matrix Multiply Performance Traversed Edges Per Second 1E+14 10 14 1E+13 10 13 1E+12 10 12 1E+11 10 11 10 10 1E+10 10 1E+9 9 ASIC Graph Processor (Projected) FPGA Graph Processor (Measured) Cray XK7 Titan (Measured) Cray XT4 Franklin (Measured) 2017 1 Chassis System Graph Processor Performance 100x 1 Racks 4 Racks 16 Racks ASIC 64 Racks FPGA Today s State of the Art 10 1E+8 Embedded Data Center Applications Applications 2 Nodes 10 1E+7 1E+1 10 1 101E+2 2 10 3 1E+3 10 4 1E+410 1E+5 10 6 1E+6 10 7 10 1E+7 10 9 1E+8 1E+9 Watts Highly efficient graph processing technology that is 100s to 1000s of times more efficient compared to traditional architectures DataSToRM - 11
Collaboration Opportunities Developing graph algorithms in the language of linear algebra Community detections, subgraph isomorphism, subgraph matching, etc. Developing graph algorithms that can scale to datasets with billions to trillions of vertices Sparsity-aware, distributed memory algorithms Identifying or developing new technologies to leverage the linear algebra abstraction Compilers, optimizer, hardware accelerators, etc. Integrating GraphBLAS backend as part of popular frameworks e.g., Apache TinkerPop, Neo4j, ElasticSearch, etc. DataSToRM - 12