An Advanced Graph Processor Prototype
|
|
- Lilian Bond
- 6 years ago
- Views:
Transcription
1 An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA C-0002 and/or FA D Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering Massachusetts Institute of Technology. Delivered to the US Government with Unlimited Rights, as defined in DFARS Part or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS or DFARS as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.
2 Graph Analysis at Scale Interested in enabling advanced data analysis of large graphs in the embedded and data center environments Aerial Support Command Center Tactical Support Ground Station Graph Processor - 2 Data Center
3 Mathematical Foundation Graphs capture relationship information between entities Molecular forces Social interactions Semantic concepts Vehicle tracks Graphs can be fully expressed in the language of linear algebra Represented as sparse matrices Enable mathematic foundation for data analysis Leverage existing linear algebra techniques and methods Define a small set of well-defined mathematical operations Vertices Graph Representations Vertices Vertices Adjacency Matrix (NxN) Edges Incidence matrix (NxM) Graph Processor - 3
4 Graph Structure Structured Graphs Vertices Contain inherent connectivity patterns Edges constrained via some physical phenomenology Can be processed efficiently via careful hand tuning and mapping Vertices Unstructured Graphs Vertices Vertices No inherent structure Random distribution of edges No clear optimization for processing Unstructured graphs are inherently difficult to process Graph Processor - 4
5 Unstructured Graphs of Interest Cross-domain Dataset Examples ISR Social Cyber Bio Intelligence information Relationships between individuals Network patterns Connectivity between brain regions ~1K 1M entities and connections ~10M 10B individuals and interactions ~1M 1B network events ~1B 1T regions and connections Scale Graphs of interests are large, unstructured and often follow a power-law distribution Graph Processor - 5
6 Graph Analysis Application Stack Applications Threat Detection Sentiment Analysis Recommender Engine composed of analytics Graph Analysis Kernels API Hardware Community Detection Classification Centrality Analysis GraphBLAS (Semi-ring Linear Algebra API) x86-based/gpu-based/other implemented on top of a standard API enables hardware diversity Hardware acceleration of a small number of well-defined mathematical operations enable an extensive analytic ecosystem Graph Processor - 6
7 Commercial HPC* Solutions Graph Processing Single-core Performance Graph Processing Performance on Commercial Parallel Multiprocessors FLOP/Second FL OP/sec x 1000x Number 104 of Non-zeros 106 Per Column/Row Number of Non-zeros Per Column/Row Graph algorithms run orders of magnitude slower on conventional processors Graph Processor - 7 * High Performance Computing (HPC)
8 Commercial HPC System Limitations General-purpose processor architecture Cache-based memory architecture Vector-unit processing Lack of application specialization Communication architectures Insufficient cross-sectional bandwidth End-to-end oriented reliable communication paradigm Inefficient network utilization Graph Processor - 8
9 Commercial HPC Performance vs. Power Traversed Edges Per Second (TEPS) 1E E E E E+10 1E E Cray XK7 Titan (Measured) Cray XT4 Franklin (Measured) Embedded Applications 1M-10M Entities 100M-10B Entities Data Center Applications 1E E E E E E E E E E Watts Insufficient performance for important DoD and commercial applications Graph Processor - 9
10 Graph Processor Requirements Scalable architecture to enable graph analysis application Size, Weight and Power(SWaP) Provide computational throughput required for real-world graph application Native support for all GraphBLAS primitives Access to expert analytic community Graph Processor - 10
11 Novel Graph Processor Enabling Technologies High Bandwidth Communication Network Cacheless Memory Accelerator-Based Architecture Proc. Cache Mem. Multidimensional reliable toroid interconnect Randomized routing (US Patent No. 8,819,272) Data/Algorithm Dependent Multiprocessor Mapping Efficient load balancing and memory usage (US Patent No. 8,751,556) Optimized for sparse matrix processing access patterns GRAPH PROCESSOR Up to 1M nodes >100 throughput >100 power efficiency Graph BLAS-Based Instruction Set Sparse matrix-based architecture Dedicated VLSI computation modules (US Patent No. 8,751,556) Systolic sorting technology (US Patent No. 8,190,943 Custom Low-Power Circuits Full custom design for critical circuitry Graph Processor - 11
12 Graph Processor Performance Projections Traversed Edges Per Second (TEPS) 1E E E E E+10 1E E ASIC Graph Processor (Projected) FPGA Graph Processor (Measured) Cray XK7 Titan (Measured) Cray XT4 Franklin (Measured) 8 Nodes Mini-Chassis 64 Nodes Chassis 256 Nodes Rack 1024 Nodes 4 Racks Embedded Applications 1M-10M Entities 4k Nodes 16 Racks 100M-10B Entities 16k Nodes, 64 Racks Data Center Applications 1E E E E E E E E E E Watts Architectures under development provide >100x performance improvement while scaling to DoD problems of interest Graph Processor - 12
13 Supported Sparse Matrix Operations Applications Graph Analysis Kernels GraphBLAS API Hardware Operation C = A +.* B C = A.± B C = A.* B C = A./ B B = op(k,a) Comments Matrix multiply operation is the throughput driver for many important benchmark graph algorithms. Processor architecture highly optimized for this operation. Dot operations performed within local memory. Operation with matrix and constant. Can also be used to redistribute matrix and sum columns or rows. The +, -, *, and / operations can be replaced with any arithmetic or logical operators e.g. max, min, AND, OR, XOR, Instruction set can efficiently support most graph algorithms Graph Processor - 13
14 Graph Processor Node Architecture Key attributes: Accelerator-based reconfigurable architecture High-performance optimized hardware for all instructions Flexible memory arbitration for all modules Ability to pipeline multiple accelerators together Optimizes external memory access Native hardware support for sparse matrix formats Simple FIFO-based network interface Graph Processor - 14
15 Early Concept Demonstration System 4-board COTS PCIe system 320 MTEPS Supports: Up to 8 processing nodes 1D toroidal interconnect(can be expanded to 2D) Parallel sparse matrix-matrix operations (including multiplication and element-wise operations) Graph Processor - 15
16 Large-Scale High-Performance FPGA Board System Development Scalable OpenVPX-based FPGA system Up to 40 TTEPS Board specifications: 4 nodes 32GB of SDRAM 960 Gb/s I/O network bandwidth Supports: Up to 256K boards and 1M processing nodes Up to 6D network topology Full GraphBLAS API Processing Board Control Network Rear Transition Module (RTM) Backplane Data plane Graph Processor - 16
17 Technology Development and Demonstration Plan Custom FPGA Board Custom FPGA Rack LL Grid MGHPCC** 5T TEPS 100T TEPS 10G TEPS Data Center: FY19-21 FY16 FY17 FY18 COTS-based FPGA Prototype Custom FPGA Processor SWaP- Optimized ASIC Embedded: FY19-21 ASIC Processor 320M TEPS* 2,560M TEPS 100G TEPS Graph Processor - 17 * Traversed Edges Per Second (TEPS) ** MA Green High Performance Computing Center (MGHPCC)
18 Summary Graph processing is critical to many commercial, DoD, and intelligence community applications Conventional processors perform poorly on graph algorithms Architecture is poorly match to computational flow MIT LL has developed a novel sparse matrix processor architecture optimized for graph processing Numerous innovations enable highly efficient graph computing Orders of magnitude higher performance projected versus conventional supercomputers MIT LL is developing a Graph Processor Prototype using FPGA technology Future ASIC version expected to deliver significantly higher performance and power efficiency to enable ultra large scale applications Graph Processor - 18
DataSToRM: Data Science and Technology Research Environment
The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering
More informationAnalysis and Mapping of Sparse Matrix Computations
Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationSecure Multi-Party Computation of Probabilistic Threat Propagation
Secure Multi-Party Computation of Probabilistic Threat Propagation Emily Shen Nabil Schear, Ellen Vitercik, Arkady Yerukhimovich Graph Exploitation Symposium 216 DISTRIBUTION STATEMENT A. Approved for
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationLLMORE: Mapping and Optimization Framework
LORE: Mapping and Optimization Framework Michael Wolf, MIT Lincoln Laboratory 11 September 2012 This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002.
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationAutomated Code Generation for High-Performance, Future-Compatible Graph Libraries
Research Review 2017 Automated Code Generation for High-Performance, Future-Compatible Graph Libraries Dr. Scott McMillan, Senior Research Scientist CMU PI: Prof. Franz Franchetti, ECE 2017 Carnegie Mellon
More informationGraphBLAS: A Programming Specification for Graph Analysis
GraphBLAS: A Programming Specification for Graph Analysis Scott McMillan, PhD GraphBLAS: A Programming A Specification for Graph Analysis 2016 October Carnegie 26, Mellon 2016University 1 Copyright 2016
More informationLeveraging Data Provenance to Enhance Cyber Resilience
Leveraging Data Provenance to Enhance Cyber Resilience Thomas Moyer Karishma Chadha, Robert Cunningham, Nabil Schear, Warren Smith, Adam Bates, Kevin Butler, Frank Capobianco, Trent Jaeger, and Patrick
More informationPost-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED
Post-K Development and Introducing DLU 0 Fujitsu s HPC Development Timeline K computer The K computer is still competitive in various fields; from advanced research to manufacturing. Deep Learning Unit
More informationLACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS
1 LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS Samuel Steffl and Sherief Reda Brown University, Department of Computer Engineering Partially funded by NSF grant 1438958 Published as
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationGen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric
Gen-Z Overview 1. Introduction Gen-Z is a new data access technology that will allow business and technology leaders, to overcome current challenges with the existing computer architecture and provide
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationMathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2
CHAPTER Mathematics 8 9 10 11 1 1 1 1 1 1 18 19 0 1.1 Introduction: Graphs as Matrices This chapter describes the mathematics in the GraphBLAS standard. The GraphBLAS define a narrow set of mathematical
More informationThe Evaluation of GPU-Based Programming Environments for Knowledge Discovery
The Evaluation of GPU-Based Programming Environments for Knowledge Discovery John Johnson, Randall Frank, and Sheila Vaidya Lawrence Livermore National Labs Phone: 925-424-4092 Email Addresses: {jjohnson,
More informationMetropolitan Road Traffic Simulation on FPGAs
Metropolitan Road Traffic Simulation on FPGAs Justin L. Tripp, Henning S. Mortveit, Anders Å. Hansson, Maya Gokhale Los Alamos National Laboratory Los Alamos, NM 85745 Overview Background Goals Using the
More informationOCP Engineering Workshop - Telco
OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationFoundations for Summarizing and Learning Latent Structure in Video
Foundations for Summarizing and Learning Latent Structure in Video Presenter: Kevin Pitstick, MTS Engineer PI: Ed Morris, MTS Senior Engineer Copyright 2017 Carnegie Mellon University. All Rights Reserved.
More informationChallenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer
Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, 2006 Sr. Principal Engineer Panel Questions How do we build scalable networks that balance power, reliability and performance
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationSparse Linear Solver for Power System Analyis using FPGA
Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.
More informationRapidIO.org Update.
RapidIO.org Update rickoco@rapidio.org June 2015 2015 RapidIO.org 1 Outline RapidIO Overview Benefits Interconnect Comparison Ecosystem System Challenges RapidIO Markets Data Center & HPC Communications
More informationDataflow Supercomputers
Dataflow Supercomputers Michael J. Flynn Maxeler Technologies and Stanford University Outline History Dataflow as a supercomputer technology openspl: generalizing the dataflow programming model Optimizing
More informationGraphBLAS Mathematics - Provisional Release 1.0 -
GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More informationFPGA Provides Speedy Data Compression for Hyperspectral Imagery
FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with
More information6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP
LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationPerformance Modeling of Pipelined Linear Algebra Architectures on FPGAs
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,
More informationANNUAL REPORT Visit us at project.eu Supported by. Mission
Mission ANNUAL REPORT 2011 The Web has proved to be an unprecedented success for facilitating the publication, use and exchange of information, at planetary scale, on virtually every topic, and representing
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationProviding Information Superiority to Small Tactical Units
Providing Information Superiority to Small Tactical Units Jeff Boleng, PhD Principal Member of the Technical Staff Software Solutions Conference 2015 November 16 18, 2015 Copyright 2015 Carnegie Mellon
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationOPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications
OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationIndustry Collaboration and Innovation
Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationA GPU Enhanced Linux Cluster for Accelerated FMS
A GPU Enhanced Linux Cluster for Accelerated FMS Computational Sciences 21 June 07 Gene Wagenbreth genew@isi.edu (310)448-8213 Background Computational Sciences Division of ISI works with clusters, compilers
More informationCryogenic Computing Complexity (C3) Marc Manheimer December 9, 2015 IEEE Rebooting Computing Summit 4
Cryogenic Computing Complexity (C3) Marc Manheimer marc.manheimer@iarpa.gov December 9, 2015 IEEE Rebooting Computing Summit 4 C3 for the Workshop Review of the C3 program Motivation Technical challenges
More informationRequirements for Scalable Application Specific Processing in Commercial HPEC
Requirements for Scalable Application Specific Processing in Commercial HPEC Steven Miller Silicon Graphics, Inc. Phone: 650-933-1899 Email Address: scm@sgi.com Abstract: More and more High Performance
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationHow Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC
How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator
More informationOrganic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design
Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design 1 Reconfigurable Computing Platforms 2 The Von Neumann Computer Principle In 1945, the
More informationIndustry Collaboration and Innovation
Industry Collaboration and Innovation Industry Landscape Key changes occurring in our industry Historical microprocessor technology continues to deliver far less than the historical rate of cost/performance
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationFPGA Acceleration of 3D Component Matching using OpenCL
FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined
More informationRapidIO.org Update. Mar RapidIO.org 1
RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationRIVYERA S6-LX150 DATASHEET. 128 FPGA Next Generation Reconfigurable Computer RIVYERA S6-LX150
DATASHEET RIVYERA S6-LX150 128 FPGA Next Generation Reconfigurable Computer RIVYERA S6-LX150 Products shown in this data sheet may be subjected to any change without prior notice. Although all data reported
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More information«Computer Science» Requirements for applicants by Innopolis University
«Computer Science» Requirements for applicants by Innopolis University Contents Architecture and Organization... 2 Digital Logic and Digital Systems... 2 Machine Level Representation of Data... 2 Assembly
More informationThe Impact of Optics on HPC System Interconnects
The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes
More informationFast Flexible FPGA-Tuned Networks-on-Chip
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael, James C. Hoe
More informationA KASSPER Real-Time Signal Processor Testbed
A KASSPER Real-Time Signal Processor Testbed Glenn Schrader 244 Wood St. exington MA, 02420 Phone: (781)981-2579 Fax: (781)981-5255 gschrad@ll.mit.edu The Knowledge Aided Sensor Signal Processing and Expert
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationGraph Streaming Processor
Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer
More informationComposite Metrics for System Throughput in HPC
Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last
More informationAn update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink Abstract
An update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink David Bruce Cousins, Kurt Rohloff, Chris Peikert, Rick Schantz Raytheon BBN Technologies,
More informationFast Hardware For AI
Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights
More informationGedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort
Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is
More informationUNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Data to Decisions Advanced Technology FY 2012 OCO
Exhibit R-2, RDT&E Budget Item Justification: PB 2012 Office of Secretary Of Defense DATE: February 2011 BA 3: Advanced Development (ATD) COST ($ in Millions) FY 2010 FY 2011 Base OCO Total FY 2013 FY
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationTactical Microgrid Standardization Update to the EGSA Government Relations Committee
Tactical Microgrid Standards Consortium Tactical Microgrid Standardization Update to the EGSA Government Relations Committee Current as of 15 September 2017 US Army Engineer R&D Center (ERDC) US Army Communications-Electronics
More informationReconfigurable Advanced Rapid-prototyping Environment (RARE):
Reconfigurable Advanced Rapid-prototyping Environment (RARE): A Computing Technology for Challenging Form Factors SBIR DATA RIGHTS Prepared by Michael J. Bonato Colorado Engineering, Inc. for HPEC 2012
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationIBM Power AC922 Server
IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationWhite paper Advanced Technologies of the Supercomputer PRIMEHPC FX10
White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 Next Generation Technical Computing Unit Fujitsu Limited Contents Overview of the PRIMEHPC FX10 Supercomputer 2 SPARC64 TM IXfx: Fujitsu-Developed
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationTHE NEXT-GENERATION INTEROPERABILITY STANDARD
THE NEXT-GENERATION INTEROPERABILITY STANDARD ABOUT OPEN VPX TM Open VPX TM is the next-generation interoperability standard for system-level defense and aerospace applications. It is ideal for rugged
More informationDesign and Implementation of the GraphBLAS Template Library (GBTL)
Design and Implementation of the GraphBLAS Template Library (GBTL) Scott McMillan, Samantha Misurda Marcin Zalewski, Peter Zhang, Andrew Lumsdaine Software Engineering Institute Carnegie Mellon University
More informationDesign and Architecture of Dell Acceleration Appliances for Database (DAAD): A Practical Approach with High Availability Guaranteed
Design and Architecture of Dell Acceleration Appliances for Database (DAAD): A Practical Approach with High Availability Guaranteed Kai Yu, Yuxiang Gao Dell Global Solutions Engineering Group Peng Zhang,
More informationCHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization
CHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization Robert N.M. Watson *, Jonathan Woodruff *, Peter G. Neumann, Simon W. Moore *, Jonathan Anderson, David Chisnall
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationOpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit
OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI Topics Computation
More informationNVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.
NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Agenda Accelerated Computing nvgraph New Features Coming Soon Dynamic Graphs GraphBLAS 2 ACCELERATED COMPUTING 10x Performance
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationRe-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationFPGA-Based Embedded Systems for Testing and Rapid Prototyping
FPGA-Based Embedded Systems for Testing and Rapid Prototyping Martin Panevsky Embedded System Applications Manager Embedded Control Systems Department The Aerospace Corporation Flight Software Workshop
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationDigital Design Methodology (Revisited) Design Methodology: Big Picture
Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationEEL 4783: HDL in Digital System Design
EEL 4783: HDL in Digital System Design Lecture 10: Synthesis Optimization Prof. Mingjie Lin 1 What Can We Do? Trade-offs with speed versus area. Resource sharing for area optimization. Pipelining, retiming,
More informationImplementing Flexible Interconnect Topologies for Machine Learning Acceleration
Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More information