Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Size: px
Start display at page:

Download "Investigating Resilient HPRC with Minimally-Invasive System Monitoring"

Transcription

1 Investigating Resilient HPRC with Minimally-Invasive System Monitoring Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab UNC Charlotte

2 Agenda Exascale systems are expected to fault frequently What can we do with FPGAs? Can we tell if there is a failure in the system? Results and analysis 1

3 Exascale systems are expected to fault frequently Two main reasons behind this belief: Ever-increasing number of components MTTF of these components are not expected to improve enough to compensate Failure record by Los Alamos National Lab 22 HPC systems in production use Accumulated downtime from 1996 to

4 Downtime by hardware failures in LANL ( ) ~907 days of work loss by just memory DIMM failures 3

5 Downtime by software failures in LANL ( ) ~387 days of work loss by just cluster file system failures 4

6 State of the art Checkpoint/restart - periodically stop program and write data to non-volatile memory library (Berkeley Lab Checkpoint/Restart) Often, ad hoc by application programer Checkpoint can take 30 min. each time in HPC systems (Franck Cappello's EuroPVM/MPI in 2008) Detection Human (programmer) realizes application is not finished or producing results Some ad hoc scripts to determine if log files are growing In short, detection is an open question 5

7 Experience with Spirit all-fpga cluster Processor crashed randomly Take hours to re-run the application Hardware cores would finish task 6

8 Big question Supposed we can build HPC systems with FPGAs, will HPRC systems be more resilient? 7

9 Towards a resilient HPRC system Hundreds of FPGAs each as system-on-a-chip (SOC) Automated system-monitoring OS Application Processor Accelerator core Autonomous restart Questions Can tell if there is a failure in the system? Can we identify why it failed? Can we recover from the failure? 8

10 Related work and techniques Debugging tools (e.g. ChipScope, SignalTap) Monitor a hardware core's status Additional JTAG and USB ports Triple Modular Redundancy Used by mission-critical system Limited power budget Owl-system monitoring framework (Schulz, et al.) Snoop system transaction for CPU FPGA is not first-class element Other performance analysis framework proposed source level (HDL) instrumentation (Koehler, et al. and Lancaster, et al.) 9

11 System-level monitoring framework Side-band Data Network Primary Data Network 10

12 Monitor Core Monitor target's registers or finite-state machines Similar to ChipScope but different sampling rate and duration States saved for checkpoint/restart 11

13 System Monitor Hub Collect status of local components from HW monitor cores Interface with side-band data network Receive requests from Head Node Apply TMR if high availability required 12

14 On-chip and off-chip high-speed network Side-band Data Network Latency impacts mean time to discovery Bandwidth impacts mean time to recovery (large amount of data will be transfered for checkpoint/restart) 13

15 Head Node Side-band Data Network Interpret status information Recover the system from failures autonomously Tell programmer why the node has failed 14

16 Experimental setup : A simple ring network One Head Node 32 Worker Nodes 15

17 Example : Issue request Head node issues the request to worker node 0 Other worker nodes are waiting for the request 16

18 Example : Append health information Worker node 0 appends its status information to the end of the packet Packet continues to flow to next worker node 17

19 Example : Detect failure Packet travels back to the head node with failure information on worker node 1 18

20 Initial results Fault Type How to emulate Detected? App crash Physical disruption OS crash Physical disruption Network failure Unplug cable Network crash Disable on-chip router Accel. core failure Fabricated 19

21 Analysis Encouraging initial result Architecture Independent Reconfigurable Network (AIREN) Support eight 4.0 Gb/s bi-directional channels 0.8 µs latency between nodes Head node can scan 32 work nodes every µs Significant reduction in mean time to discovery Even if we scale to 40,000 FPGAs, we can still scan all nodes every 32.8 ms 20

22 Conclusion We will design HPRC system with expectation of failures (and work losses) We conceptualize a resilient HPRC system with an open system-monitoring framework Initial results from a 32-node test on Spirit cluster prove this concept This framework will back up other current research (most likely long-running jobs) to prevent work loss Testbed for resilience research 21

23 Thank you Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab UNC Charlotte 22

24 Statistics of downtime in LANL ( ) 3,387 days system downtime out of 22 clusters 23

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

In-Network Computing. Paving the Road to Exascale. June 2017

In-Network Computing. Paving the Road to Exascale. June 2017 In-Network Computing Paving the Road to Exascale June 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect -Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

More information

Arlington, VA Phone: (703)

Arlington, VA Phone: (703) Andrew G. Schmidt Computer Scientist Information Sciences Institute University of Southern California 3811 N. Fairfax Dr. Suite 200 Email: aschmidt@isi.edu Arlington, VA 22203 Phone: (703) 248-6170 Education

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Reflections on Failure in Post-Terascale Parallel Computing

Reflections on Failure in Post-Terascale Parallel Computing Reflections on Failure in Post-Terascale Parallel Computing 2007 Int. Conf. on Parallel Processing, Xi An China Garth Gibson Carnegie Mellon University and Panasas Inc. DOE SciDAC Petascale Data Storage

More information

A Large-Scale Study of Soft- Errors on GPUs in the Field

A Large-Scale Study of Soft- Errors on GPUs in the Field A Large-Scale Study of Soft- Errors on GPUs in the Field Bin Nie*, Devesh Tiwari +, Saurabh Gupta +, Evgenia Smirni*, and James H. Rogers + *College of William and Mary + Oak Ridge National Laboratory

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

High Availability and Redundant Operation

High Availability and Redundant Operation This chapter describes the high availability and redundancy features of the Cisco ASR 9000 Series Routers. Features Overview, page 1 High Availability Router Operations, page 1 Power Supply Redundancy,

More information

Fault tolerance in Grid and Grid 5000

Fault tolerance in Grid and Grid 5000 Fault tolerance in Grid and Grid 5000 Franck Cappello INRIA Director of Grid 5000 fci@lri.fr Fault tolerance in Grid Grid 5000 Applications requiring Fault tolerance in Grid Domains (grid applications

More information

Windows Azure Services - At Different Levels

Windows Azure Services - At Different Levels Windows Azure Windows Azure Services - At Different Levels SaaS eg : MS Office 365 Paas eg : Azure SQL Database, Azure websites, Azure Content Delivery Network (CDN), Azure BizTalk Services, and Azure

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

FPGA-Based Embedded Systems for Testing and Rapid Prototyping

FPGA-Based Embedded Systems for Testing and Rapid Prototyping FPGA-Based Embedded Systems for Testing and Rapid Prototyping Martin Panevsky Embedded System Applications Manager Embedded Control Systems Department The Aerospace Corporation Flight Software Workshop

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 13 High Availability and Reliability for Mission Critical Systems November 9, 2015 Sam Siewert RASM Reliability High Quality Components (Unit Test) Redundancy Dual String

More information

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server HPE ProLiant Gen10 Franz Weberberger Presales Consultant Server Introducing a new generation compute experience from HPE Agility A better way to deliver business results Security A better way to protect

More information

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2894-2900 ISSN: 2249-6645 High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs M. Reddy Sekhar Reddy, R.Sudheer Babu

More information

Cycle Sharing Systems

Cycle Sharing Systems Cycle Sharing Systems Jagadeesh Dyaberi Dependable Computing Systems Lab Purdue University 10/31/2005 1 Introduction Design of Program Security Communication Architecture Implementation Conclusion Outline

More information

Chapter 17: Distributed Systems (DS)

Chapter 17: Distributed Systems (DS) Chapter 17: Distributed Systems (DS) Silberschatz, Galvin and Gagne 2013 Chapter 17: Distributed Systems Advantages of Distributed Systems Types of Network-Based Operating Systems Network Structure Communication

More information

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC EASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores 12 cores/node Power Challenges Exascale Technology Roadmap Meeting San Diego California, December 2009. $1M per

More information

Research Article HwPMI: An Extensible Performance Monitoring Infrastructure for Improving Hardware Design and Productivity on FPGAs

Research Article HwPMI: An Extensible Performance Monitoring Infrastructure for Improving Hardware Design and Productivity on FPGAs Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2012, Article ID 162404, 12 pages doi:10.1155/2012/162404 Research Article HwPMI: An Extensible Performance Monitoring

More information

Combining Arm & RISC-V in Heterogeneous Designs

Combining Arm & RISC-V in Heterogeneous Designs Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core

More information

Autosave for Research Where to Start with Checkpoint/Restart

Autosave for Research Where to Start with Checkpoint/Restart Autosave for Research Where to Start with Checkpoint/Restart Brandon Barker Computational Scientist Cornell University Center for Advanced Computing (CAC) brandon.barker@cornell.edu Workshop: High Performance

More information

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics Veloce2 the Enterprise Verification Platform Simon Chen Emulation Business Development Director Mentor Graphics Agenda Emulation Use Modes Veloce Overview ARM case study Conclusion 2 Veloce Emulation Use

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Introduction to the Service Availability Forum

Introduction to the Service Availability Forum . Introduction to the Service Availability Forum Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO Design of dependable

More information

GEN-Z AN OVERVIEW AND USE CASES

GEN-Z AN OVERVIEW AND USE CASES 13 th ANNUAL WORKSHOP 2017 GEN-Z AN OVERVIEW AND USE CASES Greg Casey, Senior Architect and Strategist Server CTO Team DellEMC March, 2017 WHY PROPOSE A NEW BUS? System memory is flat or shrinking Memory

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Jie Li1, Deb Agarwal2, Azure Marty Platform Humphrey1, Keith Jackson2, Catharine van Ingen3, Youngryel Ryu4

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET CRAY XD1 DATASHEET Cray XD1 Supercomputer Release 1.3 Purpose-built for HPC delivers exceptional application performance Affordable power designed for a broad range of HPC workloads and budgets Linux,

More information

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security Bringing OpenStack to the Enterprise An enterprise-class solution ensures you get the required performance, reliability, and security INTRODUCTION Organizations today frequently need to quickly get systems

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

HP NonStop Database Solution

HP NonStop Database Solution CHOICE - CONFIDENCE - CONSISTENCY HP NonStop Database Solution Marco Sansoni, HP NonStop Business Critical Systems 9 ottobre 2012 Agenda Introduction to HP NonStop platform HP NonStop SQL database solution

More information

Designing with ALTERA SoC Hardware

Designing with ALTERA SoC Hardware Designing with ALTERA SoC Hardware Course Description This course provides all theoretical and practical know-how to design ALTERA SoC devices under Quartus II software. The course combines 60% theory

More information

Microsoft Office SharePoint Server 2007

Microsoft Office SharePoint Server 2007 Microsoft Office SharePoint Server 2007 Enabled by EMC Celerra Unified Storage and Microsoft Hyper-V Reference Architecture Copyright 2010 EMC Corporation. All rights reserved. Published May, 2010 EMC

More information

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies CREST-DVLSI - Fundamental Technologies for Dependable VLSI Systems - Masahiro Fujita Shuichi Sakai Masahiro

More information

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner RiceNIC A Reconfigurable Network Interface for Experimental Research and Education Jeffrey Shafer Scott Rixner Introduction Networking is critical to modern computer systems Role of the network interface

More information

An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed

An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed Hong Bong Kim 1 Hanwha Thales. Co., Ltd. Republic of Korea Young Soo Choi and Yu Kyung Yang Agency for Defense Development,

More information

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform

More information

SQL Server 2017 for your Mission Critical applications

SQL Server 2017 for your Mission Critical applications SQL Server 2017 for your Mission Critical applications Turn your critical data into real-time business insights September 2018 Turn data into insights with advanced data and analytics platforms Digitization

More information

Feature Comparison Summary

Feature Comparison Summary Feature Comparison Summary, and The cloud-ready operating system Thanks to cloud technology, the rate of change is faster than ever before, putting more pressure on IT. Organizations demand increased security,

More information

Providers of a Comprehensive Portfolio of Solutions for Reliable Ethernet and Synchronization in the Energy Market. Industrial

Providers of a Comprehensive Portfolio of Solutions for Reliable Ethernet and Synchronization in the Energy Market. Industrial Industrial Providers of a Comprehensive Portfolio of Solutions for Reliable Ethernet and Synchronization in the Energy Market System-on-Chip engineering The Need for Redundant Data Communications In Automated

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

A Hardware Filesystem Implementation for High-Speed Secondary Storage

A Hardware Filesystem Implementation for High-Speed Secondary Storage A Hardware Filesystem Implementation for High-Speed Secondary Storage Dr.Ashwin A. Mendon, Dr.Ron Sass Electrical & Computer Engineering Department University of North Carolina at Charlotte Presented by:

More information

Distributed Systems 27. Process Migration & Allocation

Distributed Systems 27. Process Migration & Allocation Distributed Systems 27. Process Migration & Allocation Paul Krzyzanowski pxk@cs.rutgers.edu 12/16/2011 1 Processor allocation Easy with multiprocessor systems Every processor has access to the same memory

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme FUT3040BU Storage at Memory Speed: Finally, Nonvolatile Memory Is Here Rajesh Venkatasubramanian, VMware, Inc Richard A Brunner, VMware, Inc #VMworld #FUT3040BU Disclaimer This presentation may contain

More information

MPI versions. MPI History

MPI versions. MPI History MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

A Seamless Tool Access Architecture from ESL to End Product. Albrecht Mayer (Infineon Microcontrollers) S4D Conference Sophia Antipolis, Sept.

A Seamless Tool Access Architecture from ESL to End Product. Albrecht Mayer (Infineon Microcontrollers) S4D Conference Sophia Antipolis, Sept. A Seamless Tool Access Architecture from ESL to End Product Albrecht Mayer (Infineon Microcontrollers) S4D Conference Sophia Antipolis, Sept. 2009 Tool Access Architecture (TAA) Tool to Device TAA = Abstraction

More information

Challenges and opportunities of debugging FPGAs with embedded CPUs. Kris Chaplin Embedded Technology Specialist Altera Northern Europe

Challenges and opportunities of debugging FPGAs with embedded CPUs. Kris Chaplin Embedded Technology Specialist Altera Northern Europe Challenges and opportunities of debugging FPGAs with embedded CPUs Kris Chaplin Embedded Technology Specialist Altera Northern Europe Agenda How system bring-up has got more complicated Board Bring up

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

PACS and Its Components

PACS and Its Components Agenda PACS and Its Components Department of Radiological Technology Faculty of Medical Technology Mahidol University Hardware & Software Components Understand Virtualization Technology Fault Tolerance

More information

Gen-Z Memory-Driven Computing

Gen-Z Memory-Driven Computing Gen-Z Memory-Driven Computing Our vision for the future of computing Patrick Demichel Distinguished Technologist Explosive growth of data More Data Need answers FAST! Value of Analyzed Data 2005 0.1ZB

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Why FPGAs will win the Accelerator Battle: Building Computers that Minimize Data Movement

Why FPGAs will win the Accelerator Battle: Building Computers that Minimize Data Movement Why FPGAs will win the Accelerator Battle: Building Computers that Minimize Data Movement Allan Cantle President & Founder www.nallatech.com Overview» Commercial Realities For HPC» A View From Berkeley»

More information

Developing Microsoft Azure Solutions: Course Agenda

Developing Microsoft Azure Solutions: Course Agenda Developing Microsoft Azure Solutions: 70-532 Course Agenda Module 1: Overview of the Microsoft Azure Platform Microsoft Azure provides a collection of services that you can use as building blocks for your

More information

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation

More information

Hybrid Memory Platform

Hybrid Memory Platform Hybrid Memory Platform Kenneth Wright, Sr. Driector Rambus / Emerging Solutions Division Join the Conversation #OpenPOWERSummit 1 Outline The problem / The opportunity Project goals Roadmap - Sub-projects/Tracks

More information

Frequently Asked Questions (FAQ)

Frequently Asked Questions (FAQ) Frequently Asked Questions (FAQ) Embedded Instrumentation: The future of advanced design validation, test and debug Why Embedded Instruments? The necessities that are driving the invention of embedded

More information

HPC Downtime Budgets: Moving SRE Practice to the Rest of the World

HPC Downtime Budgets: Moving SRE Practice to the Rest of the World LA-UR-16-24361 HPC Downtime Budgets: Moving SRE Practice to the Rest of the World SREcon Europe 2016 Cory Lueninghoener July 12, 2016 Operated by Los Alamos National Security, LLC for the U.S. Department

More information

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Persistent Memory. High Speed and Low Latency. White Paper M-WP006 Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:

More information

NVMe SSDs Becoming Norm for All Flash Storage

NVMe SSDs Becoming Norm for All Flash Storage SSDs Becoming Norm for All Flash Storage Storage media has improved by leaps and bounds over the last several years. Capacity and performance are both improving at rather rapid rates as popular vendors

More information

Adaptive Runtime Support

Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

SpaceComRTOS. A distributed formal RTOS adapted to SpaceWire enabled systems

SpaceComRTOS. A distributed formal RTOS adapted to SpaceWire enabled systems SpaceWire Working group 11th Nov 2004 SpaceComRTOS A distributed formal RTOS adapted to SpaceWire enabled systems Eric.Verhulst@OpenLicenseSociety.org www.openlicensesociety.org 1 The target Ground section

More information

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer Oracle Exadata X7 Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer 05.12.2017 Oracle Engineered Systems ZFS Backup Appliance Zero Data Loss Recovery Appliance Exadata Database

More information

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk

More information

#techsummitch

#techsummitch www.thomasmaurer.ch #techsummitch Justin Incarnato Justin Incarnato Microsoft Principal PM - Azure Stack Hyper-scale Hybrid Power of Azure in your datacenter Azure Stack Enterprise-proven On-premises

More information

Optimize and Accelerate Your Mission- Critical Applications across the WAN

Optimize and Accelerate Your Mission- Critical Applications across the WAN BIG IP WAN Optimization Module DATASHEET What s Inside: 1 Key Benefits 2 BIG-IP WAN Optimization Infrastructure 3 Data Optimization Across the WAN 4 TCP Optimization 4 Application Protocol Optimization

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Introduction to the Qsys System Integration Tool

Introduction to the Qsys System Integration Tool Introduction to the Qsys System Integration Tool Course Description This course will teach you how to quickly build designs for Altera FPGAs using Altera s Qsys system-level integration tool. You will

More information

Lenovo ThinkSystem Mission Critical Customer Deck

Lenovo ThinkSystem Mission Critical Customer Deck Lenovo ThinkSystem Mission Critical Customer Deck Patrick Moakley, Director of Marketing June, 2017 Agenda Customer Challenges The ThinkSystem Mission Critical Portfolio Mission Critical Value Prop Story

More information

FPGAhammer: Remote Voltage Fault Attacks on Shared FPGAs, suitable for DFA on AES

FPGAhammer: Remote Voltage Fault Attacks on Shared FPGAs, suitable for DFA on AES , suitable for DFA on AES Jonas Krautter, Dennis R.E. Gnad, Mehdi B. Tahoori 10.09.2018 INSTITUTE OF COMPUTER ENGINEERING CHAIR OF DEPENDABLE NANO COMPUTING KIT Die Forschungsuniversität in der Helmholtz-Gemeinschaft

More information

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 In-Network Computing Paving the Road to Exascale 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric

More information

HPC Storage Use Cases & Future Trends

HPC Storage Use Cases & Future Trends Oct, 2014 HPC Storage Use Cases & Future Trends Massively-Scalable Platforms and Solutions Engineered for the Big Data and Cloud Era Atul Vidwansa Email: atul@ DDN About Us DDN is a Leader in Massively

More information

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Zvonimir Z. Bandic, Sr. Director, Next Generation Platform Technologies Western Digital Corporation

More information

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody

More information

Resilient IP Backbones. Debanjan Saha Tellium, Inc.

Resilient IP Backbones. Debanjan Saha Tellium, Inc. Resilient IP Backbones Debanjan Saha Tellium, Inc. dsaha@tellium.com 1 Outline Industry overview IP backbone alternatives IP-over-DWDM IP-over-OTN Traffic routing & planning Network case studies Research

More information

Looking ahead with IBM i. 10+ year roadmap

Looking ahead with IBM i. 10+ year roadmap Looking ahead with IBM i 10+ year roadmap 1 Enterprises Trust IBM Power 80 of Fortune 100 have IBM Power Systems The top 10 banking firms have IBM Power Systems 9 of top 10 insurance companies have IBM

More information

EOS: An Extensible Operating System

EOS: An Extensible Operating System EOS: An Extensible Operating System Executive Summary Performance and stability of the network is a business requirement for datacenter networking, where a single scalable fabric carries both network and

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

MultiChipSat: an Innovative Spacecraft Bus Architecture. Alvar Saenz-Otero

MultiChipSat: an Innovative Spacecraft Bus Architecture. Alvar Saenz-Otero MultiChipSat: an Innovative Spacecraft Bus Architecture Alvar Saenz-Otero 29-11-6 Motivation Objectives Architecture Overview Other architectures Hardware architecture Software architecture Challenges

More information

Intel Research mote. Ralph Kling Intel Corporation Research Santa Clara, CA

Intel Research mote. Ralph Kling Intel Corporation Research Santa Clara, CA Intel Research mote Ralph Kling Intel Corporation Research Santa Clara, CA Overview Intel mote project goals Project status and direction Intel mote hardware Intel mote software Summary and outlook Intel

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

Streaming, made simple. FPGA Manager. Streaming made simple

Streaming, made simple. FPGA Manager. Streaming made simple Streaming, made simple. FPGA Manager Streaming made simple Agenda Enclustra company profile Reasons for linking a FPGA to a high level language Applications Types of interaction Requirements when linking

More information

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION Vignesh Adhinarayanan Ph.D. (CS) Student Synergy Lab, Virginia Tech INTRODUCTION Supercomputers are constrained by power Power

More information

ALTERA FPGAs Architecture & Design

ALTERA FPGAs Architecture & Design ALTERA FPGAs Architecture & Design Course Description This course provides all theoretical and practical know-how to design programmable devices of ALTERA with QUARTUS-II design software. The course combines

More information

VMware Join the Virtual Revolution! Brian McNeil VMware National Partner Business Manager

VMware Join the Virtual Revolution! Brian McNeil VMware National Partner Business Manager VMware Join the Virtual Revolution! Brian McNeil VMware National Partner Business Manager 1 VMware By the Numbers Year Founded Employees R&D Engineers with Advanced Degrees Technology Partners Channel

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

RouteBricks: Exploiting Parallelism To Scale Software Routers

RouteBricks: Exploiting Parallelism To Scale Software Routers outebricks: Exploiting Parallelism To Scale Software outers Mihai Dobrescu & Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia atnasamy

More information

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas Massachusetts Institute of Technology November 8th, CCSW 2013 Cloud Storage Model

More information

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Fast Forward Storage & I/O. Jeff Layton (Eric Barton) Fast Forward & I/O Jeff Layton (Eric Barton) DOE Fast Forward IO and Exascale R&D sponsored by 7 leading US national labs Solutions to currently intractable problems of Exascale required to meet the 2020

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

Branch offices and SMBs: choosing the right hyperconverged solution

Branch offices and SMBs: choosing the right hyperconverged solution Branch offices and SMBs: choosing the right hyperconverged solution Presenters: Howard Marks Chief Scientist, DeepStorage.net Luke Pruen Director of Technical Services, StorMagic Infrastructures for the

More information

Dependable VLSI Platform using Robust Fabrics

Dependable VLSI Platform using Robust Fabrics Dependable VLSI Platform using Robust Fabrics Director H. Onodera, Kyoto Univ. Principal Researchers T. Onoye, Y. Mitsuyama, K. Kobayashi, H. Shimada, H. Kanbara, K. Wakabayasi Background: Overall Design

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has

More information