A Global Operating System for HPC Clusters

Similar documents
I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems

Common Computer-System and OS Structures

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Computer-System Structures

Operating System Review

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Linux Clusters for High- Performance Computing: An Introduction

19: I/O Devices: Clocks, Power Management

Linux Kernel Hacking Free Course

Chapter 1: Introduction. Operating System Concepts 8th Edition,

Maximizing Memory Performance for ANSYS Simulations

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

Toward An Integrated Cluster File System

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

The Optimal CPU and Interconnect for an HPC Cluster

Akaros. Barret Rhoden, Kevin Klues, Andrew Waterman, Ron Minnich, David Zhu, Eric Brewer UC Berkeley and Google

Computer-System Architecture

Computer-System Architecture. Common Functions of Interrupts. Computer-System Operation. Interrupt Handling. Chapter 2: Computer-System Structures

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

BlueGene/L (No. 4 in the Latest Top500 List)

A Case for High Performance Computing with Virtual Machines

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

xsim The Extreme-Scale Simulator

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

VIProf: A Vertically Integrated Full-System Profiler

Instruction Cycle. Computer-System Architecture. Computer-System Operation. Common Functions of Interrupts. Chapter 2: Computer-System Structures

Module 2: Computer-System Structures. Computer-System Architecture

An Empirical Study of High Availability in Stream Processing Systems

Course Syllabus. Operating Systems

Breaking Kernel Address Space Layout Randomization (KASLR) with Intel TSX. Yeongjin Jang, Sangho Lee, and Taesoo Kim Georgia Institute of Technology

CSC 2405: Computer Systems II

OPERATING SYSTEMS UNIT - 1

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Part I Overview Chapter 1: Introduction

ADRIAN PERRIG & TORSTEN HOEFLER Networks and Operating Systems ( ) Chapter 6: Demand Paging

Computer-System Architecture (cont.) Symmetrically Constructed Clusters (cont.) Advantages: 1. Greater computational power by running applications

Xytech MediaPulse Equipment Guidelines (Version 8 and Sky)

The Role of Performance

Lecture 21 Disk Devices and Timers

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Lecture 9: MIMD Architectures

Enhancing Linux Scheduler Scalability

RAMP-White / FAST-MP

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

Xen and the Art of Virtualization

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Performance of Multicore LUP Decomposition

Managing CAE Simulation Workloads in Cluster Environments

Lecture 9: MIMD Architectures

Chapter 1: Introduction. Operating System Concepts 8 th Edition,

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Technical Computing Suite supporting the hybrid system

Designing a Cluster for a Small Research Group

Introduction to Computer Systems and Operating Systems

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

An Extensible Message-Oriented Offload Model for High-Performance Applications

CS370 Operating Systems

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Mission-Critical Enterprise Linux. April 17, 2006

Xytech MediaPulse Equipment Guidelines (Version 8 and Sky)

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Benchmarking computers for seismic processing and imaging

Midterm Exam. October 20th, Thursday NSC

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Interrupt response times on Arduino and Raspberry Pi. Tomaž Šolc

CSE 120 Principles of Operating Systems

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

Chapter 1: Introduction

CellSs Making it easier to program the Cell Broadband Engine processor

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Vertical Profiling: Understanding the Behavior of Object-Oriented Applications

Chapter 6: Demand Paging

System Virtual Machines

Timers 1 / 46. Jiffies. Potent and Evil Magic

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Chapter 18 Distributed Systems and Web Services

System Virtual Machines

I/O Systems and Storage Devices

The Google File System

The Google File System

High-performance aspects in virtualized infrastructures

Molecular Dynamics and Quantum Mechanics Applications

Comp. Org II, Spring

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Parallel Processing & Multicore computers

FDS and Intel MPI. Verification Report. on the. FireNZE Linux IB Cluster

Transcription:

A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ Watson Research Center IEEE Cluster 2009

Many... running Linux! Many modern supercomputers are built as cluster of thousands of independent nodes The OS commonly used in these supercomputers runs a version of the Linux kernel (more than 80% in the last top500) Each node runs an independent OS kernel, then: There is no temporal synchronization among the kernel s nodes

Many... running Linux! Many modern supercomputers are built as cluster of thousands of independent nodes The OS commonly used in these supercomputers runs a version of the Linux kernel (more than 80% in the last top500) Each node runs an independent OS kernel, then: There is no temporal synchronization among the kernel s nodes

Do we need synchronization? A strict temporal synchronization would allow or simplify activities, like: performance analysis, application debugging, data checkpointing... Moreover, performance and scalability of an HPC application could be considerly improved just by synchronizing the system activies among the nodes in order to avoid the effects of operating system noise and noise resonance

Do we need synchronization? A strict temporal synchronization would allow or simplify activities, like: performance analysis, application debugging, data checkpointing... Moreover, performance and scalability of an HPC application could be considerly improved just by synchronizing the system activies among the nodes in order to avoid the effects of operating system noise and noise resonance

In our paper we introduce, Cluster Advanced Operating System a prototype for a HPC Operating System based on Linux Main idea Synchronize the system activities of all cluster nodes

Presentation outline In the following slides: s for our, reviewing the concepts of OS noise and noise resonance Introduce : general idea,, Show the experiments results

HPC applications Most of HPC applications are Single Process-Multiple Data (SPMD), implemented using the MPI protocol MPI applications can be characterized by a cyclic alternation of two phases. Computing phase: each process performs some computation on its local portion of data Synchronization phase: processes communicate by exchanging data among themselves Crucial for performance is ensuring that all the nodes start and terminate each phase at the same moment

HPC applications Most of HPC applications are Single Process-Multiple Data (SPMD), implemented using the MPI protocol MPI applications can be characterized by a cyclic alternation of two phases. Computing phase: each process performs some computation on its local portion of data Synchronization phase: processes communicate by exchanging data among themselves Crucial for performance is ensuring that all the nodes start and terminate each phase at the same moment

Operating System noise Assuring phases synchronization is not easy mainly because of the so-called Operating System noise In other words, OS activities like: timer interrupt, disk cache flush, and so on... slow down the computing phases consuming CPU cycles Noise resonance Since nodes kernels are not synchronized effects in a large scale cluster may decrease considerably performance and scalability

Noise resonance in a single node has been analyzed and measured in the past and is about 1 2% on average Anyway if, on each computational phase, all the processes have to wait for one slow process the whole parallel job is delayed

Synchronization with On large clusters the probability that during each computational phase one process is delayed due to the OS noise approximates 1

Synchronization with On large clusters the probability that during each computational phase one process is delayed due to the OS noise approximates 1 A possible solution is forcing all cluster s nodes to perform system activities at the same time This is one among main goals

Global Operating System services Besides improved performance with a Global OS we could: Example simplify performance analysis realize data checkpointing for OS level fault tolerance improve application debugging Capability to stop the parallel application (at the same moment all the nodes) obtaining a coherent global snapshot

Global Operating System services Besides improved performance with a Global OS we could: Example simplify performance analysis realize data checkpointing for OS level fault tolerance improve application debugging Capability to stop the parallel application (at the same moment all the nodes) obtaining a coherent global snapshot

Cluster Advanced Operating System, is a global operating system cluster based on Linux The core layer essentially consists of two main agents: Global conductor Local instrumentalist In our prototype we realized two main components: : a new scheduler design cluster : the global heartbeat mechanism of

Cluster Advanced Operating System, is a global operating system cluster based on Linux The core layer essentially consists of two main agents: Global conductor Local instrumentalist In our prototype we realized two main components: : a new scheduler design cluster : the global heartbeat mechanism of

Global conductor The Global conductor (GC) runs on a selected master node and is in charge to schedule cluster activities. It provides a common time source (heartbeat) to the cluster nodes (, already implemented) Example notifies global and local system activities The GC may command the nodes to check for: decayed software timers reclaim pages from the disk caches The Global conductor could be implemented as a user mode program running on the Master node, i.e., modifying an available job scheduler

Global conductor The Global conductor (GC) runs on a selected master node and is in charge to schedule cluster activities. It provides a common time source (heartbeat) to the cluster nodes (, already implemented) Example notifies global and local system activities The GC may command the nodes to check for: decayed software timers reclaim pages from the disk caches The Global conductor could be implemented as a user mode program running on the Master node, i.e., modifying an available job scheduler

Global conductor The Global conductor (GC) runs on a selected master node and is in charge to schedule cluster activities. It provides a common time source (heartbeat) to the cluster nodes (, already implemented) Example notifies global and local system activities The GC may command the nodes to check for: decayed software timers reclaim pages from the disk caches The Global conductor could be implemented as a user mode program running on the Master node, i.e., modifying an available job scheduler

Local instrumentalist The Local instrumentalist (LI) runs on each cluster node and performs the operations scheduled by the global conductor when needed, could notify the GC when the operations have been completed Main goal: not interfere with the computing phase of the HPC application Main component: a kernel s task scheduler (our prototype now uses )

is a new scheduler designed Cluster and presented at ACM/IEEE Supercomputing, 2008 Main features: It uses: balancing MPI applications minimazing caused by other user and kernel threads hardware resource allocation mechanism provided by an underneath processor (like IBM POWER6), heuristic functions to select the amount of hardware resources to assign to the task.

is a new scheduler designed Cluster and presented at ACM/IEEE Supercomputing, 2008 Main features: It uses: balancing MPI applications minimazing caused by other user and kernel threads hardware resource allocation mechanism provided by an underneath processor (like IBM POWER6), heuristic functions to select the amount of hardware resources to assign to the task.

and introduces a new scheduling class and a new policy to select and balance the tasks Many kernel activities should not be executed if and HPC application is ready to run Main result is able to execute the system activities during the communication phase of the MPI application

is our prototype implementation of the heartbeat mechanism of It s implemented as a patch for the Linux kernel v. 2.6.24 Main idea Substitute the nodes local tick device with a common net tick device All the nodes execute the time interrupt at the same moment and the same number of time, then they have the same perception of the flow of time

is our prototype implementation of the heartbeat mechanism of It s implemented as a patch for the Linux kernel v. 2.6.24 Main idea Substitute the nodes local tick device with a common net tick device All the nodes execute the time interrupt at the same moment and the same number of time, then they have the same perception of the flow of time

Local tick device vs common tick device Why do we need a common tick device? First of all, Net Time Protocol (NTP) is NOT a solution! It synchronizes only the Wall Clock Time... Instead we need to synchronize the system tick, and: Normally each node uses his own timer device to measure time and raise a periodic interrupt (i.e. Local APIC for Intel/AMD, Decrementer for PowerPC) Even timer devices of the same type oscillate at slightly different frequencies Tick period is slightly different from node to node Machines have different perception of the flow of time

Local tick device vs common tick device Why do we need a common tick device? First of all, Net Time Protocol (NTP) is NOT a solution! It synchronizes only the Wall Clock Time... Instead we need to synchronize the system tick, and: Normally each node uses his own timer device to measure time and raise a periodic interrupt (i.e. Local APIC for Intel/AMD, Decrementer for PowerPC) Even timer devices of the same type oscillate at slightly different frequencies Tick period is slightly different from node to node Machines have different perception of the flow of time

Local tick device vs common tick device Why do we need a common tick device? First of all, Net Time Protocol (NTP) is NOT a solution! It synchronizes only the Wall Clock Time... Instead we need to synchronize the system tick, and: Normally each node uses his own timer device to measure time and raise a periodic interrupt (i.e. Local APIC for Intel/AMD, Decrementer for PowerPC) Even timer devices of the same type oscillate at slightly different frequencies Tick period is slightly different from node to node Machines have different perception of the flow of time

Local tick device vs common tick device Why do we need a common tick device? First of all, Net Time Protocol (NTP) is NOT a solution! It synchronizes only the Wall Clock Time... Instead we need to synchronize the system tick, and: Normally each node uses his own timer device to measure time and raise a periodic interrupt (i.e. Local APIC for Intel/AMD, Decrementer for PowerPC) Even timer devices of the same type oscillate at slightly different frequencies Tick period is slightly different from node to node Machines have different perception of the flow of time

implementation Patch of Linux v. 2.6.24: No changes in drivers, only the core has been patched A very small part of the patch, only needed to support SMP system, is architecture-dependent. Actually we support Intel IA32, Intel64, PowerPC64 sysfs interface to activate/deactivate Based on Ethernet communication channel: Our test bed used Gigabit Ethernet According to Top500 list more then 56% of top ranking cluster are based on Gigabit Ethernet If performs well on Gigabit Ethernet, it can presumably with any other net architecture having a similar or higher performance

implementation Patch of Linux v. 2.6.24: No changes in drivers, only the core has been patched A very small part of the patch, only needed to support SMP system, is architecture-dependent. Actually we support Intel IA32, Intel64, PowerPC64 sysfs interface to activate/deactivate Based on Ethernet communication channel: Our test bed used Gigabit Ethernet According to Top500 list more then 56% of top ranking cluster are based on Gigabit Ethernet If performs well on Gigabit Ethernet, it can presumably with any other net architecture having a similar or higher performance

: global heartbeat A Master node send periodic frame Capability to dynamically change node s tick frequency

: timer interrupt emulation

We ran 3 sets of different experiments: reduction in a single node using FTQ (Fixed Time Quanta): (timer interrupt noise) (scheduling noise) testing scaling in a cluster (using NAS Parallel Benchmark): we can show that deferring system activities does not impact on the performance of parallel applications

test bed 24-nodes Apple Xserve cluster, generously offered by Italian Defence s General Staff Each node is equipped of 2 dual core Intel Xeon 5150 2.66GHz processors (overall 96 cores), 4GByte of RAM and 2 Gigabit Ethernet Net Interface Cards HPCS CHED N ET T ICK One of the 24 nodes was employed as master node

Noise measured with FTQ We executed Fixed Time Quanta (FTQ) on a single node to analyze the introduced by the timer interrupt Test conditions, needed to isolate the timer interrupt: All the CPUs idle (only FTQ running) Reduced number of kernel threads and user daemons FTQ measures the amount of done in a fixed time quantum (1 ms) in terms of basic operations The difference between the maximum number of basic operations (N max ) and the number of basic operations in a sample (N i ) can be considered = N max N i

Noise measured with FTQ We executed Fixed Time Quanta (FTQ) on a single node to analyze the introduced by the timer interrupt Test conditions, needed to isolate the timer interrupt: All the CPUs idle (only FTQ running) Reduced number of kernel threads and user daemons FTQ measures the amount of done in a fixed time quantum (1 ms) in terms of basic operations The difference between the maximum number of basic operations (N max ) and the number of basic operations in a sample (N i ) can be considered = N max N i

Timer interrupt noise measured with FTQ Standard system

Scheduling noise measured with FTQ In this test we introduced a periodic user daemon for cluster activities Standard system and

A cluster using Curves of execution times of NAS EP varying the number of cores, comparing the system tick (100 MHz) with (10MHz, 50MHz, 100MHz) We also ran NAS BT obtaining the same evidence: no delay introduced by

We have proposed, an extension to the Linux kernel applications in cluster-based supercomputers Main goals of : schedule OS activities in order to reduce the simplify activities like debugging or checkpointing In prototype, we developed two major components: : a local task scheduler application : a global heartbeat mechanism We proved that: both components reduce the on a single node does not introduce overhead that impacts on HPC applications performances

Thanks! A Global Operating System clusters System Programming Research Group University of Rome Tor Vergata http://www.sprg.uniroma2.it BlueGene Software Division IBM TJ Watson Research center {betti,cesati}@sprg.uniroma2.it, rgioios@us.ibm.com, francesco.piermaria@gmail.com We gratefully acknowledge support of this project by the Italian Defence s General Staff