Ian Foster, An Overview of Distributed Systems

Similar documents
Arguably one of the most fundamental discipline that touches all other disciplines and people

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Support for resource sharing Openness Concurrency Scalability Fault tolerance (reliability) Transparence. CS550: Advanced Operating Systems 2

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Ian Foster, CS554: Data-Intensive Computing

Today (2010): Multicore Computing 80. Near future (~2018): Manycore Computing Number of Cores Processing

Extreme scripting and other adventures in data-intensive computing

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Workflow languages and systems

Grid, cloud, and science: Accelerating discovery. A View and Practice from University of Chicago

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys)

MOHA: Many-Task Computing Framework on Hadoop

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys)

The Red Storm System: Architecture, System Update and Performance Analysis

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Overview Past Work Future Work. Motivation Proposal. Work-in-Progress

Ioan Raicu. Distributed Systems Laboratory Computer Science Department University of Chicago

A Data Diffusion Approach to Large Scale Scientific Exploration

High Performance Computing Course Notes HPC Fundamentals

Introduction to Grid Computing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

IBM Power AC922 Server

Chapter 1: Introduction to Parallel Computing

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The Fusion Distributed File System

Real Parallel Computers

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Real Parallel Computers

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Building NVLink for Developers

Lecture 9: MIMD Architecture

Overview of HPC at LONI

The Digitising European Industry strategy & H2020 calls related to Cyber-Physical Systems

Chapter 2 Parallel Hardware

Intel Many Integrated Core (MIC) Architecture

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

System Software for Big Data and Post Petascale Computing

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Introduction to High-Performance Computing

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Advances of parallel computing. Kirill Bogachev May 2016

The Use of Cloud Computing Resources in an HPC Environment

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Exascale: Parallelism gone wild!

Intro to Software as a Service (SaaS) and Cloud Computing

Distributed Systems. Thoai Nam Faculty of Computer Science and Engineering HCMC University of Technology

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The Future of Interconnect Technology

Clustering Documents. Case Study 2: Document Retrieval

Improved Solutions for I/O Provisioning and Application Acceleration

Accelerating Implicit LS-DYNA with GPU

Chapter 5. The MapReduce Programming Model and Implementation

GROMACS Performance Benchmark and Profiling. August 2011

Forming an ad-hoc nearby storage, based on IKAROS and social networking services

IBM CORAL HPC System Solution

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Distributed Systems. Edited by. Ghada Ahmed, PhD. Fall (3rd Edition) Maarten van Steen and Tanenbaum

XBS Application Development Platform

Lecture 9: MIMD Architectures

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

SwiftStack and python-swiftclient

Distributed Systems Principles and Paradigms

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

IBM Power Advanced Compute (AC) AC922 Server

GPUs and Emerging Architectures

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

The Future of High Performance Computing

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

A Comparative Study of Various Computing Environments-Cluster, Grid and Cloud

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HPC Capabilities at Research Intensive Universities

<Insert Picture Here> Enterprise Data Management using Grid Technology

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

An Overview of Fujitsu s Lustre Based File System

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Pervasive DataRush TM

Introduction to High-Performance Computing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Distributed Operating Systems Fall Prashant Shenoy UMass Computer Science. CS677: Distributed OS

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Multitasking and Multithreading on a Multiprocessor With Virtual Shared Memory. Created By:- Name: Yasein Abdualam Maafa. Reg.

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Maximizing heterogeneous system performance with ARM interconnect and CCIX

IBM Power Systems HPC Cluster

Transcription:

The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Ian Foster, 2006 An Overview of Distributed Systems

An Overview of Distributed Systems

Computational thinking will be a fundamental skill used by everyone in the world by the middle of the 21st Century. Jeanette Wing, 2006 An Overview of Distributed Systems

An Overview of Distributed Systems

An Overview of Distributed Systems

What is a distributed system? A collection of independent computers that appears to its users as a single coherent system -A. Tanenbaum An Overview of Distributed Systems

A distributed system organized as middleware. The middleware layer extends over multiple machines, and offers each application the same interface. An Overview of Distributed Systems

An Overview of Distributed Systems [GCE08] Cloud Computing and Grid Computing 360-Degree Compared

Computer clusters using commodity processors, network interconnects, and operating systems. An Overview of Distributed Systems

Rack 32 Node Cards Supercomputing ~ HPC Cabled 8x8x16 32 Racks Baseline System Chip 4 processors Compute Card 1 chip, 1x1x1 Node Card (32 chips 4x4x2) 32 compute, 0-4 IO cards 435 GF/s 64 GB 14 TF/s 2 TB 500TF/s 64 TB Highly-tuned computer clusters using commodity 13.6 GF/s 13.6 GF/s processors 2 GB DDR combined with custom network 8 MB EDRAM interconnects and An Overview customized of Distributed Systems operating system

Grids tend to be composed of multiple clusters, and are typically loosely coupled, heterogeneous, and geographically dispersed NCAR UC/ANL NCSA PU IU PSC Tennessee 2008 (~1PF) ORNL SDSC Tommy Minyard, TACC TACC 2007 (504TF) LONI/LSU An Overview of Distributed Systems Grids ~ Federation Computational Resources (size approximate - not to scale)

A large-scale distributed computing paradigm driven by: 1. economies of scale 2. virtualization 3. dynamically-scalable resources 4. delivered on demand over the Internet Clouds ~ hosting An Overview of Distributed Systems

Electronic health care systems Monitoring a person in a pervasive electronic health care system, using (a) a local hub or (b) a continuous wireless connection. An Overview of Distributed Systems

Sensor systems Organizing a sensor network database, while storing and processing data (a) only at the operator s site or. An Overview of Distributed Systems

Economics Microprocessors have better price/performance than mainframes Speed Collective power of large number of systems Geographic and responsibility distribution Reliability One machine s failure need not bring down the system Extensibility Computers and software can be added incrementally An Overview of Distributed Systems

Software Little software exists compared to PCs Networking Still slow and can cause other problems (e.g. when disconnected) Security Data may be accessed by unauthorized users An Overview of Distributed Systems

Tasks Completed Number of Processors Throughput (tasks/sec) CPU Cores: 130816 Tasks: 1048576 Elapsed time: 2483 secs CPU Years: 9.3 1000000 800000 600000 Processors Active Tasks Tasks Completed Throughput (tasks/sec) Speedup: 115168X (ideal 130816) Efficiency: 88% 5000 4500 4000 3500 3000 2500 400000 200000 2000 1500 1000 500 0 0 Time (sec) [SC08] Towards Loosely-Coupled Programming on Petascale Systems An Overview

PDB protein descriptions ZINC 3-D structures 1 protein (1MB) 6 GB 2M structures (6 GB) FRED Manually prep DOCK6 rec file start DOCK6 Receptor (1 per protein: defines pocket to bind to) DOCK6 Manually prep FRED rec file FRED Receptor (1 per protein: defines pocket to bind to) ~4M x 60s x 1 cpu ~60K cpu-hrs NAB Script Template BuildNABScript NAB Script NAB script parameters (defines flexible residues, #MDsteps) Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript report Select best ~5K Select best ~5K Amber Select best ~500 GCMC end ligands [SC08] Towards Loosely-Coupled Programming on Petascale Systems complexes ~10K x 20m x 1 cpu ~3K cpu-hrs ~500 x 10hr x 100 cpu ~500K cpu-hrs Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) An Overview

CPU cores: 118784 Tasks: 934803 Elapsed time: 2.01 hours Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization: Sustained: 99.6% Overall: 78.3% [SC08] Towards Loosely-Coupled Programming on Petascale Systems An Overview

Purpose On-demand stacks of random locations within ~10TB dataset Challenge Processing Costs: O(100ms) per object Data Intensive: 40MB:1sec Rapid access to 10-10K random files Time-varying load [DADC08] Accelerating Large-scale Data Exploration through Data Diffusion [TG06] AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis AP + + + + + + + = Sloan Data Locality Number of Objects Number of Files 1 111700 111700 1.38 154345 111699 2 97999 49000 3 88857 29620 4 76575 19145 5 60590 12120 10 46480 4650 20 40460 2025 An Overview of Distributed Systems 30 23695 790

Research Focus Emphasize designing, implementing, and evaluating systems, protocols, and middleware with the goal of supporting data-intensive applications on extreme scale distributed systems, from many-core systems, clusters, grids, clouds, and supercomputers People 1 Faculty member 5 PhD students 3 MS students 1 UG student More information

Number of Cores Manufacturing Process 300 250 200 150 100 50 0 2004 2006 2008 2010 2012 2014 2016 2018 100 Today (2011): Multicore Computing 80 1~16 cores commodity architectures 70 60 80 cores proprietary architectures 50 512 GPU cores 40 Near future (~2018): Manycore Computing Number of Cores Processing ~1000 cores commodity architectures Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9 th, 2007 90 30 20 10 0 23

Top500 Core-Max (# of cores) Top500 Core-Sum (# of cores) 300,000 250,000 Top500 Core-Max Top500 Core-Sum 5,000,000 4,000,000 200,000 150,000 100,000 50,000 3,000,000 Today (2011): Petascale Computing 2,000,000 5K~50K nodes 1,000,000 50K~300K processor-cores 0 0 Near future (~2018): Exascale Computing http://www.top500.org/ ~1M nodes (20X~200X) ~1B processor-cores/threads (3000X~20000X) Top500 Projected Development, http://www.top500.org/lists/2009/11/performance_development 24

Relatively new paradigm 3 years old Amazon in 2009 40K servers split over 6 zones 320K-cores, 320K disks $100M costs + $12M/year in energy costs Revenues about $250M/year Amazon in 2018 Will likely look similar to exascale computing 100K~1M nodes, ~1B-cores, ~1M disks $100M~$200M costs + $10M~$20M/year in energy Revenues 100X~1000X of what they are today 25

Power efficiency Will limit the number of cores on a chip (Manycore) Will limit the number of nodes in cluster (Exascale and Cloud) Will dictate a significant part of the cost of ownership Programming models/languages Automatic parallelization Threads, MPI, workflow systems, etc Functional, imperative Languages vs. Middlewares 26

Bottlenecks in scarce resources Storage (Exascale and Clouds) Memory (Manycore) Reliability How to keep systems operational in face of failures Checkpointing (Exascale) Node-level replication enabled by virtualization (Exascale and Clouds) Hardware redundancy and hardware error correction (Manycore) 27

Everything about science is changing because of the impact of information technology Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, dataintensive science paradigm is emerging. Goal A world in which all of the science literature is online all of the science data is online They interoperate with each other Computing and storage will increase in scale exponentially over the next decade Data will play central role in future computing systems Lots of new tools are An needed Overview of Distributed to Systems make this happen!

More information: http://www.cs.iit.edu/~iraicu/ http://datasys.cs.iit.edu/ Contact: iraicu@cs.iit.edu Questions? An Overview of Distributed Systems