The MOSIX Scalable Cluster Computing for Linux. mosix.org

Similar documents
The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds

Overview of MOSIX. Prof. Amnon Barak Computer Science Department The Hebrew University.

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds

Low Cost Supercomputing. Rajkumar Buyya, Monash University, Melbourne, Australia. Parallel Processing on Linux Clusters

The MOSIX Algorithms for Managing Cluster, Multi-Clusters and Cloud Computing Systems

Oracle Database 11g Direct NFS Client Oracle Open World - November 2007

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Lecture 9: MIMD Architectures

Chapter 18 Parallel Processing

Cloud Computing CS

An Introduction to GPFS

Organisasi Sistem Komputer

Chapter 12: File System Implementation

Lecture 9: MIMD Architectures

OPERATING SYSTEM. Chapter 12: File System Implementation

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Best Practices for Setting BIOS Parameters for Performance

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Operating System Performance and Large Servers 1

CSCI 4717 Computer Architecture

Chapter 11: Implementing File

Compute Node Linux (CNL) The Evolution of a Compute OS

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File-Systems

CSE 4/521 Introduction to Operating Systems

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

VERITAS Storage Foundation 4.0 TM for Databases

Advances of parallel computing. Kirill Bogachev May 2016

Storage Optimization with Oracle Database 11g

Chapter 11: Implementing File Systems

Computer Organization. Chapter 16

MC7204 OPERATING SYSTEMS

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

Why Multiprocessors?

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Parallel and High Performance Computing CSE 745

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

High Performance Computing Course Notes HPC Fundamentals

Crossing the Chasm: Sneaking a parallel file system into Hadoop

WHY PARALLEL PROCESSING? (CE-401)

HPC Architectures. Types of resource currently in use

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Crossing the Chasm: Sneaking a parallel file system into Hadoop

A Case for High Performance Computing with Virtual Machines

Scheduling. Jesus Labarta

Chapter 12 File-System Implementation

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Architecture of a Real-Time Operational DBMS

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Linux multi-core scalability

Experience the GRID Today with Oracle9i RAC

p5 520 server Robust entry system designed for the on demand world Highlights

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

COSC 6385 Computer Architecture - Multi Processor Systems

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

Lustre A Platform for Intelligent Scale-Out Storage

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

EMC VPLEX Geo with Quantum StorNext

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

CS420: Operating Systems

1. Memory technology & Hierarchy

CS3600 SYSTEMS AND NETWORKS

Got Isilon? Need IOPS? Get Avere.

The Host Environment. Module 2.1. Copyright 2006 EMC Corporation. Do not Copy - All Rights Reserved. The Host Environment - 1

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

DISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:

Toward An Integrated Cluster File System

The Oracle Database Appliance I/O and Performance Architecture

CLOUD-SCALE FILE SYSTEMS

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Distributed Systems Operation System Support

Chapter 11: File System Implementation

Processes and Threads

Distributed OS and Algorithms

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

IMPROVING THE PERFORMANCE, INTEGRITY, AND MANAGEABILITY OF PHYSICAL STORAGE IN DB2 DATABASES

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Chapter 11: File System Implementation

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

Virtual SQL Servers. Actual Performance. 2016

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

EMC Solutions for Backup to Disk EMC Celerra LAN Backup to Disk with IBM Tivoli Storage Manager Best Practices Planning

Chapter 12: File System Implementation

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Technical Computing Suite supporting the hybrid system

Operating Systems Overview

Chapter 11: Implementing File Systems

Chapter 10: Case Studies. So what happens in a real operating system?

CSE 124: Networked Services Fall 2009 Lecture-19

Transcription:

The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1

Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What is MOSIX (8-20) Part III: Parallel file systems (21-26) Part IV: Tools (27-28) Part V : Experience /parallel applications (29-32) Part VI: Summary/current/future projects (33-36) Color code: Highlight GOOD properties (blue) Not so good properties (Red) 2

Computing Clusters are not just for High Performance Computing (HPC) Many organizations need powerful systems to run demanding applications and for high availability. Some examples: Internet: web servers, ISP, ASP, GRID computing Telephony: scalable switches, billing systems Image/signal proc: DVD, rendering, movies Databases: OLTP, decision support, data warehousing Commercial applications, financial modeling RT, FT, Intensive I/O jobs, large compilations,... 3

Outcome Low cost Computing Clusters are replacing traditional super-computers and mainframes, because they can provide good solutions for many demanding applications Computing clusters are especially suitable for small and medium organizations that can not afford expensive systems Made of commodity components Gradual growth - tuned to budget and needs 4

Advantage and disadvantage of Clusters Advantage: the ability to do parallel processing Increased performance and higher availability Disadvantage: more complex to use and maintain Question: is an SMP (NUMA) a good alternative? Answer: by classification of parallel systems using 4 criteria: Ease (complexity) of use, overall performance (resource utilization), cost (affordability), scalability 5

2 alternatives: SMP vs. CC SMP (NUMA): easy to use, all the processors works in harmony, efficient multiprocessing and IPC, supports shared memory, good resource allocation, transparent to application, expensive when scaled up Clusters: low-cost, unlimited scalability, more difficult to use: insufficient bonds between nodes - each node works independently, many OS services are locally confined, limited provisions for remote services, inefficient resource utilization, may require modifications to applications 6

SMP vs. Clusters Better property Ease of use Cluster SMP/NUMA inconvenient excellent Utilization SMP / NUMA excellent Cluster tedious Cost SMP / NUMA high Cluster excellent Scalability SMP / NUMA limited Cluster unlimited 7

Goal of MOSIX Make cluster computing as efficient and easy to use as an SMP 8

Without MOSIX: user level control Not transparent to applications Rigid management lower performance Parallel and Sequential Applications PVM / RSH MPI / RSH Independent Linux machines 9

What is MOSIX A kernel layer that pool together the cluster-wide resources to provide user s processes with the illusion of running on one big machine Model: Single System (like an SMP) Ease of use - transparent to applications - no need to modify applications Maximal overall performance - adaptive resource management, load-balancing by process migration Overhead-free scalability (unlike SMP) 10

MOSIX is a unifying kernel layer Parallel and Sequential Applications Optional PVM / MPI / RSH Transparent, unifying, kernel layer MOSIX Independent Linux machines 11

A two tier technology 1. Information gathering and dissemination Provides each node with sufficient cluster information Support scalable configurations Fixed overhead: 16 or 6000 nodes 2. Preemptive process migration Can migrate any process, anywhere, anytime Transparent to applications - no change to user interface Supervised by adaptive algorithms that respond to global resource availability 12

Tier 1: Information exchange All the nodes gather and disseminate information about relevant resources: CPU speed, load, free memory, local/remote I/O, IPC Info exchanged in a random fashion (to support scalable configurations and overcome failures) Applicable to high volume transaction processing Scalable web servers, telephony, billing Scalable storage area cluster (SAN + Cluster) for intensive, parallel access to shared files 13

Storage Area Cluster LAN Switch Cluster Fibre-channel Storage Area Network Problem: how to preserve file consistency 14

Example: parallel make at EMC 2 Assign the next compilation to least loaded node A cluster of ~320 CPUs (4-way Xeon nodes) Runs 100-150 build, with millions lines of code concurrently Serial time ~30 hours, cluster time ~3 hours Much better performance and unlimited scalability vs. a commercial package, for less cost 15

Tier 2: Process migration for Load balancing: to improve the overall performance Responds to uneven load distribution Memory ushering: to prevent disk paging Migrate processes when RAM exhausted Parallel file operations Bring the process to the data Speed of migration (Fast Ethernet): 88.8 Mb/Sec 16

Intelligent load-balancing Node 1 A D IPC Node 2 Round-Robin placement - IPC overhead B C Node 1 Node 2 A B Optimal assignment - No IPC overhead C D 17

Single system image model Users can login to any node in the cluster. This node is the home-node for all the user s processes Migrated processes always seem to run at the home node, e.g., ps show all your processes, even if they run elsewhere Migrated processes use local resources (at the remote nodes), while interact with the home-node to access their environment, e.g. perform I/O Drawback: extra overhead for file and I/O operations 18

Splitting the Linux process User-space User-space Local Local MOSIX Link Remote Deputy Kernel Kernel Node 1 Node 2 Process context (Remote) is site independent - may migrate System context (deputy) is site dependent - must stay at home Connected by an exclusive link for both synchronous (syscalls) and asynchronous (signals, MOSIX events) 19

Example: MOSIX vs. PVM Fixed number of processes per node Random process size with average 8MB MOSIX is scalable PVM is not scalable 20

Direct File System Access (DFSA) I/O access through the home node incurs high overhead Direct file System Access (DFSA) allow processes to perform file operations (directly) in the current node - not via the home node Available operations: all common file and I/O system-calls on conforming file systems Conforming FS: GFS, MOSIX File System (MFS) 21

DFSA Requirements The FS (and symbolic-links) are identically mounted on the same-named mount-points (/mfs in all nodes) File consistency: completed operation in one node is seen in any other node Required because a MOSIX process may perform consecutive syscalls from different nodes Time-stamp consistency: if file A is modified after B, A time stamp must be greater than B's time stamp 22

The MOSIX File System (MFS) Provides a unified view of all files and all mounted FSs on all nodes, as if they were within a single FS Makes all directories and regular files throughout a MOSIX cluster available from all the nodes Provides file consistency from different nodes by maintaining one cache at the server (disk) node Parallel file access by process migrate 23

Global File System (GFS) with DFSA Same as MFS, with local cache over the cluster using a unique locking mechanism GFS + process migration combine the advantages of load-balancing with direct disk access from any node - for parallel file operations GFS for Linux 2.2 includes support for DFSA Not written yet for Linux 2.4 24

Postmark (heavy FS load) Client - Server performance Access Method 64B Data Transfer Block Size 512B 1KB 2KB 4KB 8KB 16KB Local (in the server) 102.6 102.1 100.0 102.2 100.2 100.2 101.0 MFS 104.8 104.0 103.9 104.1 104.9 105.5 104.4 NFS 184.3 169.1 158.0 161.3 156.0 159.5 157.5 * All numbers in seconds 25

Parallel Read from a Quad server (4K block size) 300 250 Local MFS NFS (Time (Sec 200 150 100 50 0 1 2 4 8 16 Number of nodes 26

Tools Linux kernel debugger - on-line access to the kernel memory, processes properties, stack, etc. Cluster installation, administration and partition Kernel monitor (mon) - displays load, speed, total and used memory and CPU utilizatio JAVA monitor - display cluster properties CPU load, speed, utilization, memory utilization, CPU temp., fan speed Access via the Internet 27

Java Monitor * Leds * Dials * Radar Pie Bars Matrix graphs Fan Tables 28

MOSIX is best for Multi-user platforms - where many users share the cluster resources High performance demanding applications Parallel applications - HPC Non-uniform clusters - with different speed machines, different memory sizes, etc. 29

API and implementation API: no new system-calls - all done via /proc MOSIX 0.9 for Linux 2.2 (since May 1999): 80 new files (40K lines), 109 files modified (7K lines) 3K lines in load-balancing algorithms MOSIX 1. for Linux 2.4 (since May 2001): 45 new files (35K lines), 124 files modified (4K lines) 48 user-level files (10K lines) 30

Our scalable configuration 31

Examples of HPC Applications Programming environments: PVM, MPI Examples: Protein classification - 20 days 32 nodes High energy molecular dynamics Quantum dynamics simulations - 100 days non-stop Weather forecasting (MM5) Computational fluid dynamics (CFD) Car crash simulations (parallel Autodyn) 32

Summary MOSIX brings a new dimension to Cluster and GRID Computing Ease of use - like an SMP Overhead-free scalability Near optimal performance May be used over LAN and WAN (GRID) Production quality - Large production clusters installed 33

Ongoing projects DSM - strict consistency model High availability - recovery model Migratable sockets - for IPC optimization Scalable web server - cluster IP Network (cluster) RAM - bring process to data Parallel File System 34

Future projects * Port MOSIX to: New platforms, e.g. Itanium, PPC Other OS, e.g. FreeBSD, Mac OS X Shared disk file system, e.g. GFS Parallel DBMS, e.g. billing systems Tune to specific parallel applications, e.g. rendering * subject to external funding 35

The MOSIX group Small R&D group - 7 people Over 20 years of experience in Unix kernel Development of 7 generations of MOSIX We like to work with a large company on R&D of cluster technology - not only in Linux Possible directions: HPC, telephony, high volume transaction/web servers, rendering, movies, etc. 36

MOSIX Thank you for your attention Questions? For further details please contact: amnon@cs.huji.ac.il 37