Practice of Software Development: Dynamic scheduler for scientific simulations

Similar documents
Database Driven Processing and Slow Control Monitoring in EDELWEISS

Gatlet - a Grid Portal Framework

Extraordinary HPC file system solutions at KIT

Using file systems at HC3

The KASCADE Cosmic ray Data Center - providing open access to astroparticle physics research data

Hands-On Workshop bwunicluster June 29th 2015

Parallel Tools Platform for Judge

Linked Data for Data Integration based on SWIMing Guideline: Use Cases in DAREED Project

A Framework for a Comprehensive Evaluation of Ant-Inspired Peer-to-Peer Protocols

Assistance in Lustre administration

MDHIM: A Parallel Key/Value Store Framework for HPC

LAMBDA The LSDF Execution Framework for Data Intensive Applications

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

Lessons learned from Lustre file system operation

bwfortreff bwhpc user meeting

Reliability, load-balancing, monitoring and all that: deployment aspects of UNICORE. Bernd Schuller UNICORE Summit 2016 June 23, 2016

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

bwunicluster Tutorial Access, Data Transfer, Compiling, Modulefiles, Batch Jobs

bwunicluster Tutorial Access, Data Transfer, Compiling, Modulefiles, Batch Jobs

SUG Breakout Session: OSC OnDemand App Development

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

Tutorial 4: Condor. John Watt, National e-science Centre

Installing and running COMSOL 4.3a on a Linux cluster COMSOL. All rights reserved.

OpenCL: History & Future. November 20, 2017

2/26/2017. For instance, consider running Word Count across 20 splits

Visualization and Data Analysis using VisIt - In Situ Visualization -

Workflow System for Data-intensive Many-task Computing

CS6401- Operating System QUESTION BANK UNIT-I

AstroGrid-D. Advanced Prototype Implementation of Monitoring & Steering Methods. Documentation and Test Report 1. Deliverable D6.5

Workshop Agenda Feb 25 th 2015

ICAT Job Portal. a generic job submission system built on a scientific data catalog. IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013

Today s class. Scheduling. Informationsteknologi. Tuesday, October 9, 2007 Computer Systems/Operating Systems - Class 14 1

Grid Computing Activities at KIT

DPA CONTEST 08/09 A SIMPLE IMPROVEMENT OF CLASSICAL CORRELATION POWER ANALYSIS ATTACK ON DES

HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

CS418 Operating Systems

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Big Data for Engineers Spring Resource Management

Getting Started with Serial and Parallel MATLAB on bwgrid

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

For Dr Landau s PHYS8602 course

Using KVK as a technical base for "Virtual Catalogue of European Historical Bibliographies"

College on Multiscale Computational Modeling of Materials for Energy Applications: Tutorial

Mitglied der Helmholtz-Gemeinschaft. Eclipse Parallel Tools Platform (PTP)

Scheduling. Jesus Labarta

Supercomputing in Plain English Exercise #6: MPI Point to Point

Workflows and Scheduling

OBTAINING AN ACCOUNT:

Accelerate HPC Development with Allinea Performance Tools

Press Release Page 1 / 5 Monika Landgraf Press Officer (acting)

CSCE Operating Systems Scheduling. Qiang Zeng, Ph.D. Fall 2018

Load Balancing and Termination Detection

Practical Introduction to

Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters*

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

CODE-DE.org Copernicus Data and Exploitation Platform Deutsch. CODE-DE Processing Integration Workshop. 25. & 26. June 2018 DLR Oberpfaffenhofen

A Web-based Prophesy Automated Performance Modeling System

ESO Reflex (FinReflex)

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

Course Syllabus. Operating Systems

Multiprocessor and Real- Time Scheduling. Chapter 10

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Moab Passthrough. Administrator Guide February 2018

User interface for a computational cluster: resource description approach

The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D

Moab Workload Manager on Cray XT3

UL HPC Monitoring in practice: why, what, how, where to look

NovoalignMPI User Guide

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

The coolest place on earth

Tutorial: Compiling, Makefile, Parallel jobs

TotalView. Debugging Tool Presentation. Josip Jakić

Big Data 7. Resource Management

Cluster Upgrade Procedure with Job Queue Migration.

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

SNiPER: an offline software framework for non-collider physics experiments

Thrill : High-Performance Algorithmic

Mitglied der Helmholtz-Gemeinschaft. System Monitoring: LLview

A short Overview of KIT and SCC and the Future of Grid Computing

Compiling applications for the Cray XC

NUSGRID a computational grid at NUS

User Guide of High Performance Computing Cluster in School of Physics

SCRIPT: An Architecture for IPFIX Data Distribution

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

Principles of Parallel Algorithm Design: Concurrency and Mapping

Grid Experiment and Job Management

Data reduction for CORSIKA. Dominik Baack. Technical Report 06/2016. technische universität dortmund

If you log into your account directly, please go to your personal startpage. You find the submission to review under the status Active.

Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer

Blue Gene/Q User Workshop. User Environment & Job submission

Operating two InfiniBand grid clusters over 28 km distance

Remaining Contemplation Questions

A Laconic HPC with an Orgone Accumulator. Presentation to Multicore World Wellington, February 15-17,

SAP CERTIFIED APPLICATION ASSOCIATE - SAP HANA 2.0 (SPS01)

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Informatica Data Explorer Performance Tuning

Transcription:

Practice of Software Development: Dynamic scheduler for scientific simulations @ SimLab EA Teilchen STEINBUCH CENTRE FOR COMPUTING - SCC KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Scheduler is a user-friendly middleware between application and cluster scheduling system with graphical (in- and) output interfaces; applicable for scientific applications with high number of identical independent tasks in one computational job; can be integrated into particular scientific applications: CORSIKA (COsmic Ray SImulations for KAscade) code for simulation of extensive air showers initiated by high energy cosmic ray particles; Simulation code for Atomic Clusters; PolGrawAllSky code searching for periodic gravitational wave signals. 2 Steinbuch Centre for Computing

communicator Scheduler modules executer interface code interface scheduler Scheduling Data Mining queue bookkeeping statistics visualization system Possible system events: executer interface: send: execute task; receive: task finished. code interface: receive: new task. Coupling types: - task data; - schedule information; - communication. where task is a pair of: Data Type (e.g. structure, array); Input Data of the specified type. 3 Steinbuch Centre for Computing

communicator Scheduler modules: advanced version GUI executer interface code interface scheduler Scheduling queue Data Mining bookkeeping statistics visualization system GUI Graphical user interface for the input data entering and batch script generation for particular batch system; can be based on X-Win API and batch system API e.g. MOAB API; 4 Steinbuch Centre for Computing

communicator System events 3.1* executer interface 4 Scheduling 2 1 2 5 3 1 queue code interface scheduler 4 Data Mining 3 bookkeeping statistics visualization system * - advanced: if local queue is empty check other schedulers for a task; new task signal Possible system events: 1. receive new task from code interface; 2. send task to data mining module; 3. analyse task according to collected statistics; 4. send estimated task weight to scheduler; 5. schedule task (place it in the queue); task finished signal 1. receive finished task from executer interface; 2. store task information to the database; 3. take new task from the queue; 4. send execute task signal to executer interface; 5 Steinbuch Centre for Computing

communicator Module scheduling + communicator executer interface code interface Scheduling Module: Scheduling queue contains a variety of scheduling algorithms and strategies; queries Data Mining module, how much resources the task needs; Data Mining bookkeeping statistics makes a decision how to place a task in a schedule (queue / run immediately); does bookkeeping; Communicator: has to organize a multi-level communication among multiple parallel schedulers (one computation several processes multiple tasks); communication can be organized e.g. as following: submit new multiprocessor jobs; change the size of your parallel world; fix the size of parallel world and distribute tasks among reserved processes; 6 Steinbuch Centre for Computing

Ordering resources when required submit new multiprocessor jobs using shell script generation or cluster submission system API (e.g. Moab-API): batching jobs @bwhpc clusters; Interfacing with Moab (APIs); use MPI 3.x mechanism to change the size of you parallel world (the number of processes you are using for computation) according to current requirements: MPI 3.1 Standard; MPI-dynamicprocesses.pdf; advanced topics in MPI programming @etutorials, chapter 9; 7 Steinbuch Centre for Computing

Fixed amount of resources (size of MPI World) Single-queue master-managed scheduling: one process is a master and does the scheduling, others are slaves and do the work; - master is a bottleneck and can + relatively easy to implement; serve only one slave at a time; Multiple-queues scheduling (preferable) each process has its own queue of tasks, tasks can be re-distributed among queues; - difficult to implement synchronisation + no bottlenecks all processes many-to-many should be organized; are doing equal algorithm; 8 Steinbuch Centre for Computing

Single-queue master-managed scheduling Each process can request master to access the queue Rules can be:... tasks queue LIFO, FIFO; Long-first, Short-first; p0 master Round-Robin; Statistics-based. p1 p2 p3 pn slaves See more about scheduling strategies in the following Doctoral theses: chapter 2.2 from shed2.pdf (good basic overview); chapters 2 and 3 from shed1.pdf; chapters 2.1-2.4 from shed3.pdf; 9 Steinbuch Centre for Computing

Multiple-queues scheduling Each process has its own queue, and does the same task: task queues p0 p1 p2 p3 pn processes The queue managing can be one of the following: Random or rule-based task distribution; Statistics-based task distribution; Task stealing (see next 2 slides). 10 Steinbuch Centre for Computing

Task stealing algorithm (equal for all processes) begin task check own q no task Blocking operation on own queue. Check whether there is a task and if yes - take it (and remove form q). no task for Ɐ p in other processes check p s q all procs are idle yes no Blocking operation on p s queue. Check whether there is a task and if yes - take it (and remove form q). task work (task) end Work cycle: the execute function of the executer interface with input parameters taken from task is called here. During the work cycle one or several new tasks may be added to the own queue. 11 Steinbuch Centre for Computing

Task stealing algorithm intermediate state Some processes are working, some are stealing; task queues p0 p1 p2 p3 pn processes pi - working process - own queue request - task in the queue pj - idle process - other queue request - empty queue slot How to organize stealing on distributed-memory systems see in: Cornell Virtual Workshop on MPI one-sided communication ; The article One-Sided Communications with MPI-2 in Linux Magazine; 12 Steinbuch Centre for Computing

Module Data mining Data Mining Advanced: statistics is called from Scheduling Module; analyses the statistics to define resource requirements for a task; keeps statistics for completed tasks; accumulates statistics of the equal computations (same scientific code) in a single database. polynomial approximation of collected statistics (e.g. once at the end of computation); using single function instead of on-line access to the database and data analysis on each step; 13 Steinbuch Centre for Computing

Module Database bookkeeping statistics is used to keep statistics, bookkeeping and local schedule; can either be a parallel-accessed or multiplied among processes with synchronisation during the computational process or accumulation at the end of it; Bookkeeping is done for the following events: task appears (timestamp, task parameters, parent process); task started (timestamp, task parameters, hosting process); task finished (timestamp, task parameters, hosting process); interconnection between processes (timestamp, processes ranks, result); Statistics collected for each task is: the runtime of the task depending on task parameters. 14 Steinbuch Centre for Computing

Module Visualization system visualization system Visualization system is used to visually analyse the collected data. Either or exports obtained results into selected format (e.g. Gnuplot format,.vtk format for Paraview or any other preferred format, used by opensource visualization software); works with visualization system API (e.g. Gnuplot API). 15 Steinbuch Centre for Computing

Code and executer interfaces The task is a set of two: Data_Type (e.g. structure or array); the Data of this type; Within one computation the Data_Type is fixed, differs only the Data. However, interfaces to different simulation software have different task types. Therefore template programming usage is highly recommended. Simple example can be found at: http://www.learncpp.com/cpp-tutorial/143-template-classes/. 16 Steinbuch Centre for Computing