OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Similar documents
CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Chapter -5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CS370 Operating Systems

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Modern Processor Architectures. L25: Modern Compiler Design

CPU-GPU Heterogeneous Computing

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

CPU Scheduling: Objectives

Multimedia Systems 2011/2012

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

EDF-DVS Scheduling on the IBM Embedded PowerPC 405LP

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

CHAPTER 6 ENERGY AWARE SCHEDULING ALGORITHMS IN CLOUD ENVIRONMENT

An Introduction to Parallel Programming

Mark Sandstrom ThroughPuter, Inc.

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

8: Scheduling. Scheduling. Mark Handley

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Chapter 5: CPU Scheduling

Reference Model and Scheduling Policies for Real-Time Systems

Scheduling. Scheduling 1/51

CISC 7310X. C05: CPU Scheduling. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/1/2018 CUNY Brooklyn College

Parallelism and Concurrency. COS 326 David Walker Princeton University

Computer Science 4500 Operating Systems

Real-time operating systems and scheduling

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date:

LECTURE 3:CPU SCHEDULING

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Introduction to Real-Time Systems ECE 397-1

4/6/2011. Informally, scheduling is. Informally, scheduling is. More precisely, Periodic and Aperiodic. Periodic Task. Periodic Task (Contd.

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Process- Concept &Process Scheduling OPERATING SYSTEMS

Operating Systems. Lecture Process Scheduling. Golestan University. Hossein Momeni

CS370 Operating Systems

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Process Scheduling Part 2

Titolo presentazione. Scheduling. sottotitolo A.Y Milano, XX mese 20XX ACSO Tutoring MSc Eng. Michele Zanella

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Scheduling. Scheduling 1/51

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Homework index. Processing resource description. Goals for lecture. Communication resource description. Graph extensions. Problem definition

TUNING CUDA APPLICATIONS FOR MAXWELL

CS370 Operating Systems

Introduction to Real-time Systems. Advanced Operating Systems (M) Lecture 2

Response Time and Throughput

Computer Systems Assignment 4: Scheduling and I/O

Subject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date:

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Network-on-Chip Architecture

8th Slide Set Operating Systems

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

Course Syllabus. Operating Systems

Multiprocessor and Real-Time Scheduling. Chapter 10

CS 326: Operating Systems. CPU Scheduling. Lecture 6

TUNING CUDA APPLICATIONS FOR MAXWELL

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Chapter 9: Virtual-Memory

Simplified design flow for embedded systems

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

CS307: Operating Systems

ENERGY EFFICIENT SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS WITH PRECEDENCE AND RESOURCE CONSTRAINTS

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke

A MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

Chapter 5: CPU Scheduling

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Virtual Memory Outline

Best Practices for Setting BIOS Parameters for Performance

ECE519 Advanced Operating Systems

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Introduction to GPU computing

Chapter 5: CPU Scheduling. Operating System Concepts Essentials 8 th Edition

Performance of computer systems

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Chapter 9: Virtual Memory

ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING

Analyzing Performance Asymmetric Multicore Processors for Latency Sensitive Datacenter Applications

Directions in Workload Management

CPU Scheduling (Part II)

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

ASYNCHRONOUS SHADERS WHITE PAPER 0

CS370 Operating Systems

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Linux (Fedora or Slackware) CPU Scheduling


Chapter 5: CPU Scheduling

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

OPERATING SYSTEM. The Process. Introduction Process creation & termination Process state diagram Process scheduling & its criteria

OPERATING SYSTEM. Functions of Operating System:

Practice Exercises 305

Transcription:

CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

What is MULTI PROCESSING?? Multiprocessing is the coordinated processing of programs by more than one computer processor. Multiprocessing is the dynamic assignment of a program to one of two or more computers working in tandem or can involve multiple computers working on the same program at the same time (in parallel).

WHY MULTI- TASKING?? As one system cannot do all the work, there is a need for task to be executed as soon as possible. So, a task is divided among various processors in a system and it is executed in very less time.

Advantages of Multi processing Increases Throughput. Economy of Scale. Increased reliability.

What is an Overhead?? It is the length of time that a processor is engaged in transmission or reception of a message during which it cannot perform other operations.

ENERGY AND TRANSITION OVERHEADS AWARE TASK SCHEDULING Many embedded or portable devices have large demands on running real time applications. Hence the designers use multicore processors in these devices. The energy usage in a device is directly proportional to the number of processors on a device. The real time tasks at hand also consume much runtime energy in a multicore processor. We focus on a technique wherein we reduce the energy consumption on a device by also considering the overheads caused. Benefits of lowering voltage: Lower overall power consumption. System becomes less expensive. Runs longer on portable devices. Clock rates can be higher and higher. Low heat dissipated thus the system can be compact. Since the system becomes cooler, processor runs faster.

Conventional scheduling algorithms assumed that each core can operate under different voltage levels. But they have not considered the effects of voltage transition overheads which may defeat the benefit of task scheduling. Basically, the tasks at hand can be periodic wherein the task arrives sequentially in a fixed period and aperiodic wherein the arrival and period of task is unknown. Previously, there were many studies related to scheduling of periodic tasks on single core or multicore processors.

DYNAMIC VOLTAGE SCALING In a periodic task we have many jobs which have to done before a deadline and which arrive sequentially in a fixed period. In order to get the job done we calculate the whole task utilization which is the ratio of the time task spends on running its work to the whole period. If we allocate a task to the core, its utilization will increase the total workloads on that core. Hence by evaluating the utilization of total tasks, we can predict if there will be idle time occurred on that core.

If a task in one period is predicted that its job is done much before the deadline and there is some idle time then we can schedule the new task onto that core by a lower voltage to perform more power saving execution. This is called as dynamic voltage scaling technique. But this action of voltage switching between tasks might bring in lot of overheads which could affect the benefit of task scheduling. Thus we discuss in detail about how we include these overheads and save energy at the same time.

BASIC PROCESS The transition overheads include the additional transition energy and the transition time of which the transition energy will affect the system the most. As the technology of implementing a processor reduces (Moore s Law) the number of voltage levels on a transistor increases as well. Thus the effects of transition overheads became more significant and couldn t be ignored at all. All these cases are only discussed for a periodic task and this matter of concern becomes even worse for an aperiodic task because these tasks are event driven and could be released at any time thus making the arrival time and status of cores at that time unpredictable e.g. user interaction requests, exceptional interrupts etc. Thus the only information at our hand is the task s workload and the deadline. Thus we explain the basic process happening while assigning the tasks.

We consider Intel Xscale DVS model wherein we have multiple voltage levels. The supply voltages, power consumption, extended execution time, release time, deadline and worst case execution time are mentioned. We also know that the supply voltage is inversely proportional to the execution time. Thus to schedule the tasks we use Earliest deadline first policy (EDF) EDF+DVS EDF+DVS considering voltage transition overheads.

Earliest Deadline First Policy Here the tasks get scheduled according to their deadlines and this procedure doesn t involve DVS thus ruling out the voltage transition overheads in scheduling. This analysis is done only to compare this result with the next analysis. The task energy consumption was observed to be 21J.

Earliest Deadline First and Dynamic Voltage Scaling In this approach we consider DVS along with EDF where the transition overheads are not considered in scheduling. Core is only interested in reducing the supply voltage of each individual task as low as possible and the other aspect of consideration is the voltage transition, which requires one time unit.

We can also observe that this approach has more transition overheads. The experimental results of task energy consumption was observed to be approximately 24 J. (including the transition energy) The reason for the high transition energy in Fig. 2 is that on core 1, task k5 has a higher supply voltage than the previous task k3. Thus core 1 has to consume significant transition energy to scale up the supply voltage to execute task k5. The same situations happen between tasks k1 and k3, and between tasks k2 and k4.

Earliest Deadline First and Dynamic Voltage Scaling With Transition Overhead Consideration

In this approach we consider voltage transition overheads. Through this method we decide a low supply voltage for each task which is close to the supply voltage of the previous task. In this way all the tasks can meet the deadlines and the voltage transition takes one unit in this approach as well. The practical results show that the total energy consumption and the overheads are also decreased where the energy sums up to be 19J approximately. Through these observations we can deduce that this approach can run for real time applications by reducing both energy consumption and the transition overheads.

Heuristic Algorithm This algorithm considers the total energy saving and the total transition overheads simultaneously but can be applied to a runtime environment due to complexity reduction. The heuristic solutions can be classified into two groups: Partitioned Scheduling Algorithm: It requires each task to be finished in a unique processor. Global Scheduling Algorithm: It allows the execution of a task to be partitioned into different processors. We make use of partitioned scheduling algorithm for our cause to avoid unnecessary energy overheads.

This algorithm is used at every scheduling point to schedule the tasks to the core and major steps involve picking up a task from the queue and determining the core to perform the execution by a selected voltage. In this algorithm we need a minimum supply voltage to finish the execution of a task and we discussed above that the transition time overhead will occur when two consecutive tasks use different supply voltages. This calculated voltage is considered as a low bound and the voltage of a task scheduled to a core shouldn t be lower than this bound. The next condition is that the task with highest voltage should be executed first in order not to miss its deadline. We also need a common voltage level in order to avoid overheads and at the same time reducing the energy. Hence all the task supply voltages should be around this value and if a real time task is gearing up to get processed then we increase the supply voltage and get the task done which will result in reduction of gaps although here we need to sacrifice on energy consumption.

Thus we summarize the heuristic algorithm as follows:

The final algorithm is as follows:

These are the experimental results comprising of all the task and transition energies and we notice that as the number of processors are increasing the energy consumption of the heuristic algorithm is getting better and better.

Limitations We could have used the Adaptive voltage scaling approach which is beneficial compared to the DVS. This approach is a closed loop dynamic power minimization technique that reduces power on actual operating conditions on a chip, i.e. the power consumption is continuously adjusted on the chip. As discussed previously the global scheduling algorithm balances the core by migrating tasks from high load cores to low load cores but we stick to partition scheduling algorithm which doesn t allow unnecessary energy overheads but these can also be taken care of. This algorithm is restricted to a homogenous multicore system and was later implemented for heterogeneous systems as well.

GPU offload latency overhead in CPU-GPU communication. What is a GPU?? - Graphics Processing Unit (Visual Processing Unit) - popularized by Nvidia (1999) - Purpose- To manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. - used in Embedded systems, Mobile phones, Personal Computers, etc.. - Computation kernels are offloaded from CPU to GPU to improve runtime and throughput.

Importance of GPU Advantages: - GPUs are widely used for general purpose computation due to their excellent performance for highly-parallel, throughput-oriented applications. - Three of the worlds largest super computers in GPUs or GPU like Intel MICs in their clusters. Disadvantages: - Latency overheads. - Overheads in offloading from CPU to GPU.

Execution time of CPU and GPU

A primary reason GPUs are not always faster than CPUs is that kernel launch and data transfer introduce delays and variabilities that are added no matter the size of the kernel. For smaller workloads the kernels are insignificant but for higher loads the kernel transfer is significant. In present existing systems, the memory space is logically shared between CPU and the GPU which cause the overhead of copying between the spaces every time data is transferred.

Memory Copy latency between CPU and GPU

Causes of latency 1. Kernel launch command takes time to execute. 2. There is a delay between when data physically arrives at its destination versus when it is ready to use. 3. API and driver overheads introduce delays with each call made. Removing latency overheads 1. Waiting for all of a data block to be ready is the only currently safe method for code to execute, but this often results in a performance slowdown. 2. F/E bits applied to GPU memory.

MPI SEND and RECEIVE Overhead Latency and Bandwidth are two important requirements in parallel processing. Scaling efficiency can be improved by the ability to overlap Communication with Computation(C- C ratio). Done using Non-Blocking MPI.

Calculating overhead time Post-work wait loop uses MPI-non-blocking send and receive. MPI_Isend( ) and MPI_Ireceive ( ). Each iteration of post work loop is overlapped with message transfer. When 1) Work interval > Message transfer time - message transfer completes first 2) Loop time = Amount of time to perform work + time required by host processor to complete transfer. 3) Overhead time = Time to perform same amount of work without Overlapping loop time message transfer.

CONCLUSION The basic aim of any system is to reduce the consumption of power and energy as well as enhance its performance. Many algorithms have been developed revolving around these criteria s. Most of them have been designed to overcome a certain limitations, rather than completely generalizing these requirements. The approach s we have discussed also are restricted to certain conditions and the overheads will be generated when number of processors on the mother board increases. Thus, we can develop algorithms to solve certain problems.

Thank you professor for giving us this opportunity to expand our knowledge in this particular subject. Extremely Sorry for the late submission.

THANK YOU