Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Similar documents
Dynamic Fine-Grain Scheduling of Pipeline Parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Review: Creating a Parallel Program. Programming for Performance

ASYNCHRONOUS SHADERS WHITE PAPER 0

Simultaneous Multithreading on Pentium 4

Concurrency for data-intensive applications

Moore s Law. Computer architect goal Software developer assumption

Introduction II. Overview

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

! Readings! ! Room-level, on-chip! vs.!

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Flexible Architecture Research Machine (FARM)

Software and Tools for HPE s The Machine Project

Hybrid Implementation of 3D Kirchhoff Migration

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Programming on Larrabee. Tim Foley Intel Corp

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Introduction to Parallel Programming Models

The Power of Batching in the Click Modular Router

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

High Performance Computing on GPUs using NVIDIA CUDA

Practical Near-Data Processing for In-Memory Analytics Frameworks

Kaisen Lin and Michael Conley

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Portland State University ECE 588/688. Graphics Processors

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Parallel Processing SIMD, Vector and GPU s cont.

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Advanced Parallel Programming I

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Parallel Computing. Hwansoo Han (SKKU)

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Performance comparison between a massive SMP machine and clusters

D3D12 & Vulkan Done Right. Gareth Thomas Developer Technology Engineer, AMD

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Lecture 7: Parallel Processing

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Best Practices for Setting BIOS Parameters for Performance

Task Partitioning and Placement in Multicore Microcontrollers. David Lacey 24 th September 2012

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

How to Write Fast Code , spring th Lecture, Mar. 31 st

CUDA GPGPU Workshop 2012

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Detecting Memory-Boundedness with Hardware Performance Counters

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Learning with Purpose

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

SCALING HARDWARE AND SOFTWARE

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Linux multi-core scalability

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

AUTOMATIC SMT THREADING

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

An Introduction to Parallel Programming

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Capriccio : Scalable Threads for Internet Services

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Programming as Successive Refinement. Partitioning for Performance

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

CS 426 Parallel Computing. Parallel Computing Platforms

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Adaptive Cluster Computing using JavaSpaces

Performance Enhancement for IPsec Processing on Multi-Core Systems

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

소프트웨어기반고성능침입탐지시스템설계및구현

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Runtime Support for Scalable Task-parallel Programs

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Transcription:

Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle

Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation Conclusion

Introduction The presentation talks about: General Runtime Architecture for Multicore Parallel Systems (GRAMPS) programming model How a GRAMPS scheduler dynamically schedules pipelined computations effectively Its advantages over the existing scheduling approaches

Motivation Static scheduling techniques lead to a high load imbalance when the algorithm is irregular/unpredictable Task stealing can t guarantee memory footprint bounds and can t handle complex control flows efficiently Breadth First techniques have high memory footprint and bandwidth requirements

Previous Scheduling Techniques Static: Application is divided into a group of stages Analysis of dependencies leads to better locality of data and mapping of tasks Offline scheduling eliminates overheads at runtime Works well when all stages nodes have uniform work

Task Stealing: Treats tasks as independent work units that can run concurrently Imposes low overheads on task creation and execution Data dependencies and priorities aren t taken into consideration while scheduling Poor data locality The memory footprint bounds can t be guaranteed and scale up easily with problem size No concrete guidelines on the implementation

Breadth First: Application is divided into a group of data parallel stages Each stage runs a number of instances in parallel Only one stage can execute at a time Needs barrier synchronization for each stage Large execution contexts leads to spillage of intermediate outputs to main memory Poor performance when there are a lot of stages and limited parallelism per stage

Fig 1. (left) Stages of a fixed pipeline GPU Fig 2. (right) Unified pipeline GPU architecture

Overview of GRAMPS A programming model for futuristic GPU Architectures Deals with complex rendering applications, like Ray Tracing, more efficiently Aims at optimizing the implementations of complex algorithms, offer a large application scope and effective abstract the underlying architecture to provide portability

GRAMPS Design An application is defined a set of computational stages The stages execute in parallel and communicate asynchronously via queues Each stage is implemented on a Stream Multi Processor (SMP) Assumes the number of stages in the computation to be much less than the number of processors in the system

Overview of the Constituents of a GRAMPS Design Execution Graph: A custom designed pipeline for the application specified by the program Stage: A part of the pipeline that performs a computation. It could be a shader, a thread or a fixed function Queues: Used between stages for communication. The minimum amount of data that can be communicated is called a work packet. Uses a reserve commit protocol. Buffers: Bound to stages to support communication

Advantages of GRAMPS over other scheduling techniques As all the stages run concurrently, limited parallelism per stage wouldn t impact the performance Concrete guidelines in terms of implementation of queues and use of techniques like backpressure to ensure strict memory footprint bounds Takes producer consumer relationships into consideration while scheduling to exploit locality Exploits both data parallelism and functional parallelism to improve performance

Other Scheduling Approaches Implementation Task Stealing: designed with a single LIFO task queue, no per stage queues, and the data queues are unbounded. When a task is stolen it is taken from the tail of the queue. Breadth First: designed to execute one stage at a time, once all producers are done the stage can run. Achieves load balancing by using work stealing dequeue that doesn t experience overflow. Static: designed so that a graph is computed that minimizes the communication to computation ratio while at the same time balancing the computation within each partition. No load balancing is performed by the Static scheduler, but once a worker runs out of work they wait until they receive work because each worker has a producer consumer queue.

Evaluation System setup: 2 socket system with hexa core 2.93 GHz Intel Xeon X5670 processors. 2 way Simultaneous Multi Threading 12 total cores with 24 total threads Memory 256 KB L2 cache per core 12 MB L3 cache per processor 48 GB DDR3 with peak bandwidth of 21 GB/s Communication through a 6.4 GB/s QPI interconnect.

Evaluation

Evaluation

Evaluation

GRAMPS Scheduler Performance The speedup grows close to linearly up to 12 threads. Provides the most effective scheduler because of the minimal time without work over all applications. Both the buffer manager and scheduler provide very low overheads.

Task Stealing Scheduler Performance Performs the best when the application contains a simple graph. Fm, tde and fft2 contain complex graphs and therefore perform worse than in GRAMPS. This causes long stall time for these applications because the simple LIFO has trouble reorganizing incoming packets.

Breadth First Scheduler Performance The only applications that perform similar to the GRAMPS scheduler are the histogram, lr, pca, srad and rg applications. They contain no pipeline parallelism. Footprints created are fairly large in all applications which in turn creates pressure on the memory system and buffer manager.

Static Scheduler Performance Performs the worst when the application is irregular. Mergesort performs the worst because of its highly irregular packer rates. All application speedups experience drop offs after 12 threads because the SMT is used in some cores and causes the threads to run at different speeds. Caused because of poor load balancing. Optimized locality causes the application execution time to be lower than GRAMPS but poor load balancing negates these results.

Buffer Management Results The best approach is the packet stealing buffer manager. Small overheads as well as good locality. Per queue approach has lower overheads than dynamic memory but can still be much larger than packet stealing. Locality is also worse in this scheme. Dynamic memory scheme is the worst of the three because of the frequent calls to malloc/free bottleneck at the memory allocator.

Conclusion The GRAMPS scheduler out performed all the other schedulers throughout the variety of applications and showed that it was very scalable to an increasing number of threads. It was able to dynamically load balance efficiently when the programs were irregular. GRAMPS was able to preserve locality and bounded memory footprints even if the application contained a complex graph.

Questions?