The Legion Mapping Interface

Similar documents
New Mapping Interface

Legion Bootcamp: Debugging & Profiling Tools

Legion Bootcamp: Partitioning

Givy, a software DSM runtime using raw pointers

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Green Hills Software, Inc.

GeForce3 OpenGL Performance. John Spitzer

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Design Patterns Reid Holmes

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

! Readings! ! Room-level, on-chip! vs.!

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

On the cost of managing data flow dependencies

Legion: Expressing Locality and Independence with Logical Regions

Chapter 3: Process Concept

Chapter 3: Process Concept

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Working with Mediator Framework

CS342: Software Design. November 21, 2017

Dynamic Data Structures. John Ross Wallrabenstein

CSE 120 Principles of Operating Systems

Tree-Structured Indexes

Parallel Programming on Larrabee. Tim Foley Intel Corp

Evaluating XPath Queries

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Main Memory and the CPU Cache

DEVELOPER DAY. Vulkan Subgroup Explained Daniel Koch NVIDIA MONTRÉAL APRIL Copyright Khronos Group Page 1

The Strategy Pattern Design Principle: Design Principle: Design Principle:

Content Sharing and Reuse in PTC Integrity Lifecycle Manager

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Java Internals. Frank Yellin Tim Lindholm JavaSoft

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Signals, Synchronization. CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han

Short Notes of CS201

Process Time. Steven M. Bellovin January 25,

Meta-Programming and JIT Compilation

Chapter 3: Process Concept

CS201 - Introduction to Programming Glossary By

Bare Metal Library. Abstractions for modern hardware Cyprien Noel

by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha

Announcement. Exercise #2 will be out today. Due date is next Monday

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

SEMANTIC ANALYSIS TYPES AND DECLARATIONS

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Programming Models for Supercomputing in the Era of Multicore

Yi Shi Fall 2017 Xi an Jiaotong University

I/O Buffering and Streaming

Introduction to Visual Basic and Visual C++ Introduction to Java. JDK Editions. Overview. Lesson 13. Overview

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Demonstration of Legion Runtime Using the PENNANT Mini-App

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Practical Near-Data Processing for In-Memory Analytics Frameworks

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Introduction to Data Management. Lecture 15 (More About Indexing)

Multithreaded Programming Part II. CSE 219 Stony Brook University, Department of Computer Science

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Chapter 4: Threads. Operating System Concepts 9 th Edition

Computer Systems II. First Two Major Computer System Evolution Steps

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

RESOURCE MANAGEMENT MICHAEL ROITZSCH

CSE 544: Principles of Database Systems

Compsci 590.3: Introduction to Parallel Computing

Lec 25: Parallel Processors. Announcements

Parallelism Marco Serafini

Improving the Practicality of Transactional Memory

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Chapter 4: Threads. Operating System Concepts 9 th Edition

Overview of research activities Toward portability of performance

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

Design Patterns Reid Holmes

Chapter 3 Processes. Process Concept. Process Concept. Process Concept (Cont.) Process Concept (Cont.) Process Concept (Cont.)

A new perspective on processing-in-memory architecture design

Chapter 4: Multithreaded Programming

Introduction to Parallel Programming Models

Latency-Tolerant Software Distributed Shared Memory

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Lecture 2 Process Management

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Review. Designing Interactive Systems II. Review. Base Window System. Apps UITK BWS GEL. 4-Layer Model Graphics and Event Library BWS GEL

Programming Kotlin. Familiarize yourself with all of Kotlin s features with this in-depth guide. Stephen Samuel Stefan Bocutiu BIRMINGHAM - MUMBAI

NUMA-aware Graph-structured Analytics

Inheritance, Polymorphism and the Object Memory Model

Scalable Cache Coherence

6.033 Spring 2018 Lecture #3

COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM

WHAT IS SOFTWARE ARCHITECTURE?

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Transcription:

The Legion Mapping Interface Mike Bauer 1

Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight Legion for performance Propagate mapping control up through layers of abstraction Dynamic Mapping React to machine changes React to app. changes App Autotuner DSL Legion Machine Hardware Control of Mapping 2

Machine Model Machine is a graph of processors and memories Nodes contain attributes Processors: kind, speed Memories: kind, capacity, hardness Disk Zero-Copy NVRAM Edges describe relationships Processor-Memory affinity Memory-Memory affinity Bandwidth, latency CPU NUMA CPU GPU OpenGL DRAM FB NUMA Machine object interface 3

Mapper Model Create a separate mapper for each processor Mappers can specialize themselves Avoid bottlenecks on mapping queries Support arbitrary number of mappers per application Compose applications and libraries with different mappers Initialized at start-up of Legion runtime set_registration_callback function Add mappers with the add_mapper function Default mapper is always given MapperID 0 Can be replaced with replace_default_mapper function D 1 2 D 1 2 D 1 2 D 1 2 D 1 2 D 1 2 CPU CPU GPU CPU CPU GPU Node Node 4

Mapper API Mapping API is a pure virtual interface Easy to extend existing mappers Methods invoked by the runtime as queries At most one invocation per mapper at a time No need for locks Mappers can be stateful Memoize information State is distributed class Mapper { public: virtual void select_task_options(*) = 0; public: virtual void pre_map_task(*) = 0; virtual void map_task(*) = 0; virtual void post_map_task(*) = 0; }; 5

Default Mapper Legion comes with a default mapper Implement custom mappers that inherit from default mapper Only need to customize specific mapper calls Leverage open/closed principle of software engineering Lends itself to a natural performance tuning loop Repeat for each application+architecture Easy to autotune Start Profile Application Execution Determine Performance Bottlenecks Continue Customizing Mapper 6

The Lifetime of a Mapper calls for tasks follow the task pipeline Not all calls handled by the same mapper object s can map both locally and remotely Guaranteed to be handled by mappers of the same ID Always Done Locally Can Be Done Locally or Remotely Select Options Pre-Map Map Select Variant Post-Map 7

Mapping Locally vs. Remotely Three important processors associated with Origin Processor (orig_proc): where task was launched Current Processor (current_proc): owner of the task Target Processor (target_proc): current mapping target s can be mapped locally or remotely Locally: current_proc == orig_proc Remotely: current_proc == target_proc (!= orig_proc ) Handle Mapping Queries Orig Proc Move Meta-Data Move Launch Meta-Data Target Proc Handle Mapping Queries Launch! Meta-Data >> Launch Meta-Data Remote Mapping -> More Parallelism 8

Select Options virtual void select_task_options( *task) Currently decorate fields of object Planned: structure describing options Assign the following fields: target_proc pick the first owner processor inline_task inline using parent task s physical regions spawn_task make eligible for stealing map_locally map the task locally or remotely profile_task capture profiling information Select Options Pre-Map Map Select Variant Post-Map 9

Mapping (Part 1) s always have an owner processor Owner can be changed until a task is mapped Once a task is mapped it will run on owner processor Mapping a task consists of three decisions Fixing the owner processor Selecting memories for physical instances of each region Determining layout constraints for physical instances Select Options Pre-Map Map Select Variant Post-Map 10

Mapping (Part 2) virtual bool map_task( *task) Choose memory ordering for each region requirement Return true to be notified of mapping result has a vector of application-specified regions Represented by region requirements Called regions Legion provides list of current memories with data Called current_instances Boolean indicates if contains valid data for all fields Mapper ranks target memories in target_ranking Runtime tries to find or create instance in each memory Will issue necessary copies and synchronization Choose layout by selecting blocking_factor 11

Mapping (Part 3) Legion automatically computes copies based on mapping decisions Sometimes there might be multiple valid sources Never guess! (Legion knows what to do if only one source) virtual void rank_copy_sources( ) Set of possible source memories Memory containing physical instance Populate a vector ranking source memories by preference DRAM GASNet Framebuffer 12

Mapping (Future) New mapping API in progress Switch from memory centric to physical instance centric Be field aware Support more data layout formats map_task will not change much Legion will provide information about physical instances Layout, field sizes, which fields are valid Mappers provide ranking of physical instances Physical instances specified as a set of constraints Order of index space dimensions + field interleaving Constraints on specific pointers and offsets 13

Selecting Variants variant selected based on mapping decisions Legion examines all constraints and picks variant Processor kind, physical instance memories and layouts If there are multiple valid variants then query mapper virtual void select_task_variant( *task) Might require many variants. Is there a better way? Yes! generator functions (using meta-programming) Select Options Pre-Map Map Select Variant Post-Map 14

Dealing with Failed Mappings Mappings can fail for many reasons Resource utilization Memories not visible from target processor No registered task variant based on constraints virtual void notify_mapping_failed( ) Region requirements annotated by mapping_failed field Failed tasks automatically ready to map again Mappers can try mapping them again later Watch out for repeated failures (looks like livelock) Future work: establish conditions for re-mapping 15

Pre-Map (Part 1) virtual bool pre_map_task( *task) Can early-map region requirements to physical instances Performed on origin processor before task can be moved Return true to be notified of pre-mapping result Handle some special cases Read-Write coherence on index space task region Select Options Pre-Map Map Select Variant Post-Map 16

Pre-Map (Part 2) Runtime performs close operations as part of pre-mapping task Handle translation between different views t 1 t 2 t 3 Two options: Concrete Instance Composite Instance Composite Instances Memoize intersection tests to amortize cost virtual bool rank_copy_targets( ) return true for composite instance 17

Post-Map (In Progress) Create optional checkpoints of logical regions Generate physical instances in hardened memories Copies automatically issued by Legion runtime Control which logical regions and fields are saved Select Options Pre-Map Map Select Variant Post-Map 18

Mapping Other Operations Legion maps many operations other than tasks Inline mappings Explicit region-to-region copies Similar mapping calls, all on origin processor virtual void map_copy(copy *copy) virtual void map_inline(inline *inline_op) Map region requirements the same as tasks 19

Managing Deferred Execution Legion is an out-of-order task processor How far do we run ahead (into the future)? Machine and application dependent -> mapper decision Analysis from iteration i+2 s from iteration i Utility Core AMD Interlagos cores NVIDIA Kepler K20 20

Managing Deferred Execution (2) Two components of managing run-ahead How many sub-tasks outstanding per task? When should tasks begin the mapping process? Control max outstanding sub-tasks with window size virtual void configure_context( *task) Set max_window_size (default 1024) Can be unbounded (any negative value) Trade-off parallelism discovery with memory usage Control max outstanding sub-tasks with frames Call issue_frame at the start of each iteration Set max_outstanding_frames 21

Managing Deferred Execution (3) Legion maintains ready queue for each mapper Contains tasks that are ready to map D 1 2 D 1 2 D 1 2 D 1 2 D 1 2 D 1 2 CPU CPU GPU CPU CPU GPU Node Node Mappers decide when to start mapping tasks virtual void select_tasks_to_schedule(list<*>) Open question: when to stop invocation? Right now: when enough tasks outstanding (-hl:sched) Can perform one of three operations for each task Start mapping (set schedule field of * to true) Change current_proc to new processor to send remotely Do nothing: important for loading balancing 22

Inter-Node Load Balancing Legion supports both push and pull load balancing Push tasks to other nodes Pull work from other nodes (e.g. stealing) Two forms of push Change current_proc in select_tasks_to_schedule virtual void slice_domain( ) Decompose index space into subsets and distribute Recursively slice subsets, specify target processor Index space tasks: slice into subsets of points Look at is_index_space to determine if slice or single task index_domain gives bounds of slice Node 0 Node 1 Node 2 Node 3 23

Stealing Legion also supports pull load balancing via stealing Stealing is totally under control of mappers Mappers can only steal tasks from mappers of the same kind Stealing in Legion is two-phase Send requests: virtual void target_task_steal( ) Choose targets for stealing (no guessing by Legion) Approve requests: virtual void permit_task_steal( ) s can only be stolen from ready queues Cannot steal already mapped tasks Node 0 Steal Request Stolen s! Node 1 Steal Request D 1 2 Steal Failure! D 1 2 CPU Advertise CPU Invoke Permit Invoke Permit New! 24

Intra-Node Load Balancing Stealing is inefficient within a node Support mapping tasks onto multiple processors in a node Can assign additional_procs in map_task Must be of the same kind as target_proc, on same node Must be able to access all physical instances Legion automatically checks these properties Will ignore bad processors and issue warning Create internal queues for running these tasks Processors pull highest priority task from all their queues NUMA Queue All-CPU Queue NUMA Queue CPU CPU CPU CPU 25

Program Introspection Mappers can (immutably) introspect data structures Region tree forest: index space trees, logical region trees variant collections Semantic tags describing tasks and regions Dynamic dependence graph (computed by runtime) Mappers can profile task execution Set profile_task to true in select_task_options virtual void notify_profiling_info( *task) Currently profile basic properties (e.g. execution time) What else do we need? 26

Other Mapping Features Tunable variables Abstract variables that depend on machine (e.g. # of partitions) virtual int get_tunable_variable( ) Virtual mappings Some tasks only need privileges, don t need a physical inst. Virtually map region requirement by setting virtual_map Child task mapping flows back into parent context Controlling speculation Mapper controls speculation on predicated tasks virtual void speculate_on_predicate( ) Don t speculate for now, available soon J 27

Avoiding Resource Deadlocks Sibling tasks with a dependence cannot map until all children have mapped Enforced automatically by the runtime T T T T Necessary to avoid resource deadlock Is there a better way? T T T T 28

Open Mapping Questions Resource constrained mapping Right now we map one task at a time Map multiple tasks together to optimize resource usage Trade-off parallelism with resource usage fusion + mutation of dynamic dependence graph Fuse operations to support better data re-use More on this in meta-programming talk later today Other manipulations on dynamic dependence graph? replication Why move data when you can compute it multiple times? Replicate tasks to reduce overall data movement costs 29