Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Similar documents
POSIX Threads: a first step toward parallel programming. George Bosilca

Chapter 3 Parallel Software

Introduction to Multicore Programming

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Performance of Multicore LUP Decomposition

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CS516 Programming Languages and Compilers II

Copyright 2013 Thomas W. Doeppner. IX 1

Accelerated Library Framework for Hybrid-x86

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Introduction to Multicore Programming

Multi-core Architecture and Programming

CSE 451 Midterm 1. Name:

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

CUDA GPGPU Workshop 2012

Introduction to OpenCL!

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Multithreaded

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

OPERATING SYSTEM. Chapter 4: Threads

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

High Performance Computing on GPUs using NVIDIA CUDA

CS 3305 Intro to Threads. Lecture 6

Process Description and Control

CS307: Operating Systems

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

Performance impact of dynamic parallelism on different clustering algorithms

3/7/18. Secure Coding. CYSE 411/AIT681 Secure Software Engineering. Race Conditions. Concurrency

Remaining Contemplation Questions

Introduction to PThreads and Basic Synchronization

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Martin Kruliš, v

AN 831: Intel FPGA SDK for OpenCL

CS 261 Fall Mike Lam, Professor. Threads

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Introduction to pthreads

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Chapter 11: Implementing File-Systems

Multiprocessors 2007/2008

Data Communication and Synchronization

Towards Virtual Shared Memory for Non-Cache-Coherent Multicore Systems

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

User Space Multithreading. Computer Science, University of Warwick

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

High Performance Computing Course Notes Shared Memory Parallel Programming

Review: Easy Piece 1

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

CS370 Operating Systems Midterm Review

Threads Implementation. Jo, Heeseung

Programming Languages

Multithreaded Programming

Operating Systems (234123) Spring (Homework 3 Wet) Homework 3 Wet

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Chapter 4: Threads. Chapter 4: Threads

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Chapter 4: Multithreaded Programming

Multiprocessor Systems. COMP s1

Chapter 4: Multi-Threaded Programming

Trends and Challenges in Multicore Programming

Overview: The OpenMP Programming Model

Application Programming

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

CYSE 411/AIT681 Secure Software Engineering Topic #13. Secure Coding: Race Conditions

Multiprocessor Systems. Chapter 8, 8.1

SmartHeap for Multi-Core

Shared Memory Parallel Programming with Pthreads An overview

Scalable GPU Graph Traversal!

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

4.8 Summary. Practice Exercises

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

15-418, Spring 2008 OpenMP: A Short Introduction

CS533 Concepts of Operating Systems. Jonathan Walpole

Other consistency models

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Multicore and Multiprocessor Systems: Part I

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

CS420: Operating Systems

Processes and Threads

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition,

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Integrated Approach. Operating Systems COMPUTER SYSTEMS. LEAHY, Jr. Georgia Institute of Technology. Umakishore RAMACHANDRAN. William D.

Review: Creating a Parallel Program. Programming for Performance

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

University of Waterloo Midterm Examination Model Solution CS350 Operating Systems

OPERATING SYSTEM. Chapter 12: File System Implementation

Chapter 4: Threads. Chapter 4: Threads

Chapter 5: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux Threads Java Threads

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 574 Cluster Computing Lecture 8

Capriccio: Scalable Threads for Internet Services

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

Parallelism. CS6787 Lecture 8 Fall 2017

Threads. CS3026 Operating Systems Lecture 06

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Transcription:

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin Skadron Department of Computer Science, University of Virginia {jws9c,skadron}@cs.virginia.edu December 22, 2011 Abstract We present Fractal, an API for programming parallel workloads on multithreaded and multicore architectures. Fractalschedules threads on processors with cache configuration in mind in order to minimize cache misses. Fractal achieves speedups of nearly 2 on Sun NIAGARA, while processors with less hyperthreading see less impressive performance improvements. 1 Fractal Fractal is an API for parallel workloads, which takes into accountphysicalprocessorconfigurationand utilizes processor affinity to schedule work with the goal of minimizing L1 cache misses. The API employs a variety of schedulers, designed to minimize cache contention in different ways and for different data access patterns. With hyperthreaded processors executing one thread per virtual core, our schedulers lead to modest speedups of 10% 20%, while our system, executing on a Sun NAIGARA T-1000, an in-order multicore processor with 4 virtual cores per L1, can demonstrate nearly 2 performance improvements. Fractal provides a write-once, run-anywhere abstraction which, while currently targeting only multicore or hyperthreaded CPUs, is designed with heterogeneous targets in mind. TheFractalframeworksupportsthenotionof heterogeneity both in a multinode cluster and, more importantly in the short term, in an environment that supports one or more GPUs, and can be easily extended to work in a networked,heterogeneouscluster. Fractal s source code is available for download at http://www.cs.virginia.edu/ jws9c/fractal/. 1.1 Fractal API Fractal provides a minimalist API to parallelize independent iterations of loop bodies. In this section we discuss the entry points and their implementations. 1.2 Components Table 1 contains a complete listing of Fractal API entrypoints. We discuss the functionality-critical members below. 1

Function Prototype Description new(handle t *handle) Create a new Fractal context kernel?d(handle t handle, kernel?d t kernel) Register a?-d kernel, where? is 1, 2, or 3 scheduler function(handle t handle, scheduler t scheduler) Select a scheduler loop(handle t handle, dimension t dimension, intptr t initial, Describe a loop control intptr t less, intptr t stride) delete(handle t handle) Free resources related to a context launch(handle t handle) Begin a calculation get error( void ) Check error state clear error( void ) Reset error state barrier(handle t handle) Global barrier partition barrier(handle t handle) Per-L1 barrier finish(handle t handle) Wait for calculation to complete print error(file *stream) Display error information override cpu(handle t handle, char *cpu name) Select a different CPU from the DB Table 1: A listing, with brief descriptions, of all of the entrypoints into the Fractal API. All entrypoints and type names are prefixed with fractal (not shown), as are non-standard types. All entrypoints return a fractal error t value, except for fractal print error(). intptr t is a signed integer type (from <stdint.h> in C99) with sufficient width to store a pointer on the host architecture. fractal new() and fractal delete(): fractal new() creates a new fractal context and returns an opaque index in handle. A valid index is required for all other fractal API entrypoints (except the error handlers), so fractal new() must be called before any other fractal functions. Multiple simultaneous fractal contexts may exist concurrently; however, to launch simultaneous fractal computations would be unwise, as they would compete for resources. This functionality exists primarily to allow future extension oftheapi. fractal delete() releases the resources associated with handle. Itisanerrortoattemptfurtheroperations relative to handle after calling fractal delete(). fractal kernel?d(): Each of fractal kernel1d(), fractal kernel2d(),andfractal kernel3d() are used to associate callbacks with compute kernels with handle. Thetypesfractal kernel?d t are function pointer types which take one, two, or three intptr t (see caption on Table 1) arguments, respectively. fractal scheduler function(): The user calls fractal scheduler function() associates a scheduler sched with handle. Currently fractal scheduler t is opaque; however, we envision the possibility of exporting the definition to allow users to implement custom schedulers. There are currently four schedulers available. Details follow in Section 1.3.3. fractal loop(): 2

fractal loop() is used to describe for loop-style loop control data and associate it with handle. Fractal currently supports up to three loop dimensions (specified with dimension). Extension to higher dimensionality is possible, should user response signal it is needed. Work is partitioned by the scheduler according to the loop specifications. The associated kernel is called once per innermostloop iteration, as if the body of that loop. fractal launch() and fractal finish(): fractal launch() signals the Fractal runtime that the context associated with handle is fully specified and the computation is ready to begin. The runtime does final initialization, work scheduling, and thread creation (see Section 1.3.2). The main thread continues asynchronously. fractal finish() synchronizes on the termination of all associated threads. A program should not call fractal delete() or attempt to terminate without first calling fractal finish(). fractal barrier() and fractal partition barrier(): These functions are special in that they can and must be called fromafractalkernel.fractal barrier() creates a global barrier. No thread may proceed past the barrier until all threads have reached it. fractal - partition barrier() creates a barrier that is local to the set of threads that share aphysicalcpu(oran L1 cache). Threads local to a different CPU may proceed without consequence. Because Fractal is based on POSIX threads, all synchronization and mutual exclusion primitives and operations associated with pthreads are available within Fractal. These include POSIX semaphores and pthread mutexes and condition variables. Indeed, the fractal barriers are based on these primitives. It is safe to mix primitives as needed, however, users desiring more control with pthreads-based, lower-level synchronization primitives will need to take care to avoid deadlock, and these constructs willnotnecessarilyworkwithalltargets. 1.3 Implementation In this section we discuss some of the details and challenges of our Fractal API implementation. 1.3.1 System Configuration We faced a number of issues in implementing the Fractal API. Most important among these are unreliable system-level tools for obtaining architecture-level hardware specifications. Linux provides the /proc and /sys filesystems, which provide informationabout the currently running system. In this work, we are concerned with the physical configuration of the underlying processor architecture, as well as the mappings of virtual to physical processors, virtual and physical processors to L1 caches, and processors and caches to their system level abstractions. The information provided by the Linux system level interfaces on these topics is usually false. For example, all Intel systems used in our development and testing list the ht tag, supposedly indicating that the processor supports hyperthreading. Of these systems, only our Core i7 based machines are actually hyperthreaded. All others falsely report this functionality. Similarly, our Core i7 systems report eight physical ids (0 7), one core id (0), one sibling (ostensibly a count of virtual cores per physical core for hyperthreading), and the existence of one core, while our Core2 Quad systems report one physical id (0), four core ids (0 3), four siblings, and the existence of four cores; In reality, the Core i7s have four physical cores with 2-way hyperthreading on each core for eight virtual cores and the Core2 Quads have four physical cores with no hyperthreading (thus four virtual cores). There is nowaytodeterminethisfromthedatain/proc/cpuinfo, nor from any other data contained under /proc or /sys. Linux also provides the sysconf() interface to query system state, but this is limited in functionality and actually reads /proc/cpuinfo for its data. To get around these issues, and also to allow the use of our API on systems that do not provide the same system information interfaces as Linux (our Solaris port, for example), we have developed a CPU database as part of our implementation. Our database is indexed first by hostname. If that fails the system attempts to read/proc/cpuinfo and uses the CPU model name data contained there (this information seems to be correct) and uses it to index the database a second time. If the second lookup also fails, the system will fall back on a default configuration which assumes a single physical core with two virtual cores. 3

Our CPU database allows uses to specify all of the desired datadescribedabove,aswellasanumberofotherdetails about caches. Having discovered the issues with /proc/cpuinfo,wewereabletogetactualdevicespecifications from marketing materials; however, that is still not sufficient to map a certain virtual processor to a certain L1 cache. Intel provides a tool which was able to simplify this step on the Intel-based Linux systems 1.UnderSolaris,wewere forced to microbenchmark. The need for a CPU database provided an unforeseen benefit: we were able to easily extend our API with an entry into the processor database that allows override of CPU configuration state. This is useful in its ability to allow a debug cpu. Our debug configuration defines only a single virtual processor. This is not generally useful for API development, as the difficulty in developing the API is primarily related to the nondeterminism and locking involved in multithreaded and reentrant code. For application developers, however, this allows debuggingof their applications in a single-threaded environment without hacking anything. 1.3.2 Initialization and Execution At API initialization, the processor database is queried foraprocessorspecificationmatchingthecurrentsystem.with this information in hand, the Fractal runtime can allocate its control structures and create threads. In order to enforce the desired scheduling and sharing (Section 1.3.3), the systems explicitly assigns processor affinity when launching threads. Each virtual processor is assigned one thread, except in the case where there is a one-to-onemapping between virtual and physical processors, in which case two threads are assigned per processor. First Order Third Order Second Order Fourth Order Figure 1: The Morton space-filling curve, with four levels of recursion shown. 1.3.3 Work Scheduling Before launch, the master thread partitions the work into chunks. The chunks are created in sets of n equally sized work units, where n related to the number of processors (virtual or physical, depending on the exact work scheduler selected). The work partitioner begins by evenly partitioning half the work among the n processors, proceeding 1 http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/ 4

recursively until a minimum size work unit is allocated. The last two passes produce units of the same size. The exponential decay in work unit size is designed to reduce the likelihood of experiencing a long tail effect, wherein one or a few cores continue to process more compute-intensiveworkloadslongafterthemajorityofcoreshavebecome idle. In an ideal situation, every core will process the same slice of work units. There is no mechanism for work stealing built into the API at this time. Fractal schedules work according to a selection among a number of possible scheduling patterns. The Fractal schedulers assume neighborhood-based memory access patterns, with tight spatial and temporal locality. Schedulers with more special purpose access patterns, say for a specific implementation of FFT, are feasible. The available scheduling patterns include naïve, which divides memory into 2n contiguous chunks (where n is the number of available cores and L1 caches; parallel z, in which subsequent rows of data are processed simultaneously by threads sharing core and cache; and staggered x, inwhichthreadssharingcoreandcachestaggeraccessestothe same matrix row. A morton scheduler is under development, which will schedule work based on a Morton space-filling curve shaped access pattern. The parallel z scheduler achieves speedups of 1.79 on NIAGARA while doing a 240 240 matrix multiply and 1.81 on a 3 3 3 Gaussian blur over 256 256 256 domain, and more modest speedups on hyperthreaded Intel architectures (1.12 and 1.01 on the same workloads on a Core i7), when compared with the naïve scheduler. Architecures without hyperthreading are unable to achieve speedups at all. None of the other schedulers are able to beat naïve. Wehavenotfullyexploredthereasonsforthepoorperformance of these schedulers, though we believe it can be attributed tomultiplefactorsincluding:out-of-orderprocessing,which seems to explain the better performance of NIAGARA; and the large overhead of OS schedulers, which allows the first of ostensibly cooperative threads to run well ahead before the second thread has a chance to begin. Attempts to enforce closer synchronization the ameliorate this problemincreasedoverheadsufficientlytooutweightheirbenefit. 2 Conclusions and Future Work Fractal provides an API that allows users to easily and portably target multicore platforms with many threads of control. Fractal can achieve impressive performance improvements on hyperthreaded architectures. Our results comparing NIAGARA, Core i7, and Core 2 suggest that increased hyperthreading leads to better results for fractal. We would like to explore this more fully. In addition to hyperthreading, our results suggest that OS schedulers interfere with the fractal schedulers, by allowing threads to run separately that Fractal intended to run together. This results in a competitive situation where it was intended to be cooperative. By modifying the Linux scheduler to be Fractal aware, we believe we could improve Fractal s performance on Linux and show how OS schedulers could be better designed for high performance computing. Attempting to target GPUs with Fractal proved difficult, as fractal attempts to do fine-grained scheduling of threads, and GPU programming APIs (and indeed GPU architectures) do not allow that. We did not find a satisfactory solution to this problem. Furthermore, the Fractal model of fine-grained work scheduling does not map well to GPU architectures, which automatically schedule threads en masse. Even given satisfactory method of compiling GPU kernels within a Fractal environment, current GPU targets are not likely to see performance improvements from Fractal scheduling. 5