A Streaming Multi-Threaded Model

Similar documents
A Streaming Multi-Threaded Model

Analysis of Quasi-Static Scheduling Techniques in a Virtualized Reconfigurable Machine

Inter-page Network Resource Sharing Model for SCORE

Stream computations organized for reconfigurable execution

The S6000 Family of Processors


Modern Processor Architectures. L25: Modern Compiler Design

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

TRIPS: Extending the Range of Programmable Processors

EE382 Processor Design. Processor Issues for MP

Multithreaded Processors. Department of Electrical Engineering Stanford University

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

Processors. Young W. Lim. May 12, 2016

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

ESE532: System-on-a-Chip Architecture. Today. Process. Message FIFO. Thread. Dataflow Process Model Motivation Issues Abstraction Recommended Approach

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Systems Architecture

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Multiprocessors - Flynn s Taxonomy (1966)

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Higher Level Programming Abstractions for FPGAs using OpenCL

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

WHY PARALLEL PROCESSING? (CE-401)

Comparing Memory Systems for Chip Multiprocessors

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Memory Systems IRAM. Principle of IRAM

EECS 570 Final Exam - SOLUTIONS Winter 2015

Computer Architecture Today (I)

! Readings! ! Room-level, on-chip! vs.!

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

The University of Texas at Austin

4. Hardware Platform: Real-Time Requirements

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Applying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Exploitation of instruction level parallelism

Computer Architecture

A Fine-Grain Multithreading Superscalar Architecture

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

A unified multicore programming model

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

CS 426 Parallel Computing. Parallel Computing Platforms

Advanced Computer Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Continuum Computer Architecture

Lec 25: Parallel Processors. Announcements

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Issues in Multiprocessors

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002

Processor (IV) - advanced ILP. Hwansoo Han

Deep Learning Accelerators

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Advanced Instruction-Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Node Prefetch Prediction in Dataflow Graphs

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

A hardware operating system kernel for multi-processor systems

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

CMSC 611: Advanced. Parallel Systems


Multithreading: Exploiting Thread-Level Parallelism within a Processor

High Performance Computing. University questions with solution

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Parallel and High Performance Computing CSE 745

Conventional Computer Architecture. Abstraction

ASYNCHRONOUS SHADERS WHITE PAPER 0

Embedded Systems. 7. System Components

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

High Performance Computing on GPUs using NVIDIA CUDA

New Advances in Micro-Processors and computer architectures

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Computer parallelism Flynn s categories

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Transcription:

A Streaming Multi-Threaded Model Extended Abstract Eylon Caspi, André DeHon, John Wawrzynek September 30, 2001 Summary. We present SCORE, a multi-threaded model that relies on streams to expose thread parallelism and to enable efficient scheduling, low-overhead communication, and scalability. We present work to-date on SCORE for scalable reconfigurable logic, as well as implementation ideas for SCORE for processor architectures. We demonstrate that streams can be exposed as a clean architectural feature that supports forward compatibility to larger, more parallel hardware. For the past several decades, the predominant architectural abstraction for programmable computation systems has been the instruction set architecture (ISA). An ISA defines an instruction set and semantics for executing it. The longevity of the ISA model owes largely to the fact that those semantics decouple software from hardware development, so that hardware can improve without sacrificing the existing software of an architectural family (e.g. the now 23-year old Intel x86 family; the IBM 360). Increasingly, however, ISA uniprocessors are running out of headroom for performance improvement, due primarily to the increasing costs of extracting and exploiting ILP. Today s state-of-the art processors expend much of their area and power in hardware features to enhance ILP and tolerate latency, including caches, branch prediction units, and instruction reorder buffers. Recently, new architectures have emerged to exploit other forms of parallelism, including explicit instruction parallelism (VLIW), data-level parallelism (vector, MMX), and thread-level parallelism (chip multiprocessors, multi-threaded processors). These architectures typically sacrifice some of 1

the desirable properties of the single-threaded ISA model, be it ease of programming, compiler analyzability (e.g. obscured inter-thread communication patterns), or forward compatibility to larger hardware (e.g. VLIW). In this paper we present SCORE, a scalable, multi-threaded computation model and associated architectural abstraction that maintain some of the best properties of the ISA model. The model is highly parallel, efficient, and supports forward compatibility of executable programs to larger hardware. At the heart of SCORE is the premise that streams (inter-thread communication channels with FIFO discipline) should be a fundamental abstraction that is exposed both in the programming model and in the architectural model. We rely on streams because: (1) streams expose inter-thread communication dependencies (data-flow), allowing efficient scheduling; (2) streams admit data batching (data blocking) to amortize the cost of context swaps and inter-thread communication in particular, the per-message cost of communication can be made negligible by setting-up and reusing a stream, and this reuse can be fully automated as part of thread scheduling; (3) streams can be exposed as a clean architectural feature, resembling a memory interface, that admits hardware acceleration, protection, and forward compatibility. SCORE draws heavily on the prior work of numerous parallel compute models and system implementations. Due to space constraints in this paper, we refer the reader to [1] for a more complete treatment of related work and implementation details. In the remainder of this paper we discuss: (a) the basic primitives and properties of the SCORE model; (b) a binding of SCORE for reconfigurable logic, including a hardware model, programming model, scheduler, and simulation results for a media processing application; and (c) ideas for a binding of SCORE for processor architectures. 2

The SCORE Model. A SCORE program is a collection of threads that communicate via streams. A thread here has the usual meaning of a process with local state, but there is no global state shared among threads (such as shared memory). The only mechanism for inter-thread communication and synchronization is the stream, an inter-thread communication channel with FIFO order (first-in-first-out), blocking reads, non-blocking writes, and conceptually unbounded capacity. In this respect, SCORE closely resembles Kahn process networks [2] or, as we shall see later, dataflow process networks [3]. A processor is the basic hardware unit that executes threads, one at a time, in a time-multiplexed manner when there are more threads than processors. Since threads interact with each other only through streams, it is possible to execute them in any order allowed by the stream data-flow, with deterministic results. 1 The data-flow graph induced by stream connections exposes inter-thread dependencies, which reflect actual inter-thread parallelism. Those dependencies can be used to construct schedules that are more efficient for a particular hardware target (i.e. minimize idle cycles) and which can be pre-computed. In contrast, other threading models tend to obscure inter-thread dependencies (e.g. pointer aliasing in shared memory threads) and restrict scheduling using explicit synchronization (e.g. semaphores, barriers). Such restricted schedules cannot take advantage of larger hardware and thus undermine forward compatibility. Threads can be time-sliced to batch-process a large sequence of inputs. Batching is useful for amortizing the run-time costs of context swapping and of setting up stream connections and buffers. This batching mechanism is similar in spirit to traditional data blocking for better cache behavior, but in our case the blocking factor is determined in scheduling and can be tailored to the available hardware (namely to 1 The blocking read and FIFO order of streams guarantee determinism under any schedule. Nondeterminism can be added to the model by allowing non-blocking stream reads, in which case thread execution becomes sensitive to scheduling and communication timing. 3

available buffer sizes) as late as at run-time. This late binding of the blocking factor contributes to forward compatibility and performance scaling on larger hardware. In contrast, data blocking in non-streaming models is fixed at compile time and cannot improve performance on larger hardware. SCORE for Reconfigurable Architectures. Our first target for SCORE is as a model for scalable reconfigurable systems. Reconfigurable logic (e.g. field programmable gate arrays FPGAs) is a promising performance platform because it combines massive, fine-grained, spatial parallelism with programmability. Programmability, and in particular dynamic reconfiguration, aids performance because: (1) it can improve hardware utilization by giving otherwise idle logic a meaningful task, and (2) it allows specializing a computation around its data, later than compile time. Although reconfigurable systems (FPGAs, CPLDs) have been available commercially for some time, they have no ISA-like abstraction to decouple hardware from software. Resource types and capacities (e.g. LUT count) are exposed to the programmer, forcing him/her to bind algorithmic decisions and reconfiguration times to particular devices, thereby undermining the possibility of forward compatibility to larger hardware. SCORE hides the size of hardware from the programmer and the executable using a notion of hardware paging. Programs and hardware are sliced into fixed-size compute pages that, in analogy to virtual memory pages, are automatically swapped into available hardware at run-time by operating system support. A paged computation can run on any number of compute pages and will, without recompilation, see performance improvement on more pages. Inter-page communication uses a streaming discipline, which is a natural abstraction of synchronous wire communication, but which is independent of wire timing (a page stalls in the absence of stream input), and which allows data batching. Batching is critical to amortizing the typically high 4

L2 Cache JPEG Encode Run-Time ICache up DCache MMU 2.0 CMB CP Makespan (M Cycles) 1.5 1.0 0.5 Exhaustive Topological Mincut 0.0 4 5 6 7 8 9 10 11 12 13 14 15 Hardware Size (CPs) Figure 1: Hypothetical, single-chip SCORE system Figure 2: Run times for JPEG Encode cost of reconfiguration. With respect to the abstract SCORE model, a compute page serves the role of a processor, whereas a page configuration serves the role of a thread. The basic SCORE hardware model is shown in Figure 1. A compute page (CP) is a fixed-size slice of reconfigurable fabric (typically logic and registers) with a stream network interface. The fabric kind, size, and number of I/O channels (streams) are architecturally defined and are identical for all pages (we presently target a page of 512 4-LUTs). The number of pages can vary among architecturally compatible devices. A page s network interface includes stream flow control (data presence and back-pressure to stall the page) and a buffer of several words per stream (for reducing stalls and draining in-flight communication). A configurable memory block (CMB) is a memory block with a stream network interface and a sequential address generator. The CMB memory capacity is architecturally defined (we presently target a CMB of 2Mbit). A CMB holds stream buffers, user data, and page configurations, under OS-controlled memory management. A conventional microprocessor is used for page scheduling and OS support. The common streaming protocol linking these components allows a page thread to be completely oblivious to whether its streams are connected to another page, to a buffer in a CMB, or even to the microprocessor. The actual 5

network transport is less important, though it should ideally have high bandwidth. We presently assume a scalable interconnect modeled after the Berkeley HSRA [4] that is circuit-switched, fat-tree structured, and pipelined for high-frequency operation. We have designed a language (TDF) for specifying page threads and inter-page data-flow, as well as a compiler (tdfc) to compile threads into page simulation code or into POSIX threads for microprocessor execution. Microprocessor threads interact with simulated hardware using a stream API and can dynamically spawn and connect page threads. A TDF page thread is defined as a streaming finite state machine (SFSM), essentially an extended FSM with streaming I/O. The execution semantics specify that on entry into a state the SFSM issues blocking reads to those streams specified in the state s input signature. When input is available on those streams, the SFSM fires, executing a state-specific action described in a C-like syntax. At present, SFSMs must be explicitly written to match page hardware constraints (area, I/Os, timing), but we are working on an automatic partitioner to decompose arbitrarily large SFSMs into groups of stream-connected, page-size SFSMs. 2 We have written a number of media processing applications in TDF, including wavelet, JPEG, and MPEG coding. We have developed and tested several page schedulers. Each such scheduler is a privileged microprocessor task responsible for time-multiplexing a collection of page threads on available physical hardware. We use a time-sliced model where, in each time slice, the scheduler chooses a set of page threads to execute in hardware and manages stream buffers to communicate with suspended pages. Schedules are quasi-static, with a page loading order determined once from the stream data-flow topology and applied repeatedly. Schedules strive to run communicating pages together to reduce 2 Thread sizing (to match a page) is an artifact of the reconfigurable hardware target. It may be unnecessary or optional with other hardware targets and should, in general, not be exposed to the programmer. Partitioning SFSMs into reconfigurable pages is in some ways analogous to restructuring microprocessor code to improve VM page locality. 6

communication latency (especially under inter-page feedback) and stream buffering. Figure 2 shows performance results for running JPEG encode (an application with 15 page threads) on simulated hardware of varying page counts, using three quasi-static schedulers. First, these results demonstrate that application performance scales predictably across hardware sizes more hardware means shorter run-times. Second, these results demonstrate that time-multiplexing is efficient in the sense that the application can run on substantially fewer compute pages than there are page threads, with negligible performance loss. The application, as implemented, has limited parallelism that requires some threads to be idle some of the time a timemultiplexed schedule can avoid loading such threads into idle hardware. Third, these results show that simple, heuristic schedulers based on stream data-flow topology (topological: topological order, min-cut: minimize buffered streams) perform almost as well an exhaustive-search to minimize idle cycles (exhaustive). SCORE for Processor Architectures. The SCORE architecture developed above can be easily extended with additional processor types. We have already defined three processor types: CP, CMB, and microprocessor. An extended architecture might use specialized function units (ALUs, FFT units, etc.) or additional microprocessors. Each processor must have a stream network interface with ability to stall the processor. The network fabric may vary. Streams can be added to a microprocessor as a clean architectural feature, resembling a memory interface. The instruction set is extended with load/store-like operations: stream read(register,index) and stream write(register,index), where index enumerates the space of streams (like a memory address), and register denotes a value register. Stream reads must be able to stall the processor (like memory wait states), while stream writes can resume immediately (like memory stores to a write buffer). These features can be handled by stream access units, resembling memory 7

load/store units, even with out-of-order issue. These units must transparently route stream accesses either to a network interface or to memory buffers. Stream state ( live or buffered ) and buffer locations can be stored in a TLB-like stream table, backed by a larger table in memory (like a VM page table). Access protection can be added using a process ID in the stream table. Finally, stalled stream accesses should time out after a while, allowing a scheduler to swap out blocked threads. Many of these features could be emulated on a conventional microprocessor using memory mapping or a co-processor interface, plus an external controller to route stream accesses to a network or to buffer memory. The scheduling policies for reconfigurable hardware are easily adapted to multiprocessor and multi-threaded processor architectures, provided I/O operations are as fast as memory operations. A chip multi-processor would be scheduled like a SCORE architecture with only microprocessors. A multi-threaded processor would be scheduled similarly, treating each hardware thread as a separate SCORE processor. Communication between threads loaded in hardware is live through registers, whereas communication to threads swapped-out to memory is buffered in memory. In either case, the scheduler prefers to co-schedule communicating threads. References [1] E. Caspi, M. Chu, R. Huang, J. Yeh, Y. Markovskiy, J. Wawrzynek, and A. DeHon. Stream computations organized for reconfigurable execution (SCORE): Introduction and tutorial. http://brass.cs.berkeley.edu/documents/score_tutorial.pdf, August 2000. [2] G. Kahn. Semantics of a simple language for parallel programming. Info. Proc., pages 471 475, August 1974. [3] E. A. Lee and T. M. Parks. Dataflow process networks. Proc. IEEE, 83(5):773 801, May 1995. [4] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O. Rowhani, V. George, J. Wawrzynek, and A. DeHon. Hsra: High-speed, hierarchical synchronous reconfigurable array. In Proc. International Symposium on Field Programmable Gate Arrays (FPGA 99), pages 125 134, February 1999. 8