Streaming as a pattern. Peter Mattson, Richard Lethin Reservoir Labs

Similar documents
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Types of Parallel Computers

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Comparing Memory Systems for Chip Multiprocessors

Chapter 1 Computer System Overview

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

A Streaming Virtual Machine for GPUs

A Streaming Multi-Threaded Model

Multiprocessors - Flynn s Taxonomy (1966)

Evaluation of Stream Virtual Machine on Raw Processor

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Modern Processor Architectures. L25: Modern Compiler Design

CMSC 611: Advanced. Parallel Systems

Issues in Multiprocessors

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Data Parallel Architectures

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Issues in Multiprocessors


High Performance Computing. University questions with solution

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Higher Level Programming Abstractions for FPGAs using OpenCL

UNIT I (Two Marks Questions & Answers)

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Mission-Critical Space Software For Multi- Core Processors

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Programmation Concurrente (SE205)

Programmable Graphics Hardware (GPU) A Primer

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Introduction to GPU programming with CUDA

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

I/O Buffering and Streaming

CS 1013 Advance Computer Architecture UNIT I

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

WHY PARALLEL PROCESSING? (CE-401)

CSC 553 Operating Systems

Flynn s Taxonomy of Parallel Architectures

PGI Accelerator Programming Model for Fortran & C

Data Parallel Execution Model

Why Multiprocessors?

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Chapter 12: Multiprocessor Architectures. Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing

Architectural Styles. Software Architecture Lecture 5. Copyright Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All rights reserved.

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

At the End of the Last Decade The Next Decade in Compiler Technology

ECE 669 Parallel Computer Architecture

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Introduction II. Overview

Chapter 1 Computer System Overview

Cache Aware Optimization of Stream Programs

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Processors. Session 1 Introduction

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

TDT 4260 lecture 3 spring semester 2015

Basics of Performance Engineering

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: Caching (2) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 4: Caching (2) Welcome!

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Instruction Level Parallelism (ILP)

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Rule partitioning versus task sharing in parallel processing of universal production systems

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Parallel Computing: Parallel Architectures Jin, Hai

Chapter 8 : Multiprocessors

LAB 1: Combinational Logic: Designing and Simulation of Arithmetic Logic Unit ALU using VHDL

Programming Models. Rebecca Schultz Paul Lee Varun Malhotra

Copyright 2012, Elsevier Inc. All rights reserved.

Chí Cao Minh 28 May 2008

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Parallel Architectures

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

A Stream Compiler for Communication-Exposed Architectures

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

CUDA Parallelism Model

Compiling for Advanced Architectures

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Logical Diagram of a Set-associative Cache Accessing a Cache

Patterns for! Parallel Programming!

Introduction to Parallel Processing

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

OpenMP Optimization and its Translation to OpenGL

Transcription:

Streaming as a pattern Peter Mattson, Richard Lethin Reservoir Labs

Streaming as a pattern Streaming is a pattern in efficient implementations of computation- and data-intensive applications Pattern has three key characteristics: Processing is largely parallel Data access patterns are apparent Control is high-level, steady, simple

Causes of the pattern Applications process, simulate, or render physical systems Architectural limits Application qualities Processing is largely parallel Data access patterns are apparent Independence of spatially and/or temporally distributed data Continuous, N- dimensional spatial/temporal data VLIW, SIMD, multiprocessor demand for parallelism Small local memories Streaming Architecture limits Control is high-level, steady, simple Few tasks per application, few unpredictable events Handling unpredictability in hardware is expensive

Streaming is not only 1D Multidimensional data, data rearrangement Multidimensional architecture topologies Time is favored dimension, yielding 1D tendency Unmapped, coarse-grain streams of whole-array elements moving between actors with whole-loopnest invocations Serialization for transmission, especially address stream to memory to get data stream to processor

Using streaming Streaming applications expose maximum opportunities, information to Streaming architectures expose maximum resources, control to the Streaming languages should enforce the pattern Force programmer thought to reach pattern Guarantee pattern to Streaming s should exploit the pattern Expand scope of application optimization Expand scope of resource choreography

Streaming languages Enforce similar, but not identical patterns StreamIt Think structured synchronous data flow Single stream graph Streams are infinite length Static rates Filters can have state, may require sequential processing Designed by people, clean but more constrained Brook/StreamC Think pointer-less C, with embedded dataflow graphs instead of loop nests Multiple stream graphs, surrounded by C-subset Streams are finite length Dynamic rates Kernels must be state-less, allow parallel processing Designed by application and architecture people, rough but more expressive Applications qualities Architecture limits

Streaming s Expand scope of optimization Application is transparent to Access patterns, aliasing, etc. completely known Top-down, not just bottom-up optimization Exploit task parallelism Expand scope of choreography Compiler lays out all computation, data, and communication Closely model architecture VLIW scheduling writ large Difficult! Many variables, phase ordering Whole program Multiple loop nests Single loop nest CSE etc. ALUs, registers Cache Local memory Multiprocessor, routing

R-Stream Reservoir Labs is developing the R-Stream high-level Goal is portable, consistently high-performance compilation High-level machine model Processor, memory, interprocessor communication mapping VLIW scheduling, precise routing Brook StreamIt R-Stream high-level Streaming Virtual Machine (SVM) code Architecturespecific low-level Binary executable

R-Stream Reservoir Labs is developing the R-Stream high-level Goal is portable, consistently high-performance compilation High-level machine model Streaming IR Top-down optimization Unified mapping of computation and data Brook StreamIt R-Stream high-level Streaming Virtual Machine (SVM) code Architecturespecific low-level Binary executable

Streaming Pattern Characterized by parallelism, apparent data access, high-level etc. control Driven by application class, architectural limits Enforced by languages, exploited by s Used by R-Stream; goal of portable, consistently high-performance compilation