LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Size: px
Start display at page:

Download "LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs"

Transcription

1 LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA Research Sponsors: 1

2 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 2

3 Dynamic Parallelism on GPU Launch workload on demand from GPU CUDA Dynamic Parallelism (CDP) OpenCL device-side enqueue Dynamic Thread Block Launch [1] (DTBL) Benefits 0 Apply to fine-grained parallelism in irregular applications Increase execution efficiency and productivity Graph Traversal Launch new kernels Launch new thread blocks Dynamically Launched Child Kernels/TBs Launch when sufficient parallelism (e.g. 4 threads) Reduced Workload Imbalance Parent Kernel Workload Imbalance Launched by t1 Launched by t2 Launched by t3 Uniform control flow and more memory coalescing Launched by t4 [1] Wang, Rubin, Sidelnik and Yalamanchili, Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus, ISCA

4 Memory Locality in Dynamic Parallelism New data reference locality relationship in parent-child launching Potential Locality: Parent-child and childsibling data sharing /L2 locality Read/Write Graph Data Parent Kernel Launches Cache Cache Cache L2 Cache GPU Global Memory Reuse Graph Data Cache Child Kernels/TBs Average shared footprint ratio for 8 benchmarks: Parent-child: 38.4% Child-sibling: 30.5% 4

5 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 5

6 Current GPU Scheduler First-Come-First-Serve kernel scheduler Round-Robin TB Scheduler Kernel 1 TB0 TB1 First Come First Serve Kernel Distributor TB2 TB3 Round Robin Scheduler Kernel 2 TB4 TB5 TB0 TB1 TB2 TB3 TB4 TB5 6

7 Current GPU Scheduler (continue) For non-dynamic Parallelism scenario Fairness and efficiency For Dynamic Parallelism scenario Child TBs are scheduled after parent TBs Parent TBs First Come First Serve Kernel Distributor P0 P1 P2 P3 P4 P5 P6 P7 Round Robin Scheduler C2 C4 C3 C0 C5 Child TBs C1 P0 P1 P2 P3 P4 P5 P6 P7 C0 C1 C2 C3 C4 C5 7

8 Current GPU Scheduler: Issues Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality First Come First Serve Kernel Distributor GPU Global Memory L2 Cache Round Robin Scheduler P0 P1 P2 P3 P4 P5 P6 P7 Potential L2 pollution C0 C1 C2Child TBs C3are far away from parent TBs C4 C5 8

9 Current GPU Scheduler: Issues (continue) Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality Issue 2: Child TBs are executed on a different than its parent, decreasing locality Fails to exploit parent-child or child-sibling locality Exacerbated in real applications GPU Global Memory which generally have many TBs. L2 Cache P0 P1 P2 P3 P4 P5 P6 P7 C0 Child TBs are not able C1 C2 C3 to utilize on the C4 parent s C5 9

10 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 10

11 LaPerm: Locality-Aware Scheduler for Dynamic Parallelism Leverage locality between parent and child TBs Three scheduling decisions Accommodate different forms of locality Goal: improve memory efficiency and overall performance 11

12 Scheduling Decision 1: TB Prioritizing Prioritize child TBs to be executed immediately after direct parent TBs Reuse parent data and avoid L2 cache pollution GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 C2 C3 C0 C1 C0 C1 P4 P5 C4 C5 C2 C3 C4 C5 Child TBs P6 P7 Still no locality! 12

13 Scheduling Decision 2: Prioritized Binding Bind the prioritized child TBs on the same as the parent TBs Utilize cache for data reuse GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 C0 P6 C2 C3 C0 C1 C2 P7 C1 C4 C5 C3 Child TBs C4 load balancing issue! C5 13

14 Scheduling Decision 3: Adaptive Prioritized Binding Adaptively bind child TBs on available s Avoid load balancing caused by binding GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 C0 P6 C2 C3 C0 C1 C2 P7 C1 C3 C4 C5 C4 C5 Child TBs 14

15 Architecture Support Multi-level priority queues Used to store new TB/Kernel information Ordered using priority value Divided among multiple s Off-chip overflow priority queues Kernel Management Unit First Come First Serve Kernel Distributor DRAM Scheduler New Scheduler Logic On-chip priority queues 15

16 Benchmark and Experimental Environment Benchmark implemented with dynamic parallelism Simulated on GPGPU-Sim Benchmark Input Notation Adaptive Mesh Refinement Energy data set amr Barnes Hut Tree Random data set bht Breadth-First Search Citation network bfs_citation USA road network Cage15 sparse matrix bfs_usa_road bfs_cage15 Graph Coloring Citation network clr_citation Graph 500 Cage15 sparse matrix clr_graph500 clr_cage15 Regular Expression Match Darpa network regx_darpa Random string collection Product Recommendation MovieLens pre regx_string Relational Join Uniform distribution synthetic join_uniform Gaussian distribution synthetic join_gaussian Single-Source Shortest Path Citation network sssp_citation Fight network Physics Simulation Cage15 sparse matrix Tree and Graph Applications Machine Learning Relational Database sssp_flight sssp_cage15 16

17 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind 17

18 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind L2 Cache hit rate benefits mainly from prioritizing child TBs 18

19 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind L2 Cache hit rate benefits mainly from prioritizing child TBs Graph applications with cage15 input have more parent-child data reuse 19

20 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate

21 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate Cache hit rate benefits mainly from binding child TBs to parents s 21

22 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate Cache hit rate benefits mainly from binding child TBs to parents s Product recommendation pre has good child-sibling data locality 22

23 Performance of LaPerm: IPC 1.7 TB-Pri -Bind Adaptive-Bind Normalized IPC Normalized to RR 23

24 Performance of LaPerm: IPC 1.7 TB-Pri -Bind Adaptive-Bind Normalized IPC Normalized to RR TB-Pri: IPC (1.13x) increases due to higher L2 cache hit rate -Bind: IPC decreases (1.08x) from TB-Pri due to higher cache hit rate but load balancing Adaptive-Bind: Overall IPC (1.27x) increases because of both memory efficiency and load balancing 24

25 Conclusion LaPerm: Locality-aware thread block scheduler for dynamic parallelism Exploit new memory reference locality in the parent-child launching Three scheduling decisions Increase /L2 cache locality while maintaining load balance Achieve overall memory system efficiency and performance improvement 25

26 THANK YOU! QUESTIONS? 26

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

DEEP DIVE INTO DYNAMIC PARALLELISM

DEEP DIVE INTO DYNAMIC PARALLELISM April 4-7, 2016 Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1 OVERVIEW AND INTRODUCTION 2 WHAT IS DYNAMIC PARALLELISM? The ability

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Evaluating GPGPU Memory Performance Through the C-AMAT Model

Evaluating GPGPU Memory Performance Through the C-AMAT Model Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Exploring Dynamic Parallelism on OpenMP

Exploring Dynamic Parallelism on OpenMP www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems

More information

Dynamic Load Balancing on Single- and Multi-GPU Systems

Dynamic Load Balancing on Single- and Multi-GPU Systems Dynamic Load Balancing on Single- and Multi-GPU Systems Long Chen, Oreste Villa, Sriram Krishnamoorthy, Guang R. Gao Department of Electrical & Computer Engineering High Performance Computing University

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,

More information

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Ping Xiang, Yi Yang*, Mike Mantor #, Norm Rubin #, Lisa R. Hsu #, Huiyang Zhou

More information

Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications

Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications Víctor García 1,2 Juan Gómez-Luna 3 Thomas Grass 1,2 Alejandro Rico 4 Eduard Ayguade 1,2

More information

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC)

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC) SET - 1 II B. Tech I Semester Supplementary Examinations, May/June - 2016 PART A 1. a) Write a procedure for the Tower of Hanoi problem? b) What you mean by enqueue and dequeue operations in a queue? c)

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Beyond Programmable Shading. Scheduling the Graphics Pipeline Beyond Programmable Shading Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 9 August 2011 Mike s just showed how shaders can use large, coherent batches of work to achieve high throughput.

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

EFFICIENT SYNCHRONIZATION FOR GPGPU. by Jiwei Liu B.S., Zhejiang University, 2010

EFFICIENT SYNCHRONIZATION FOR GPGPU. by Jiwei Liu B.S., Zhejiang University, 2010 EFFICIENT SYNCHRONIZATION FOR GPGPU by Jiwei Liu B.S., Zhejiang University, 2010 Submitted to the Graduate Faculty of the Swanson School of Engineering in partial fulfillment of the requirements for the

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Sample Questions. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Sample Questions. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Sample Questions Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Sample Questions 1393/8/10 1 / 29 Question 1 Suppose a thread

More information

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy) 2011 NVRAMOS Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy) 2011. 4. 19 Jongmoo Choi http://embedded.dankook.ac.kr/~choijm Contents Overview Motivation Observations Proposal:

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

Characterizing the Performance Bottlenecks of Irregular GPU Kernels

Characterizing the Performance Bottlenecks of Irregular GPU Kernels Characterizing the Performance Bottlenecks of Irregular GPU Kernels Molly A. O Neil M.S. Candidate, Computer Science Thesis Defense April 2, 2015 Committee Members: Dr. Martin Burtscher Dr. Apan Qasem

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

{jtan, zhili,

{jtan, zhili, Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,

More information

Comp 310 Computer Systems and Organization

Comp 310 Computer Systems and Organization Comp 310 Computer Systems and Organization Lecture #9 Process Management (CPU Scheduling) 1 Prof. Joseph Vybihal Announcements Oct 16 Midterm exam (in class) In class review Oct 14 (½ class review) Ass#2

More information

Orchestrated Scheduling and Prefetching for GPGPUs

Orchestrated Scheduling and Prefetching for GPGPUs Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis

More information

Scheduling, part 2. Don Porter CSE 506

Scheduling, part 2. Don Porter CSE 506 Scheduling, part 2 Don Porter CSE 506 Logical Diagram Binary Memory Formats Allocators Threads Today s Lecture Switching System to CPU Calls RCU scheduling File System Networking Sync User Kernel Memory

More information

Scheduling the Graphics Pipeline on a GPU

Scheduling the Graphics Pipeline on a GPU Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

8: Scheduling. Scheduling. Mark Handley

8: Scheduling. Scheduling. Mark Handley 8: Scheduling Mark Handley Scheduling On a multiprocessing system, more than one process may be available to run. The task of deciding which process to run next is called scheduling, and is performed by

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Lecture Notes: Unleashing MAYHEM on Binary Code

Lecture Notes: Unleashing MAYHEM on Binary Code Lecture Notes: Unleashing MAYHEM on Binary Code Rui Zhang February 22, 2017 1 Finding Exploitable Bugs 1.1 Main Challenge in Exploit Generation Exploring enough of the state space of an application to

More information

GViM: GPU-accelerated Virtual Machines

GViM: GPU-accelerated Virtual Machines GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications Mihir Awatramani, Xian Zhu, Joseph Zambreno and Diane Rover Department of Electrical and Computer Engineering Iowa

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels Characterizing Performance and Power towards Efficient Synchronization of GPU s Islam Harb Department of Computer Science Virginia Tech Blacksburg, Virginia, USA Electronic Research Institute, Egypt iharb@vt.edu

More information

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information