Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Similar documents
A unified multicore programming model

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

An Introduction to Parallel Programming

Introduction to Parallel Computing

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

! Readings! ! Room-level, on-chip! vs.!

Multi-core Programming Evolution

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Modern Processor Architectures. L25: Modern Compiler Design

CUDA GPGPU Workshop 2012

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

MTAPI: Parallel Programming for Embedded Multicore Systems

Multiprocessors & Thread Level Parallelism

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Multithreading: Exploiting Thread-Level Parallelism within a Processor

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Chap. 4 Multiprocessors and Thread-Level Parallelism

Dynamic Performance Tuning for Speculative Threads

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Lecture 9: MIMD Architectures

Uniprocessor Computer Architecture Example: Cray T3E

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Distributed Systems. Lecture 4 Othon Michail COMP 212 1/27

2011 IBM Research Strategic Initiative: Workload Optimized Systems

Real World Multicore Embedded Systems

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

CS377P Programming for Performance Multicore Performance Multithreading

Multi-core Architectures. Dr. Yingwu Zhu

The S6000 Family of Processors

Flash: an efficient and portable web server

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

mos: An Architecture for Extreme Scale Operating Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Supercomputing and Mass Market Desktops

Comp. Org II, Spring

Lecture 25: Board Notes: Threads and GPUs

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Heterogeneous platforms

Computer Architecture Today (I)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Parallel Processing & Multicore computers

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun

Processor Architecture and Interconnect

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Distributed OS and Algorithms

Comp. Org II, Spring

Chapter 5. Multiprocessors and Thread-Level Parallelism

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Computer Systems Architecture

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI

high performance medical reconstruction using stream programming paradigms

EEMBC FPMARK THE EMBEDDED INDUSTRY S FIRST STANDARDIZED FLOATING-POINT BENCHMARK SUITE

Data Parallel Architectures

Easy Multicore Programming using MAPS

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Systems Architecture

OPERATING SYSTEM. Chapter 4: Threads

Using POSIX Threading to Build Scalable Multi-Core Applications

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel Computing: Parallel Architectures Jin, Hai

The Bifrost GPU architecture and the ARM Mali-G71 GPU

3D Graphics in Future Mobile Devices. Steve Steele, ARM

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Chapter 4: Multithreaded

CMSC 611: Advanced. Parallel Systems

Software Driven Verification at SoC Level. Perspec System Verifier Overview

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs

Continuum Computer Architecture

Building a Bridge: from Pre-Silicon Verification to Post-Silicon Validation

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Modeling and SW Synthesis for

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

WHY PARALLEL PROCESSING? (CE-401)

Memory Systems in Pipelined Processors

Design Principles for End-to-End Multicore Schedulers

Transcription:

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association

Enabling the Multicore Ecosystem Multicore Association Initial engagement began in May 2005 Industry-wide participation Current efforts Communications APIs Debug API Embedded Microprocessor Benchmark Consortium (EEMBC) Industry benchmarks since 1997 Evaluating current and future MP platforms Uncover MP bottlenecks Compare multiprocessor vs. uni-processor

What Is Multicore s Value? Is + <? Is + =? Is + >? Answer: It depends on the application

Multicore Issues to Solve Communications, synchronization, resource management between/among cores Debugging: connectivity, synchronization Distributed power management Concurrent programming OS virtualization Modeling and simulation Load balancing Algorithm partitioning Performance analysis

Perspective on COTS Ready-made solutions help overcome some of these challenges Extracting full benefits requires effort COTS = Careful Optimizations To Succeed

Multicore for Multiple Reasons Increase compute density Centralization of distributed processing All cores can run entirely different things; for example, four machines running in one package App1 App2 App3 App4

Multicore for Multiple Reasons Functional partitioning While possible with multiprocessors, benefits from proximity and possible data sharing Core 1 runs security, Core 2 runs routing algorithm, Core 3 enforces policy, etc. Security Routing Control QoS

Multicore for Multiple Reasons Asynchronous multiprocessing (AMP) Minimizes overhead and synchronization issues Core 1 runs legacy OS, Core 2 runs RTOS, others do a variety of processing tasks (i.e. where applications can be optimized) Linux RTOS Video Compress Security

Multicore for Multiple Reasons Parallel pipelining Taking advantage of proximity The performance opportunity. APPLICATION Thread1 Thread2 Thread3 Thread4 Threadn

Benchmarking Multicore What s Important? Scalability where contexts exceed resources Single versus multiprocessor Memory and I/O bandwidth Inter-core communications OS scheduling support Efficiency of synchronization System-level functionality

Bird s Eye View to Multicore Performance Analysis Shared Memory Semantics of a thread-based API Supporting coherency Applicable to MPC8641D, Intel x86 Message Based Heterogeneous MP solutions lack common programming paradigm Distributed memory architectures Applicable to i.mx devices, many others Scenario Driven Black Box Approach Performance based on real implementations Define methodology, inputs, expected outputs

Phase 1 Methodology for Benchmarking Multicore Software assumes homogeneity across processing elements Scalability of general purpose processing, as opposed to accelerator offloading Contexts Standard technique to express concurrency Each context has symmetric memory visibility Abstraction Test harness abstracts the hardware into basic software threading model EEMBC has patent-pending on methodology and implementation

Multicore Benchmarking Primer Multicore Performance Analysis is Multi-Faceted Problem More Cores More Performance A Major Hardware Architecture and Programming Issue

Workloads, Work Items, Concurrency Workload can contain 1 or more work items, executed <N> times Can execute up to <N> work items concurrently Performance limited by OS scheduling, # of cores, memory bandwidth Similar to spec-rate Work Item A 0 * N Workload

Avoid Oversubscription Repeats=1 18 Speedup 16 14 12 10 8 6 4 2 aifftr_template aifirf_template aiifft_template autcor_template bitmnp_template canrdr_template conven_template fft_template iirflt_template pntrch_template puw mod_template rspeed_template tblook_template viterb_template 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 32 64 Num concurrent items Linear speedup until reaching 16 concurrent items = number of cores Effects vary based on algorithm and platform

Workloads Can Contain Multiple Contexts Split Work Item into multiple contexts = context Increase parallelism within algorithms Always improves performance? Work Item A 0 * N Workload

Synchronization Overhead and Parallelization Side Effects template2: 1 instance of rgbhpg (rate) template4: 1 instance of rgbhpg dual context template6: 1 instance of rgbhpg quad context 18 16 14 12 10 8 6 4 2 0 Repeats=100, Speedup by num contexts used 1 2 3 4 5 6 7 8 12 Splitting job in 2 and using 6 concurrent items (total of 12 software contexts) is more efficient than using 12 concurrent items or splitting in 4 and using 3 concurrent items template2 template4 template6

Increasing Workload Complexity and Realism Workload can contain a mixture of items Example Item A applied to 2 different datasets Item B explicitly using 4 execution contexts Work Item A 0 Work Item A 1 Work Item B 0 Workload

Cache Effects Hit Early Combine filter items 1 instance of rgbyiq using one context 1 instance of rgbhpg using one context 2 instances of rgbhpg using 2 contexts each RGB to HPG conversion, splitting the image processing among 2 contexts 14 12 10 8 6 4 Repeats=1 Max out at 13x Oversubscribed 2 Break in linearity 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 32 64 Contexts

Memory Limitations Test workload for H.264 encoder 2 contexts on each stream 6 streams 2 instances for each x264 speedup item by number of concurrent streams Number of software contexts used is double the number of concurrent items (2 contexts per stream). Still, overall benefit over single core case is 9x rather then 16x on a 16 core system. Speedup 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 32 64 Number of items in parallel NOTE: Memory limited after 4 items in parallel

What Does This Picture Have in Common With Multicore?

Rotate Benchmark Stresses Multicore in Many Ways A few points describing this benchmark. Rotate greyscale picture 90 deg counter clock-wise. Data shown is specifically for a 4 M-Pixel picture. Benchmark parameters allow different slice sizes, concurrent items, and number of workers Workers = # of software contexts working on rotating a single picture concurrently Items = # of pictures to process Slice size = # of lines in a horizontal slice of the photo for each worker to process each time Note: When total # SW Contexts <= # HW contexts, there is no difference.

Balance Data Usage and Synchronization 3 Slice=1 (high sync granularity) Different lines indicate number of concurrent items Also note that with 2 items running concurrently, each with 4 workers processing the image, achieves the highest score on this platform 2.5 2 Note the sharp drop in performance as number of workers climb due to sync overhead. Iterations/sec 1.5 [1] [2] [4] [8] [16] 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of workers per item (decomposition level)

Lots of Data Slams Cache 1.6 1.4 Slice=64 (medium sync granularity) Different lines indicate number of concurrent items Highest performance point for this level of granularity is achieved by running one item, with 5 workers processing the image. 1.2 Iterations/Sec 1 0.8 0.6 [1] [2] [4] [8] [16] [64] 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of workers per item (decomposition level) Although slice size of 64 means less synchronizati on required, it also means more impact on cache.

Critical to Optimize for Scheduling 0.8 Decomposition, Workers=1 Different lines indicate slice size, affecting sync granularity 0.7 0.6 With only one worker, sync granularity is irrelevant, and the number of items running concurrently manifests performance speedup, until the point where system and scheduling combine to drop performance sharply as concurrent items compete for machine resources. Iterations/sec 0.5 0.4 0.3 {1} {2} {4} {16} {64} 0.2 0.1 0 [1] [2] [4] [8] [16] [64] Number of concurrent items

Exposing Memory Bottlenecks Decomposition, Workers=8 Different lines indicate slice size, affecting sync granularity 3 Slice of 2, 4,16 closest in performance. 2.5 2 With 8 workers processing the image, the granularity of synchronization is a major factor determining performance. Interesting to note that slice = 4 is optimal. Iterations/sec 1.5 {1} {2} {4} {16} {64} 1 0.5 0 [1] [2] [4] [8] [16] [64] Number of concurrent items

Less is More 3 Concurrent Items=1 Different lines indicate slice size, affecting sync granularity With only one instance of the kernel to process the data, a slice size of 4 is optimal. 2.5 2 Iterations/sec 1.5 {1} {2} {4} {16} {64} 1 0.5 The Effects of Rotating One Item 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of workers (Decomposition level)

More is Less 0.6 Concurrent Items=16 Different lines indicate slice size, affecting sync granularity 0.5 0.4 Iterations/sec 0.3 {1} {2} {4} {16} {64} 0.2 0.1 With 16 concurrent items, maximum performance is achieved with slice size of 16, but even that maximum is less then 0.6, while the best configuration for the system gives over 2.6 iterations/sec 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of workers (Decomposition level)

Communication API (MCAPI) Target Domain Multiple cores on a chip and multiple chips on a board Closely distributed and/or tightly-coupled systems Heterogeneous and homogeneous systems Cores, interconnects, memory architectures, OSes Scalable: 2 1000 s of cores Assumes the cost of messaging/routing < computation of a node

CAPI Features and Performance Considerations Very small runtime footprint Low-level messaging API Low-level streaming API High efficiency Seeking reviewers for pre-release version Application Application Further abstraction (OS/middleware) Application Further abstraction (OS/middleware) CAPI architecture Interconnect topology CAPI Interconnect architecture CAPI CPU/DSP Accelerators HW CPU/DSP Accelerators CPU/DSP Accelerators Courtesy of PolyCore Software

Benchmark Implementation on a Message-Based Platform Uses CAPI as underlying infrastructure and test harness Portable implementation, even with a heterogeneous platform Divide applications and/or algorithms into smallest possible functional modules Each module is compiled separately EEMBC Benchmark EEMBC Benchmark EEMBC Benchmark Further abstraction (OS/middleware) Further abstraction (OS/middleware) CAPI architecture Interconnect topology CAPI Interconnect architecture CAPI CPU/DSP Accelerators HW CPU/DSP Accelerators CPU/DSP Accelerators

Multicore Opportunities and Challenges Multicore technology is inevitable It s time to begin implementing The Multicore Association will help ease the transition EEMBC will help analyze the performance benefits Patent pending on new multicore benchmark technique