Politecnico di Milano
|
|
- Vivian Horn
- 6 years ago
- Views:
Transcription
1 Politecnico di Milano Automatic parallelization of sequential specifications for symmetric MPSoCs [Full text is available at Fabrizio Ferrandi, Luca Fossati, Marco Lattuada, Gianluca Palermo, Donatella Sciuto, Antonino Tumeo Thursday, May 31 - IESS '07 Irvine California - USA
2 Outline Introduction Related Work Target Architecture: CerberO Parallelization Partitioning Task start conditions Experimental Results Conclusions - 2 -
3 Introduction On-chip Multiprocessors are gaining momentum The development of good parallel applications is highly dependent on software tools Developers must contend with several problems not encountered in sequential programming: non-determinism, communication, synchronization, data partitioning and distribution, load-balancing, heterogeneity, shared or distributed memory, deadlocks, race conditions This work proposes an approach for automatic parallelization of sequential programs - 3 -
4 Objectives This work focuses on a complete design flow From the high level sequential C description of the application to its deployment on a multiprocessor system-on-chip prototype The sequential code is partitioned in tasks with a specific clustering algorithm Then, the resulting task graph is optimized and the parallel C code is generated (C to C transformation) The generated tasks are dynamically schedulable run-time evaluation of boolean conditions tasks are started as soon as only the real data dependences are satisfied - 4 -
5 Related Work Several strategies for partial automatization of the parallelization process have been proposed: Problem-solving environments which generate parallel programs starting from high level sequential descriptions Machine independent code annotations The parallelization process: The initial specification is parsed in an intermediate graph representation Partitioning of the intermediate representation An initial task graph is obtained, then tasks need to be allocated on processors through clustering and clusterscheduling (merging) - 5 -
6 Related Work (Partitioning) [Girkar et al.]: A Hierarchical Task Graph (HTG) (we do not use hierarchy) Simplification of the conditions for execution of task nodes [Luis et al.]: Extend Girkar's work by using a Petri net model to represent parallel code [Newburn and Shen]: PEDIGREE compiler: a flow for automatic parallelizazion (it is not a C to C tool) The Program Dependence Graph (PDG) is analysed, searching for control equivalent regions Then, they are partitioned bottom up - 6 -
7
8 Parallelization: the flow The sequential C code is compiled with a slightly modified version of the GCC 4.0 The internal structures are dumped The C to C partitioning algorithm works on a modified system dependence graph (SDG) Code is generated for both OpenMP and CerberO - 8 -
9 FSDG Creation Vertices: statements or predicate expressions Grey solid edges: data dependences Black edges: control dependences Black dashed edges: both Grey dashed edges: feedback edges (loops) All the loops are converted in do-while loops - 9 -
10 Partitioning First step: feedback edges analysis A partition for each loop A partition for nodes not in loops Second step: control edges analysis Recognization of control-equivalent (CE) regions Statement nodes descending from the same branch condition (TRUE or FALSE) of a predicate node are grouped together Each region presents potential parallelism Third step: data dependence analisys of CE regions Depth-first exploration. A node is added to a cluster if: it is dependent from one and only one node all its predecessors have already been added
11 Partitioning
12 Optimizations The partitioning phase tends to produce too many small clusters Task management overhead could eliminate all the advantages of parallel execution Two type of optimizations Optimization on control structures: Control predicates are executed before statements beneath The control predicate and the instructions which depends on it are grouped in the same tasks Then and Else clauses are grouped in the same tasks since they are mutually exclusive If control structures are not optimized, they are replicated in order to remove any control dependences among tasks
13 Optimizations (2) Optimizations on data dependences: Data dependent clusters can be joined together to form a bigger cluster Candidates are those clusters containing a number of instructions smaller than N (predetermined number) The algorithm tries to join a task with its successors Two clusters are grouped if all the data dependences on edges exiting from a cluster have the same target cluster Repeated until no more clusters are joined or no more clusters smaller than N exist
14 Task Creation The final clustered FSDG must be translated into specific data structures effectively representing the tasks First step: identification of the task variables Edges coming from the ENTRY node are global variables Edges going in a cluster represent the input parameters Edges going out from a cluster represent the output parameters Edges whose both source and destination nodes are contained in the same cluster are the local variables of the task Edges that represent the same variable: a single variable is instantiated Second step: computation of the start conditions
15 Start Conditions If C1 = TRUE then a and b must be ready before the task starts Else Only c must be ready before the task starts When C1 is determined The branch outcome is known If a, b and c are produced by different tasks, there is no need to wait that all three are produced
16 Start Conditions A start condition is valid when: A task is started only once (only one of the predecessors of the task can start it) Problem: track if a task has already been started Solution: a boolean variable set to TRUE when the task starts The necessary parameters for a correct execution of the task must have already been computed when the start condition evaluates to TRUE Problem: we don't want all the variables, we just need the parameters for the actual execution path Solution: algorithm to generate start conditions depending on the execution path
17 Start Conditions Algorithm First step: explore the input parameters of a task Parameters used in the same control region (i.e. all in the true or false branch) are put in and All they must be ready if the region is going to be executed All the resulting and expressions (one for each control region) are joined by an or operator Second step: explore the preceding tasks Searching where the input parameters to the task to start are written In case there are more control flows from which a parameter is written, all the corresponding paths are joined in an or expression
18 Example Tx: TRUE if task x ended Cx: if predicate x x : all the possible paths that compute x in the preceding tasks C2 ( C0 T0) + C2 [C3 ((C0 T0 + C1 T1) (T1)) + C3 (C1 T1)] (b) = C0 T0 (a) = C0 T0 + C1 T1 (d) = T1 (c) = C1 T1 C2 (b) + C2 [C3 ( (a) (d)) + C3 (c)]
19 Backend The condition is inserted at the end of both Task0 and Task1 When it evaluates to TRUE Task2 is launched Long conditions: BDDs (Binary Decision Diagrams) can be used to reduce the complexity A C Writer backend produces the final parallel C OpenMP compliant code for functional validation Code compliant with the CerberO platform threading API
20 Experimental Setup CerberO architectures with 2 to 6 processors The CerberO OS Layer is thin, but thread management routines of the architecture have an overhead Shared memory accesses to store threads and processors tables The sequential programs have been run on a single processor architecture with CerberO-like memory mapping CerberO-like thread management They have been hand-modified to account for these aspects and to allow a fair comparison
21 Experimental Results (ADPCM) At most 4 parallel threads Maximum speed up with 4 processors: 70% More processors: more synchronization/threading overhead
22 Experimental Results (JPEG) RBG-to-YUV and 2D-DCT have been parallelized (70% of the sequential JPEG execution time) For the whole JPEG algorithm, the maximum speedup reached is 42%. At most 4 parallel threads
23 Conclusions Main contributions: A complete design flow from sequential to parallel C code executable on CerberO, a homogeneous multiprocessor system on FPGA A partitioning algorithm that extracts parallelism and transforms all control dependences in data dependences An algorithm that generates start conditions for dynamic thread scheduling without requiring a complex operating system support by the target architecture The flow has been applied to several standard applications ADPCM and JPEG algorithms show speedups up to 70% and 42% respectively
24 Thank you for your attention! Questions?
25 Related work (Clustering & Merging) Clustering algorithms: Dominant sequence clustering (DSC) by [Yang and Gerasoulis] Linear clustering by [Kim and Browne] [Sarkar]'s internalization algorithm (SIA) Cluster-scheduling: [Hou, Wang]: Evolutionary algorithms [Kianzad and Bhattacharyya]: Single step evolutionary approach for both the clustering and cluster scheduling aspects
26 CerberO details A symmetric shared memory multiprocessor system-on-chip (MPSoC) prototype on FPGA Multiple Xilinx MicroBlazes, shared IPs and the controller for the shared external memory reside on the shared bus The addressing space of each processor is partitioned in two parts: a private part and a shared part Instructions are cached by each processor, data are moved from the shared to the fast private memory Synchronization Engine (SE) provides hw locks/barriers A thin operating system layer dynamically schedules and allocates threads More details in GLSVLSI '07 paper by [Tumeo et al.]
Politecnico di Milano
Politecnico di Milano Prototyping Pipelined Applications on a Heterogeneous FPGA Multiprocessor Virtual Platform Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani, Gianluca Palermo, Fabrizio
More informationApplications to MPSoCs
3 rd Workshop on Mapping of Applications to MPSoCs A Design Exploration Framework for Mapping and Scheduling onto Heterogeneous MPSoCs Christian Pilato, Fabrizio Ferrandi, Donatella Sciuto Dipartimento
More informationUsing Speculative Computation and Parallelizing techniques to improve Scheduling of Control based Designs
Using Speculative Computation and Parallelizing techniques to improve Scheduling of Control based Designs Roberto Cordone Fabrizio Ferrandi, Gianluca Palermo, Marco D. Santambrogio, Donatella Sciuto Università
More informationExploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,
More informationAn Interrupt Controller for FPGA-based Multiprocessors
An Interrupt Controller for FPGA-based Multiprocessors Antonino Tumeo, Marco Branca, Lorenzo Camerini, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano E-mail:
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationA Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications
A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications Antonino Tumeo 1 Marco Branca 1 Lorenzo Camerini 1 Marco Ceriani 1 Matteo Monchiero 2 Gianluca Palermo 1 Fabrizio Ferrandi
More informationExploiting Vectorization in High Level Synthesis of Nested Irregular Loops. Marco Lattuada, Fabrizio Ferrandi
Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops Marco Lattuada, Fabrizio Ferrandi Marco Lattuada and Fabrizio Ferrandi. Exploiting vectorization in high level synthesis of nested
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationMetodologie di Progettazione Hardware e Software
POLITECNICO DI MILANO Metodologie di Progettazione Hardware e Software Reconfigurable Computing - Design Flow - Marco D. Santambrogio marco.santabrogio@polimi.it Outline 2 Retargetable Compiler Basic Idea
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More informationA Process Model suitable for defining and programming MpSoCs
A Process Model suitable for defining and programming MpSoCs MpSoC-Workshop at Rheinfels, 29-30.6.2010 F. Mayer-Lindenberg, TU Hamburg-Harburg 1. Motivation 2. The Process Model 3. Mapping to MpSoC 4.
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationCS370 Operating Systems Midterm Review
CS370 Operating Systems Midterm Review Yashwant K Malaiya Fall 2015 Slides based on Text by Silberschatz, Galvin, Gagne 1 1 What is an Operating System? An OS is a program that acts an intermediary between
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More information1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008
1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction
More informationAn adaptive genetic algorithm for dynamically reconfigurable modules allocation
An adaptive genetic algorithm for dynamically reconfigurable modules allocation Vincenzo Rana, Chiara Sandionigi, Marco Santambrogio and Donatella Sciuto chiara.sandionigi@dresd.org, {rana, santambr, sciuto}@elet.polimi.it
More informationA Multiprocessor Self-reconfigurable JPEG2000 Encoder
A Multiprocessor Self-reconfigurable JPEG2000 Encoder Antonino Tumeo 1 Simone Borgio 1 Davide Bosisio 1 Matteo Monchiero 2 Gianluca Palermo 1 Fabrizio Ferrandi 1 Donatella Sciuto 1 1 Politecnico di Milano
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationA Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs
A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano, Dipartimento di Elettronica e Informazione
More informationCS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University
CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationMessage-Passing Shared Address Space
Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationModeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano
Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market
More informationOperating Systems (2INC0) 2017/18
Operating Systems (2INC0) 2017/18 Memory Management (09) Dr. Courtesy of Dr. I. Radovanovic, Dr. R. Mak (figures from Bic & Shaw) System Architecture and Networking Group Agenda Reminder: OS & resources
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to
More informationOperating System Review Part
Operating System Review Part CMSC 602 Operating Systems Ju Wang, 2003 Fall Virginia Commonwealth University Review Outline Definition Memory Management Objective Paging Scheme Virtual Memory System and
More informationA Novel Design Framework for the Design of Reconfigurable Systems based on NoCs
Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationEECS 583 Class 2 Control Flow Analysis LLVM Introduction
EECS 583 Class 2 Control Flow Analysis LLVM Introduction University of Michigan September 8, 2014 - 1 - Announcements & Reading Material HW 1 out today, due Friday, Sept 22 (2 wks)» This homework is not
More informationasoc: : A Scalable On-Chip Communication Architecture
asoc: : A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang,, Andrew Laffely,, and Wayne Burleson University of Massachusetts, Amherst Reconfigurable Computing Group Supported by
More informationSynchronization for Concurrent Tasks
Synchronization for Concurrent Tasks Minsoo Ryu Department of Computer Science and Engineering 2 1 Race Condition and Critical Section Page X 2 Algorithmic Approaches Page X 3 Hardware Support Page X 4
More informationCSE 544: Principles of Database Systems
CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;
More informationCSCI 4717 Computer Architecture
CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationMultiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)
Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationCommercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory
Commercial Real-time Operating Systems An Introduction Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory swamis@iastate.edu Outline Introduction RTOS Issues and functionalities LynxOS
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationCS 261 Fall Mike Lam, Professor. Threads
CS 261 Fall 2017 Mike Lam, Professor Threads Parallel computing Goal: concurrent or parallel computing Take advantage of multiple hardware units to solve multiple problems simultaneously Motivations: Maintain
More informationOpenMP and more Deadlock 2/16/18
OpenMP and more Deadlock 2/16/18 Administrivia HW due Tuesday Cache simulator (direct-mapped and FIFO) Steps to using threads for parallelism Move code for thread into a function Create a struct to hold
More informationConfiguring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.
Configuring the Oracle Network Environment Objectives After completing this lesson, you should be able to: Use Enterprise Manager to: Create additional listeners Create Oracle Net Service aliases Configure
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationParallel Program Graphs and their. (fvivek dependence graphs, including the Control Flow Graph (CFG) which
Parallel Program Graphs and their Classication Vivek Sarkar Barbara Simons IBM Santa Teresa Laboratory, 555 Bailey Avenue, San Jose, CA 95141 (fvivek sarkar,simonsg@vnet.ibm.com) Abstract. We categorize
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationOpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system
OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives
More informationSPARK: A Parallelizing High-Level Synthesis Framework
SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark
More informationEfficient AND/OR Search Algorithms for Exact MAP Inference Task over Graphical Models
Efficient AND/OR Search Algorithms for Exact MAP Inference Task over Graphical Models Akihiro Kishimoto IBM Research, Ireland Joint work with Radu Marinescu and Adi Botea Outline 1 Background 2 RBFAOO
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationMidterm Exam. October 20th, Thursday NSC
CSE 421/521 - Operating Systems Fall 2011 Lecture - XIV Midterm Review Tevfik Koşar University at Buffalo October 18 th, 2011 1 Midterm Exam October 20th, Thursday 9:30am-10:50am @215 NSC Chapters included
More informationA Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationChapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
More informationCSE 421/521 - Operating Systems Fall Lecture - XXV. Final Review. University at Buffalo
CSE 421/521 - Operating Systems Fall 2014 Lecture - XXV Final Review Tevfik Koşar University at Buffalo December 2nd, 2014 1 Final Exam December 4th, Thursday 11:00am - 12:20pm Room: 110 Knox Chapters
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationCompiling for HSA accelerators with GCC
Compiling for HSA accelerators with GCC Martin Jambor SUSE Labs 8th August 2015 Outline HSA branch: svn://gcc.gnu.org/svn/gcc/branches/hsa Table of contents: Very Brief Overview of HSA Generating HSAIL
More informationAnt Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOLUME XX, NO. XX, JANUARY 2010 1 Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous
More information1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor
CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing
More informationFork / Join Parallelism
Fork / Join Parallelism Image courtesy of http://www.llnl.gov/computing/tutorials/openmp/ Speedup limited by linear portion Amdahl s Law, Speedup = 1 / [(1- F) + F/S] Synchronization wait time OpenMP:
More informationOperating Systems Overview. Chapter 2
Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and
More informationSri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.
Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and Real Time Systems Page 1 Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and
More informationA Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS
A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS Jaehwan Lee* Kyeong Keol Ryu* Vincent J. Mooney III + {jaehwan, kkryu, mooney}@ece.gatech.edu http://codesign.ece.gatech.edu
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationMutekH embedded operating system. January 10, 2013
MutekH embedded operating system January 10, 2013 Table of Contents Table of Contents History... 2 Native heterogeneity support... 3 MutekH kernel overview... 6 MutekH configuration... 17 MutekH embedded
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationMulticore and Multiprocessor Systems: Part I
Chapter 3 Multicore and Multiprocessor Systems: Part I Max Planck Institute Magdeburg Jens Saak, Scientific Computing II 44/337 Symmetric Multiprocessing Definition (Symmetric Multiprocessing (SMP)) The
More informationSelf-Aware Adaptation in FPGA-based Systems
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Self-Aware Adaptation in FPGA-based Systems IEEE FPL 2010 Filippo Siorni: filippo.sironi@dresd.org Marco Triverio: marco.triverio@dresd.org Martina Maggio: mmaggio@mit.edu
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2017 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationCS 5220: Shared memory programming. David Bindel
CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationSyCERS: a SystemC design exploration framework for SoC reconfigurable architecture
SyCERS: a SystemC design exploration framework for SoC reconfigurable architecture Carlo Amicucci Fabrizio Ferrandi Marco Santambrogio Donatella Sciuto Politecnico di Milano Dipartimento di Elettronica
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationOS and Hardware Tuning
OS and Hardware Tuning Tuning Considerations OS Threads Thread Switching Priorities Virtual Memory DB buffer size File System Disk layout and access Hardware Storage subsystem Configuring the disk array
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve OPENMP Standard multicore API for scientific computing Based on fork-join model: fork many threads, join and resume sequential thread Uses pragma:#pragma omp parallel Shared/private
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationConcurrency, Thread. Dongkun Shin, SKKU
Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationNative Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization
Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis
More information