The Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel.
|
|
- Spencer Austin
- 6 years ago
- Views:
Transcription
1 The Concurrent Collections (CnC) Parallel Programming Model Kathleen Knobe Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners
2 Cholesky Performance Aparna Chandramowlishwaran Rich Vuduc (Georgia Tech) Intel 2-socket x 4-core 2.8 GHz + Intel MKL
3 Eigensolver Performance ~2x speedup over MKL Intel 2-socket x 4-core 2.8 GHz + Intel MKL
4 Why is Concurrent Collections interesting for Autotuners? Goal of autotuning: find best options within the constraints of the application Goal of Concurrent Collections: Separation of concerns domain-expert: semantic requirements only tuning-expert: maximal flexibility (minimizes constraints) for tuning What is a tuning-expert? The same person as the domain-expert at a different time A different person A naïve runtime A sophisticated static analyzer An autotuner Concurrent Collections minimizes the constraints for the autotuner 4
5 Intel Concurrent Collections The application problem The work of the domain expert Semantic correctness Constraints required by the application Goal: serious separation of concerns: The domain expert does not need to know about parallelism Concurrent Collections Spec The work of the tuning expert Architecture Actual parallelism Locality Overhead Load balancing Distribution among processors Scheduling within a processor The tuning expert does not need to know about the domain. Mapping to target platform 5
6 The problem Serial languages over-constrain orderings Require arbitrary serialization Allow for overwriting of data The decision of if and when to execute are bound together Parallel programming languages embedded within serial languages Inherit problems of serial languages Too specific wrt type of parallelism in the application type target architecture 6
7 Raise the level of the programming model just enough to avoid over-constraints Concurrent Collections (only semantically required constraints) explicitly serial languages (over-constrained) explicitly parallel languages (over-constrained) 7
8 Outline Language concepts Runtime Abstract execution model Actual systems 2 examples of flexibility Checkpointing Interesting target architectures 8
9 Outline Language concepts Runtime Examples of flexibility 9
10 The Big Idea Don t specify what operations run in parallel difficult and depends on target Specify only the semantic ordering requirements easier and depends only on application 10
11 Exactly two sources of ordering requirements Producer / Consumer (Data Dependence) Producer must execute before consumer Controller / Controllee (Control Dependence) Controller must execute before controllee 11
12 Notation White Board Textual Slideware Computation Step foo (foo) (foo) Data Item x [x] [x] Control Tag T <T> <T> 12
13 Exactly two sources of ordering requirements Producer / Consumer (Data Dependence) Producer must execute before consumer Controller / Controllee (Control Dependence) Controller must execute before controllee Producer - consumer Controller - controllee <t2> (step1) [item] (step2) (step1) (step2) 13
14 Tag collections: new concept Tags are: Generalization of iteration space Typically related to the semantics of application. (step1) <t2> (step2) Loop nests components of tag <t2> are k and j (Step2) loop k = loop j = loop i = Z(i, j, k) = = A(k, j) end end end Loop iteration space is a sequence <1,1,1>, <1,1,2>, The body of the loop has access to its indices As each tuple in the sequence is created, the associated body is executed. The related tag collection is a set {<1,1>, <1, 2>, } The step code has access to its tags The time the associated body is executed is yet to be determined. 14
15 Tag collections: new concept Tag collections are like iteration spaces but more general Not just loop indices Trees (e.g., recursion) Graphs (e.g., irregular mesh) Sets (employees, molecules, frames of images) Not sequential But rather unordered Named To facilitate reuse 15
16 Collections of dynamic instances Static <t> (s) (q) Dynamic <t: 1> <t: 2> <t: 3> <t: 4> (s: 1) (s: 2) (s: 3) (s: 4) (q: 1) (q: 2) (q: 3) (q: 4) 16
17 Collections of dynamic instances Static (s1) [i] (s2) Dynamic [ i: 6 ] (s1: 5 ) [ i: 7] (s2: 7) (s2: 4) [i: 13] ( s1: 12 ) [i: 4] ( s1: 3 ) [ i:5] (s2: 13) 17
18 An Application (simple in the extreme) Break up an input string - sequences of repeated single characters Filter allowing only - sequences of odd length Input string Sequences of repeated characters Filtered sequences aaaffqqqmmmmmmm aaa ff qqq mmmmmmm aaa qqq mmmmmmm 18
19 How people think about their application: The white board drawing What are the high level operations? What are the chunks of data? What are the producer/consumer relationships? What are the inputs and outputs? [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [Input] = aaaffqqqmmmmmmm [span: 1] = aaa [span: 2] = ff [span: 3] = qqq [span: 4] = mmmmmmm [results: 1] = aaa [results: 3] = qqq [results: 4] = mmmmmmm 19
20 Make it precise enough to execute How do we distinguish among the instances? Here the tag values are arbitrary. Often the tags have semantic meaning in the application. <spantag: 1> <spantag: 2> <spantag: 3> <spantag: 4> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [Input] = aaaffqqqmmmmmmm [span: 1] = aaa [span: 2] = ff [span: 3] = qqq [span: 4] = mmmmmmm [results: 1] = aaa [results: 3] = qqq [results: 4] = mmmmmmm 20
21 <spantag> CnC specification [span] (processspan) [results] <spantag: j,s> :: (processspan: j,s); [span:j,s] -> (processspan: j,s); (processspan: j,s) -> [result: j,s]; Original scalar routine string processspanorig(string instr) { //unchanged} Glue code Step processspan( Graph_t &g, Tag_t t) { string outstr = processspanorig(g.span.get(t)); if (strlen(outstr) > 0) g.result.put(t); } 21
22 Make it precise enough to execute How do we distinguish among the instances? <stringtag: 1> <stringtag: 2> <spantag: 1, 1> <spantag: 1, 2> <spantag: 1, 3> <spantag: 1, 4> <stringtag: j> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [input: 1] = aaaffqqqmmmmmmm [input: 2] = rrhhhhxxx [span: 1, 1] = aaa [span: 1, 2] = ff [span: 1, 3] = qqq [span: 1, 4] = mmmmmmm [results: 1, 1] = aaa [results: 1, 3] = qqq [results: 1, 4] = mmmmmmm 22
23 No thinking about parallelism Only domain/application knowledge Result is: Parallel Deterministic (wrt results) Race-free 23
24 <K> Experience with graph design questions (Cell detector: K) [Histograms: K] [Cell candidates: K ] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K, cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) 24 [Predicted states: K, cell]
25 Outline Language concepts Runtime Examples of flexibility 25
26 Semantic model: instances acquire attributes When this tag is available When these items are available This step will execute (sometime) <T: 3, 28> <T: 3, 29> and this tag is available This tag becomes available [X: 2, 28] (FOO: 3, 28) [X: 3, 28] [X: 4, 28] [Y: 3 ] When this step executes this step becomes enabled 26
27 Execution model Red: prescribed Blue: inputs available Red&blue: enabled Yellow: executing Gray: done&dead 27
28 CnC supports not only different runtimes but a wide range of runtime styles HP TStreams memory grain distribution schedule distributed static static static HP TStreams distributed static static dynamic Intel CnC shared static dynamic dynamic * Rice CnC shared static dynamic dynamic GaTech CnC shared dynamic dynamic dynamic * soon: the ability to plug in an application-specific or a domain-specific scheduler 28
29 Outline Language concepts Runtime Examples of flexibility 29
30 Checkpoint: save execution frontier As objects become part of the execution frontier, we can save them on disk. We maintain an independent execution frontier on disk. On disk 30
31 Restart: from disk Assume the executing program dies. We can restart from the execution frontier on disk. On disk 31
32 Checkpoint/continue Checkpoint/restart implies full stop on failure Checkpoint/continue would continue around the failed component 32
33 Dynamic GPU/vector operations Red step collection Blue step collection All enabled steps Enabled Blue steps Enabled Red steps Single bucket Bucket per collection 33
34 ILP Red step collection Blue step collection Step fusion yields additional parallelism Red-blue step collection <b> <i, j> <b, i, j > Compiled independently for ILP For ILP, compiled together assuming independence between red and blue 34
35 ILP Pick an enabled blue step an enabled red step or one of each Enabled Blue steps Enabled Red steps 35
36 The Intel community The HP community DPD Shin Lee Steve Rose Leo Treggiari Ilya Cherny Frank Schlimbach Ganesh Rao Nikolay Kurtov SPI Kath Knobe Geoff Lowney Mark Hampton Ryan Newton Others Steve Lang John Pieper Carl Offner Alex Nelson The academic community Rice University Vivek Sarkar Zoran Budimlic Sagnak Tasirlar David Peixotto Georgia Tech Rich Vuduc Aparna Chandramowlishwaran Colorado State Michelle Strout Novosibirsk Nikolay Kurtov 36
37 whatif.intel.com 37
The Concurrent Collections (CnC) Parallel Programming Model Tutorial. Kathleen Knobe Intel Vivek Sarkar Rice University
The Concurrent Collections (CnC) Parallel Programming Model Tutorial Kathleen Knobe Intel Vivek Sarkar Rice University kath.knobe@intel.com, vsarkar@rice.edu Tutorial CnC 09 July 23, 2009 *Intel and the
More informationSCHEDULING MACRO-DATAFLOW PROGRAMS TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR
1 SCHEDULING MACRO-DATAFLOW PROGRAMS ON TASK-PARALLEL RUNTIME SYSTEMS SAĞNAK TAŞIRLAR Thesis 2 Our thesis is that advances in task parallel runtime systems can enable a macro-dataflow programming model,
More informationMulti-core Implementations of the Concurrent Collections Programming Model
Multi-core Implementations of the Concurrent Collections Programming Model Zoran Budimlić, Aparna Chandramowlishwaran, Kathleen Knobe, Geoff Lowney, Vivek Sarkar, and Leo Treggiari Department of Computer
More informationCnC for high performance computing. Kath Knobe Intel SSG
CnC for high performance computing Kath Knobe Intel SSG Thanks Frank Schlimbach Intel Vivek Sarkar and his group Rice University DARPA UHPC (Runnemede) DOE X-Stack (Trilacka Glacier) 2 Outline Intro Checkpoint/restart
More informationConcurrent Collections
Concurrent Collections Zoran Budimlić 1 Michael Burke 1 Vincent Cavé 1 Kathleen Knobe 2 Geoff Lowney 2 Ryan Newton 2 Jens Palsberg 3 David Peixotto 1 Vivek Sarkar 1 Frank Schlimbach 2 Sağnak Taşırlar 1
More informationThe Concurrent Collections Programming Model
The Concurrent Collections Programming Model Michael G. Burke Rice University Houston, Texas Kathleen Knobe Intel Corporation Hudson, Massachusetts Ryan Newton Intel Corporation Hudson, Massachusetts Vivek
More informationDeclarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model
Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model Zoran Budimlić Rice University zoran@rice.edu Aparna Chandramowlishwaran Georgia Institute of Technology
More informationImplemen'ng Asynchronous Checkpoint/Restart for CnC. Nick Vrvilo, Vivek Sarkar Rice University Kath Knobe, Frank Schlimbach Intel September 24, 2013
Implemen'ng Asynchronous Checkpoint/Restart for CnC Nick Vrvilo, Vivek Sarkar Rice University Kath Knobe, Frank Schlimbach Intel September 24, 2013 1 MoCvaCng CnC Checkpoint/Restart (C/R) what s good for
More informationAsynchronous Checkpoint/Restart for the Concurrent Collections Model
RICE UNIVERSITY Asynchronous Checkpoint/Restart for the Concurrent Collections Model by Nick Vrvilo A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master of Science Approved,
More informationDynamic Accommodation of Performance, Power, and Reliability Tradeoffs
Dynamic Accommodation of Performance, Power, and Reliability Tradeoffs Special ACK to: Vivek Sarkar (co-pi) & team, Rice Kath Knobe, Intel U.S. Dept. of Defense Intel Labs Fifth Annual Concurrent Collections
More informationUsing Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations
More informationPerformance Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems
Performance Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems Aparna Chandramowlishwaran, Kathleen Knobe, Richard Vuduc College of Computing, Georgia Institute of Technology,
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationCnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University
CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea,
More informationRICE UNIVERSITY. Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems. by Sagnak Ta Irlar
RICE UNIVERSITY Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems by Sagnak Ta Irlar A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Master of Science APPROVED,
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationResilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche
Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationUnderstanding Sources of Inefficiency in General-Purpose Chips
Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors
More informationOptimize an Existing Program by Introducing Parallelism
Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationOn the limits of (and opportunities for?) GPU acceleration
On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationCall Paths for Pin Tools
, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that
More informationFoundations of the C++ Concurrency Memory Model
Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationWriting Parallel Programs; Cost Model.
CSE341T 08/30/2017 Lecture 2 Writing Parallel Programs; Cost Model. Due to physical and economical constraints, a typical machine we can buy now has 4 to 8 computing cores, and soon this number will be
More informationTips and Techniques for Designing the Perfect Layout with SAS Visual Analytics
SAS2166-2018 Tips and Techniques for Designing the Perfect Layout with SAS Visual Analytics Ryan Norris and Brian Young, SAS Institute Inc., Cary, NC ABSTRACT Do you want to create better reports but find
More informationAutomatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationIntel Concurrent Collections as a Method for Parallel Programming
Intel Concurrent Collections as a Method for Parallel Programming Richard Adjogah*, Randal Mckissack*, Ekene Sibeudu*, Undergraduate Researchers Andrew M. Raim**, Graduate Assistant Matthias K. Gobbert**,
More informationDeterministic Parallel Programming
Deterministic Parallel Programming Concepts and Practices 04/2011 1 How hard is parallel programming What s the result of this program? What is data race? Should data races be allowed? Initially x = 0
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationScalable Shared Memory Programing
Scalable Shared Memory Programing Marc Snir www.parallel.illinois.edu What is (my definition of) Shared Memory Global name space (global references) Implicit data movement Caching: User gets good memory
More informationDATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
1 DATA-DRIVEN TASKS AND THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY Fork/Join graphs constraint -ism 2 Fork/Join models restrict task graphs to be
More informationAutomatic Generation of Tuners for Intel Concurrent Collections Programs
Proceedings of the IEEE SoutheastCon 215, April 9-12, 215 - Fort Lauderdale, Florida Automatic Generation of Tuners for Intel Concurrent Collections Programs Peiyi Tang Department of Computer Science University
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationWhat is Meant By Cotton Goods
/ 6 9 - z 9 / 9 q - ; x - 6 \ ~ 9? - -» z z - x - - / - 9 -» ( - - - ( 9 x q ( -x- ( ]7 q -? x x q - 9 z X _? & & - ) - ; - 96? - - ; - / 99 - - - - x - - 9 & \ x - z ; - - x - - 6 q - z z ; 9 x 9 - -
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationFourier-Motzkin and Farkas Questions (HW10)
Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationOPENMP TIPS, TRICKS AND GOTCHAS
OPENMP TIPS, TRICKS AND GOTCHAS OpenMPCon 2015 2 Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! Extra nasty if it is e.g. #pragma opm atomic
More informationCnC for Tuning Hints on OCR. Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015
CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal,
More informationMorsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin
Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I Solutions
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.033 Computer Systems Engineering: Spring 2011 Quiz I Solutions There are 10 questions and 12 pages in this
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More information! Use of formal notations. ! in software system descriptions. ! for a broad range of effects. ! and varying levels of use. !
What Are Formal Methods? David S. Rosenblum ICS 221 Winter 2001! Use of formal notations! first-order logic, state machines, etc.! in software system descriptions! system models, constraints, specifications,
More informationIntroduction to Concurrency. Kenneth M. Anderson University of Colorado, Boulder CSCI 5828 Lecture 4 01/21/2010
Introduction to Concurrency Kenneth M. Anderson University of Colorado, Boulder CSCI 5828 Lecture 4 01/21/2010 University of Colorado, 2010 1 Credit where Credit is Due 2 Some text and images for this
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationAADL Graphical Editor Design
AADL Graphical Editor Design Peter Feiler Software Engineering Institute phf@sei.cmu.edu Introduction An AADL specification is a set of component type and implementation declarations. They are organized
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMilind Kulkarni Research Statement
Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers
More informationThe Legion Mapping Interface
The Legion Mapping Interface Mike Bauer 1 Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationChapter 13: Reference. Why reference Typing Evaluation Store Typings Safety Notes
Chapter 13: Reference Why reference Typing Evaluation Store Typings Safety Notes References Computational Effects Also known as side effects. A function or expression is said to have a side effect if,
More informationLecture Notes on Spanning Trees
Lecture Notes on Spanning Trees 15-122: Principles of Imperative Computation Frank Pfenning Lecture 26 April 25, 2013 The following is a simple example of a connected, undirected graph with 5 vertices
More informationEnzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy
Enzo-P / Cello Formation of the First Galaxies James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department
More informationThe Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.
Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationMetadata for Component Optimisation
Metadata for Component Optimisation Olav Beckmann, Paul H J Kelly and John Darlington ob3@doc.ic.ac.uk Department of Computing, Imperial College London 80 Queen s Gate, London SW7 2BZ United Kingdom Metadata
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationAlgorithm Analysis. Big Oh
Algorithm Analysis with Big Oh Data Structures and Design with Java and JUnit Chapter 12 Rick Mercer Algorithm Analysis w Objectives Analyze the efficiency of algorithms Analyze a few classic algorithms
More informationStatic Program Analysis Part 1 the TIP language
Static Program Analysis Part 1 the TIP language http://cs.au.dk/~amoeller/spa/ Anders Møller & Michael I. Schwartzbach Computer Science, Aarhus University Questions about programs Does the program terminate
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationLecture 23: Priority Queues, Part 2 10:00 AM, Mar 19, 2018
CS8 Integrated Introduction to Computer Science Fisler, Nelson Lecture : Priority Queues, Part : AM, Mar 9, 8 Contents Sorting, Revisited Counting Sort Bucket Sort 4 Radix Sort 6 4. MSD Radix Sort......................................
More informationCS558 Programming Languages Winter 2018 Lecture 4a. Andrew Tolmach Portland State University
CS558 Programming Languages Winter 2018 Lecture 4a Andrew Tolmach Portland State University 1994-2018 Pragmatics of Large Values Real machines are very efficient at handling word-size chunks of data (e.g.
More informationLecture 25 Notes Spanning Trees
Lecture 25 Notes Spanning Trees 15-122: Principles of Imperative Computation (Spring 2016) Frank Pfenning 1 Introduction The following is a simple example of a connected, undirected graph with 5 vertices
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationDataflow Architectures. Karin Strauss
Dataflow Architectures Karin Strauss Introduction Dataflow machines: programmable computers with hardware optimized for fine grain data-driven parallel computation fine grain: at the instruction granularity
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationCSCI-1200 Data Structures Fall 2009 Lecture 25 Concurrency & Asynchronous Computing
CSCI-1200 Data Structures Fall 2009 Lecture 25 Concurrency & Asynchronous Computing Final Exam General Information The final exam will be held Monday, Dec 21st, 2009, 11:30am-2:30pm, DCC 308. A makeup
More informationL23: Parallel Programming Retrospective
Administrative L23: Parallel Programming Retrospective Schedule for the rest of the semester - Midterm Quiz = long homework - Return by Dec. 15 - Projects - 1 page status report due TODAY handin cs4961
More informationIntroduction to High Performance Parallel I/O
Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationLesson 1 1 Introduction
Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationIntel Array Building Blocks (Intel ArBB) Technical Presentation
Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationHint-based Static Analysis for Automatic Program Parallelization. Steve Simon Joseph Fernandez CSCI 5535 University of Colorado, Boulder
Hint-based Static Analysis for Automatic Program Parallelization Steve Simon Joseph Fernandez CSCI 5535 University of Colorado, Boulder Background Year Hardware Software (mostly) 1998 Single Core Single
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationRelational Data Model
Relational Data Model 1. Relational data model Information models try to put the real-world information complexity in a framework that can be easily understood. Data models must capture data structure
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationImproving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester
Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software
More informationModeling and Simulating Discrete Event Systems in Metropolis
Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract
More informationMemory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008
Dynamic Storage Allocation CS 44: Operating Systems Spring 2 Memory Allocation Static Allocation (fixed in size) Sometimes we create data structures that are fixed and don t need to grow or shrink. Dynamic
More information