Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
|
|
- Luke Watts
- 5 years ago
- Views:
Transcription
1 Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison
2 Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2
3 Beyond pipelining to ILP Late 1980s to mid 1990s Search for post RISC architecture More accurately, instruction processing model Desire to do more than one instruction per cycle exploit ILP VLIW/EPIC Out-of-order (OOO) superscalar 3
4 VLIW/EPIC School Descendants of HPC computing experience (array processors) Search for independence (by compiler) Express independence in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism Strive for efficiency Static scheduling to saturate a resource 4
5 VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems Predication Statically scheduling loads Exception handling Recovery code Lots of research addressing these problems 5
6 Not from HPC school OOO Superscalar Non-scientific influence important (e.g., branch prediction) No static search/representation for independence Arguably statically representing dependent operations (e.g., accumulator) would help Improvements to superscalar were in this direction 6
7 OOO Superscalar Create dynamic parallel execution from sequential static representation dynamic dependence information accurate execution schedule flexible Parallelism in application/algorithm required but program representation is sequential None of the problems associated with trying to create a parallel representation statically 7
8 VLIW/Superscalar Superscalar not at efficient as VLIW Has all this extra hardware for dynamic dataflow execution Hard to saturate a resource like VLIW But provides natural (sequential) interface for program generator Much more adaptable to run time uncertainties E.g., resources changing dynamically 8
9 Lessons from VLIW/Superscalar Wisdom from HPC is natural to apply, but is this a good idea? Hardware efficiency may not be the best notion of efficiency Parallel execution can be achieved even without a parallel representation 9
10 Lessons from VLIW/Superscalar Dataflow execution is much more flexible and adaptable than control flow execution e.g., new forms of speculation easily added e.g., resource architecture easily changed (Hardware) overheads for achieving dynamic dataflow execution worth the effort 10
11 Enhancing Superscalar How can we big a very large window superscalar processor which can process lots of instructions per cycle? Understand the characteristics of the dynamic dataflow graph and exploit patterns/localities Values tend to be used locally ; most parallelism is non-local Break up centralized hardware into clusters Most value communication within cluster Reduce load on inter-cluster communication network 11
12 Lessons from Beyond Superscalar Parallelism derived from sequential program (instruction stream) has patterns Local groups of instructions are dependent Most value communication happens here Independence is in non-local groups Few values passed between independent groups Altering this natural pattern requires a lot of storage Dependence relationships are highly stable and thus predictable when unknown 12
13 Lessons from Beyond Superscalar For good parallelism exploitation, dependent operations should be packed together Opposite of conventional wisdom Most value communication local within core Reduced demand for inter-core communication network To get parallelism, focus on dependence, not independence 13
14 More Lessons For good parallelism exploitation need lots of temporary storage Fast access to large storage necessitates creating and exploiting localities Creating localities for value communication implies grouping together dependent operations Is sequential (or pipelined ) a preferred way of representing parallel computation? 14
15 Summarizing Lessons HPC experience may not be the best for small scale parallelism Hardware efficiency may not be best notion of efficiency Dataflow execution works nicely to unwind available parallelism from a sequential program Overheads may be worth paying 15
16 Summarizing Lessons Statically representing dependence may be better than representing independence Easier dataflow execution, optimizing value communication, minimal resource assumptions, etc. Statically representing parallelism may create problems that are hard to deal with E.g., exceptions in VLIW, I/O in transactions And many others.. 16
17 The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing Mostly in the scientific application/hpc arena Use this as basis Create a program with parallelism expressed statically 17
18 Hardware Going Forward Multiple general-purpose processing cores Some special-purpose hardware GPUs, specialized units, etc. Pool of available (i.e., powered on) resources might change frequently Need to optimize storage of values (caches) and value communication (interconnect) Use of software (e.g, VM) to hide hardware detail 18
19 How to use future hardware? Program in parallel Teach students about parallel programming Transactional memory Etc.. Do we really believe this? 19
20 Going Forward Programmers are going to continue to express computation in familiar ways: sequential, objectoriented programs May use parallel algorithm, but likely won t be a statically-parallel program How are we going to make it work? 20
21 Abstraction, Sequential, Dataflow Then: abstraction is a friend of software Now: abstraction is going to help us use future hardware Program is going to be a sequential representation of abstractions of computations which are going to be executed on a heterogeneous pool of hardware resources in a dataflow manner 21
22 Lessons Applied OO programming naturally creates groups of dependent operations Can optimize value communication Can optimize storage for values (internal/external) Can optimize traditional cache operations for many values 22
23 Lessons Applied Can process methods (chunks of dynamic instructions) in a dataflow manner Don t care how internals of method are implemented Don t care when, where and how it is executed Only data flow matters Dataflow execution can easily be unwound from sequential representation With right granularity of methods 23
24 Lessons Applied Dynamic dataflow execution probably not as efficient as bare bones parallel program, but other efficiencies probably more important Achieving dataflow execution from sequential program probably going to have software (or hardware) overhead, but likely worth it 24
25 Dynamic Serialization: What? Data-driven parallel execution from sequential program Data-centric (dynamic) expression of dependence Determinate, race-free execution No locks and no explicit synchronization Easier to write, debug, and maintain No speculation a la TLS or TM Comparable or better performance than conventional parallel models 25
26 How? Big Picture Write program in well object-oriented style Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution Make suitable annotations as needed Don t impose how parallelism is executed Dynamically determine data object touched by selected code Identify dependence Program thread assigns selected code to bins in a determined (sequential) order 26
27 How? Big Picture Serialize computations to same object Enforce dependence Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence Falls out as an effect of enforcing dependence Computations in different bins execute in parallel Updates to given state in same order as in sequential program Determinism No races If sequential correct; parallel execution is correct (same input) 27
28 Methodology Study existing parallel programs Empirical comparison of multithreaded vs. Prometheus x86-64 multi-core and ccnuma servers 64-bit binaries, maximum optimization Prometheus implementation Convert benchmark to idiomatic OOP program in C++ Use objects, inheritance, STL containers Parallelize same operations using Prometheus Prometheus version may be more fine-grained Some unavoidable differences due to locks, shared data 28
29 Hardware configurations μ-arch AMD Barcelona Intel Nehalem Processor Phenom 9850 Opteron 8350 Opteron 8356 Core i7 965 Xeon X5550 Sockets Cores Threads Total contexts Clock (GHz) Memory (GB)
30 30 Benchmarks Program Source Language Synchronization Description barnes-hut Lonestar C++ barrier black-scholes PARSE C C bzip2 pbzip2 C canneal dedup PARSE C PARSE C barrier histogram Phoenix C barrier N-body simulation financial analysis mutex, condition variables compression C++ atomic, optimistic VLSI CAD C mutex, condition variables enterprise storage image analysis reverse index Phoenix C mutex web indexing word count Phoenix C barrier text analysis
31 31 Micro-benchmark results
32 32 Multicore results
33 33 Multi-socket results
34 Conclusions Lessons from the past 25 years are going to be important for the future Think parallel, use parallel algorithms, but program sequentially! Focus on dependence, not independence Techniques like dynamic serialization can do as well or better than parallel programming techniques for achieving parallel execution! 34
35 Questions? 35
Parallel Computing. Parallel Computing. Hwansoo Han
Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallelism, Multicore, and Synchronization
Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationThe Implications of Multi-core
The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationPatterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices
Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationThe IA-64 Architecture. Salient Points
The IA-64 Architecture Department of Electrical Engineering at College Park OUTLINE: Architecture overview Background Architecture Specifics UNIVERSITY OF MARYLAND AT COLLEGE PARK Salient Points 128 Registers
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationComputer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008
Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers
More informationAde Miller Senior Development Manager Microsoft patterns & practices
Ade Miller (adem@microsoft.com) Senior Development Manager Microsoft patterns & practices Save time and reduce risk on your software development projects by incorporating patterns & practices, Microsoft's
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationEECS 452 Lecture 9 TLP Thread-Level Parallelism
EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationRECAP. B649 Parallel Architectures and Programming
RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationUniversity of Karlsruhe (TH)
University of Karlsruhe (TH) Research University founded 1825 Technical Briefing Session Multicore Software Engineering @ ICSE 2009 Transactional Memory versus Locks - A Comparative Case Study Victor Pankratius
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationTransactifying Apache s Cache Module
H. Eran O. Lutzky Z. Guz I. Keidar Department of Electrical Engineering Technion Israel Institute of Technology SYSTOR 2009 The Israeli Experimental Systems Conference Outline 1 Why legacy applications
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationAdministration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects
CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationTask Superscalar: Using Processors as Functional Units
Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationUCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine
Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationParallelism Marco Serafini
Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 0 J A N L E M E I R E Course Objectives 2 Intel 4004 1971 2.3K trans. Intel Core 2 Duo 2006 291M trans. Where have all the transistors gone? Turing Machine
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationAdministrivia. Minute Essay From 4/11
Administrivia All homeworks graded. If you missed one, I m willing to accept it for partial credit (provided of course that you haven t looked at a sample solution!) through next Wednesday. I will grade
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationStatic Analysis of Embedded C Code
Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider Relevant features of C code for MCUs Interrupt-driven concurrency Direct hardware access Whole program
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationDataflow Execution of Sequential Imperative Programs on Multicore Architectures
Dataflow Execution of Sequential Imperative Programs on Multicore Architectures Gagan Gupta and Gurindar S. Sohi Computer Sciences Department University of isconsin-madison Madison, I, USA {gagang, sohi}@cs.wisc.edu
More informationParallel Functional Programming Lecture 1. John Hughes
Parallel Functional Programming Lecture 1 John Hughes Moore s Law (1965) The number of transistors per chip increases by a factor of two every year two years (1975) Number of transistors What shall we
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationanced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer
Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS
More informationStatic Analysis of Embedded C
Static Analysis of Embedded C John Regehr University of Utah Joint work with Nathan Cooprider Motivating Platform: TinyOS Embedded software for wireless sensor network nodes Has lots of SW components for
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationWalking Four Machines by the Shore
Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationSummary of Student Projects from CSC 580, Multiprocessor Programming in Java
Summary of Student Projects from CSC 580, Multiprocessor Programming in Java Dale E. Parson, http://faculty.kutztown.edu/parson, January 2012, course was held spring 2011 This course in Multiprocessor
More informationSpeculation and Future-Generation Computer Architecture
Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationExploiting Distributed Software Transactional Memory
Exploiting Distributed Software Transactional Memory Christos Kotselidis Research Fellow Advanced Processor Technologies Group The University of Manchester Outline Transactional Memory Distributed Transactional
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More information