Graphite IntroducDon and Overview. Goals, Architecture, and Performance
|
|
- Alexina Andrews
- 5 years ago
- Views:
Transcription
1 Graphite IntroducDon and Overview Goals, Architecture, and Performance 4
2 The Future of MulDcore #Cores cores? CompuDng has moved aggressively to muldcore MIT Raw Intel SSC Up to 72 cores available now Sun UltraSparc T2 IBM PowerXCell 8i core chips by 2018, if trends condnue Time 5
3 SimulaDon in MulDcore Research SimulaDon is vital for exploring future architectures Experiment with new designs/technologies Abstract away details and focus on key elements Rapid exploradon of design space Early sohware development for upcoming architectures The future of muldcore simuladon: Need to simulate 100 s to 1000 s of cores Massive quanddes of computadon High- level architecture is becoming more important than microarchitecture On- chip networks, Memory hierarchies, DRAM access, Cache coherence 6
4 Graphite At- a- Glance ApplicaDon A fast, high- level simulator for large- scale muldcores ApplicaDon- level simuladon where threads are mapped to target cores core core core core core core core core core core core core MulDcore Host Machines MulD- machine distribudon Leverage add l compute and memory Invisible to applicadon Runs off- the- shelf pthread apps Relaxed synchronizadon scheme Trades some Dming accuracy for performance Guarantees funcdonal correctness Integrated power models 7
5 Graphite Performance Graphite performance on 8 host machines (64 cores total) Min Max Median 3 MIPS 81 MIPS 14 MIPS Typical slowdown for exisdng sequendal simulators: 10,000x 100,000x Results from SPLASH2 benchmarks on a 32- core target processor 8
6 Graphite Trades Accuracy for Performance Simulator performance is a major limidng factor Limits depth and breath of studies, size of benchmarks Too much detail slows simuladon Cannot simulate 1000 s of cores Most simulators are sequendal, Graphite is parallel Typical performance: 10,000x 100,000x slowdown per core Our target performance: 20 MIPS (100x 1000x) Performance vs. accuracy Cycle- accurate: very accurate but slow High- level: trade some accuracy for performance For next year s chips, you need cycle- accuracy For chips 5-10 years out, you need performance 9
7 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 10
8 Graphite Overview ApplicaDon- level simulator based on dynamic binary transladon Uses Intel s Pin App runs nadvely except for new features and modeled events On trap, model funcdonality, Dming and energy SimulaDon consists of running an applicadon on a target architecture specified by swappable models and rundme parameters Different architectures Accuracy vs. Performance Result: ApplicaDon output Simulated Dme to compledon StaDsDcs about processor events Energy and power for various components 11
9 Graphite Architecture Application Application Threads Host Threads Host Machines Host Core Host Process Host OS Host Core Host Core Architecture Graphite Host Process Host Core Host Process Host OS Host Core Host Core ApplicaDon threads mapped to target Dles On trap, use correct target Dle s models Dles are distributed among host processes Processes can be distributed to muldple host machines 12
10 Simulated Architecture Processor Core DRAM Controller DRAM Network Switch Cache Hierarchy Interconnection Network Swappable models for processor, network, and memory hierarchy components Explore different architectures Trade accuracy for performance Cores may be homogeneous or heterogeneous 13
11 Key Simulator Components Host Machine Host Machine Host Machine ApplicaDon Thread MCP LCP LCP LCP Messaging API Memory System Network Model Transport Layer Physical Transport 14
12 CommunicaDon Stack ApplicaDon Thread Messaging API Memory System Network Model Transport Layer Graphite implements a layered communicadon stack. The applicadon thread communicates with other threads via messages. Graphite messaging API Simulated shared memory Messages are routed and Dmed by target architecture network model. Transport layer delivers messages to desdnadon target core. Host shared memory (same host process) TCP/IP (different host processes) 15
13 Power Modeling Core, caches, and network stadc and dynamic power/ energy modeling SimulaDon- driven energy modeling (not trace- based) On- line availability of energy/power esdmates Enables dynamic power management (hardware and sohware) DVFS Affects power and performance Uses 3 rd party tools: McPAT and DSENT 16
14 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 17
15 Parallel DistribuDon Challenges Wanted support for standard pthreads model Allows use of off- the- shelf apps Simulate coherent- shared- memory architectures Must provide the illusion that all threads are running in a single process on a single machine Single shared address space Thread spawning System calls 18
16 Single Shared Address Space All applicadon threads run in a single simulated address space Memory subsystem provides modeling as well as funcdonality FuncDonality implemented as part of the target memory models Eliminate redundant work Test correctness of memory models Host Address Space Application Simulated Address Space Host Address Space Host Address Space 19
17 Thread DistribuDon ApplicaDon Graphite runs applicadon threads across several host machines core core core core core core core core core core core core MulDcore Must inidalize each host process correctly Threads are automadcally distributed by trapping threading calls Host Machines 20
18 System Calls Many system calls need to be handled specially Pass memory operands to the kernel SynchronizaDon/communicaDon between threads AllocaDng and deallocadng dynamic memory File I/O operadons Reflect target architectural state (e.g., Dme) Other system calls can simply be allowed to fall through 21
19 Lite Mode Runs without complexity needed for muld- machine simuladons Relies on host system for correctness No special handling of system calls Advantages Be?er compadbility with off- the- shelf applicadons Faster simuladons in some situadons Disadvantages Can only run on a single machine Does not help debug target memory system models May not work well with very large numbers of target cores Power/performance models of target architecture remain the same Includes core, memory subsystem and network models 22
20 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 23
21 Clock SynchronizaDon Cores only interact through messages Clocks are updated with message Dmestamps Core 1 Core 2 Message Message 24
22 Clock SynchronizaDon Threads may run at different speeds, causing clocks to deviate Clocks are only used for Dming, funcdonal correctness is always preserved Must be synchronized on explicit interacdon Clocks may differ on implicit interacdon à Dming inaccuracy Define synchronizadon as managing the skew of different target core clocks. This is not applicadon synchronizadon! Graphite supports three synchronizadon schemes with different accuracy and performance tradeoffs 25
23 SynchronizaDon Schemes Lax Relies exclusively on applicadon synchronizadon events to synchronize Dles local clocks FuncDonally, events may occur out- of- order w.r.t. simulated Dme Best performance; worst accuracy LaxP2P ObservaDon: Timing inaccuracy is due to a few outliers Every N cycles, each target core randomly pairs with another If cycles differ by too much, future core goes to sleep Good performance; good accuracy LaxBar Every N cycles, all target cores wait on a barrier Keeps cores Dghtly synchronized, imitates cycle- accuracy Worst performance; best accuracy 26
24 Example SimulaDon (Lax) App Exit Lax Simulated Dme ApplicaDon SynchronizaDon Point Core 1 Core 2 Core 3 Real Dme 27
25 Example SimulaDon (LaxP2P) App Exit Lax LaxP2P Zzzz Zzzz Zzzz P2P Check P2P Check Simulated Dme ApplicaDon SynchronizaDon Point P2P Check Real Dme 28
26 Example SimulaDon (LaxBar) App Exit Lax LaxP2P LaxBar Barrier Barrier Simulated Dme ApplicaDon SynchronizaDon Point Barrier Real Dme 29
27 Clock Skew Measurements Local clock value Lax LaxP2P LaxBar Graphs show approximate clock skew for each scheme (fmm benchmark) Clock skew is the spread between minimum and maximum clocks at any given point Note: Spikes on graphs due to errors in measurement method Lax has largest skew (~2,000,000 cycles) ApplicaDon synchronizadon events are clearly visible Fine- grain thread interacdons can be missed or misrepresented Lax P2P has much lower skew (~30,000 cycles) [Interval = 10,000 cycles] ApplicaDon synchronizadon events slightly visible LaxBar has low, constant skew (~4000 cycles) 30
28 Outline IntroducDon Graphite Architecture Results Experimental methodology Simulator performance and scaling ValidaDon against cycle- level simuladon Conclusions 31
29 Experimental Methodology Architecture: Feature Value Number of cores 64 / 1024 L1- I/L1- D caches L2 caches Cache coherence scheme InterconnecDon Network Private, 32 kb per Dle Private, 512 kb per Dle Full- map directory based 2- D mesh SPLASH- 2 & PARSEC benchmark suite All experimental results collected on 8- core Xeon host machines running Linux 32
30 Performance Scaling 64 Cores Simulator Speed- Up (Normalized) x Graphite scales if the applicadon scales Even non- ideal speedup sdll reduces latency and design iteradon Dme 33
31 Performance Summary Performance of Graphite Simulator (in MIPS) SequenOal (1 core) 1 host* (8 cores) 8 hosts* (64 cores) Min Max Mean Median * Host machines are 8- core servers SequenDal simulator performance is unacceptable Parallel simulator performance as high as 81 MIPS Would condnue to increase with larger targets and more hosts Simulator overhead depends heavily on applicadon characterisdcs SDll more room for opdmizadon 34
32 Performance Scaling 1024 Cores Simulated MIPS (Normalized) radix u lu_condguous ocean_condguous Performance increased by 4x (on average) going from 1 to 8 host machines 8- core host: 2 sockets (Intel 4- core Xeon CPU X5460) 35
33 Cycle- Level SimulaDon Synchronize at cycle boundaries Default Graphite operadng modes synchronize at instrucdon boundaries Simulates all architectural models on a cycle- by- cycle basis Models exact contendon and synchronizadon delays Events globally ordered by cycle Dme Processed in order of occurrence Events stored in priority queues Dequeued in the order of their Dmestamps Verifying the accuracy of Graphite 36
34 Cycle- Level SimulaDon DeviaOon (%) Graphite ValidaDon Graphite (LaxP2P) is only 6.4% off from cycle- level on average
35 Summary Graphite accelerates muldcore simuladon using muld- machine parallel distribudon Enables simuladon of 1000 s of cores Invisible to applicadon, runs off- the- shelf pthread apps Simultaneous performance and energy esdmadon Graphite provides fast, scalable performance As high as 81 MIPS simulator performance Up to 34x speedup on 64 host cores (across 8 machines) 38
Graphite: A Distributed Parallel Simulator for Multicores
Graphite: A Distributed Parallel Simulator for Multicores Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep and Anant Agarwal Massachusetts
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationComputer Architecture: Cache Op1miza1ons. Berk Sunar and Thomas Eisenbarth ECE 505
Computer Architecture: Cache Op1miza1ons Berk Sunar and Thomas Eisenbarth ECE 505 Memory Hierarchy Design Basic Principles of Data Cache Chap. 2 Six basic opdmizadons for caches Appendix B Performance
More informationHarshad Kasture. Master of Science. at the. February Certified by... Anant Agarwal Professor Thesis Supervisor
Graphite: A Parallel Distributed Simulator for Multicores by Harshad Kasture Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the
More informationAllevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes*
Allevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes* Erlangen Regional Compu0ng Center, Germany *King Abdullah Univ.
More informationThe Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale
The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale Michela Taufer University of Delaware Michela Becchi University of Missouri From MulD- core to Many Cores
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationLBANN: Livermore Big Ar.ficial Neural Network HPC Toolkit
LBANN: Livermore Big Ar.ficial Neural Network HPC Toolkit MLHPC 2015 Nov. 15, 2015 Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, Barry Chen Center for Applied ScienDfic CompuDng (CASC) + ComputaDonal
More informationWillow: A User- Programmable SSD
Willow: A User- Programmable SSD Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson Non- VolaDle Systems Laboratory Computer Science
More informationJava 9: Tips on MigraDon and Upgradability
Java 9: Tips on MigraDon and Upgradability Bernard Traversat Vice President of Development Java SE PlaJorm Oracle November, 2017 Copyright 2017, Oracle and/or its affiliates. All rights reserved. ConfidenDal
More informationMay 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to
A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationEmpirical Approximation and Impact on Schedulability
Cache-Related Preemption and Migration Delays: Empirical Approximation and Impact on Schedulability OSPERT 2010, Brussels July 6, 2010 Andrea Bastoni University of Rome Tor Vergata Björn B. Brandenburg
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationCO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 15: Porting CPU Software to DFEs
CO405H Computing in Space with OpenSPL Topic 15: Porting CPU Software to DFEs Oskar Mencer Georgi Gaydadjiev Department of Compu:ng Imperial College London h#p://www.doc.ic.ac.uk/~oskar/ h#p://www.doc.ic.ac.uk/~georgig/
More informationThe Cache-Coherence Problem
The -Coherence Problem Lecture 12 (Chapter 6) 1 Outline Bus-based multiprocessors The cache-coherence problem Peterson s algorithm Coherence vs. consistency Shared vs. Distributed Memory What is the difference
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationParallel SimOS: Scalability and Performance for Large System Simulation
Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1 Overview This work develops methods to simulate
More informationCS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory. Last?me in Lecture 9
CS 152 Computer Architecture and Engineering Lecture 9 - Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152!
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationQuartet Inference from SNP Data Under the Coalescent Model
Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman and Laura Kubatko By Shashank Yaduvanshi EsDmaDng Species Tree from Gene Sequences Input: Alignments from muldple genes Output:
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More information06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1
Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and
More informationATLAS: A Chip-Multiprocessor. with Transactional Memory Support
ATLAS: A Chip-Multiprocessor with Transactional Memory Support Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun Transactional Coherence and Consistency
More informationExploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems
Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI
More informationEmbedded Systems: Projects
December 2015 Embedded Systems: Projects Davide Zoni PhD email: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Research Activities Interconnect: bus, NoC Simulation (component design, evaluation)
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationChapter 2. OS Overview
Operating System Chapter 2. OS Overview Lynn Choi School of Electrical Engineering Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, Kong-Hak-Kwan 411, lchoi@korea.ac.kr,
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationDistributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1
Distributed File Systems Issues NFS (Network File System) Naming and transparency (location transparency versus location independence) Host:local-name Attach remote directories (mount) Single global name
More informationA Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper
A Low Latency Solution Stack for High Frequency Trading White Paper High-Frequency Trading High-frequency trading has gained a strong foothold in financial markets, driven by several factors including
More informationChapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
More informationA Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related
More informationMessage-Passing Shared Address Space
Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization
More informationBuilding High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye
Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400
More informationAsymmetry-Aware Work-Stealing Runtimes
Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang Wang, and Christopher atten School of Electrical and Computer Engineering Cornell University 43rd Int l Symp. on Computer Architecture,
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationRAMP-White / FAST-MP
RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and Computer Engineering University of Texas at Austin Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale RAMP-White
More informationArchitecture and OS. To do. q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure
Architecture and OS To do q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure Computer architecture and OS OS is intimately tied to the hardware it runs on
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationVSI TCP/IP Stack & VCI 2.0
VSI TCP/IP Stack & VCI 2.0 September 26, 2016 Agenda What Was Announced What Were VSI s OpDons Why Choose This OpDon How Does VSI TCP/IP V10.5 Compare To Current Stack What Are VSI TCPIP V10.5 Features
More informationDistributed Systems. Peer-to-Peer. Rik Sarkar. University of Edinburgh Fall 2018
Distributed Systems Peer-to-Peer Rik Sarkar University of Edinburgh Fall 2018 Peer to Peer The common percepdon A system for distribudng (sharing?) files Using the computers of common users (instead of
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationUsing Chapel to teach parallel concepts. David Bunde Knox College
Using Chapel to teach parallel concepts David Bunde Knox College dbunde@knox.edu Acknowledgements Silent partner: Kyle Burke Material drawn from tutorials created with contribudons from Johnathan Ebbers,
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationDynamic Tracing and the DTrace book
Dynamic Tracing and the DTrace book Brendan Gregg Lead Performance Engineer, Joyent BayLISA, May 2011 Agenda Dynamic Tracing DTrace Latency Performance IntrospecDon of Cloud CompuDng DTrace Book Please
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationELEC 377 Operating Systems. Week 1 Class 2
Operating Systems Week 1 Class 2 Labs vs. Assignments The only work to turn in are the labs. In some of the handouts I refer to the labs as assignments. There are no assignments separate from the labs.
More informationLecture 23 Database System Architectures
CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More information740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationBottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications
More informationAdvanced RDMA-based Admission Control for Modern Data-Centers
Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationParallelism Marco Serafini
Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud
More informationNikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris
Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical
More informationArchitectural Support for Operating Systems
Architectural Support for Operating Systems Today Computer system overview Next time OS components & structure Computer architecture and OS OS is intimately tied to the hardware it runs on The OS design
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationAgenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2
Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process
More informationIncorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg
Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationBuffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
National Alamos Los Laboratory Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini and Wu-chun Feng {fabrizio,feng}@lanl.gov Los Alamos National
More informationOS Design Approaches. Roadmap. OS Design Approaches. Tevfik Koşar. Operating System Design and Implementation
CSE 421/521 - Operating Systems Fall 2012 Lecture - II OS Structures Roadmap OS Design and Implementation Different Design Approaches Major OS Components!! Memory management! CPU Scheduling! I/O Management
More informationHigh Performance Java Remote Method Invocation for Parallel Computing on Clusters
High Performance Java Remote Method Invocation for Parallel Computing on Clusters Guillermo L. Taboada*, Carlos Teijeiro, Juan Touriño taboada@udc.es UNIVERSIDADE DA CORUÑA SPAIN IEEE Symposium on Computers
More informationCSEP 524: Parallel Computa3on (week 6) Brad Chamberlain Tuesdays 6:30 9:20 MGH 231
CSEP 524: Parallel Computa3on (week 6) Brad Chamberlain Tuesdays 6:30 9:20 MGH 231 Adding OpenMP to Our Categoriza3on (part 1) degree of voodoo level of abstracdon C+Pthreads Chapel OpenMP less voodoo
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationDistributed Systems. Peer- to- Peer. Rik Sarkar. University of Edinburgh Fall 2014
Distributed Systems Peer- to- Peer Rik Sarkar University of Edinburgh Fall 2014 Peer to Peer The common percepdon A system for distribudng (sharing?) files Using the computers of common users (instead
More informationECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017
ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)
More information