Parallel SimOS: Scalability and Performance for Large System Simulation
|
|
- Herbert Taylor
- 6 years ago
- Views:
Transcription
1 Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1
2 Overview This work develops methods to simulate large computer systems with practical performance We use smaller machines to simulate larger machines We extend the capabilities of computer system simulation by an order of magnitude, to systems of more than 1000 processors 2
3 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Performance Evaluation Usability Evaluation Related Work Future Work and Conclusions 3
4 Why large systems? Large applications! Biology, Chemistry, Physics, Engineering From large systems (e.g. Earth s climate) to small systems (e.g. cells, DNA) Web applications, search, databases Simulation, visualization (and games!) 4
5 Why simulate large systems? Compare alternative designs Verify a system before building it Predict behavior and performance Debug a system during bring-up Write software when the system is not available (or before it exists!) Avoid expensive mistakes 5
6 The SimOS System Complete Machine Simulator developed in CSL Simulates complete hardware of computer system: CPU, memory, devices Enough speed and detail to run full operating system, system software, application programs Multiple CPU and memory models for fast or detailed performance and behavioral modeling Target Workload Target OS SimOS Simulated Hardware CPU Model Memory Model P P P P M M M M Device Models Disk Network Other Host OS Host Hardware 6
7 Using SimOS Disk Image OS, System Software User Applications Config/ Control Scripts SimOS Modeled performance and event statistics Program output Application Data External I/O Simulator statistics 7
8 Performance Terminology Execution time is the most meaningful measurement of simulator performance Slowdown = Real Time/Simulated Time Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware Self-relative slowdown compares a simulator with the machine it is running on 8
9 Speed/Detail Trade-off SimOS CPU Model MXS Mipsy Embra w/caches Embra Detail Dynamic, superscalar microarchitecture model; non-blocking memory system Sequential interpreter; blocking memory system Single-cycle CPU model; simplified cache model Single-cycle CPU and memory model Approximate KIPS (225 MHz R10K) Self-relative slowdown ~ ~10 SimOS CPU and Memory Models 9
10 Benefits of fast simulation Makes it possible to simulate complex workloads Many billions of cycles Allows software development, debugging interactive usability Enables exploration of large design space Real OS, system software, large applications Positioning before more detailed simulation Provides rough estimate of performance, trends 10
11 SimOS Applications Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle Enabled numerous studies of OS and application performance Research platform for operating systems, virtual machines, visualization 11
12 SimOS Limitations As we simulate larger machines, slowdown increases 15,000 Slowdown (real time / simulated time) 10,000 5,000 Barnes FFT Radix LU Simulated Processors 12
13 SimOS Limitations...resulting in longer simulation times Time (minutes) to simulate one minute of virtual time 15,000 > 1 week 10,000 5, hours 10 minutes Simulated Processors 13
14 Problem: Simulator Slowdown What causes simulator slowdown? Intrinsic Slowdown Resource Exhaustion Linear slowdown Overall multiplicative slowdown: Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown 14
15 Solution: Parallel SimOS Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown Extend speed/detail trade-off with fast, parallel mode of simulation Goal: eliminate slowdown due to parallelism and increase scalability to enable large system simulation with practical performance 15
16 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Embra background Parallel Embra Design Performance Evaluation Usability Evaluation Related Work Future Work and Conclusions 16
17 Embra: SimOS fastest simulation mode Binary translation CPU and memory simulator Translation Cache (TC) Callouts to handle events, MMU operations, exceptions and annotations CPU multiplexing ~10x base slowdown MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 17
18 Embra: sources of slowdown Binary translation overhead Multiplexing overhead Resource Exhaustion ST = WT * (Slowdown(I) + Slowdown(R)) * M 18
19 Binary translation overhead PC TC Index lw r1, (r2) lw r3, (r4) add r5, r1, r3 Simulator Memory Decoder and Translator lw SIM_T1, R2(cpu_base) jal mem_read_addr lw SIM_T2, (SIM_T1) sw SIM_T2, R1(cpu_base) lw SIM_T1, R4(cpu_base) jal mem_read_addr lw SIM_T3, (SIM_T1) sw SIM_T3, R3(cpu_base) add.w SIM_T1, SIM_T2, SIM_T3 sw SIM_T1, R5(cpu_base) Translation Cache (TC) 19
20 CPU multiplexing CPU State array Context switching with variable timeslice overhead P CPU 0 Registers FPU MMU other state large for low overhead P Registers CPU 1 FPU small for better responsiveness MMU other state minimal: MPinUP mode P Registers CPU 2 FPU MMU other state 20
21 A new, faster mode: Use parallelism and memory system of shared-memory multiprocessor Parallel Embra Decimation-in-space approach Simulated nodes Simulator threads Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion: ST = WT * (Slowdown(I) + Slowdown(R)) * M 21
22 Design Evolution We started with a baseline design and evolved it to achieve scalable performance Baseline: thread-based parallelism, shared memory Critical design features: Mirroring hardware in software Replication, fine-grained parallelism Unsynchronized execution speed 22
23 Design: Software should mirror Hardware Shared Translation Cache to reduce overhead? Problem: contention and serialization; chaining and cache conflicts Fuses hardware, breaks parallelism Solution: mirror hardware in software with replicated Translation Caches MMU/ glue glue code code Kernel TC TC TC User User TC TC TC Translation Cache (TC) (TC) Translation Cache (TC) (TC) index index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 23
24 Design: Software should mirror Hardware Shared Event Queue for global ordering? Events are rare! Problem: event frequency increases with parallelism Solution: replicated event queues to mirror hardware in software MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 24
25 Design: Software should mirror Hardware 90% of time in TC - how about parallelize TC only? Problem: Amdahl s law Problem: frequent callouts, contention everywhere MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index MMU Cache MMU Handler Statistics Reporting Result: critical region expansion and serialization Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra SimOS Interface 25
26 Critical Region Expansion Critical Regions Expansion and Serialization Contention and Descheduling Time 26
27 Design: Software should mirror Hardware Solution: mirror hardware in software with finegrained parallelism throughout Parallel Embra OS and apps require parallel callouts from Translation Cache Parallel statistics reporting is also a good idea, but happens infrequently MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 27
28 Design: flexible virtual time synchronization Problem: cycle skew between fast, slow processors Solution: configurable barrier synchronization fast processors wait for slow processors fine-grain (like MPinUP mode) loose grain (reduce sync overhead) variable interval for flexibility 28
29 Design: synchronization causes slowdown 4 32p Slowdown vs. large sync interval Barnes FFT LU MP3D Ocean Raytrace Radix Water Synchronization interval (cycles) 29
30 Design: unsynchronized execution For performance, the best synchronization interval is longer than the workload, i.e. never synchronize We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew This is because every thread sees a consistent ordering of memory and synchronization events 30
31 Design conclusions Parallelism increases contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem Contention cascades, resulting in critical region expansion and serialization Mirroring hardware in software preserves parallelism, avoids contention effects Fine-grained synchronization is required to permit correct and highly parallel access to simulator data Time synchronization across processors is unnecessary for correctness and undesirable for speed Performance depends on combination of all parallel performance features 31
32 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Performance Evaluation Usability Evaluation Related Work Future Work Conclusions 32
33 Performance:Test Configuration Stanford FLASH Multiprocessor 64 nodes MIPS R10000, 225 Mhz 220 MB DRAM/node (14GB total) flash1, flash32, flash64, etc. Machine benchmark Barnes FFT LU MP3D Radix Raytrace Ocean Water pmake ptest description Hierarchical Barnes-Hut method for N-body problem Fast Fourier Transform Lower/Upper matrix factorization Particle-based hypersonic wind tunnel simulation Integer radix sort Ray tracer Ocean currents simulation Water molecule simulation Compile phase of Modified Andrew Benchmark Simple benchmark for sanity check/peak performance Workload 33
34 Performance: Peak and actual MIPS 1600 MIPS MIPS over time -vpc-suite/flash-32-suite.log 1000 MIPS Flash32: ptest Flash32: SPLASH-2 Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware 34
35 Performance: Hardware self-relative slowdown 60 Self-relative slowdown Barnes FFT LU MP3D Ocean Radix Raytrace Water pmake LU-big Radix-big Simulated Machine Size ~10x slowdown regardless of machine size 35
36 Performance: benchmark phases Barnes-Flash32 LU-Flash32 36
37 Performance: benchmark phases MP3D-Flash32 37
38 Large Scale Performance 38
39 Large Scale Performance 15,000 12,500 Slowdown (Real time/ simulated time) 10,000 7,500 10,323 9,409 Radix/Flash32 LU/Flash64 5,000 2,500 0 SimOS Parallel SimOS Hours or days rather than weeks 39
40 Speed/Detail Trade-off, revisited Parallel SimOS CPU Model MXS Mipsy Embra w/caches Embra Parallel Embra Detail Dynamic, superscalar microarchitecture model; non-blocking memory system Sequential interpreter; blocking memory system Single-cycle CPU model; simplified cache model Single-cycle CPU and memory model Non-deterministic, single-cycle CPU and memory model Approximate KIPS (225 MHz R10K) Self-relative slowdown ~ ~10 > 1,000,000 ~10 Parallel SimOS CPU and Memory Models 40
41 Performance Conclusions Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS Parallel SimOS simulates multiprocessor with analogous performance to Serial SimOS simulating a uniprocessor Parallel SimOS extends scalability of complete machine simulation to 1024 processor systems 41
42 Usability Study Study of large, complex parallel program: Parallel SimOS itself Self-hosting capability of orthogonal simulators Performance debugging of Parallel SimOS, and test of functionality and usability Self-hosting architecture: Benchmark (Radix) Inner Irix 6.5 Inner SimOS Outer Irix 6.5 Outer SimOS Irix 6.5 Hardware (SGI Origin) 42
43 Phase profile CPU CPU time(s) time(s) Computation intervals for self-hosted radix Serial SimOS Computation intervals for self-hosted radix Parallel SimOS Bugs: Excessive TLB misses, interrupt storms Limitation: system imbalance effects 43
44 Usability Conclusions Parallel SimOS worked correctly on itself Revealed bugs and limitations of Parallel SimOS Speed/detail trade-off enabled with checkpoints Detailed mode too slow - ended up scaling down workload Need for faster detailed simulation modes 44
45 Limitations Virtual time depends on real time but can use checkpoints! System Imbalance Effects Memory Limits Need for fast detailed mode Loss of determinism, repeatability future work 45
46 Related Work Parallel SimOS uses shared-memory multiprocessors and decimation in space Other approaches to improving performance using parallelism include: Decimation in time Cluster-based simulation 46
47 Related Work: Decimation in Time checkpoint checkpoint checkpoint Initial serial execution Segment 1 Segment 2 Segment 3 Segment 4 checkpoint checkpoint Subsequent parallel execution Segment 1 Segment 2 Segment 3 Segment 4 overlap checkpoint checkpoint checkpoint Serial reconstruction Segment 1 Segment 2 Segment 3 Segment 4 ST = WT * (Slowdown(I) + Slowdown(R)) * N 47
48 Parallel SimOS: Decimation in Space Simulated nodes Simulator threads ST = WT * (Slowdown(I) + Slowdown(R)) * M 48
49 Related Work: Clusterbased Simulation Most common means of parallel simulation: Shaman, BigSim, others; Fast (?) LAN = highlatency communication Software-based shared memory = low performance switch Reduced flexibility 49
50 Parallel SimOS: Flexible Simulation Tightly and loosely coupled machines Workstation Cluster - Sweet Hall Network From clusters to multiprocessors and everything in between NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine Node CPU Cache Node CPU Cache Node CPU Cache CPU Cache CPU Cache CPU Cache Memory Controller Memory Controller Memory Controller Node CPU CPU Memory Cache Cache Controller Parallelism across multiprocessor nodes Multi-level Bus/interconnect... Multiprocessor Cluster Node Multiprocessor Node Node Multiprocessor Node Node Node Node Node Network Interface Network Interface Network... 50
51 Related Work Summary Decimation in Time achieves good speedup at the expense of interactivity synergistic with Parallel SimOS Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory Parallel SimOS approach achieves programmability and performance - for larger design space that includes tightlycoupled and hybrid systems 51
52 Future Work Faster detailed simulation Parallel detailed mode with flexible memory, pipeline models Try to recapture determinism Faster less-detailed simulation Global memory ordering in virtual time Revisit direct execution, using virtual machine monitors, user-mode OS, etc. 52
53 Conclusion: Thesis Contributions Developed design and implementation of scalable, parallel complete machine simulation Eliminated slowdown due to resource exhaustion and multiplexing Scaled complete machine simulation up by an order of magnitude processor machines on our hardware Developed flexible simulator capable of simulating large, tightly-coupled systems with interactive performance 53
ECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationDepartment of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.
Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline
More informationG Disco. Robert Grimm New York University
G22.3250-001 Disco Robert Grimm New York University The Three Questions! What is the problem?! What is new or different?! What are the contributions and limitations? Background: ccnuma! Cache-coherent
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationVirtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018
Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationFull-System Timing-First Simulation
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin Madison The Problem Design of future computer systems uses simulation
More informationEvaluation of Design Alternatives for a Multiprocessor Microprocessor
Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationComparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)
Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationMultiprocessor scheduling
Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCS533 Concepts of Operating Systems. Jonathan Walpole
CS533 Concepts of Operating Systems Jonathan Walpole Disco : Running Commodity Operating Systems on Scalable Multiprocessors Outline Goal Problems and solutions Virtual Machine Monitors(VMM) Disco architecture
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More information06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1
Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationCSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review
CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from
More informationComputer Architecture and OS. EECS678 Lecture 2
Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary between users and hardware A program that is always running A resource manager Manage resources efficiently and fairly
More informationSnoop-Based Multiprocessor Design III: Case Studies
Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions
More informationHIVE: Fault Containment for Shared-Memory Multiprocessors J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, A. Gupta
HIVE: Fault Containment for Shared-Memory Multiprocessors J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, A. Gupta CSE 598C Presented by: Sandra Rueda The Problem O.S. for managing FLASH architecture
More informationParallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer
Parallel Systems Part 7: Evaluation of Computers and Programs foils by Yang-Suk Kee, X. Sun, T. Fahringer How To Evaluate Computers and Programs? Learning objectives: Predict performance of parallel programs
More informationMultiprocessor Support
CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationComputer Architecture Review. Jo, Heeseung
Computer Architecture Review Jo, Heeseung Computer Abstractions and Technology Jo, Heeseung Below Your Program Application software Written in high-level language System software Compiler: translates HLL
More informationPerformance, Power, Die Yield. CS301 Prof Szajda
Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationECE 3055: Final Exam
ECE 3055: Final Exam Instructions: You have 2 hours and 50 minutes to complete this quiz. The quiz is closed book and closed notes, except for one 8.5 x 11 sheet. No calculators are allowed. Multiple Choice
More informationAccelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis
Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis Clay Hughes & Tao Li Department of Electrical and Computer Engineering University of Florida
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationMulticast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood
Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin
More informationcsci 3411: Operating Systems
csci 3411: Operating Systems Memory Management II Gabriel Parmer Slides adapted from Silberschatz and West Each Process has its Own Little World Virtual Address Space Picture from The
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationSorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks
15-823 Advanced Topics in Database Systems Performance Sorting Shimin Chen School of Computer Science Carnegie Mellon University 22 March 2001 Sort benchmarks A base case: AlphaSort Improving Sort Performance
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationExample Networks on chip Freescale: MPC Telematics chip
Lecture 22: Interconnects & I/O Administration Take QUIZ 16 over P&H 6.6-10, 6.12-14 before 11:59pm Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Exams in ACES
More informationDisco. CS380L: Mike Dahlin. September 13, This week: Disco and Exokernel. One lesson: If at first you don t succeed, try try again.
Disco CS380L: Mike Dahlin September 13, 2007 Disco: A bad idea from the 70 s, and it s back! Mendel Rosenblum (tongue in cheek) 1 Preliminaries 1.1 Review 1.2 Outline 1.3 Preview This week: Disco and Exokernel.
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationReadings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important
Storage Hierarchy III: I/O System Readings reg I$ D$ L2 L3 memory disk (swap) often boring, but still quite important ostensibly about general I/O, mainly about disks performance: latency & throughput
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLight64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu
: Ligh htweight hardware support for data ra ce detection ec during systematic testing Adrian Nistor, Darko Marinov, Josep Torrellas University of Illinois, Urbana Champaign http://iacoma a.cs.uiuc.edu
More informationProtoFlex: FPGA-Accelerated Hybrid Simulator
ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Multiprocessor Simulation Simulating one processor in software
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationLecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1
Lecture - 4 Measurement Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1 Acknowledgements David Patterson Dr. Roger Kieckhafer 9/29/2009 2 Computer Architecture is Design and Analysis
More informationMultiprocessor and Real-Time Scheduling. Chapter 10
Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems
More informationSDSM Progression. Implementing Shared Memory on Distributed Systems. Software Distributed Shared Memory. Why a Distributed Shared Memory (DSM) System?
SDSM Progression Implementing Shared Memory on Distributed Systems Sandhya Dwarkadas University of Rochester TreadMarks shared memory for networks of workstations Cashmere-2L - 2-level shared memory system
More informationChapter 6. Storage and Other I/O Topics
Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationPersistent Storage - Datastructures and Algorithms
Persistent Storage - Datastructures and Algorithms 1 / 21 L 03: Virtual Memory and Caches 2 / 21 Questions How to access data, when sequential access is too slow? Direct access (random access) file, how
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationComputer Architecture
Computer Architecture Pipelined and Parallel Processor Design Michael J. Flynn Stanford University Technische Universrtat Darmstadt FACHBEREICH INFORMATIK BIBLIOTHEK lnventar-nr.: Sachgebiete: Standort:
More informationChapter 8: Main Memory. Operating System Concepts 9 th Edition
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationBuilding High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye
Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More information