An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization
|
|
- Reynard Jennings
- 5 years ago
- Views:
Transcription
1 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides available from Peter.Strazdins/seminars) PDSE 2018: The 19th Workshop on Parallel and Distributed Scientific and Engineering omputing, Vancouver, anada, 25 May 2018
2 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 1 1 Talk Overview motivation background: network device virtualization using Xen our approach (small side-core to offload driver domain I/O) and aims methodology: overall, AMP simulation framework derivation of side-core design parameters core execution units TLBs L1 and L2-caches overall sidecore parameters and comparison of 4 side-cores conclusions
3 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 2 2 Motivation virtualization is an attractive technology for HP allows for a user-defined, deploy-anywhere OS / software stack dynamically migratable, e.g. to/from local supercomputer to public cloud, from nodes developing faults suffers however from poor (network) I/O performance techniques like direct device assignment lose virtualization benefits others, like SR-IOV, require special hardware the sidecore approach: devote a core to offloading I/O virtualization has neither of these drawbacks but wasteful to use a large core: most I/O offloading is very simple can we use a single instruction set architecture asymmetric multiprocessor (AMP): use the large cores for the application with small side-core for the I/0 How Small an It Be?
4 Netfront Netfront Netfront IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 3 3 Background: Network Device Virtualization using Xen Xen allows guest OSs (domus) to perform I/O through a privileged OS known as the driver domain (or dom0) the TP/IP network protocol stack is split into: a top-half, netfront, on domu a bottom-half, netback, on dom0 these communicate through VMM data transport and event notification mechanisms Driver Domain Bridge Netback VIFs Physical Device Driver Device I/O Ring Xen User Domain User Domain User Domain provides security & live migration but incurs considerable I/O overheads!
5 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 4 4 Approach and Aims our approach: use the small side-cores to offload dom0, and the largecores for domus no modification to Xen required allows parallelization of top- and bottom-half I/O such small side-cores are also shown useful for offloading other services, e.g. Java VM JIT and garbage collections our previous work (How Small an it be? HP 15) used AMP emulation to design (derive architectural parameters for) such a side-core limited to existing physical hardware (e.g. AMD-K80 and Intel Atom) some params. just a guess based on Atom s, e.g. L2-TLB & I$ size considered only performance (delay) contribution of this work: verify/refine this design using: a methodology using full-machine functional and power simulation systematically evaluate the trade-offs on performance, area & energy
6 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 5 5 Simulation Methodology use ATI to estimate latency (cycles) of caches, TLBs conduct full-machine simulation of AMP system (SimNow! and OTSon) evaluate the detailed area and energy profiles using McPat Processor Description ATI Timing Simulation Timing and architectural statistics Energy-delay product Power Simulation Power profiles, area metrics: ED 2 P = Energy Delay 2 and ED 2 AP = Energy Delay 2 Area where Energy = n i=1 (P i(access, miss) + S i ) T P i /S i : dynamic/static power of an i-th level component (cache, TLB) T : total execution time, as measured by full machine simulation
7 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 6 6 Simulation Framework (I) Simulation node Simulation node Functional simulator Functional simulator Dom0 DomU DomU Dom0 DomU DomU Simore Simore Simore Simore Simore Simore Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Network mediator Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Side-core Big-core Big-core Side-core Big-core Big-core I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs L2 L2 L2 L2 L2 L2 L3 L3 BUS BUS Memory I/O Devs Memory I/O Devs
8 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 7 7 Simulation Framework (II) model an AMP module consisting of two big cores and a side-core a larger AMP processor can be made from several modules all cores share L3$; L1/L2$ and TLBs private to cores each functional core (Simore) in SimNow! maps to a small or big core timing model in OTSon (extended for x86 table-walk, HW prefetch & superscalar) simulate both both intra-node (2 domus on the same node and module) and inter-node (2 2 domus on different nodes) communication choose NPB IS.A and FT.W: have message sizes of 2MB and 16MB (intra-node) larger than candidate size-core cache sizes / TLB spans Note: communication-intensive workloads produce much the same traffic on the IP level as seen at dom0 warm up TLBs, caches etc on 1st 2 it ns, collect statistics on next 2 an unused nop is injected into benchmarks to notify OTSon of this a 10Gb network interface & switch in OTSon s network model
9 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 8 8 Deriving Design Parameters: ore Execution Units 100% 90% % 70% 60% % 40% 30% 20% 10% % asymmetric 2-way symmetric 2-way 3-way 4-way Single Dual Triple Quad utilisation of multiple execution units on dom0 (results essentially identical FT.W/IS.A and inter-/intra-node) asymmetric 2-way design (2nd unit: only integer instructions, does not need dual-ported L1$) is within 7% of the 4-way
10 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 9 9 Deriving Design Parameters: Translation Lookaside Buffers DTLB-intra ITLB-intra DTLB-inter ITLB-inter Number of entries fully-associative single-level I/D TLB miss rates vs size (4KB page size, LRU replacement), FT/IS essentially identical. (note ITLB-inter, entries) performance insufficient for 24 & saturates for 256 entries
11 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 TLBs DTLB-intra ITLB-intra DTLB-inter ITLB-inter L2 TLB miss rate (% L2 misses / L1 accesses) vs size associativity performance seems to saturate from 256-2/4 or 512-2
12 ED 2 AP (Normalized) ED 2 AP (Normalized) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP Analysis of 2-level TLB onfigurations I-TLB, inter-node D-TLB, inter-node normalized ED 2 AP vs L1-TLB size (entries) for three minimal ED 2 AP L2 I/D-TLB settings (128-4, 256-4, and 512-2) fully-associative 32/48-entry L1 I/D-TLBs with 4-way 256-entry L2 I/D TLBs are optimal similar analysis for intra-node communication shows a fully associative 24 entry L1 I/D-TLBs with a 128 entry 4-way L2 I/D-TLBs is optimal
13 Miss rate (%) Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches ft-intra is-intra ft-inter is-inter ft-intra is-intra ft-inter is-inter Instruction cache configurations Data cache configurations I-cache D-cache miss rates of I/D-caches vs capacity(kb) associativity I-cache: 3 plateaus at 16-4, 32-4 and 64-4, inter-node rates higher D-cache: plateaus not as clear, intra-node rates higher as only 2 processes used D-cache inter-node scalability is worse as cold and coherency misses are inherent in data access
14 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-1 Instruction/Data ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Runtime (mw) Leakage (mw) / / / / / / / / / / / / / / / / / / / / / / / / / / / 47 from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note changes in access time due to size and associativity
15 cycles Average access cycles (Normlised) Average acess cycles (Normalised) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches Normalised Area Normalised Area I-cache D-cache average memory access times vs area of I/D-caches (capacity associativity) for inter-node communication (intra-node was almost identical) a 512 KB 4-way L2$ was used (other L2$s exhibit the same trend) I$: Pareto-efficient frontiers are 16-2, 16-4, 32-2, 32-4, 64-2 and 64-4 D$: Pareto-efficient frontiers are 16-2 and 32-2: a clear choice!
16 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 ache INTRA INTER 0 ache configurations L2 cache miss rate vs capacity(kb) associativity ratio of L2 misses to total L1 cache read/write accesses is used plateaus at 256-4, and
17 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-2 Unified ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Leakage (mw) from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note again changes in access time due to size and associativity
18 Normalized ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP and ED 2 AP Analysis of 2-level ache onfigurations / / / / / / / / / / / ED2P ED2AP for inter-node communication: small design (16KB 4/2-way I/D-$, and a 128KB 4-way L2$) is ED 2 AP optimal! from other results, also optimal for intra-node for both ED 2 AP and ED 2 P
19 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Overall Results: Processor Parameters Big Atom SS-e SS-s ore type Out-Of-Order In-order In-order In-order Pipelines Symm. 4 Asymm. 2 Asymm. 2 Asymm. 2 Instrn. cache 64KB/2-way 32KB(8) 64KB(2) 16KB(4) Data cache 64KB(2) 24KB(6) 24KB(6) 16KB(2) L2 cache 512KB(16) 512KB(8) 256KB(16) 128KB(4) L1 I/D TLBs 64(f)/64(f) 32(f)/16(f) 32(f)/48(f) 32(f)/48(f) L2 I/D TLBs 512(4)/512(4) None/64(4) 512(4)/512(4) 256(4)/ 256(4) Area (mm 2 ) Avg. Power (W) cores: Big (the AMD K10), Atom (the Intel Atom), SS-e (from HP 15 paper), SS-s (this paper) SS-e and SS-s have no FPU, a simple 2-level branch predictor and no hardware prefetch (emulation was sufficient to determine these)
20 Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization omparison of 3-ore Modules with Various Side-ores Power Area Performance (inverse) Energy Big Atom SS-e SS-s inter-node simulation results (FT.W.4) for a module of two 2 (AMD K10) cores and 1 side-core (Big, Atom, SS-e, and SS-s). figures normalised to the smallest of each category. Smaller is better.
21 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions our methodology to design a small side-core in an AMP system for efficient driver domain (dom0) I/O offloading based on coupled full machine, area and power simulations validated on the Atom, side-core model s perf. counter metrics being within 11% permitted a systematic exploration of the design space takes into account the relationship of energy, delay and area e.g. the effect of cache size/associativity on access latency 2-level memory hierarchy for EDA 2 P optimization a challenge! need to calculate L2 miss rate no L1 accesses Pareto frontiers and/or exhaustive analysis for optimal parameters of interest is only the TP/IP traffic-generated workload on dom0 largely insensitive to any communication-intensive domu workload inter-node communication workloads more demanding than intra-
22 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions (II) 2-way asymmetric in-order execution pipeline optimal due to tight instruction dependence and the short basic block size in dom0 workloads the 2-level TLB and cache parameters of the side-core are critical, with surprisingly small values being optimal for ED 2 AP L1 I/D TLBs fully-associative, but 2-4 way associativity elsewhere 16KB/2-way D-ache was within 3% of the 32KB/2-way, which was optimal for Delay broadly validated the How Small an It Be? (HP 15) study, which was based on emulation on existing H/W however with more confidence form the more systematic methodology can get energy and area savings with new design, with almost identical performance It an Be Even Smaller!
23 Normalized ratio Netfront Netfront Netfront Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 22 Thank You!!... Questions??? Driver Domain User Domain Processor Description Timing Simulation Power Simulation Netback Bridge VIFs Physical Device Driver Device I/O Ring User Domain User Domain ATI Timing and architectural statistics Power profiles, area Xen Energy-delay product / / / / / / / / / / / Power Area Performance (inverse) Energy ED2P ED2AP Big Atom SS-e SS-s
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip
More informationImplications of Cache Asymmetry on Server Consolidation Performance
Implications of ache Asymmetry on Server onsolidation Performance Presenter: Omesh Tickoo Padma Apparao, Ravi Iyer, Don Newell *Hardware Architecture Lab Intel orporation IISW 2008 1 Outline Server onsolidation
More informationOptimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs
Optimal Algorithm Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs page 1 Least Recently Used (LRU) Algorithm Reference string: 1, 2, 3,
More informationLow-power Architecture. By: Jonathan Herbst Scott Duntley
Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationEfficient I/O Virtualisation in Asymmetric Multiprocessor Architectures Chung Hwan Lee
Efficient I/O Virtualisation in Asymmetric Multiprocessor Architectures Chung Hwan Lee A thesis submitted for the degree of Doctor of Philosophy The Australian National University December 2016 c Chung
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Reality Check Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about
More informationXenoprof overview & Networking Performance Analysis
Xenoprof overview & Networking Performance Analysis J. Renato Santos G. (John) Janakiraman Yoshio Turner Aravind Menon HP Labs Xen Summit January 17-18, 2006 2003 Hewlett-Packard Development Company, L.P.
More informationCSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)
CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give
More informationThe levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms
The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Lecturer SOE Dan Garcia Google Glass may be one vision of the future of post-pc interfaces augmented reality with video
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationXen Network I/O Performance Analysis and Opportunities for Improvement
Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.
More informationChapter 2: Computer-System Structures. Hmm this looks like a Computer System?
Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding
More informationUniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationOperating System Support for Shared-ISA Asymmetric Multi-core Architectures
Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com
More informationinstruction is 6 bytes, might span 2 pages 2 pages to handle from 2 pages to handle to Two major allocation schemes
Allocation of Frames How should the OS distribute the frames among the various processes? Each process needs minimum number of pages - at least the minimum number of pages required for a single assembly
More informationHow to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland
How to abstract hardware acceleration device in cloud environment Maciej Grochowski Intel DCG Ireland Outline Introduction to Hardware Accelerators Intel QuickAssist Technology (Intel QAT) as example of
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationUnit 2 Buffer Pool Management
Unit 2 Buffer Pool Management Based on: Sections 9.4, 9.4.1, 9.4.2 of Ramakrishnan & Gehrke (text); Silberschatz, et. al. ( Operating System Concepts ); Other sources Original slides by Ed Knorr; Updates
More informationLecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )
Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationNetchannel 2: Optimizing Network Performance
Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationAdvanced Operating Systems (CS 202) Virtualization
Advanced Operating Systems (CS 202) Virtualization Virtualization One of the natural consequences of the extensibility research we discussed What is virtualization and what are the benefits? 2 Virtualization
More informationVirtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Precise Definition of Virtual Memory Virtual memory is a mechanism for translating logical
More informationLecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches
Lecture 17: Virtual Memory, Large Caches Today: virtual memory, shared/pvt caches, NUCA caches 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large address space
More informationVirtual Memory. Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]
Virtual Memory Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Virtual Memory Usemain memory asa cache a for secondarymemory
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014
More informationMemory Hierarchy. Mehran Rezaei
Memory Hierarchy Mehran Rezaei What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk) Option : Build It Out of Fast SRAM About 5- ns access Decoders
More informationECE 571 Advanced Microprocessor-Based Design Lecture 13
ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache
More informationMulti-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis
Multi-level Translation CS 57 Lecture 9 Paging Michael Swift Problem: what if you have a sparse address space e.g. out of GB, you use MB spread out need one PTE per page in virtual address space bit AS
More informationMemory Hierarchies 2009 DAT105
Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement
More informationFirst-In-First-Out (FIFO) Algorithm
First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:
More informationMemory Hierarchy Y. K. Malaiya
Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath
More informationFast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names
Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationDisco: Running Commodity Operating Systems on Scalable Multiprocessors
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, Kinskuk Govil and Mendel Rosenblum Stanford University Presented by : Long Zhang Overiew Background
More informationMemory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationVirtual Memory - Objectives
ECE232: Hardware Organization and Design Part 16: Virtual Memory Chapter 7 http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy Virtual Memory - Objectives
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 28: More Virtual Memory Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Virtual memory used to protect applications from
More informationI/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班
I/O virtualization Jiang, Yunhong Yang, Xiaowei 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
More informationXen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016
Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationMain Memory (Fig. 7.13) Main Memory
Main Memory (Fig. 7.13) CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory b. Wide memory organization c. Interleaved memory organization
More informationVirtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts
Lecture 16 Virtual memory why? Virtual memory: Virtual memory concepts (5.10) Protection (5.11) The memory hierarchy of Alpha 21064 (5.13) Virtual address space proc 0? s space proc 1 Physical memory Virtual
More informationIntroduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay
Introduction to Cloud Computing and Virtualization By Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay Talk Layout Cloud Computing Need Features Feasibility Virtualization of Machines What
More informationCS/ECE 3330 Computer Architecture. Chapter 5 Memory
CS/ECE 3330 Computer Architecture Chapter 5 Memory Last Chapter n Focused exclusively on processor itself n Made a lot of simplifying assumptions IF ID EX MEM WB n Reality: The Memory Wall 10 6 Relative
More informationCOSC3330 Computer Architecture Lecture 20. Virtual Memory
COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 Caches III 2008-04-16 Lecturer SOE Dan Garcia Hi to Chin Han from U Penn! Prem Kumar of Northwestern has created a quantum inverter
More informationThe Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):
The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationTransistor: Digital Building Blocks
Final Exam Review Transistor: Digital Building Blocks Logically, each transistor acts as a switch Combined to implement logic functions (gates) AND, OR, NOT Combined to build higher-level structures Multiplexer,
More informationReadings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important
Storage Hierarchy III: I/O System Readings reg I$ D$ L2 L3 memory disk (swap) often boring, but still quite important ostensibly about general I/O, mainly about disks performance: latency & throughput
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationVirtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018
Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationComputer Architecture. Memory Hierarchy. Lynn Choi Korea University
Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Asst. Proflecturer SOE Miki Garcia WHEN FIBER OPTICS IS TOO SLOW 07/16/2014: Wall Street Buys NATO Microwave Towers in
More informationAbstract. Testing Parameters. Introduction. Hardware Platform. Native System
Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler
More informationLecture 24: Memory, VM, Multiproc
Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationVirtualization and memory hierarchy
Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 16: Virtual Machine Monitors Geoffrey M. Voelker Virtual Machine Monitors 2 Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot
More informationCSE 560 Computer Systems Architecture
This Unit: CSE 560 Computer Systems Architecture App App App System software Mem I/O The operating system () A super-application Hardware support for an Page tables and address translation s and hierarchy
More informationCaches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first
Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Jiang Jiang jiangjiang@ic.sjtu.edu.cn [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2008, MK] Chapter 5 Large
More informationOptimising Multicore JVMs. Khaled Alnowaiser
Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationOperating Systems. Operating Systems Sina Meraji U of T
Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationFast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names
Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency
More informationChapter 8 Main Memory
COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples
More informationCIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018
CIS 3207 - Operating Systems Memory Management Cache and Demand Paging Professor Qiang Zeng Spring 2018 Process switch Upon process switch what is updated in order to assist address translation? Contiguous
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Virtual Memory 11282011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Cache Virtual Memory Projects 3 Memory
More informationCS533 Concepts of Operating Systems. Jonathan Walpole
CS533 Concepts of Operating Systems Jonathan Walpole Disco : Running Commodity Operating Systems on Scalable Multiprocessors Outline Goal Problems and solutions Virtual Machine Monitors(VMM) Disco architecture
More information