An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Size: px
Start display at page:

Download "An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization"

Transcription

1 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides available from Peter.Strazdins/seminars) PDSE 2018: The 19th Workshop on Parallel and Distributed Scientific and Engineering omputing, Vancouver, anada, 25 May 2018

2 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 1 1 Talk Overview motivation background: network device virtualization using Xen our approach (small side-core to offload driver domain I/O) and aims methodology: overall, AMP simulation framework derivation of side-core design parameters core execution units TLBs L1 and L2-caches overall sidecore parameters and comparison of 4 side-cores conclusions

3 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 2 2 Motivation virtualization is an attractive technology for HP allows for a user-defined, deploy-anywhere OS / software stack dynamically migratable, e.g. to/from local supercomputer to public cloud, from nodes developing faults suffers however from poor (network) I/O performance techniques like direct device assignment lose virtualization benefits others, like SR-IOV, require special hardware the sidecore approach: devote a core to offloading I/O virtualization has neither of these drawbacks but wasteful to use a large core: most I/O offloading is very simple can we use a single instruction set architecture asymmetric multiprocessor (AMP): use the large cores for the application with small side-core for the I/0 How Small an It Be?

4 Netfront Netfront Netfront IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 3 3 Background: Network Device Virtualization using Xen Xen allows guest OSs (domus) to perform I/O through a privileged OS known as the driver domain (or dom0) the TP/IP network protocol stack is split into: a top-half, netfront, on domu a bottom-half, netback, on dom0 these communicate through VMM data transport and event notification mechanisms Driver Domain Bridge Netback VIFs Physical Device Driver Device I/O Ring Xen User Domain User Domain User Domain provides security & live migration but incurs considerable I/O overheads!

5 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 4 4 Approach and Aims our approach: use the small side-cores to offload dom0, and the largecores for domus no modification to Xen required allows parallelization of top- and bottom-half I/O such small side-cores are also shown useful for offloading other services, e.g. Java VM JIT and garbage collections our previous work (How Small an it be? HP 15) used AMP emulation to design (derive architectural parameters for) such a side-core limited to existing physical hardware (e.g. AMD-K80 and Intel Atom) some params. just a guess based on Atom s, e.g. L2-TLB & I$ size considered only performance (delay) contribution of this work: verify/refine this design using: a methodology using full-machine functional and power simulation systematically evaluate the trade-offs on performance, area & energy

6 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 5 5 Simulation Methodology use ATI to estimate latency (cycles) of caches, TLBs conduct full-machine simulation of AMP system (SimNow! and OTSon) evaluate the detailed area and energy profiles using McPat Processor Description ATI Timing Simulation Timing and architectural statistics Energy-delay product Power Simulation Power profiles, area metrics: ED 2 P = Energy Delay 2 and ED 2 AP = Energy Delay 2 Area where Energy = n i=1 (P i(access, miss) + S i ) T P i /S i : dynamic/static power of an i-th level component (cache, TLB) T : total execution time, as measured by full machine simulation

7 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 6 6 Simulation Framework (I) Simulation node Simulation node Functional simulator Functional simulator Dom0 DomU DomU Dom0 DomU DomU Simore Simore Simore Simore Simore Simore Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Network mediator Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Side-core Big-core Big-core Side-core Big-core Big-core I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs L2 L2 L2 L2 L2 L2 L3 L3 BUS BUS Memory I/O Devs Memory I/O Devs

8 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 7 7 Simulation Framework (II) model an AMP module consisting of two big cores and a side-core a larger AMP processor can be made from several modules all cores share L3$; L1/L2$ and TLBs private to cores each functional core (Simore) in SimNow! maps to a small or big core timing model in OTSon (extended for x86 table-walk, HW prefetch & superscalar) simulate both both intra-node (2 domus on the same node and module) and inter-node (2 2 domus on different nodes) communication choose NPB IS.A and FT.W: have message sizes of 2MB and 16MB (intra-node) larger than candidate size-core cache sizes / TLB spans Note: communication-intensive workloads produce much the same traffic on the IP level as seen at dom0 warm up TLBs, caches etc on 1st 2 it ns, collect statistics on next 2 an unused nop is injected into benchmarks to notify OTSon of this a 10Gb network interface & switch in OTSon s network model

9 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 8 8 Deriving Design Parameters: ore Execution Units 100% 90% % 70% 60% % 40% 30% 20% 10% % asymmetric 2-way symmetric 2-way 3-way 4-way Single Dual Triple Quad utilisation of multiple execution units on dom0 (results essentially identical FT.W/IS.A and inter-/intra-node) asymmetric 2-way design (2nd unit: only integer instructions, does not need dual-ported L1$) is within 7% of the 4-way

10 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 9 9 Deriving Design Parameters: Translation Lookaside Buffers DTLB-intra ITLB-intra DTLB-inter ITLB-inter Number of entries fully-associative single-level I/D TLB miss rates vs size (4KB page size, LRU replacement), FT/IS essentially identical. (note ITLB-inter, entries) performance insufficient for 24 & saturates for 256 entries

11 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 TLBs DTLB-intra ITLB-intra DTLB-inter ITLB-inter L2 TLB miss rate (% L2 misses / L1 accesses) vs size associativity performance seems to saturate from 256-2/4 or 512-2

12 ED 2 AP (Normalized) ED 2 AP (Normalized) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP Analysis of 2-level TLB onfigurations I-TLB, inter-node D-TLB, inter-node normalized ED 2 AP vs L1-TLB size (entries) for three minimal ED 2 AP L2 I/D-TLB settings (128-4, 256-4, and 512-2) fully-associative 32/48-entry L1 I/D-TLBs with 4-way 256-entry L2 I/D TLBs are optimal similar analysis for intra-node communication shows a fully associative 24 entry L1 I/D-TLBs with a 128 entry 4-way L2 I/D-TLBs is optimal

13 Miss rate (%) Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches ft-intra is-intra ft-inter is-inter ft-intra is-intra ft-inter is-inter Instruction cache configurations Data cache configurations I-cache D-cache miss rates of I/D-caches vs capacity(kb) associativity I-cache: 3 plateaus at 16-4, 32-4 and 64-4, inter-node rates higher D-cache: plateaus not as clear, intra-node rates higher as only 2 processes used D-cache inter-node scalability is worse as cold and coherency misses are inherent in data access

14 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-1 Instruction/Data ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Runtime (mw) Leakage (mw) / / / / / / / / / / / / / / / / / / / / / / / / / / / 47 from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note changes in access time due to size and associativity

15 cycles Average access cycles (Normlised) Average acess cycles (Normalised) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches Normalised Area Normalised Area I-cache D-cache average memory access times vs area of I/D-caches (capacity associativity) for inter-node communication (intra-node was almost identical) a 512 KB 4-way L2$ was used (other L2$s exhibit the same trend) I$: Pareto-efficient frontiers are 16-2, 16-4, 32-2, 32-4, 64-2 and 64-4 D$: Pareto-efficient frontiers are 16-2 and 32-2: a clear choice!

16 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 ache INTRA INTER 0 ache configurations L2 cache miss rate vs capacity(kb) associativity ratio of L2 misses to total L1 cache read/write accesses is used plateaus at 256-4, and

17 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-2 Unified ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Leakage (mw) from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note again changes in access time due to size and associativity

18 Normalized ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP and ED 2 AP Analysis of 2-level ache onfigurations / / / / / / / / / / / ED2P ED2AP for inter-node communication: small design (16KB 4/2-way I/D-$, and a 128KB 4-way L2$) is ED 2 AP optimal! from other results, also optimal for intra-node for both ED 2 AP and ED 2 P

19 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Overall Results: Processor Parameters Big Atom SS-e SS-s ore type Out-Of-Order In-order In-order In-order Pipelines Symm. 4 Asymm. 2 Asymm. 2 Asymm. 2 Instrn. cache 64KB/2-way 32KB(8) 64KB(2) 16KB(4) Data cache 64KB(2) 24KB(6) 24KB(6) 16KB(2) L2 cache 512KB(16) 512KB(8) 256KB(16) 128KB(4) L1 I/D TLBs 64(f)/64(f) 32(f)/16(f) 32(f)/48(f) 32(f)/48(f) L2 I/D TLBs 512(4)/512(4) None/64(4) 512(4)/512(4) 256(4)/ 256(4) Area (mm 2 ) Avg. Power (W) cores: Big (the AMD K10), Atom (the Intel Atom), SS-e (from HP 15 paper), SS-s (this paper) SS-e and SS-s have no FPU, a simple 2-level branch predictor and no hardware prefetch (emulation was sufficient to determine these)

20 Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization omparison of 3-ore Modules with Various Side-ores Power Area Performance (inverse) Energy Big Atom SS-e SS-s inter-node simulation results (FT.W.4) for a module of two 2 (AMD K10) cores and 1 side-core (Big, Atom, SS-e, and SS-s). figures normalised to the smallest of each category. Smaller is better.

21 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions our methodology to design a small side-core in an AMP system for efficient driver domain (dom0) I/O offloading based on coupled full machine, area and power simulations validated on the Atom, side-core model s perf. counter metrics being within 11% permitted a systematic exploration of the design space takes into account the relationship of energy, delay and area e.g. the effect of cache size/associativity on access latency 2-level memory hierarchy for EDA 2 P optimization a challenge! need to calculate L2 miss rate no L1 accesses Pareto frontiers and/or exhaustive analysis for optimal parameters of interest is only the TP/IP traffic-generated workload on dom0 largely insensitive to any communication-intensive domu workload inter-node communication workloads more demanding than intra-

22 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions (II) 2-way asymmetric in-order execution pipeline optimal due to tight instruction dependence and the short basic block size in dom0 workloads the 2-level TLB and cache parameters of the side-core are critical, with surprisingly small values being optimal for ED 2 AP L1 I/D TLBs fully-associative, but 2-4 way associativity elsewhere 16KB/2-way D-ache was within 3% of the 32KB/2-way, which was optimal for Delay broadly validated the How Small an It Be? (HP 15) study, which was based on emulation on existing H/W however with more confidence form the more systematic methodology can get energy and area savings with new design, with almost identical performance It an Be Even Smaller!

23 Normalized ratio Netfront Netfront Netfront Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 22 Thank You!!... Questions??? Driver Domain User Domain Processor Description Timing Simulation Power Simulation Netback Bridge VIFs Physical Device Driver Device I/O Ring User Domain User Domain ATI Timing and architectural statistics Power profiles, area Xen Energy-delay product / / / / / / / / / / / Power Area Performance (inverse) Energy ED2P ED2AP Big Atom SS-e SS-s

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Implications of Cache Asymmetry on Server Consolidation Performance

Implications of Cache Asymmetry on Server Consolidation Performance Implications of ache Asymmetry on Server onsolidation Performance Presenter: Omesh Tickoo Padma Apparao, Ravi Iyer, Don Newell *Hardware Architecture Lab Intel orporation IISW 2008 1 Outline Server onsolidation

More information

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs Optimal Algorithm Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs page 1 Least Recently Used (LRU) Algorithm Reference string: 1, 2, 3,

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Efficient I/O Virtualisation in Asymmetric Multiprocessor Architectures Chung Hwan Lee

Efficient I/O Virtualisation in Asymmetric Multiprocessor Architectures Chung Hwan Lee Efficient I/O Virtualisation in Asymmetric Multiprocessor Architectures Chung Hwan Lee A thesis submitted for the degree of Doctor of Philosophy The Australian National University December 2016 c Chung

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Reality Check Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about

More information

Xenoprof overview & Networking Performance Analysis

Xenoprof overview & Networking Performance Analysis Xenoprof overview & Networking Performance Analysis J. Renato Santos G. (John) Janakiraman Yoshio Turner Aravind Menon HP Labs Xen Summit January 17-18, 2006 2003 Hewlett-Packard Development Company, L.P.

More information

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Lecturer SOE Dan Garcia Google Glass may be one vision of the future of post-pc interfaces augmented reality with video

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com

More information

instruction is 6 bytes, might span 2 pages 2 pages to handle from 2 pages to handle to Two major allocation schemes

instruction is 6 bytes, might span 2 pages 2 pages to handle from 2 pages to handle to Two major allocation schemes Allocation of Frames How should the OS distribute the frames among the various processes? Each process needs minimum number of pages - at least the minimum number of pages required for a single assembly

More information

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland How to abstract hardware acceleration device in cloud environment Maciej Grochowski Intel DCG Ireland Outline Introduction to Hardware Accelerators Intel QuickAssist Technology (Intel QAT) as example of

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Unit 2 Buffer Pool Management

Unit 2 Buffer Pool Management Unit 2 Buffer Pool Management Based on: Sections 9.4, 9.4.1, 9.4.2 of Ramakrishnan & Gehrke (text); Silberschatz, et. al. ( Operating System Concepts ); Other sources Original slides by Ed Knorr; Updates

More information

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections ) Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Advanced Operating Systems (CS 202) Virtualization

Advanced Operating Systems (CS 202) Virtualization Advanced Operating Systems (CS 202) Virtualization Virtualization One of the natural consequences of the extensibility research we discussed What is virtualization and what are the benefits? 2 Virtualization

More information

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Precise Definition of Virtual Memory Virtual memory is a mechanism for translating logical

More information

Lecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches

Lecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches Lecture 17: Virtual Memory, Large Caches Today: virtual memory, shared/pvt caches, NUCA caches 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large address space

More information

Virtual Memory. Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Virtual Memory. Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Virtual Memory Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Virtual Memory Usemain memory asa cache a for secondarymemory

More information

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014

More information

Memory Hierarchy. Mehran Rezaei

Memory Hierarchy. Mehran Rezaei Memory Hierarchy Mehran Rezaei What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk) Option : Build It Out of Fast SRAM About 5- ns access Decoders

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE 571 Advanced Microprocessor-Based Design Lecture 13 ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache

More information

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis Multi-level Translation CS 57 Lecture 9 Paging Michael Swift Problem: what if you have a sparse address space e.g. out of GB, you use MB spread out need one PTE per page in virtual address space bit AS

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

First-In-First-Out (FIFO) Algorithm

First-In-First-Out (FIFO) Algorithm First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:

More information

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy Y. K. Malaiya Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, Kinskuk Govil and Mendel Rosenblum Stanford University Presented by : Long Zhang Overiew Background

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Virtual Memory - Objectives

Virtual Memory - Objectives ECE232: Hardware Organization and Design Part 16: Virtual Memory Chapter 7 http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy Virtual Memory - Objectives

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 28: More Virtual Memory Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Virtual memory used to protect applications from

More information

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班 I/O virtualization Jiang, Yunhong Yang, Xiaowei 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016 Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Main Memory (Fig. 7.13) Main Memory

Main Memory (Fig. 7.13) Main Memory Main Memory (Fig. 7.13) CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory b. Wide memory organization c. Interleaved memory organization

More information

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts Lecture 16 Virtual memory why? Virtual memory: Virtual memory concepts (5.10) Protection (5.11) The memory hierarchy of Alpha 21064 (5.13) Virtual address space proc 0? s space proc 1 Physical memory Virtual

More information

Introduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay

Introduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay Introduction to Cloud Computing and Virtualization By Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay Talk Layout Cloud Computing Need Features Feasibility Virtualization of Machines What

More information

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

CS/ECE 3330 Computer Architecture. Chapter 5 Memory CS/ECE 3330 Computer Architecture Chapter 5 Memory Last Chapter n Focused exclusively on processor itself n Made a lot of simplifying assumptions IF ID EX MEM WB n Reality: The Memory Wall 10 6 Relative

More information

COSC3330 Computer Architecture Lecture 20. Virtual Memory

COSC3330 Computer Architecture Lecture 20. Virtual Memory COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 Caches III 2008-04-16 Lecturer SOE Dan Garcia Hi to Chin Han from U Penn! Prem Kumar of Northwestern has created a quantum inverter

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Transistor: Digital Building Blocks

Transistor: Digital Building Blocks Final Exam Review Transistor: Digital Building Blocks Logically, each transistor acts as a switch Combined to implement logic functions (gates) AND, OR, NOT Combined to build higher-level structures Multiplexer,

More information

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important Storage Hierarchy III: I/O System Readings reg I$ D$ L2 L3 memory disk (swap) often boring, but still quite important ostensibly about general I/O, mainly about disks performance: latency & throughput

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Asst. Proflecturer SOE Miki Garcia WHEN FIBER OPTICS IS TOO SLOW 07/16/2014: Wall Street Buys NATO Microwave Towers in

More information

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler

More information

Lecture 24: Memory, VM, Multiproc

Lecture 24: Memory, VM, Multiproc Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Virtualization and memory hierarchy

Virtualization and memory hierarchy Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 16: Virtual Machine Monitors Geoffrey M. Voelker Virtual Machine Monitors 2 Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot

More information

CSE 560 Computer Systems Architecture

CSE 560 Computer Systems Architecture This Unit: CSE 560 Computer Systems Architecture App App App System software Mem I/O The operating system () A super-application Hardware support for an Page tables and address translation s and hierarchy

More information

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang Chapter 5 Large and Fast: Exploiting Memory Hierarchy Jiang Jiang jiangjiang@ic.sjtu.edu.cn [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2008, MK] Chapter 5 Large

More information

Optimising Multicore JVMs. Khaled Alnowaiser

Optimising Multicore JVMs. Khaled Alnowaiser Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Operating Systems. Operating Systems Sina Meraji U of T

Operating Systems. Operating Systems Sina Meraji U of T Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

Chapter 8 Main Memory

Chapter 8 Main Memory COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples

More information

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018 CIS 3207 - Operating Systems Memory Management Cache and Demand Paging Professor Qiang Zeng Spring 2018 Process switch Upon process switch what is updated in order to assist address translation? Contiguous

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Virtual Memory 11282011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Cache Virtual Memory Projects 3 Memory

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Disco : Running Commodity Operating Systems on Scalable Multiprocessors Outline Goal Problems and solutions Virtual Machine Monitors(VMM) Disco architecture

More information