An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Size: px

Start display at page:

Download "An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization"

Reynard Jennings
5 years ago
Views:

An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University

1 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides available from Peter.Strazdins/seminars) PDSE 2018: The 19th Workshop on Parallel and Distributed Scientific and Engineering omputing, Vancouver, anada, 25 May 2018

2 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 1 1 Talk Overview motivation background: network device virtualization using Xen our approach (small side-core to offload driver domain I/O) and aims methodology: overall, AMP simulation framework derivation of side-core design parameters core execution units TLBs L1 and L2-caches overall sidecore parameters and comparison of 4 side-cores conclusions

3 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 2 2 Motivation virtualization is an attractive technology for HP allows for a user-defined, deploy-anywhere OS / software stack dynamically migratable, e.g. to/from local supercomputer to public cloud, from nodes developing faults suffers however from poor (network) I/O performance techniques like direct device assignment lose virtualization benefits others, like SR-IOV, require special hardware the sidecore approach: devote a core to offloading I/O virtualization has neither of these drawbacks but wasteful to use a large core: most I/O offloading is very simple can we use a single instruction set architecture asymmetric multiprocessor (AMP): use the large cores for the application with small side-core for the I/0 How Small an It Be?

4 Netfront Netfront Netfront IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 3 3 Background: Network Device Virtualization using Xen Xen allows guest OSs (domus) to perform I/O through a privileged OS known as the driver domain (or dom0) the TP/IP network protocol stack is split into: a top-half, netfront, on domu a bottom-half, netback, on dom0 these communicate through VMM data transport and event notification mechanisms Driver Domain Bridge Netback VIFs Physical Device Driver Device I/O Ring Xen User Domain User Domain User Domain provides security & live migration but incurs considerable I/O overheads!

5 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 4 4 Approach and Aims our approach: use the small side-cores to offload dom0, and the largecores for domus no modification to Xen required allows parallelization of top- and bottom-half I/O such small side-cores are also shown useful for offloading other services, e.g. Java VM JIT and garbage collections our previous work (How Small an it be? HP 15) used AMP emulation to design (derive architectural parameters for) such a side-core limited to existing physical hardware (e.g. AMD-K80 and Intel Atom) some params. just a guess based on Atom s, e.g. L2-TLB & I$ size considered only performance (delay) contribution of this work: verify/refine this design using: a methodology using full-machine functional and power simulation systematically evaluate the trade-offs on performance, area & energy

6 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 5 5 Simulation Methodology use ATI to estimate latency (cycles) of caches, TLBs conduct full-machine simulation of AMP system (SimNow! and OTSon) evaluate the detailed area and energy profiles using McPat Processor Description ATI Timing Simulation Timing and architectural statistics Energy-delay product Power Simulation Power profiles, area metrics: ED 2 P = Energy Delay 2 and ED 2 AP = Energy Delay 2 Area where Energy = n i=1 (P i(access, miss) + S i ) T P i /S i : dynamic/static power of an i-th level component (cache, TLB) T : total execution time, as measured by full machine simulation

7 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 6 6 Simulation Framework (I) Simulation node Simulation node Functional simulator Functional simulator Dom0 DomU DomU Dom0 DomU DomU Simore Simore Simore Simore Simore Simore Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Network mediator Memory, Functional TLB I/O Devs (Disk, Network, etc.) Timing model Side-core Big-core Big-core Side-core Big-core Big-core I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs I D TL Bs L2 L2 L2 L2 L2 L2 L3 L3 BUS BUS Memory I/O Devs Memory I/O Devs

8 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 7 7 Simulation Framework (II) model an AMP module consisting of two big cores and a side-core a larger AMP processor can be made from several modules all cores share L3$; L1/L2$ and TLBs private to cores each functional core (Simore) in SimNow! maps to a small or big core timing model in OTSon (extended for x86 table-walk, HW prefetch & superscalar) simulate both both intra-node (2 domus on the same node and module) and inter-node (2 2 domus on different nodes) communication choose NPB IS.A and FT.W: have message sizes of 2MB and 16MB (intra-node) larger than candidate size-core cache sizes / TLB spans Note: communication-intensive workloads produce much the same traffic on the IP level as seen at dom0 warm up TLBs, caches etc on 1st 2 it ns, collect statistics on next 2 an unused nop is injected into benchmarks to notify OTSon of this a 10Gb network interface & switch in OTSon s network model

9 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 8 8 Deriving Design Parameters: ore Execution Units 100% 90% % 70% 60% % 40% 30% 20% 10% % asymmetric 2-way symmetric 2-way 3-way 4-way Single Dual Triple Quad utilisation of multiple execution units on dom0 (results essentially identical FT.W/IS.A and inter-/intra-node) asymmetric 2-way design (2nd unit: only integer instructions, does not need dual-ported L1$) is within 7% of the 4-way

10 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 9 9 Deriving Design Parameters: Translation Lookaside Buffers DTLB-intra ITLB-intra DTLB-inter ITLB-inter Number of entries fully-associative single-level I/D TLB miss rates vs size (4KB page size, LRU replacement), FT/IS essentially identical. (note ITLB-inter, entries) performance insufficient for 24 & saturates for 256 entries

11 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 TLBs DTLB-intra ITLB-intra DTLB-inter ITLB-inter L2 TLB miss rate (% L2 misses / L1 accesses) vs size associativity performance seems to saturate from 256-2/4 or 512-2

12 ED 2 AP (Normalized) ED 2 AP (Normalized) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP Analysis of 2-level TLB onfigurations I-TLB, inter-node D-TLB, inter-node normalized ED 2 AP vs L1-TLB size (entries) for three minimal ED 2 AP L2 I/D-TLB settings (128-4, 256-4, and 512-2) fully-associative 32/48-entry L1 I/D-TLBs with 4-way 256-entry L2 I/D TLBs are optimal similar analysis for intra-node communication shows a fully associative 24 entry L1 I/D-TLBs with a 128 entry 4-way L2 I/D-TLBs is optimal

13 Miss rate (%) Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches ft-intra is-intra ft-inter is-inter ft-intra is-intra ft-inter is-inter Instruction cache configurations Data cache configurations I-cache D-cache miss rates of I/D-caches vs capacity(kb) associativity I-cache: 3 plateaus at 16-4, 32-4 and 64-4, inter-node rates higher D-cache: plateaus not as clear, intra-node rates higher as only 2 processes used D-cache inter-node scalability is worse as cold and coherency misses are inherent in data access

14 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-1 Instruction/Data ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Runtime (mw) Leakage (mw) / / / / / / / / / / / / / / / / / / / / / / / / / / / 47 from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note changes in access time due to size and associativity

15 cycles Average access cycles (Normlised) Average acess cycles (Normalised) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-1 aches Normalised Area Normalised Area I-cache D-cache average memory access times vs area of I/D-caches (capacity associativity) for inter-node communication (intra-node was almost identical) a 512 KB 4-way L2$ was used (other L2$s exhibit the same trend) I$: Pareto-efficient frontiers are 16-2, 16-4, 32-2, 32-4, 64-2 and 64-4 D$: Pareto-efficient frontiers are 16-2 and 32-2: a clear choice!

16 Miss rate (%) IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Deriving Design Parameters: Level-2 ache INTRA INTER 0 ache configurations L2 cache miss rate vs capacity(kb) associativity ratio of L2 misses to total L1 cache read/write accesses is used plateaus at 256-4, and

17 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Level-2 Unified ache Parameters Size (KB) Assoc. (way) Access time (cycles) Area (mm 2 ) Leakage (mw) from ATI 6.5 and McPat, on a 3 GHz clock and 45nm lithography note again changes in access time due to size and associativity

18 Normalized ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization ED 2 AP and ED 2 AP Analysis of 2-level ache onfigurations / / / / / / / / / / / ED2P ED2AP for inter-node communication: small design (16KB 4/2-way I/D-$, and a 128KB 4-way L2$) is ED 2 AP optimal! from other results, also optimal for intra-node for both ED 2 AP and ED 2 P

19 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization Overall Results: Processor Parameters Big Atom SS-e SS-s ore type Out-Of-Order In-order In-order In-order Pipelines Symm. 4 Asymm. 2 Asymm. 2 Asymm. 2 Instrn. cache 64KB/2-way 32KB(8) 64KB(2) 16KB(4) Data cache 64KB(2) 24KB(6) 24KB(6) 16KB(2) L2 cache 512KB(16) 512KB(8) 256KB(16) 128KB(4) L1 I/D TLBs 64(f)/64(f) 32(f)/16(f) 32(f)/48(f) 32(f)/48(f) L2 I/D TLBs 512(4)/512(4) None/64(4) 512(4)/512(4) 256(4)/ 256(4) Area (mm 2 ) Avg. Power (W) cores: Big (the AMD K10), Atom (the Intel Atom), SS-e (from HP 15 paper), SS-s (this paper) SS-e and SS-s have no FPU, a simple 2-level branch predictor and no hardware prefetch (emulation was sufficient to determine these)

20 Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization omparison of 3-ore Modules with Various Side-ores Power Area Performance (inverse) Energy Big Atom SS-e SS-s inter-node simulation results (FT.W.4) for a module of two 2 (AMD K10) cores and 1 side-core (Big, Atom, SS-e, and SS-s). figures normalised to the smallest of each category. Smaller is better.

21 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions our methodology to design a small side-core in an AMP system for efficient driver domain (dom0) I/O offloading based on coupled full machine, area and power simulations validated on the Atom, side-core model s perf. counter metrics being within 11% permitted a systematic exploration of the design space takes into account the relationship of energy, delay and area e.g. the effect of cache size/associativity on access latency 2-level memory hierarchy for EDA 2 P optimization a challenge! need to calculate L2 miss rate no L1 accesses Pareto frontiers and/or exhaustive analysis for optimal parameters of interest is only the TP/IP traffic-generated workload on dom0 largely insensitive to any communication-intensive domu workload inter-node communication workloads more demanding than intra-

22 IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization onclusions (II) 2-way asymmetric in-order execution pipeline optimal due to tight instruction dependence and the short basic block size in dom0 workloads the 2-level TLB and cache parameters of the side-core are critical, with surprisingly small values being optimal for ED 2 AP L1 I/D TLBs fully-associative, but 2-4 way associativity elsewhere 16KB/2-way D-ache was within 3% of the 32KB/2-way, which was optimal for Delay broadly validated the How Small an It Be? (HP 15) study, which was based on emulation on existing H/W however with more confidence form the more systematic methodology can get energy and area savings with new design, with almost identical performance It an Be Even Smaller!

23 Normalized ratio Netfront Netfront Netfront Normalised ratio IPDPS/PDSE-18 An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization 22 Thank You!!... Questions??? Driver Domain User Domain Processor Description Timing Simulation Power Simulation Netback Bridge VIFs Physical Device Driver Device I/O Ring User Domain User Domain ATI Timing and architectural statistics Power profiles, area Xen Energy-delay product / / / / / / / / / / / Power Area Performance (inverse) Energy ED2P ED2AP Big Atom SS-e SS-s

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip