Cross-Layer Memory Management to Reduce DRAM Power Consumption

Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1

Introduction Assistant Professor at UT since August 2014 Before UT PhD in Computer Science at KU (July 2014) Intern at Intel Corporation (2012 2013) Research interests: Compilers (optimization, phase ordering) Operating Systems (kernel instrumentation, memory and power management) Runtime Systems (dynamic compilation, object mgmt.) Courses taught: Compilers (COSC 461), Discrete Structures (COSC 311) 2

Outline Compiler Optimization Phase Ordering Dynamic Compilation Cross-Layer Memory Management Motivation Design Experimental Evaluation Future Directions Conclusions 3

Compiler Optimization Phase Ordering 4

Phase Ordering Compiler optimizations operate in phases Phases interact with each other Phase ordering: different phase orderings produce different quality code Problem: finding the best ordering for each function or program takes a very long time Iterative search is the most common technique 5

Exploiting Phase Interactions Our approach: identify and exploit phase interactions during search Major contributions: Reduce exhaustive phase ordering search time Increase applicability and effectiveness of individual optimization phases Improve phase ordering heuristics Publications: LCTES 10 [1], CASES 10, [2] CASES 13 [3], S:P&E (Jan. 13) [4] 6

Dynamic Compilation 7

Tradeoffs in Dynamic Compilation Managed language applications (e.g. Java) Distributed as machine-independent codes Require compilation at runtime Dynamic compilation policies involve tradeoffs Can potentially slow down overall performance Must consider several factors when setting policy: Compiling speed and quality of compiled code Execution frequency of individual methods Availability of compilation resources 8

Dynamic Compilation Strategies Conducted multiple studies on how, when, and if to compile program methods Employ industrial-grade Java VM (HotSpot) Major studies: Performance potential of phase selection in dynamic compilers (VEE '13-A [5]) Dynamic compilation strategy on modern machines (TACO, Dec. '13 [6]) 9

Cross-Layer Memory Management 10

A Collaborative Approach to Memory Management Memory has become a significant player in power and performance Memory power management is challenging Propose a collaborative approach between applications, operating system, and hardware: Applications communicate memory usage intent to OS OS re-architect memory mgmt. to interpret application intent and manage memory over hardware units Hardware communicate hardware layout to the OS to guide memory management decisions 11

A Collaborative Approach to Memory Management Implemented framework by re-architecting a recent Linux kernel Experimental evaluation Publications: VEE 13-B [7], Linux Symposium 14 [8], manuscript in submission [9] 12

Why CPU and Memory are most significant players for power and performance In servers, memory power == 40% of total power [10] Applications can direct CPU usage threads may be affinitized to individual cores or migrated b/w cores prioritize threads for task deadlines (with nice) individual cores may be turned off when unused Surprisingly, much of this flexibility does not exist for controlling memory 13

Example Scenario System with database workload with 512GB DRAM All memory in use, but only 2% of pages are accessed frequently CPU utilization is low How to reduce power consumption? 14

Challenges in Managing Memory Power Memory refs. have temporal and spatial variation At least two levels of virtualization: Virtual memory abstracts away application-level info Physical memory viewed as single, contiguous array of storage No way for agents to cooperate with the OS and with each other Lack of a tuning methodology 15

A Collaborative Approach Our approach: enable applications to guide mem. mgmt. Requires collaboration between the application, OS, and hardware: Interface for communicating application intent to OS Ability to keep track of which memory modules host which physical pages during memory mgmt. To achieve this, we propose the following abstractions: Colors Trays 16

Communicating Application Intent with Colors Software Intent Color Tray Memory Allocation and Freeing Color = a hint for how pages will be used Colors applied to sets of virtual pages that are alike Attributes associated with each color Attributes express different types of distinctions: Hot and cold pages (frequency of access) Pages belonging to data structures with different usage patterns Allow applications to remain agnostic to lower level details of mem. mgmt. 17

Power-Manageable Units Represented as Trays Software Intent Color Tray Tray = software structure containing sets of pages that constitute a power-manageable unit Requires mapping from physical addresses to power-manageable units ACPI 5.0 defines memory power state table (MPST) to expose this mapping Re-architect a recent Linux Kernel to perform memory management over trays Memory Allocation and Freeing 18

M0 M1 M2 M3 M4 M5 M6 M7 Application Hot pages Application colors pages to indicate a range of pages will be hot V1 V2 VN Cold pages Seq. Access Operating System P1 P2 Physical memory allocation and recyclying PN OS looks up attribute associated with the virtual pages color Trays: T0 T1 T2 T3 T4 T5 T6 T7 Pages: Hardware Memory topology represented in the OS using trays Controller Controller Controller Controller CH0 CH1 CH0 CH1 NUMA Node 0 NUMA Node 1 19

Experimental Evaluation Emulating NUMA API s Memory prioritization for applications Reducing DRAM power consumption Power-saving potential of containerized memory management Localized allocation and recycling Exploiting generational garbage collection 20

Automatic Cross-Layer Memory Management Limitations of application guidance: Little understanding of which colors or coloring hints will be most useful for existing workloads All colors and hints must be manually inserted Our approach: integrate with profiling and analysis to automatically provide power / bandwidth mgmt. Implemented using the HotSpot JVM Instrumentation and analysis to build memory profile Partition live objects into separately colored regions 21

Application Heap Young generation Execution Engine Hot eden Hot survivors Cold eden Cold survivors Object profiling and analysis JIT Compiler Hot tenured Tenured generation Cold tenured Garbage Collection Employ the default HotSpot config. for server-class applications Divide survivor / tenured spaces into spaces for hot / cold objects 22

Application Heap Young generation Execution Engine Hot eden Hot survivors Cold eden Cold survivors Object profiling and analysis JIT Compiler Hot tenured Tenured generation Cold tenured Garbage Collection Color spaces on creation or resize Partition allocation sites and objects into hot / cold sets 23

Potential of JVM Framework Our goal: evaluate power-saving potential when hot / cold objects are known statically MemBench: Java benchmark that uses different object types for hot / cold memory HotObject and ColdObject Contain memory resources (array of integers) Implement different functions for accessing mem. 24

Experimental Platform Hardware Single node of 2-socket server machine Processor: Intel Xeon E5-2620 (12 threads @ 2.1GHz) Memory: 32GB DDR3 memory (four DIMM s, each connected to its own channel) Operating System CentOS 6.5 with Linux 2.6.32 HotSpot JVM v. 1.6.0_24, 64-bit Default configuration for server-class applications 25

The MemBench Benchmark Object allocation Creates HotObject and ColdObject objects in a large in-memory array # of hots < # of colds (~15% of all objects) Object array occupies most (~90%) system mem. Multi-threaded object access Object array divided into 12 separate parts, each passed to its own thread Iterate over object array, only accessing hot objects Optional delay parameter 26

MemBench Configurations Three configurations Default Tray-based kernel (custom kernel, default HotSpot) Hot/cold organize (custom kernel, custom HotSpot) Delay varied from "no delay" to 1000ns With no delay, 85ns between memory accesses 27

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Performance 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Tray-based kernel has about same performance as default Hot/cold organize exhibits poor performance with low delay 28

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Bandwidth 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Default and tray-based kernel produce high memory bandwidth when delay is low Placement of hot objects across multiple channels enables higher bandwidth 29

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Bandwidth 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Hot/cold organize - hot objects co-located on single channel Increased delays reduces bandwidth reqs. of the workload 30

Energy consumed relative to default (J) (J(X) / J(DEF)) MemBench Energy 2 1.8 1.6 1.4 1.2 tray-based kernel (DRAM only) tray-based kernel (CPU+DRAM) hot/cold organize (DRAM only) hot/cold organize (CPU+DRAM) 1 0.8 0.6 0.4 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses Hot/cold organize consumes much less power with low delay Even when BW reqs. are reduced, hot/cold organize consumes less power than other configurations 31

Energy consumed relative to default (J) (J(X) / J(DEF)) MemBench Energy 2 1.8 1.6 1.4 1.2 tray-based kernel (DRAM only) tray-based kernel (CPU+DRAM) hot/cold organize (DRAM only) hot/cold organize (CPU+DRAM) 1 0.8 0.6 0.4 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses Significant energy savings potential with custom JVM Max. DRAM energy savings of ~39%, max. CPU+DRAM energy savings of ~15% 32

Results Summary Object partitioning strategies Offline approach partitions allocation points Online approach uses sampling to predict object access patterns Evaluate with standard sets of benchmarks DaCapo, SciMark Achieve 10% average DRAM energy savings, 2.8% CPU+DRAM reduction Performance overhead 2.2% for offline, 5% for online 33

Current and Future Projects in Cross-Layer Memory Management Immediate future work: address performance losses of our current approach Improve the online sampling Automatic bandwidth management Applications for heterogeneous memory architectures Exploit data object placement within each page to improve efficiency 34

Conclusions Research focuses on software systems Compilers, operating systems, and runtime systems Cross-layer memory management Achieving power/performance efficiency in memory requires a cross-layer approach First framework to use usage patterns of application objects to steer low-level memory mgmt. Approach shows promise for reducing DRAM energy Opens several avenues for future research in collaborative memory management 35

Questions? 36

References 1. Prasad Kulkarni, Michael Jantz, and David Whalley. Improving Both the Performance Benefits and Speed of Optimization Phase Sequence Searches In the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '10), April 2010 2. Michael Jantz and Prasad Kulkarni. Eliminating False Phase Interactions to Reduce Optimization Phase Order Search Space. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '10), October 24-29, 2010. 3. Michael Jantz and Prasad Kulkarni. Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '13), September 29 - October 4, 2013. 4. Michael Jantz and Prasad Kulkarni. Analyzing and Addressing False Phase Interactions During Compiler Optimization Phase Ordering. In Software: Practice and Experience. January 2013. 5. Michael Jantz and Prasad Kulkarni. Exploring Single and Multi-Level JIT Compilation Policy for Modern Machines. In ACM Transactions on Architecture and Code Optimization (TACO). December 2013. 6. Michael Jantz and Prasad Kulkarni. Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013. 37

References 7. Michael Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, and Kshitij A. Doshi. A Framework for Application Guidance in Virtual Memory Systems. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013. 8. Michael Jantz, Kshitij Doshi, Prasad Kulkarni, and Heechul Yun. Leveraging MPST in Linux with Application Guidance to Achieve Power-Performance Goals. In Linux Symposium, Ottawa, Canada. May 2014. 9. Michael Jantz, Forrest Robinson, Prasad Kulkarni, and Kshitij Doshi. Cross-Layer Memory Management for Managed Language Applications. In submission. July 2015. 10. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller. Energy management for commercial servers. Computer,36 (12):39 48, Dec. 2003 38