EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Size: px

Start display at page:

Download "EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun"

Theresa Hopkins
6 years ago
Views:

1 EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1

2 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work, progress, and plan Summary assignment subject line: [EECS750] Summary: Paper title Deadline: 11:59 p.m., a day before the class. Class presentation subject line: [EECS750] Presentation: Paper title Deadline: 5:00 p.m., a day before the class. Don t need to write a summary for the paper you present 2

3 Operating System Level Shared Memory Management in the Multicore Era. University of Kansas Heechul Yun

4 Multicore for Embedded Systems Benefits of multicore processors Reduce #of computers & cost Save space, weight, power (SWaP) w/o performance loss But performance isolation is difficult 4

5 Challenges: Shared Resources T1 T2 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 CPU Core 1 Core 2 Core 3 Core 4 Memory Hierarchy Memory Hierarchy Unicore Multicore Performance Impact 5

6 Memory Performance Isolation Part 1 Part 2 Part 3 Part 4 Core1 Core2 Core3 Core4 Memory Controller DRAM Q. How to guarantee worst-case performance? Q. How to reduce memory contention? 6

Case Study HRT Synthetic real-time video rec.

cache 4GB DDR3 1333MHz DIMM (1ch) CPU cores are

7 Case Study HRT Synthetic real-time video rec. & analysis P=20, D=13ms Cache-insensitive X-server Scrolling text on a gnome-terminal HRT Xsrv. Hardware platform Intel Xeon MB shared L3 cache 4GB DDR3 1333MHz DIMM (1ch) CPU cores are isolated Core1 L3 (8MB) DRAM A desktop PC (Intel Xeon 3530) Core2 7

8 HRT Time Distribution solo 99pct: 10.2ms w/ Xserver 99pct: 14.3ms 28% deadline violations Due to contention in the memory hierarchy 8

Runtime slowdown Inter-Core DRAM Interference 2.2 Slowdown ratio due to interference foreground X-axis background 470.lbm 2.0 1.8 Core Core L2 L2 Shared DRAM Intel Core2 1.6 1.4 1.2 1.0 (1.

9 Runtime slowdown Inter-Core DRAM Interference 2.2 Slowdown ratio due to interference foreground X-axis background 470.lbm Core Core L2 L2 Shared DRAM Intel Core (1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s) 437.leslie3d 462.libquantum 410.bwaves 471.omnetpp Significant slowdown (Up to 2x on 2 cores) Slowdown is not proportional to memory bandwidth usage 9

10 Outline Motivation Background & Problems DRAM Bandwidth Mgmt. (Time) DRAM Allocation Mgmt. (Space) Conclusion 10

11 Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Have multiple banks Different banks can be accessed in parallel

12 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s DDR3 1333Mhz

13 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s DDR3 1333Mhz Out-of-order processors

14 Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Mess Performance =??

15 Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Slow 1bank b/w Less than peak b/w How much?

16 Background: DRAM Operation Row 1 Row 2 Row 3 Row 4 Row 5 1 Col7 precharge READ ( 1, Row 3, Col 7) activate Row Buffer read/write Stateful per-bank access time Row miss: 19 cycles Row hit: 9 cycles (*) PC6400-DDR2 with (RAS-CAS-CL latency setting)

17 time Real Worst-case Core 1 Core 2 Core 3 Core 4 Request order 1 L3 Memory Controller (MC) 2 3 DRAM DIMM 4 Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 1 bank & always row miss ~1.2GB/s Each core = ¼ x 1.2GB/s = 300MB/s? (*) Intel 64 and IA-32 Architectures Optimization Reference Manual

Background: Memory Controller(MC) Bruce Jacob et al, Memory Systems: Cache, DRAM, Disk Fig 13

18 Background: Memory Controller(MC) Bruce Jacob et al, Memory Systems: Cache, DRAM, Disk Fig Request queue(s) Not fair (FR-FCFS: open-row first re-ordering) Unpredictable queuing delay 18

19 Challenges for Performance Isolation Multiple parallel resources (banks) Stateful bank access latency Unpredictable queuing delay Unpredictable memory performance 19

20 Related Work Hardware approaches Predictable DRAM controllers Hard: [Paolieri 09], [Akesson 08], [Reineke 11], [Goossen 13], [Zheng 13] Soft: [Mutlu 08, 07], [Ebrahimi 10] Cons: can t apply to commodity systems Software approaches Contention-aware scheduling Try to find minimally interfering task->core mapping [Fedorova 07], [Merkel 10]; Survey paper:[zhuravlev 12] Cons: still unpredictable, no guarantees 20

21 Outline Motivation Background & Problems DRAM Bandwidth Mgmt. (Time) MemGuard [RTAS 13] DRAM Allocation Mgmt. (Space) Conclusion 21

22 MemGuard [RTAS 13] MemGuard Reclaim Manager Operating System BW BW Regulator 0.6GB/s BW BW Regulator 0.2GB/s Regulator 0.2GB/s Regulator 0.2GB/s PMC PMC PMC PMC Core1 Core2 Core3 Core4 Memory Controller DRAM DIMM Multicore Processor Goal: guarantee minimum memory b/w for each core How: b/w reservation + best effort sharing 22

23 Reservation Idea Scheduler regulates per-core memory b/w using h/w counters Period = 1 scheduler tick (e.g., 1ms) Budget 2 1 Suspend the RT idle task Core activity 0 1ms Schedule a RT idle task 2ms computation memory fetch 23

24 LLC misses/ms LLC misses/ms Impact of Reservation W/o MemGuard MemGuard (1GB/s) Time (ms) Time (ms) 24

25 Reservation Key insight Worst-case bandwidth can be guaranteed. Total reserved bandwidth < worst-case DRAM bandwidth (r min ) System-wide reservation rule m i=0 B i r min m: #of cores B i : Core i s b/w reservation 25

26 time(ms) Best-Effort Sharing 0 Core0 900MB/s Core1 300MB/s throttled reschedule 1 guaranteed b/w 2 best-effort b/w Spare Sharing [RTAS 13] Proportional Sharing [Under submission to TC] 26

27 Case Study HRT Synthetic real-time video capture P=20, D=13ms Cache-insensitive X-server Scrolling text on a gnome-terminal HRT Xsrv. Hardware platform Intel Xeon MB shared cache 4GB DDR3 1333MHz DIMM Core1 L3 (8MB) DRAM A desktop PC (Intel Xeon 3530) Core2 27

28 w/o MemGuard HRT (solo) HRT s 99pct: 10.2ms HRT (w/ Xserver) HRT s 99pct: 14.3ms X s CPU util: 78% 28

29 MemGuard reserve only (HRT=900MB/s, X=300MB/s) HRT (solo) HRT s 99pct: 10.7ms HRT (w/ Xserver) HRT s 99pct: 11.2ms X s CPU util: 4% 29

30 MemGuard reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing HRT (solo) HRT s 99pct: 10.7ms HRT (w/ Xserver) HRT s 99pct: 10.7ms X s CPU util: 48% 30

31 MemGuard reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing HRT (solo) HRT s 99pct: 10.9 ms HRT (w/ Xserver) HRT s 99pct: 12.1ms X s CPU util: 61% 31

32 Real-Time Performance Improvement HRT X-server Using MemGuard, we can achieve No deadline miss for HRT Good X-server performance 32

Normalized IPC Evaluation Results 1.2 Foreground (462.libquantum) 462.Libquantum (foreground) memory hogs (background) 1.0 0.8 0.6 Guaranteed performance C0 C2 L2 1GB/s L2.

33 Normalized IPC Evaluation Results 1.2 Foreground (462.libquantum) 462.Libquantum (foreground) memory hogs (background) Guaranteed performance C0 C2 L2 1GB/s L2.2GB/s Shared Memory Intel Core run-alone co-run run-alone co-run run-alone co-run w/o Memguard Memguard (reservation only) Memguard (reclaim+share) Reclaiming and Sharing maximize performance

34 Summary of MemGuard Unpredictable memory performance multiple resources(banks), per-bank state, queueing delay MemGuard Guarantee minimum memory bandwidth for each core b/w reservation (guaranteed part) + best-effort sharing Case-study On Intel Xeon multicore platform, using HRT + X-server MemGuard can improve real-time performance efficiently Limitations Coarse grain (a OS tick) enforcement Small guaranteed b/w due to potential bank conflict DRAM bank partitioning (next part) 34

35 Outline Motivation Background & Problems DRAM Bandwidth Mgmt. (Time) DRAM Allocation Mgmt. (Space) PALLOC [RTAS 14] Conclusion & Future Work 35

36 Best-case: No Inter-Core Conflict Core1 Core2 Core3 Core4 Parallel accesses L3 Memory Controller (MC) Fast But DRAM DIMM

37 Problem SMP OS Core1 Core2 Core3 Core4 L3 OS does NOT know DRAM banks OS memory pages are spread all over multiple banks Memory Controller (MC)???? DRAM DIMM Unpredictable Conflict

38 PALLOC SMP OS Core1 Core2 Core3 Core4 CPC Memory Controller (MC) OS is aware of DRAM mapping Each page can be allocated to a desired DRAM bank DRAM DIMM Flexible Allocation Policy

39 PALLOC Core1 Core2 Core3 Core4 Private banking L3 Memory Controller (MC) Allocate pages on certain exclusively assigned banks DRAM DIMM Better Performance Isolation

40 PALLOC Modified Linux kernel s buddy allocator Lowest level allocator, mother of all allocations Can specify <banks> for each CGROUP # echo 4-7 > /sys/fs/cgroup/corun/palloc.dram_bank allow bank 4 or 5 or 6 or 7 for corun cgroup 40

41 Simplified Pseudocode 41

42 Evaluation Platforms Platform #1: Intel Xeon 3530 X86-64, 4 cores, 8MB shared L3 cache 1 x 4GB DDR3 DRAM module (16 banks) Modified Linux Platform #2: Freescale P4080 PowerPC, 8 cores, 2MB shared LLC 2 x 2GB DDR3 DRAM module (32 banks) Modified Linux

43 Identifying Memory Mapping cache-sets banks banks Intel Xeon GiB DDR3 DIMM (16 banks) cache-sets banks channel Freescale P x2 GiB DDR3 DIMM (32 banks) See paper for details 43

44 Samebank vs. Diffbank Diffbank: Core0 0, Core Samebank: All cores 0 44

45 Real-Time Performance Buddy(solo) Buddy PALLOC(diffbank) Setup: HRT Core0, X-server Core1 Buddy: no bank control (use all 0-15) Diffbank: Core0 0-7, Core

46 SPEC2006 Use 15 high-medium memory intensive benchmarks 46

47 Normalized IPC Performance Impact on Unicore 4banks 8banks 16banks buddy More banks more MLP (better performance, e.g., 470.lbm) For most benchmarks, the impact is not significant 47

48 Slowdown ratio Performance Isolation on 4 Cores buddy PB PB+PC Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) PB: DRAM bank partitioning only PB+PC: DRAM bank and Cache partitioning 48

49 Summary of PALLOC PALLOC DRAM bank aware kernel level memory allocator Can eliminate inter-core bank conflicts Evaluation Private banking policy improves performance isolation without significant single-thread performance reduction But, far from ideal isolation because memory bus is still a bottleneck need for b/w control (MemGuard) (TBA) 49

50 Outline Motivation Background & Problems DRAM Bandwidth Mgmt. (Time) DRAM Allocation Mgmt. (Space) Conclusion & Future Work 50

51 Conclusion Problems The shared memory hierarchy in multicore Today s OSes do not manage it. Resulting in high performance variations and poor QoS guarantees Proposed solutions MemGuard: DRAM bandwidth (time) control PALLOC: DRAM bank (space) control Improved performance isolation Funding agencies 51

52 Future Work T1: Better integration of MemGuard and PALLOC assignments affect reservable memory bandwidth (i.e., PALLOC MemGuard) T2: Smarter policies Optimal assignments for known workloads Dynamic adaptation schemes for unknown workloads T3: Fine-grained QoS guarantees in the Cloud Improve QoS guarantees in public cloud systems A vision for real-time cloud systems (NSF CNS/CSR Highlights) 52

Managing Memory for Timing Predictability. Rodolfo Pellizzoni

Managing Memory for Timing Predictability Rodolfo Pellizzoni Thanks This work would not have been possible without the following students and collaborators Zheng Pei Wu*, Yogen Krish Heechul Yun* Renato