STATS AND CONFIGURATION

Size: px

Start display at page:

Download "STATS AND CONFIGURATION"

Lester Harmon
6 years ago
Views:

1 ZSim Tutorial MICRO 2015 STATS AND CONFIGURATION Core Models Po-An Tsai

2 Outline 2

3 Outline ZSim core simulation techniques 2

4 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I

5 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I Coding examples with demo Branch predictor Westmere to Silvermont

6 Core Simulation Techniques 3

7 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation

8 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation ZSim mainly uses 4 types of analysis routine Basic block Load and Store Branch to cover the simulated program

9 Core Simulation Technique 4

10 Core Simulation Technique A basic block (BBL) from Pin 4

11 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 4

12 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5

13 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information

14 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a

15 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor

16 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]

17 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]

18 Core Simulation Technique 6 Decode x86 instructions into uops With different latencies, src/dst pair, function unit ports

19 Core Simulation Technique 7

20 Core Simulation Technique 2. Simulate memory system operations with addresses 7

21 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a

22 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a

23 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Load(Address addr) { L1D->load(addr); } Store(Address addr) { L1D->Store(addr); }

24 Core Simulation Technique 8

25 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock

26 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } }

27 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } } simulateissue(uop) { adduoptorob(currobcycle, uop); if(rob.isfull()){ nextrobavailcycle = rob.advance(); } }

28 Core Simulation Technique 9 Event-driven uncore activity simulation

29 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core

30 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag

31 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss

32 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data

33 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data

34 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data Response to core

35 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Weave phase Cache Response to core

36 General Core Structure 10

37 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc

38 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc Current supported core type Simple IPC1 core Timing core OOO core (Westmere-like)

39 Simple IPC1 Core 11 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a

40 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a

41 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

42 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

43 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)

44 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)

45 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4

46 Timing Core 12 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a

47 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a

48 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

49 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

50 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

51 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

52 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

53 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

54 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

55 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4 Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

56 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec

57 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

58 OOO Core - BBL 14 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

59 OOO Core - BBL 15 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

60 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

61 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

62 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

63 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

64 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Fetch Decode Issue OOO Execute Commit

65 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

66 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

67 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

68 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Exec Commit

69 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Commit

70 OOO Core - BBL 16 Simulate all stages at once Load A Fetch Decode Issue OOO Execute Commit

71 OOO Core - BBL 17 Simulate all stages at once Load A Fetch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode Issue OOO Execute Commit

72 OOO Core - BBL 18 Simulate all stages at once Load A Decode cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue OOO Execute Check next available cycle Adjust Decode clock Commit

73 OOO Core - BBL 19 Simulate all stages at once Load A Dispatch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src available available cycle cycle Rob Adjust Decode clock Check next avail cycle Issue width RegFile width Adjust issue clock Commit

74 OOO Core - BBL 20 Simulate all stages at once Load A Commit cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store *Only for load/store Commit

75 OOO Core - BBL 21 Simulate all stages at once Load A Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store Commit Reg Scoreboard Set dst available cycle Rob Retire uop considering rob width Adjust retire clock

76 OOO Core Load/Store 22 Simulate MLP Load A Load B

77 OOO Core Load/Store 22 Simulate MLP Load A Load B Cache Issue 30 70

78 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue

79 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue In weave phase, request B will not be delayed due to contentions for A

80 Simulation Speed for Different Core Type SPECCPU 2006 suite 24

81 Simulation Speed for Different Core Type SPECCPU 2006 suite 24 ~3X difference between IPC1 and OOO-C in Hmean

82 Not Modeled Core Behaviors 25

83 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere

84 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes

85 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes TLBs and SMT Not supported yet

86 Coding Examples 26

87 Coding Examples Implement a branch predictor for OOO core 26

88 Coding Examples 26 Implement a branch predictor for OOO core Change OOO core type From Westmere to Silvermont

89 Implement Branch Predictors 27

90 Implement Branch Predictors Have a new branch predictor class 27

91 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27

92 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } 27

93 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } Implement the predict method 27

94 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) {

95 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen);

96 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken;

97 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken }

98 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h

99 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h //BranchPredictorPAg<11, 18, 14> branchpred; GSharePredictor branchpred;

100 Demo 28

101 Different OOO Micro-architecture 29

102 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29

103 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters

104 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp

105 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp Step 3: verify it against real system

106 Obtain Important Core Parameters 30 Westmere[1] Silvermont[2] Issue width 4 2 F/D/I/E stages 1/4/7/13 1/3/5/8 Fetch width 16B 8B RF read width 3 2 ROB size Ins window 1K * 36 1K * 16 Issue queue 28 8 [1] [2]

107 Change OOO Core Parameters 31

108 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h 31

109 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2>

110 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp

111 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp #define FETCH_STAGE 1 -> 1 #define DECODE_STAGE 4 -> 3 #define ISSUE_STAGE 7 -> 5 #define DISPATCH_STAGE 13 -> 8 #define FETCH_BYTES_PER_CYCLE 16 -> 8 #define ISSUES_PER_CYCLE 4 -> 2 #define RF_READS_PER_CYCLE 3 -> 2 31

112 Demo 32

113 Verify It Against Real System 33 IPC traces for Westmere and Silvermont Westmere (6% performance difference) Silvermont (9% performance difference)

114 Summary 34

115 Summary ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities 34

116 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h)

117 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h) Extending zsim core model is straightforward Modify 4 basic analysis routines Substitute the hardware structure with your implementation Change the parameters in OOO

118 Hacking Advice 35 As common Pin programming, functions in the core are very frequently called in zsim You should be aware of performance when coding It s the main reason why zsim statically allocates hardware structures and set ooo parameters

119 Thank you! Any questions? 36

120 Break / Q&A Try zsim now! 37

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200