Diploma Thesis. Instruction Timing Analysis for Linux/x86-based Embedded and Desktop Systems

Size: px

Start display at page:

Download "Diploma Thesis. Instruction Timing Analysis for Linux/x86-based Embedded and Desktop Systems"

Bernadette Hill
5 years ago
Views:

1 Facuty of Eectrica Engineering & Information Technoogy Chair of Process Automation Dipoma Thesis Instruction Timing Anaysis for Linux/x86-based Embedded and Desktop Systems Author: Supervisors: Tobias John Prof Peter Protze Dr Robert Baumgart Date of Submission: 22th September 2005

2 John, Tobias Instruction Timing Anaysis for Linux/x86-based Embedded and Desktop Systems Dipoma Thesis, Facuty of Eectrica Engineering & Information Technoogy Chemnitz University of Technoogy, 2oo5

3 Decaration of Authorship I hereby decare that the whoe of this dipoma thesis is my own work, except where expicity stated otherwise in the text or in the bibiography This work is submitted to Chemnitz University of Technoogy as a requirement for being awarded a dipoma in Eectrica Engineering - Automation Engineering I decare that it has not been submitted in whoe, or in part, for any other degree Chemnitz, 22th September 2005 Tobias John

5 Conceptua formuation 1 Objective: Rea-time aspects are increasingy reevant in standard PC environments At the same time x86-based processors are used more often in embedded systems For the construction of rea-time systems standard techniques are avaiabe, which unfortunatey reduce the efficiency of the systems Therefore specific system parameters often do without a guaranty, and overdimensioned hardware is used instead The intention of this work is to obtain quantitative statements about the chronoogica behaviour of current x86 based processor architectures under the Linux operating system Reevant aspects are as foows: * comparison of types Pentium 4 and AMD Ean SC 410 * outine and abstract of micro-benchmarks which address certain units (ALU, FPU, MMX, ISSE) specificay * assessment of typica operating systems services (often used system cas) on both architectures * assessment of given rea-time appications (for exampe current muti media codes) This work wi be ministered by chair of Process Automation of the facuty of Eectrica Engineering & Information Technoogy (Prof Protze) and by the juniorprofessorship Rea-time Systems of the computing facuty (Dr Baumgart) 1 Appendix A hods a copy of the origina (german) conceptua formuation

7 Abstract As the conceptua formuation expresses this work shoud compare x86 based genera purpose processors with embedded ones intended to be used in rea-time systems The focus was directed especiay the jitter of execution times However to anayse the execution of instructions on a microprocessor in a way that the competion is one time as fast as possibe and the other time as sow as in the worst case, requires that the architecture is known in detai It is necessary to know which aspects, and in which order, infuence the execution fow This information is often not pubished, not detaied enough or even wrong 2 As a consequence the first step has to be the anaysis of the underying hardware That this step woud take me most of the time coud not be known in advance and therefore the resuts of this work are a coection of microbenchmarks to expore the caching and branch prediction architecture to obtain the missing information Nevertheess, the gained resuts are interpreted in a way that best and worst case in execution timing are stressed 2 Eg [LH] and [Sea00] decare that the PII uses a strict LRU repacement strategy, athough a pseudo technique is actuay appied to L1!

8 Acknowedgements Athough this work has been created by mysef, severa friends have had great infuence in it, that is why I want to mention those: I woud ike to thank Jette & Tim for proofreading, my famiy for the patience they had with me, Chris for remembering me on the duties of a student and for aways being open to my questions Specia thanks goes to Brigitte for wakening the ast energy resources and to Cari, she showed me that there is something worth fighting for

9 CONTENTS Contents 1 Introduction 1 2 Background knowedge 2 21 Hardware 2 22 Performance monitoring 2 23 Branch prediction 3 24 CPU caches INTELs Cache Architecture 8 3 Anaysis concepts Performance Monitoring - IA-32 architecture P6 famiy PIV, Xeon Branch prediction Branch History (shift-)register (BHR) Branch Target Buffer (BTB) Caching Adjacent cache ine prefetch - PIV Caching strategy of the L1 cache Repacement strategy 22 4 Worst case on caching Cache fooding I Doube Purge Scenario Cache fooder as described in [LH] Cache fooding II The agorithm Conditions on the cache architecture Fiing vs Fooding 37 5 Resuts Branch prediction BHR - Branch History Register BTB - Branch Target Buffer Caching Adjacent cache ine prefetch - PIV L1 - caching strategy Repacement strategy Cache fooding 50 6 Concusions 53 A Conceptua formuation 55 Tobias John i

10 CONTENTS ii Tobias John

11 1 INTRODUCTION 1 Introduction Rea-time aspects are becoming more important in standard desktop PC environments and x86 based processors are being utiized in embedded systems more often Whie these processors were not created for use in hard rea time systems, they are fast and inexpensive and can be used if it is possibe to determine the worst case execution time Information on CPU caches (L1, L2) and branch prediction architecture is necessary to simuate best and worst cases in execution timing, but is often not detaied enough and sometimes not pubished at a This document describes how the underying hardware can be anaysed to obtain this information This document is structured as foows: The foowing section (sec 2) gives background information to the covered topics: performance monitoring, branch prediction and caching With this genera knowedge it shoud be no probem to understand the exporation techniques presented in section 3 These pages cover the ideas behind the benchmarks and how they have to be impemented Section 4 describes the worst case on caching Two possibiities are expained on how to achieve it, whereas the first one is based on [LH] and the second has been deveoped by mysef These theoretica concepts have been impemented and tested on Inte Pentium II, III and IV processor, beonging to the architectures P6 and Netburst The resuts obtained are presented in section 5 Finay summary and a ist of open questions and remaining probems (sec 6) foows Tobias John 1

12 2 BACKGROUND KNOWLEDGE 2 Background knowedge 21 Hardware The tests described in the foowing sections were executed on different Inte architectures: P6 (Pentium II, III) and Netburst (Pentium IV) The PII is a Kamath at 233 MHz with a 66 MHz system bus frequency and a 512 KB second eve cache running at haf the core cock The PIII, from the same famiy of processors, has a CPU frequency of 500 MHz, a system bus rated at 100 MHz and an L2 cache of the same size as the PII that runs at haf core speed, too Its Codename is Katmai Both processors have the MMX instruction set but ony the PIII utiizes SSE The representative of the Netburst microarchitecture, the PIV, is a Northood processor It does not feature Hyper Threading, has a cock frequency of 266 GHz and a 512 KB big second eve cache 22 Performance monitoring If the execution time of a program varies, this parameter is not usabe for making comparisons or assertions anymore Therefore counting the events corresponding to the anaysed topic as cache or branch behaviour is the ony way to draw exact concusions Many processors therefore provide mode specific registers (MSR) that serve this purpose, the so caed performance monitoring registers There exist severa utiities to access these registers as VTune for the Inte processors, PAPI, OProfie and many others VTune is imited to Inte processors and to some specia Linux distributions ony PAPI, the Performance API, is an interface to the performance counter hardware for different patforms It mainy consists of a ibrary that needs to be inked to your software, wherefore it cannot be used in kerne modues OProfie is a profier under GNU GPL, that supports even anaysis of kerne modues As the name suggests it is a profier that is not caed directy to read the counters Instead it is caed through an interrupt reeased by an overfow of a performance counter Under some circumstances it is unavoidabe to work in kerne mode For exampe if physica addresses are needed ( cache fooder) or if you intent to anayse a RTAI modue That is why ibraries as PAPI cannot be used Profiing software is either too inexact or produces too much overhead (if the counter is setup to overfow at a minima count) wherefore I decided to impement my own functions to access the performance monitoring hardware - for the time being, imited to the Pentium II/III, IV [Int04] describes how to configure the MSRs to count certain events Some hepfu functions as setting up registers and starting, stopping and resetting counters were written as inine assemby macros in a header fie So it is possibe to count for exampe cache misses and branch (mis)predictions within a kerne modue without instaing additiona software or patching the kerne (as it is necessary for PAPI) Direct manipuation of the MSRs is aowed in priviege eve 0 ony, so our performance monitoring macros are imited to kerne code, but because our intention was to anayse kerne modues (they are in conjunction with RTAI the simpest possibiity to run code without interruption) this does not face a probem 2 Tobias John

13 2 BACKGROUND KNOWLEDGE 23 Branch prediction In order to get more instructions competed faster, modern microprocessors are deepy pipeined That means that instructions do not wait for the previous ones to compete before their execution begins A probem with this approach arises, due to conditiona branches If a conditiona branch is encountered and the resut of the condition has not yet been cacuated, the microprocessor does not know whether to take the branch or not The appied soution is branch prediction - the processor decides whether to take the branch or not and starts executing the instructions at the predicted branch target Finay when the resut of the branch condition is known it is obvious whether the branch has been predicted correcty or not In the atter case the aready executed instructions of the wrong (mispredicted) path have to be thrown out (fushing the pipeine) which is particuary expensive with deepy pipeined processors The deay for a mispredicted branch is usuay equivaent to the pipeine depth There are two main types of branch prediction: static and dynamic one Static prediction assumes that the majority of backwards pointing branches occur in the context of repetitive oops, where the condition is used to determine whether the oop is to be repeated or not Therefore backward branches are predicted to be taken, whereas forward pointing branches are predicted not to be taken STRONGLY WEAKLY act not taken Figure 1: bimoda counter - used for dynamic prediction act taken act not taken st snt actuay taken act not taken actuay taken actuay taken wt wnt act not taken NOT TAKEN TAKEN The abiity to dynamicay predict the direction and the target of branches is based on the branch instruction s inear address, using the branch target buffer (BTB) If there is no vaid entry in the BTB for the recent branch then static prediction wi be used to decide which path to take A widey used scheme is the foowing: There is a branch history register (BHR) with a width of N H bits that stores the outcomes of the ast N H conditiona branches Either there is one goba BHR, based on the correations between subsequent branches in the whoe program fow or severa oca ones that are based on the correation between subsequent executions of the same branch Some bits of the BHR together with the branch instruction s address index a tabe of n-bit saturating counters (usuay n = 2 strongy taken (st), weaky taken (wt), weaky not taken (wnt), strongy not taken (snt)) that are updated when a jump condition is evauated and predict the branch outcome Tobias John 3

14 2 BACKGROUND KNOWLEDGE Figure 1 shows such a bimoda counter and figure 2 is a scheme of the described architecture [Sto01], [MMK04] branch address BTB target address next seq addr BHR BPT goba oca f Figure 2: usua branch prediction architecture Unfortunatey, some processor manufacturers provide amost no information on the exact predictor impementation, athough there are severa advises on how to optimize your code ( [And]) 4 Tobias John

15 2 BACKGROUND KNOWLEDGE 24 CPU caches Most common CPU cache architectures use severa eves of set-associative caches to reduce the number of cyces the CPU has to wait for data from memory A ine is the smaest unit that can be transferred between a cache and main memory If a ine can be stored in any pace in a cache it is caed fuy associative as opposed to direct mapped, where each ine can ony be stored in a specific pace Set associativity is the compromise where a ine can be cached in W different ocations Those W ines form a set and because the address of the data within one ine can be stored in any of that W paces the address ony indexes the set and not the ine within the set The appropriate ine within a set is found through comparison Cache set S-1 { L Main Mem set 2 set 1 set 0 { { { }{{}}{{} }{{} way 0 way 1 way W -1 Figure 3: simpified structure of a cache If data, that is accessed by the CPU, is found in a cache, it is caed a cache hit, otherwise a miss Severa strategies exist to decide what to do with an accessed ine The most common are: Tabe 1: common caching strategies ine in cache? strategy description hit miss write through write back write aocate write no-aocate the cache is updated and the next eve of memory too (either another (sower) cache or main mem) ony the cache is updated (the ine is marked dirty and written back ater) the next eve of memory is updated and the ine is fetched into the cache the next eve of memory is updated the ine is not fetched When data is going to be cached and no free space is avaiabe (That does not mean that the whoe cache is fied! Because a ine can be stored in ony W paces of one set it is of no hep if there are other free sets avaiabe) another ine has to be purged out of the cache - which one, the repacement poicy decides Based on [Mi04] the most common repacement poicies are Random, Round-Robin, LRU (Least Recenty Used) and plru (pseudo LRU) Among the plru agorithms are plrut (pseudo LRU tree based) and plrum (pseudo LRU Tobias John 5

16 2 BACKGROUND KNOWLEDGE based on MRU (Most Recenty Used) bits) Strict LRU is quite costy and compex because the whoe history of the W cache ines in a set have to be saved and updated on an access Therefore the pseudo LRU mechanisms try to reduce the number of bits needed to store the plru information and the time to manage them, through approximation of the rea agorithm The tree based plru poicy (plrut) uses a binary tree to point to the assumed east recenty used cache ine When accessing a ine and thereby making it the most recent one, a the tree bits that ay on the path to that ine are updated to point to the opposing direction (0 1) The eft coumn of figure 4 shows an exampe to a 4-way cache that utiises the plrut strategy To keep it simpe, ony one set of the cache is shown and the tree bits keep r / for right / eft instead of 0 / 1 The back arrows denote the path down the tree that points to the (pseudo) east recenty used cache ine The contents of a cache ine are marked through owercase aphabetic characters and bod etters symboize modified or updated entries Each of the three pictures in a coumn of figure 4 shows the cache ine and the plru bits after the given instruction (eg read(b)) has been executed The red boxed bits are those that have been updated With respect to the tree bits in step 0 of figure 4 the third cache ine is the east recenty used When data b is read, which is aready in the cache, a tree bits that are on the path down to that ine have to be updated to point to the opposite direction In step 2 a new ine is read and an other entry has to be freed to store k in the cache Because the tree bits point to the second way (data c) this ine is repaced and after updating the tree, data d is the LRU ine plrut plrum 0) r r MRU bits tree bits {}}{ n o o n e c b d e b c d 1) read(b) r r n n o n e c b d e b c d 2) read(k) r r o o n o e k b d e b k d Figure 4: tree based LRU and MRU based LRU repacement poicies 6 Tobias John

17 2 BACKGROUND KNOWLEDGE Another approximation of the LRU poicy is the MRU bits based (plrum) method Every ine in a set has a MRU bit that shows whether that ine has recenty been used ( n : new) or not ( o : od) An accessed ine is marked new and ony od ones are repaced If the ast od ine of a set is repaced and marked as new a other S 1 MRU bits are updated to od otherwise a bits woud mark their corresponding ines as new and none coud be repaced Step 2 in figure 4 shows that speciaity The other steps are sef-expanatory Necessary information to describe a cache is: name data parameter associativity (the number of ways) 2 w i w i cache size 2 s i s i ine ength 2 i = {1, 2} and refers to L1, L2 It is common, athough not necessary, that the ine ength is identica for both cache eves, therefore this document covers the case where 1 = 2 = A set-associative cache has 2 s w entries: size associativity ineength = That means it has an address width a i = s i w i 2s 2 w 2 = 2s w The east significant bits of an address are used to determine the corresponding byte within a cache ine and the foowing a 1 bits are used to index the corresponding set of L1 When data is fetched from memory aways a whoe cache ine is read That is why the rightmost bits are of no importance in indexing a cache Often reduced addresses are used to ease understanding and so the east significant bits are negected An exampe for a 2-way cache of 32 B size and 4 B ine size is shown in figure 5: way 0 way Q R S T { set I J K L I J K L A B C D Q R S T A B C D 1 Byte { Figure 5: exampe of a 2-way cache Tobias John 7

18 2 BACKGROUND KNOWLEDGE Tabe 2: parameters of the exempary 2-way cache ways 2 w = 1 ine size 4 = 2 cache size 32 s = 5 addrwidth a = s w = 2 Because ony a i bits of an address are used to index the corresponding cache entry, an address can be denoted as foows: 31 0 x y z a 1 a 2 Figure 6: highighting the bits used to index a cache The z part of an address (which is a 1 bits wide) indexes L1 and part yz (a 2 bits) indexes L2 241 INTELs Cache Architecture This section describes the cache architecture of INTELs PII/III and the Pentium IV The PII/III beong to the same famiy caed P6 famiy and therefore share the same architecture Some of the given parameters vary among different editions of a processor The bod vaues refer to the examined processor L1 L2 Data Instruction Trace unified Tabe 3: overview of Intes caching architecture PII/III 16 KB 8 KB 4 ways 4 ways 32 B ine ength 64 B ine ength 8/16 KB 4 ways 32 B ine ength Trace Cache 12 Kµops 8 ways 128/256/512/1024/2048 KB 256/512 KB 4 ways 8 ways PIV 32 B ine ength 64 B ine ength, 128 B sector The information about the size, associativity and ine ength can be gathered through evauating the bits returned by the cpuid instruction This is what the too cpuid from [sou] does [Hay] states that the P6 famiy has an 8-way second eve cache but [Int04] knows that the associativity is ony 4, which is acknowedged by the cpuid information! A ot of detaied information about processors can be found on [san], however some points are 8 Tobias John

19 2 BACKGROUND KNOWLEDGE missing or even are incorrect, eg the statement that the L1-Cache of the PII/III uses a LRU repacement strategy ( section 333) A genera description of the caching architecture and their configuration gives Inte s System Programming Guide ([Int04]) The Netburst microarchitecture ([H + 01]) of the P4 features a 128 B sectored cache that fetches 2 adjacent cache ines on a miss from memory and a hardware prefetcher that monitors access patterns and prefetches data automaticay Both features can be disabed through the IA32 MISC ENABLE MSR (bit 9 and/or 19 on address 0x1a0) When enabed one cache miss initiates two 64 B memory reads, to fi two adjacent cache ines (sector based read), however writes are aways ine based and ony write the modified 64 B back into main memory [Int] states that L1 uses a write through poicy and that A caches use a pseudo-lru repacement agorithm Yet which plru agorithm is not mentioned! Tobias John 9

20 3 ANALYSIS CONCEPTS 3 Anaysis concepts 31 Performance Monitoring - IA-32 architecture Mode Specific Registers (MSRs) can be read and written in priviege eve 0 ony, using the rdmsr, wrmsr instructions, where registers EDX:EAX hod the content that is either read from or written to the MSR addressed by ECX The Time Stamp Counter, avaiabe since the Pentium processor, is incremented every CPU cyce and can be read using the rdtsc instruction Here again, the 64 bit content is avaiabe in EDX:EAX The RDTSC instruction is not seriaizing or ordered with other instructions Thus, it does not necessariy wait unti a previous instructions have been executed before reading the counter ([Int04, p 15-26]) The instructions rdmsr, wrmsr however, are seriaizing and can be executed before the TSC is read 311 P6 famiy The P6 famiy - to which the PII, PIII beong - utiizes two 40 bit counter with corresponding event seect and controing registers: Tabe 4: performance monitoring MSRs - P6 famiy Name Address Meaning PerfEvtSe0 0x186 event seection MSR 0 PerfEvtSe1 0x187 event seection MSR 1 PerfCtr0 0xC1 counter 0 PerfCtr1 0xC2 counter 1 Because the event seection registers are 32 bit wide, it is enough to modify ony EAX when writing to these MSRs: // set the ow part of PerfEvtSe MSR to va #define set_esr(msr, va)\ asm voatie (\ "xor %%edx, %%edx\n\t"\ "wrmsr"\ :\ : "c" (msr), "a" (va)\ : "edx") The counters can be started and stopped by setting / cearing the ENABLE fag in the PerfEvt- Se0 register: // start counting #define start_counting()\ asm voatie (\ "mov $0x186, %%ecx\n\t"\ "rdmsr\n\t"\ "bts $22, %%eax\n\t"\ "wrmsr"\ :\ :\ : "eax", "ecx", "edx") // stop counting #define stop_counting()\ asm voatie (\ "mov $0x186, %%ecx\n\t"\ "rdmsr\n\t"\ "btr $22, %%eax\n\t"\ "wrmsr"\ :\ :\ : "eax", "ecx", "edx") Appendix A3 of [Int04] ists performance monitoring events of the P6 famiy These are simpe to configure and are not mentioned here 10 Tobias John

21 3 ANALYSIS CONCEPTS 312 PIV, Xeon The Pentium IV features 18 performance counters and configuration registers Tabe 15-2 on page of [Int04] shows the association of counters, CCCRs (Counter Configuration Contro Reg) and ESCRs (Event Seection Contro Reg) and appendix A1 gives information on countabe events Tabe 5 ists ony a few of them The coumns ESCR, CCCR, Counter show the addresses of the MSRs, whereas EvSe and CSe give the vaues that have to be in the EVENTSELECT fied of the ESCR and in the ESCR-SELECT fied of the CCCR register If severa CCCR and counter addresses are given, then one of them has to be chosen The configuration and contro registers are 64 bit wide, however the upper 32 bit are reserved, so that ony the ower part (EAX) is modified Bit 16 and 17 sha aways be set, so this is done when setting up a CCCR: // set the ow part of CounterConfigurationContro MSR to va // set bit 16,17 (must be set) #define set_cccr(cccr, va)\ asm voatie (\ "xor %%edx, %%edx\n\t"\ "or $(0b11<<16), %%eax\n\t"\ "wrmsr"\ :\ : "c" (cccr), "a" (va)\ : "edx") Counters are started and stopped through bit 12 of the corresponding CCCR Counting non-seep CPU cyces Chapter of [Int04] describes how to count non-seep cock ticks Non-Seep Cockticks - Measures cock cyces in which the specified physica processor is not in a seep mode or in a power-saving state 1 seect one of the 18 counters and its corresponding ESCR, CCCR ( tabe 15-2 of [Int04]) 2 set EvSe to anything other than no event: 0x01 3 enabe threshod comparison: set compare bit in CCCR (bit 18) 4 set threshod (bit 20-23) to 15 and set compement fag in CCCR (bit 19) // cnt non-seep cyces #define CYCLE_ESCR 0x3a6 #define CYCLE_CCCR 0x368 #define CYCLE_CNTR 0x308 // ---- set up counting of non-seep cyces ---- set_escr(cycle_escr, ESCR_OS (0x01<<ESCR_EVS_SHIFT)); set_cccr(cycle_cccr, CCCR_CMP CCCR_CPL 0xf<<CCCR_THRESH_SHIFT); Tobias John 11

22 3 ANALYSIS CONCEPTS Tabe 5: PIV - MSR configuration of seected performance monitoring events event ESCR CCCR Counter Event ESCR Event Seect Seect Mask predicted / mispred branches 0x3cc 0x3cd 0x36c 0x36d 0x370 0x36e 0x36f 0x371 0x30c 0x30d 0x310 0x30e 0x30f 0x311 0x06 0x TM TP NM NP T : Taken N: Not-Taken P : Predicted M: Mispredicted cache misses 0x3cc 0x3cd 0x36c 0x36d 0x370 0x36e 0x36f 0x371 0x30c 0x30d 0x310 0x30e 0x30f 0x311 0x09 0x05 MSR 0xef1: set bits 24, 0 (L1 miss), 1 (L2 miss) MSR 0x3f2: set bit 0 µops retired 0x3b8 0x3b9 0x36c 0x36d 0x370 0x36e 0x36f 0x371 0x30c 0x30d 0x310 0x30e 0x30f 0x311 0x01 0x04 Bit 0: bogus Bit 1: non-bogus FSB data activity IOQ aocation 0x3a2 0x360 0x300 0x361 0x301 0x17 0x06 0x362 0x302 0x3a3 0x363 0x303 0x3a2 0x360 0x300 0x361 0x301 0x03 0x06 0x362 0x302 0x3a3 0x363 0x303 0: drive data onto bus 1: read data 2: other processors reset bits 3,4,5 0-4: 0b : read 6: write 7:UC 8:WC 9:WT 10:WP 11:WB 13: own 14: other proc 14: prefetch 12 Tobias John

23 3 ANALYSIS CONCEPTS 32 Branch prediction [MMK04] describes possibiities to determine the organization of branch predictors, eg to expore the width N H of the BHR and the number of bits that are used to index the branch target buffer (BTB) With this information it is easy to achieve both, the best and worst case in a branch prediction benchmark The strategies presented in [MMK04] were adapted to run as RTAI kerne modues and were extended by a test whether a oca or a goba history component is used, which seemed to be more simpe and easy to understand than the one given in the paper As we, the agorithms were extended to take the cache behaviour into account ( sec 512, p 40) 321 Branch History (shift-)register (BHR) If the BHR has a width of N H bits it can store the ast N H outcomes (T: Taken / N: Not taken) Any further branches wi override previous ones and the index function wi seect a wrong entry in the BTB - a misprediction is ikey to occur The outcomes of a branch that is taken ony every mod th iteration wi fit in the BHR as ong as mod < mod and amost no mispredictions wi be counted Yet, for mod mod the MPR (MisPrediction Ratio) wi raise, because every mod th branch is mispredicted mov $ITER, %ecx mov $MOD, %ebx # number of iterations (outer oop) # moduo parameter again: xor %edx, %edx # cear edx mov %ecx, %eax div %ebx test %edx, %edx jz 0 cc 0: dec %ecx jnz again # eax=(int)(eax/ebx), edx=moduo(eax,ebx) # spy branch # do sth Figure 7: BHR benchmark At this point it has to be distinguished between oca and goba BHRs For a oca component the number of history bits is N H = mod 2, because the history refers to the spy branch ony but a goba register saves the outcomes of the surrounding oop too, so N H = 2 (mod 2 ) The reason for subtracting 2 is given in an exampe with a goba BHR of width N H = 6: N H = 6 means that the ast 6 outcomes can be saved and because it is a goba register ony every second bit of the register is reserved for the spy branch (the other one is for the outer oop jnz again), therefore a history pattern of ength 3 shoud fit in the register and one of ength 4 shoud not fit and cause mispredictions: Tobias John 13

24 3 ANALYSIS CONCEPTS BHR i T T T 1 T T N T T N T 2 T T N T N T T N T N T 3 T N T N T N N T N T N T 4 T N T N T T N T N T T T 5 T N T T T N N T T T N T 6 T T T N T N T T N T N T 7 T N T N T N N T N T N T 8 T N T N T T = Taken unique because of T map to their own BTB entry = Taken Figure 8: BHR pattern of ength 4 As it can be seen in figure 8, a history pattern of ength 4 can sti be predicted correcty because the address of the spy branch instruction together with the history register content is unique for every taken spy branch and therefore refers to the same BTB index However a pattern of ength 1 2 N H+2 = mod = 5 causes every 5th branch to be mispredicted, due to a non-unique BHR content for the taken spy branches Through a variation of mod it is possibe to find mod which is either an indication to a oca mod 2 bit BHR or a goba 2 (mod 2 ) bit BHR The moduo parameter mod can be given as mod=<num> option when oading the kerne modue that executes the given code above (fig 7) BHR i T T T 1 T T N T T N T 2 T T N T N T T N T N T 3 T N T N T N N T N T N T 4 T N T N T N N T N T N T 5 T N T N T T N T N T T T 6 T N T T T N N T T T N T 7 T T T N T N T T N T N T 8 T N T N T N N T N T N T 9 T N T N T N N T N T N T 10 T N T N T T = wnt snt = snt wnt unique because of T map to their own BTB entry = wnt snt = snt wnt Figure 9: BHR pattern of ength 5 To verify which kind of register - oca or goba - is used, I decided to execute a test with two of these moduo-branches, with the same moduo parameter If oca history registers are used, then both branches have their own and the MPR wi raise at the same moduo parameter as in the test with just one spy branch However, if a goba register is used, then both branches have to share it and infuence each other, wherefore the MPR raises at a ower moduo parameter (at a shorter history pattern) 14 Tobias John

25 3 ANALYSIS CONCEPTS 322 Branch Target Buffer (BTB) When describing how to anayse the branch prediction architecture, [MMK04] assumes the number of BTB entries is known Yet it is not difficut to obtain that information through some tests based on the agorithm expained in the foowing section, therefore the hints on how to measure N BT B are given afterwards (p 19) As described when introducing caches (sec 24, p 5), addresses are spit up into different parts to indicate which bits are used for what purpose Because ony one buffer is avaiabe, ony one part, namey the z part is used to address it and the y and x portions are combined to y The foowing figure shows an address used to index the BTB: 31 0 y z a 1 Figure 10: address used to index the BTB ony the z part of the address is used to index the buffer If 2 consecutive addresses map to the same set, then the east significant bits of an address are not used to index the entry Starting with an exampe to a BTB that s parameters are a known: It is a buffer with 512 entries, 4 ways (that means it has 128 sets) and an unused part that is = 4 bits wide Because 128 sets have to be indexed the z part has to hod a 1 = 7 bits Addresses y i z j 0 to y i z j 15 map to the same set, because the east significant bits are not used for indexing When executing 512 branch instructions with varying the distance between consecutive branch instruction s addresses, severa phenomens can be observed: (512 branch addresses shoud fit in a buffer with 512 entries) For a distance of 2 bytes, 8 consecutive branch addresses (y i z j 0, y i z j 2,, y i z j 12, y i z j 14) map to the same set Yet, there are just 4 ways to store them, so the ast 4 branch addresses wi override the previous 4 In the next iteration of the outer oop, the addresses can not be found in the BTB and dynamic prediction is unavaiabe, therefore every branch wi be staticay predicted If the branches are coded to be conditiona forward-jumps that are aways taken, then static prediction wi aways fai For a distance of 4 bytes, 4 consecutive addresses (y i z j 0, y i z j 4, y i z j 8, y i z j 12) map to the same BTB entry, so that every way is fied and a 512 addresses are stored in the BTB Therefore no mispredictions wi occur ( fig 11) Tobias John 15

26 3 ANALYSIS CONCEPTS dist = 4 [y 0 0 ] 0: jump cond 1 [y 0 4 ] 1: jump cond 2 [y 0 8 ] [y 0 12] [y 1 0 ] [y 1 4 ] [y127 0 ] [y127 4 ] [y127 8 ] [y12712] 127 y y y127 0 y 0 4 y 1 4 y127 4 y 0 8 y 1 8 y127 8 y 0 12 y 1 12 y12712 Figure 11: conditiona branches at a distance of 4 bytes A distance of 8 bytes wi aso fit in the buffer because a group of addresses y i z j 0, y i, z j 8 needs ony 2 ways and an 16 byte-distance is no probem too ( fig 12) [y0 0 0] [y0 1 0] [y0 2 0] [y01260] [y01270] [y11270] [y21270] [y31270] 127 y y y01270 y1 0 0 y1 1 0 y11270 y2 0 0 y2 1 0 y21270 y3 0 0 y3 1 0 y31270 Figure 12: conditiona branches at a distance of 16 bytes However, addresses with distances greater than 16 wi not fit in the BTB because some cache sets remain unused (marked as free in figure 13): 16 Tobias John

27 3 ANALYSIS CONCEPTS [y0 0 0] [y0 2 0] [y0 4 0] [y01260] [y1 0 0] [y11260] [y71260] y3 0 0 y7 0 0 y2 0 0 free y6 0 0 y3 2 0 y1 0 0 free y7 2 0 y5 0 0 y2 2 0 y0 0 0 free y6 2 0 y4 0 0 y1 2 0 free y5 2 0 y31260 y0 2 0 y71260 y4 2 0 y21260 y61260 free y11260 y51260 free y01260 y41260 free free Figure 13: conditiona branches at a distance of 32 bytes If one address is y i z j 0 the next foowing address for a distance of 32 bytes woud be y i z j+2 0 and the set with the index z j+1 wi remain unused, so that the addresses cannot fit into the buffer So for distances greater than 16 the misprediction ratio (MPR) wi aways be high The concusion is that the unused = 4 bits of an address ead to a maximum fitting distance of D = 16 and that the sum of 3 fitting distances resuts from the 4-way associativity The 4 bits of do not refer to an associativity of 4! If = 3 then there woud be 3 fitting distances, too Yet the highest possibe distance was not 16 but 8 From this exampe it can be generay derived that: 1 f fitting distances ead to an associativity of W = 2 F 1 w = F 1 2 the highest fitting distance D eads to = og 2 (D) 3 N BT B entries at an associativity of W form S = N BT B W sets, so that the z part must contain S addresses, that means: ( ) NBT B a 1 = og 2 (S) = og 2 W The benchmark aows to vary the distance between subsequent conditiona jump instructions (jcc) and the number of these Aso, any conditiona branch is directed forward, so that static prediction wi mispredict them A probem faces the surrounding oop, that aows repetition of the whoe jump scenario so that the infuence of inaccuracy of the performance counters can be reduced Tobias John 17

28 3 ANALYSIS CONCEPTS again: xor %ecx, %ecx mov $10, %eax cmp $1000, %ecx j 0 jmp fin 0: cmp $15, %eax j 1 0: j 2 distance D D The eipsis symboize any assember instructions that are used to fi the distance D The size of the unconditiona jmp instructions depends on the jump distance For distances ess than 128 B no more than 2 B are needed, instead of 5 B for greater distances However the distance can not easiy be cacuated because the distance itsef depends on the size of the unconditiona jump To circumvent the probem, avoid combinations?: inc %ecx with jmp again <number of branches> <distance> 128 fin: Or at east be aware that the first distance D might not be of correct size Figure 14: structure of the BTB benchmark Figures 15 and 16 show a code exampe for a microbenchmark with a distance D = 32 and a number of B = 3 conditiona branches The instructions reay executed in both exampes are the same, except that figure 15 uses forward and figure 16 backward branches 18 Tobias John

29 3 ANALYSIS CONCEPTS Figure 15: code snippet of the BTB microbenchmark with forward branches asm voatie ( "xor %%ecx, %%ecx\n\t" "mov $10, %%eax\n\t" "again: cmp $100000, %%ecx\n\t" "j 0\n\t" "jmp fin\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "0: cmp $15, %%eax\n\t" "j 1\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "1: j 2\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "2: inc %%ecx\n\t" "jmp again\n\t" "fin:" ::: "eax", "ecx" ); Figure 16: code snippet of the BTB microbenchmark with backward branches asm voatie ( "xor %%ecx, %%ecx\n\t" "mov $10, %%eax\n\t" "jmp again\n\t" "0: inc %%ecx\n\t" "jmp again\n\t" "1: j 0\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "cc\n\t" "cc\n\t" "2: cmp $15, %%eax\n\t" "j 1\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "mov $10, %%eax\n\t" "cc\n\t" "cc\n\t" "again: cmp $100000, %%ecx\n\t" "j 2" ::: "eax", "ecx" ); Number of BTB entries To use that benchmark it is necessary to know the number of BTB entries Either this information can be found in some documentation or it is gained with the benchmark itsef: Start at the smaest possibe distance D = 2 and a sma number of branches, eg 32 A If the MPR is ow, increase the number of branches unti the MPR is high The highest number of branches that did not cause a high MPR is the number of BTB entries N BT B B If the MPR is high, then increase the distance D unti the MPR gets ow (eave the number of branches unchanged!) When a distance with a ow MPR is found, continue with step A Tobias John 19

30 3 ANALYSIS CONCEPTS 33 Caching Legend of used variabes {a, b, } cache ine {A 1, B 1, } address range that fis one L1 cache way {A 2, B 2, } address range that fis one L2 cache way {A 0, A 1, } an address range that fis L1/L2 competey (here the number does not refer to L1/L2, it just distinguishes different addr ranges) eve i cache A 0 = [A i,, D i ] A i B i C i D i a b c d i = {1, 2} Figure 17: cache variabes 331 Adjacent cache ine prefetch - PIV To test the sectored cache ine fi and its disabing I wrote a short program that aocates a continuous memory bock as big as the second eve cache (512 KB) and accessed (read) ony every second ine within this bock If the adjacent cache ine prefetch is enabed (defaut), a addresses of the memory area shoud be cached and cause no L2 misses on a second read, but if this feature is disabed ony haf the addresses wi be avaiabe in L2 Because it is not known at which address the memory bock starts - at the first byte of a sector or at the second, the test is executed twice The first time the memory is accessed starting at an offset of 0 B, the second time at an offset of 1 B 332 Caching strategy of the L1 cache write-aocate/write-no-aocate Simpe to test is the usage of write-aocate/write-no-aocate because if L1 hods data that has been written to, the cache utiizes a write-aocate poicy So the foowing steps have to be performed: 1 invaidate caches (at east L1), so that no data is cached wbinvd 2 write to as many addresses as fit into L1 write(a 0 ) 3 read these addresses and count the L1 misses ( n miss ) read(a 0 ) n miss if the misses are approximatey as high as the number of read addresses, write-noaocate is used 20 Tobias John

31 3 ANALYSIS CONCEPTS if there are just a few misses, the data has been stored in L1 = write-aocate write-through/write-back Write-through/write-back are cache hit poicies, so the data, the test is working on, has to be in L1 If write-though is used then any write is performed in L1 and in L2 too, as opposed to write-back, where ony the data in L1 is updated If L1 is fied with data that has been written to and new addresses are oaded, then the modified data has to be purged out of L1 In the case of a write-through L1 cache the od, modified data is aready resident in L2 and can immediatey be overwritten, whereas a write-back L1 cache has to first write back the data into L2, before the new can be oaded The time needed for oading the new data shoud be onger in the atter case The structure of the benchmark is as foows: 1 fi L1 twice through reading the addresses A 0 for the first fi and A 1 for the second read(a 0 ) read(a 1 ) 2 write to the addresses A 1 aready in L1 (we are examining a write hit poicy) if L1 uses write-back then L2 is not updated write(a 1 ) 3 read addresses A 0 (these have to be oaded from L2 and push out the modified data A 1 ) and take the time ( t mod ) and count the L2 accesses ( n mod ) read(a 0 ) t mod, n mod 4 repeat step 1 (fi L1 twice) as a resut addresses A 1 are cached in L1 read(a 0 ) read(a 1 ) 5 repeat step 3 (read addresses A 0 ) time t unmod L2 accesses n unmod because the data has not been modified, this time, it can be purged regardess of the appied poicy and t unmod, n unmod correspond to t w-through, n w-through read(a 0 ) t unmod, n unmod if the time is onger and the count is bigger in the case of the modified data then a write-back poicy is used for L1 } t mod > t unmod = t w-through write-back n mod > n unmod = n w-through Tobias John 21

32 3 ANALYSIS CONCEPTS 333 Repacement strategy When there were some documents with information to base my foundings on caching strategy, there is amost none (and if, not extensive) information on repacement poicies As far as I found out, the Inte manuas ony cover the Netburst microarchitecture when stating A caches use a pseudo-lru (east recenty used) repacement agorithm However, it is never mentioned which plru strategy is appied The statement of [Sea00] to the Pentium II is more cear because more simpe: [Int, p 1-19] L1 uses a 4-way set associative mapping which divides the 512 ines into 128 sets of 4 cache ines Each of these sets is reay a east recenty used (LRU) ist Because it seems that Inte makes use of LRU based strategies the benchmarks are aimed in that direction As can be seen in figure 4 the plrum agorithm has some disadvantages compared to plrut: 1 it needs more plru bits namey one per way in each set N bits, m = W S whereas plrut needs 1 bit ess per set N bits, t = (W 1) S 2 the moment the ast od entry of a set is repaced, marked as new and a other entries are marked od, the history of these ines is ost see figure 18 0) n n n o a b c LRU MRU a b c 1) read(d) o o o n a b c d a b c d 2) read(c) o o n n a b c d a b d c 3) read(b) o n n n a b c d a d c b 4) read(a) n o o o a b c d d c b a Figure 18: plrum history oss 22 Tobias John

33 3 ANALYSIS CONCEPTS Figure 18 shows quite ceary that the MRU-based pseudo LRU poicy is an approximation to the strict LRU agorithm After ines a, b, c, d have been read, to be oaded in the L1 cache, they are read in the reverse order (c, b, a) to make a the most and d the east recenty used entry (step 4: LRU stack is d, c, b, a) That means, when repacing ines, d shoud be the first, c, b the next and a the ast However we assumed that the cache ways are fied from eft to right (step 0, 1), therefore the od ines wi be repaced in the order b, c, d! Foowing these thoughts my assumption was that if a pseudo LRU strategy was used, the tree based agorithm woud be preferred To anayse the underying poicy one coud fi the cache, reoad some ways to make these addresses the most recent ones and afterwards oad any new way to purge an od one - which one, sha be the indication for the used agorithm The detaied structure ooks ike: 1 read as many addresses as fit into L1 PII/III/IV have a 4-way L1 cache, so addresses A 1, B 1, C 1, D 1 have to be oaded read(a 1 ) read(d 1 ) 2 reoad (read) some od ways to make them the MRU ones read({a 1,, D 1 }) 3 oad (read) a new way which overrides another one read(e 1 ) 4 read the od addresses and count the L1 misses for each way read(a 1 ) n A1 read(d 1 ) n D1 the way with the most misses is the one been repaced If the second eve cache is that arge, that one of its ways can hod more ines than fit into L1, it is sure that the reoading of one L2 way reay reads from and updates the ines in L2, not ony in L1 a 2 a 1 + w 1 If this condition is met, the presented agorithm can be used to anayse the repacement strategy of L2, too Tobias John 23

34 3 ANALYSIS CONCEPTS L1 L2 A31127 B31127 C31127 D31127 D280 D28127 D29127 D30127 D31127 D280 D290 D300 D310 A00 B00 C00 D00 Figure 19: sizing reations L1 - L2 based on PII If ways have been oaded in the order A 2, B 2, C 2, D 2 and afterwards way C 2 sha be reoaded, it is sure that none of the addresses C00 - C31127 sti reside in L1, because they have been purged when oading D 2 To obtain severa hints on the appied repacement poicy I decided to vary the reoading of the 3 odest ways A i, B i, C i Resuting possibiities are: %/A i /B i /A i, B i /C i /A i, C i /B i, C i /A i, B i, C i After carrying out the first experiments, I reaized that there are different possibiities how a cache with a plrut poicy can be oaded: Either the tree bits are used and the empty cache is fied in tree order, or more simpe, the cache fis the first (eg from eft to right) free entry and ony uses the tree bits, when a entries of a set are fied Figure 20 shows a 4-way cache set that uses a plrut poicy with tree based fiing of empty entries, whereas fig 21 presents a fiing of the first free entry Once again, updated tree bits are boxed red and back arrows symboize the path through the tree Every step shows the cache set after the given instruction has been executed Step 0 shows the fresh set after the cache has been invaidated Steps 1 to 4 present the fiing and steps 5, 6 expain how the pseudo-lru agorithm can be tricked to repace an entry other than the east recenty used 24 Tobias John

35 3 ANALYSIS CONCEPTS 0) 0) 1) read(a) r 1) read(a) r r r a a 2) read(b) 2) read(b) r r r a b a b 3) read(c) r 3) read(c) r r a c b a b c 4) read(d) 4) read(d) a c b d a b c d 5) read(c) r 5) read(b) r a c b d a b c d 6) read(e) 6) read(e) r r a c e d a b e d Figure 20: tree based fi Figure 21: sequentia fi In steps 0 to 3 of drawing 21 the tree bits do not have arrows indicating the path to the LRU ine to stress that this path is not evauated as ong as there are free entries Tabe 6 faces the resuts of varying the ways to be reoaded between tree based and the sequentia fiing The eftmost coumn gives the combinations of the 3 odest ways to be reoaded Coumns 2 to 5 show the cache ways that shoud be repaced Lightbue backgrounded rows are those which have a different resut depending on the fiing method Tobias John 25

36 3 ANALYSIS CONCEPTS Tabe 6: cache ways expected to be repaced through a plrut strategy 4 ways reoad tree based fi sequentia fi none (%) A A A B C B A C A, B C C C B A A, C B B B, C D A A, B, C D A 8 ways tree based fi sequentia fi A A B E A E C E B E B E D E D E How the pseudo LRU tree based repacement strategy is appied on a 8 times associative cache is iustrated in figure 22 The drawing is to be understood as the others of this kind before This benchmark anaysed which ways of a cache are being repaced when reading a new way not yet cached To obtain severa hints on the underying poicy the number and order of reoaded ways A i,, C i coud be changed to infuence the plru history Yet there is sti another simpe method to test the repacement strategy: First the cache is fied with addresses A 0 Afterwards one new way is read and it is checked which od way it repaces Then the cache is fied again with A 0 but this time two new ways are oaded and it is noted which two od ways are repaced by them This method is repeated unti as many new ways are oaded as fit into the cache, that means unti a od ways have been purged With that incrementa repacing of one to W ways of a cache it can be observed which ways the pseudo LRU agorithm seects as east recenty used and that heps to identify the used agorithm The structure of this benchmark is as foows: 1 read as many addresses as fit into L1 PII/III/IV have a 4-way L1 cache, so addresses A 1, B 1, C 1, D 1 have to be oaded read(a 1 ) read(d 1 ) 2 oad (read) one new way which overrides another one read(e 1 ) 3 read the od addresses and count the L1 misses for each way read(a 1 ) n A1 read(d 1 ) n D1 26 Tobias John

37 3 ANALYSIS CONCEPTS 0) 1) read(a) r r r a 2) read(b) r r r r 3) read(c) a r b r r r r a c b 4) read(d) r r r r a c b d 5) read(e) r r r r r a e c b d 6) read(f,g,h) a e c g b f d h Figure 22: exampe of a 8-way cache with a plrut poicy and tree based fiing Tobias John 27

38 3 ANALYSIS CONCEPTS the way with the most misses is the one been repaced: X 1 4 repeat step 1 read(a 1 ) read(d 1 ) 5 oad (read) two new ways which override another two 6 repeat step 3 read(e 1 ) read(d 1 ) read(a 1 ) n A1 read(d 1 ) n D1 the ways with the most misses are the ones being repaced: X 1, X 2 One of the two repaced ways is that, been repaced in step 3, so the other one hods new information 9 repeat step 3 the ways with the most misses are those being repaced: X 1, X 2, X 3 12 repeat step 3 the ways with the most misses are those being repaced: X 1, X 2, X 3, X 4 X k = {A 1, B 1, C 1, D 1 }, k = {1, 2, 3, 4} As expained before ( fig 19, p 24), this test can be appied to L2 too Ony address ranges have to be adapted: A 2,, D 2 /A 2,, H 2 3 have to be read to fi L2 and E 2,, H 2 /I 2,, P 2 3 are those new ways to purge the od ones 3 4-way/8-way L2 28 Tobias John

39 4 WORST CASE ON CACHING 4 Worst case on caching Because access to caches is much faster than access to main memory, caches may greaty improve performance However if data is not found in cache it has to be retrieved from memory and even worse if the cache is aready fied with modified data, which has to be written back to memory before new data can be oaded into it, it needs much more time than a singe oad from main memory 41 Cache fooding I [LH] describes the worst case when working with a 2-eve memory cache architecture and how to achieve that case Conditions to that so caed doube purge configuration are: * 2-eve cache architecture * at east a 2-way L1 * write-back strategy * a strict LRU cache ine substitution mechanism * L2 is at east a 1 a 2 times bigger than L1 (a 1, a 2 are the address widths of L1/L2) 411 Doube Purge Scenario Mem L2 x a y m z i A L1 y m z i B x b y m z i B z i A y n z i C x c y n z i C x d y n z i D Figure 23: cache configuration, soid arrows indicate mapping reations, access to D resuts in doube purge case The uppercase aphabetic characters (A,B,C,D) in figure 23 denote a whoe cache ine and the soid arrows indicate the mapping reations from memory to L2 and from L2 to L1 The cache ines A, B, C contain modified data Cache ine D maps to the entry z i of L1 which is aready fied with A, so for caching D in L1, ine A has to be purged to L2 However L2 cannot hod A because modified ine B occupies the corresponding entry, so A is written-through to memory Tobias John 29

COS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University

COS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University COS 318: Operating Systems Virtua Memory Design Issues: Paging and Caching Jaswinder Pa Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Virtua Memory: