ABSTRACT. PANCHAMUKHI, SHRINIVAS ANAND. Providing Task Isolation via TLB Coloring. (Under the direction of Dr. Frank Mueller.)

Size: px
Start display at page:

Download "ABSTRACT. PANCHAMUKHI, SHRINIVAS ANAND. Providing Task Isolation via TLB Coloring. (Under the direction of Dr. Frank Mueller.)"

Transcription

1 ABSTRACT PANCHAMUKHI, SHRINIVAS ANAND. Providing Task Isolation via TLB Coloring. (Under the direction of Dr. Frank Mueller.) The translation look aside buffer (TLB) improves the performance of the system by caching the virtual page to physical frame mapping. But TLBs present a source of unpredictability for real time systems. Standard heap allocated regions do not provide guarantees on the TLB set that will hold a particular page translation. This unpredictability can lead to TLB misses with a penalty of thousands of cycles and consequently intertask interference resulting in loose bounds on the worst case execution time (WCET). In this paper, we design and implement a new heap allocator that guarantees the TLB set that will hold a particular page translation. The allocator is based on the concept of page coloring. Virtual pages are colored such that two pages of different color cannot map to the same TLB set. Our experimental evaluations confirm the unpredictability associated with the standard heap allocation. Using a set of synthetic and standard benchmarks, we show that our allocator provides task isolation for real time tasks. To the best of our knowledge, such TLB isolation is unprecedented, increases TLB predictability and can facilitate WCET analysis.

2 Copyright 2014 by Shrinivas Anand Panchamukhi All Rights Reserved

3 Providing Task Isolation via TLB Coloring by Shrinivas Anand Panchamukhi A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science Computer Science Raleigh, North Carolina 2014 APPROVED BY: Dr. William Enck Dr. Alexander Dean Dr. Frank Mueller Chair of Advisory Committee

4 DEDICATION To my parents and my sister Archana. ii

5 BIOGRAPHY Shrinivas Anand Panchamukhi was born in Dharwad, a small town in the state of Karnataka in India. His family is settled in the state of Goa, India. He did his schooling in Goa and went to National Institute of Technology (NIT), Rourkela, Orissa, India for his B.Tech in Computer Science. He joined Sapient Consulting as an associate technology and worked there for two years. He came to NC State in Fall 2012 as a Master s student in the department of Computer Science. He has been working under Dr. Frank Mueller as a Research Assistant since January iii

6 ACKNOWLEDGEMENTS This work would not have been possible without the collective effort of a lot of people. First and foremost, I would like to thank my advisor Dr. Frank Mueller for showing confidence in me and giving me the opportunity to work on this project. His guidance and feedback put me on the right track from time to time. I am thankful to Dr. William Enck and Dr. Alexander Dean to serve on my committee. Lastly, I would like to thank my labmates in the System Research Lab, Payal Godhani, my family and friends for their support. iv

7 TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES vii Chapter 1 Introduction Chapter 2 Motivation, Hypothesis and Contribution Motivation Hypothesis Contributions Chapter 3 Design Chapter 4 Implementation Notations Data structures Generic algorithms Intel Xeon E Chapter 5 Experimental framework Chapter 6 Results Predictability Identifying noisy DTLB sets L1/L2 DTLB miss penalty Synthetic benchmarks Malardalen benchmarks Mi benchmarks Chapter 7 Related Work Chapter 8 Future Work and Conclusion References Appendix Introduction Appendix A Detailed Experimental Results A.1 malloc vs. tlb malloc experiment A.2 Timing results v

8 LIST OF TABLES Table 6.1 Task characteristics vi

9 LIST OF FIGURES Figure 1.1 Virtual page to physical page translation using TLB Figure 3.1 TLB coloring - max contiguous allocation = page size Figure 3.2 TLB coloring - max contiguous allocation = page size R Figure 4.1 tlb malloc internals Figure 4.2 tlb free internals Figure 4.3 Data structures Figure 6.1 DTLB set utilization Figure 6.2 Same set vs. diff set using tlb malloc Figure 6.3 Jobs vs. execution cycles for T Figure 6.4 Malloc: Jobs vs DTLB misses for T Figure 6.5 tlb malloc: Jobs vs DTLB misses for T Figure 6.6 Malloc: Jobs vs DTLB misses for T Figure 6.7 tlb malloc: Jobs vs DTLB misses for T Figure 6.8 Jobs vs execution cycles for T Figure 6.9 T Figure 6.10 T Figure 6.11 T Figure 6.12 T Figure 6.13 T1 and T2 use fft and adpcm benchmarks, respectively Figure 6.14 Combination of Malardalen and Mi benchmarks Figure A.1 Malloc: Jobs vs. DTLB misses for T Figure A.2 tlb malloc: Jobs vs DTLB misses for T Figure A.4 tlb malloc: Jobs vs DTLB misses for T Figure A.3 Malloc: Jobs vs. DTLB misses for T Figure A.5 Jobs vs. execution cycles for T Figure A.6 Jobs vs. execution cycles for T Figure A.7 Jobs vs. execution cycles T Figure A.8 Jobs vs. execution cycles for T Figure A.9 Jobs vs. execution cycles for T Figure A.10 Jobs vs. execution cycles T Figure A.11 Jobs vs. execution cycles for T Figure A.12 Jobs vs. execution cycles for T vii

10 Chapter 1 Introduction The translation look aside buffer (TLB) is a hardware cache that sits close to the processor and caches the virtual page to physical frame mappings. Today s computer architectures support multiple levels of TLBs. Most processors from Intel [9], AMD [2] and ARM [3] feature a separate instruction TLB (ITLB) and a data TLB (DTLB) at the first level. In some processors [9], the second level of the TLB is shared between data and instructions while in others [2], the DTLB is shared between different cores. Typically, a virtual address is translated to the physical address using a hierarchy of translation tables known as paging structures. Figure 1.1 shows a simple example of how a virtual page is translated to a physical frame using a TLB. In this example, on a TLB miss, the page table is consulted directly. In general, the highest-level paging structure will be consulted. Figure 1.1: Virtual page to physical page translation using TLB 1

11 Static timing analysis tools [19] analyze the source code of a task and compute the worst case execution time (WCET) of the task. Mueller [13] (among others) has proposed abstract cache states to analyze instruction caches to achieve tight bounds on WCET. White et al. [18], Ramaprasad et al. [15] [16] and Ferdinand et al. [7] have proposed bounding the WCET for systems with data caches. To the best of our knowledge, TLBs have not been considered in the context of achieving tight bounds on the WCET of a task. In this thesis, we describe the design and implementation of a new heap allocator that we refer to as tlb malloc. This allocator utilizes the concept of page coloring. We assign colors to virtual pages in such a way that virtual pages with different color will not map to the same DTLB set. When tasks dynamically allocate memory using tlb malloc, in addition to specifying the size of the allocation, they will also specify the color of the memory region. By ensuring that each task allocates memory regions of a unique color, we can provide isolation between tasks and also enable static timing analysis to potentially calculate significantly tighter bound on the WCET. Our experimental evaluations reveal the unpredictability associated with standard heap allocation. Using a set of synthetic and standard benchmarks, we show that tasks can be isolated from each other, i.e., no inter-task DTLB misses are incurred. Past research has utilized page coloring to guarantee task isolation with respect to caches and DRAMs. Ward et al. [17] have proposed coloring physical frames such that two frames with different colors will not cause last level cache conflicts. Yun et al. [20] have proposed a DRAM bank aware memory allocation. This ensures that concurrently running applications on different cores do not access memory that maps to the same DRAM bank. The rest of the thesis is organized as follows: Chapter 2 describes the motivation, hypothesis and our contributions. A generic design of our allocator is presented Chapter 3. Chapter 4 describes the implementation details of tlb malloc. Chapter 5 discusses the experimental framework. Chapter 6 presents the results for both synthetic and standard benchmarks. Chapter 7 discusses the related work. Chapter 8 summarizes the contributions and discusses open problems. 2

12 Chapter 2 Motivation, Hypothesis and Contribution 2.1 Motivation TLBs improve the performance of the system in terms of speeding up the page translations, but they are also a source of unpredictability for real time systems. Heap allocation is generally avoided in real time systems due to the following unpredictability issues. First, the WCET of the standard heap allocation API is too high or is unbounded [12]. Second, standard heap allocation APIs do not guarantee the DTLB set in which the virtual page to physical frame mapping will be placed. Our focus in this paper is on the latter point. This unpredictability poses three problems. First, it makes it difficult for static timing analysis to determine whether a memory access will hit or miss in the DTLB. Hence, for calculating the WCET of a task, static timing analysis would take a pessimistic approach and calculate the WCET assuming all memory references to be a miss. This pessimistic WCET leads to inefficient processor utilization. Second, heap allocation may utilize the DTLB sets in a non-uniform manner. For example, consider two tasks, T1 and T2, and a system with a 2-way set associative DTLB. We further assume that the DTLB is empty. Then let each task requests two virtual pages from the standard heap allocator. The heap allocator may allocate virtual pages in such a way that all four pages map to the same DTLB set. Note that the DTLB was initially empty and had sufficient space to accommodate all four mappings without conflicts. Third, since T1 and T2 conflict in the DTLB, they might evict each other s mapping in the DTLB repeatedly. If tasks T1 and T2 are hard real time (HRT) and soft real time (SRT) tasks, respectively, the HRT task will suffer interference from the SRT task. This interference could result in the HRT task missing its deadline unless the DTLB were explicitly modelled. But even 3

13 if the DTLB were modelled, it might result in a loose WCET bound due to overestimations of DTLB misses. 2.2 Hypothesis We hypothesize that the allocation of memory can be controlled in software so that real-time tasks will not interfere with one another in terms of DTLB conflicts. 2.3 Contributions Our contributions in this paper are: 1. We design and implement a new heap allocator that provides guarantees on the DTLB sets that hold a particular virtual page to physical frame mapping. 2. We devise experiments to assess which bits of the virtual address determine the DTLB set. 3. We conduct experimental evaluations of the heap allocator to demonstrate task isolation. 4

14 Chapter 3 Design To describe our heap allocator, we start with a basic design and then incrementally add complexities and design details. Consider an N-way set associative DTLB supporting N M entries and a virtual address space with N M pages as shown in Figure 3.1. The DTLB handles translations for virtual pages of size page size. Page N*M Page M+1 Page M Page M-1... Page 3 Page 2 Page 1 Way 0 Way 1 Way 2 Way N -1 Page 0 Page M Page 2M Page 1 Page 2 Page 3 Page M+1 Page M+2 Page M+3 Page 2M+1 Page 2M+2 Page 2M Page M(N-1) Page M(N-1)+1 Page M(N-1)+2 Page M(N-1) Page M-1 Page 2M-1 Page 3M-1... Page(N*M)-1 Data Translation look aside buffer Set 0 Set 1 Set 2 Set 3 Set M -1 Set 0 Set 2 Set 3 Set 1 Page 0 Virtual address space Set M Figure 3.1: TLB coloring - max contiguous allocation = page size Let us assume that page 0 will map to set 0 as shown in Figure 3.1. The translation for page 0 could be stored in any one of the N ways. For simplicity, let us assume that the DTLB is initially empty. We fill the entries from left to right in each set. Further, assume that page 1 will map to set 1, page 2 will map to set 2 and so on until page M 1. Pages M to 2M 1 will wrap around, i.e., page M will map to set 0, page M + 1 will map to set 1 etc. 5

15 We color pages 0, M, 2M M(N 1) with the same color (red in this example) because all of them map to the same DTLB set. Similarly, pages 1, M + 1, 2M M(N 1) + 1 are colored blue and so on. We can see that no two virtual pages with different color can map to the same DTLB set. Each DTLB entry holds a translation for a virtual page of size page size. Since each DTLB set is given one color, the maximum contiguous virtual address space one can allocate of a particular color is given by the page size. Let us now consider that a task needs to allocate more than page size bytes of contiguous memory in the virtual address space. This typically results in allocating arrays that span across multiple pages. In our basic design, such an allocation would span across two pages of different color assuming that the allocation is aligned to a page boundary. In order for the tasks to be able to allocate contiguous virtual memory of size greater than page size of a particular color, we reserve a certain number of DTLB sets for contiguous memory allocation. Consider the same example as in Figure 3.1 but with R sets reserved for contiguous memory allocation. Figure 3.2 shows an example of DTLB coloring that supports greater than page size allocations for a particular color. Page (NM)-1... Page M+R... Page M+1 Page M Page M-1... Page R+1 Page R Page R-1... Page 1 Page 0 Virtual address space Way 0 Way 1 Way 2 Way N-1 Page 0 Page M Page 2M Page M(N-1) Page 1 Page M+1 Page 2M Page R-1 Page M+(R-1) Page 2M+(R-1) Page R Page M+R Page 2M+R Page R+1 Page M+R+1 Page 2M+R+1 Page M(N-1)+1 Page M(N-1) + (R- 1) Page M(N-1) + R Page M(N-1) + R Page M-1 Page 2M-1 Page 3M-1 Page (NM)-1 Data Translation look aside buffer Set 0 Set 1 Set R-1 Set R Set R+1 Set M-1 Figure 3.2: TLB coloring - max contiguous allocation = page size R Since we have R sets reserved for contiguous allocations, the maximum amount of contiguous memory than can be allocated of a particular color is R page size. Pages 0 to R 1 are colored red. Starting from page R until page M 1, the coloring is performed in a similar way as described for the basic design. But pages M to M +(R 1) are colored blue, again to support contiguous memory allocation. All the other virtual pages are colored in a similar manner. Contemporary operating systems support the concept of huge pages whose size varies from 6

16 2 MB to 1 GB depending on the configuration and the system in place. Computer architectures have a different DTLB structure to support huge page translations. Our heap allocator handles huge page allocation in a similar manner as described above. We now generalize the methods for DTLB coloring described so far. Let M be the total number of DTLB sets, R be the number of sets reserved for contiguous allocations and N be the associativity of the DTLB. Let page size denote either the normal page size or huge page size of the system. Corollary 1. Number of colors for allocations greater than page size is equal to N Proof. Follows from the definition of N. Let max alloc size refer to the maximum contiguous allocation size that can be served by our allocator. Let num colors upto page size refer to the number of colors available for allocations not exceeding page size. Then, max alloc size = page size R, (3.1) num colors upto page size = M R. (3.2) The number of instances of a particular color is constrained by the associativity of the DTLB. 7

17 Chapter 4 Implementation In this section, we describe the notation and the data structures to implement our heap allocator. We then present generic algorithms, which can be implemented on any architecture. Finally we present a specific instance of the generic algorithm that we have implemented on Intel Xeon E Notations We refer to the routine that initializes our heap allocator as tlb malloc init, the heap allocator as tlb malloc and the deallocator as tlb free. These three routines are exposed as library functions to user space applications. Compared to the standard heap allocation API, tlb malloc and tlb free take an additional parameter, color. This parameter indicates the color of the memory region to be allocated. For each DTLB to be colored, tlb malloc init sets aside a virtual address space of page size dtlb sets bytes. Additional memory may be needed to handle page boundary alignment. Our heap allocator serves allocations from this virtual address space pool that is set aside. In this paragraph, we give a generic description of the internals of tlb malloc and tlb free. It is applicable to all the DTLBs to be colored. Depending on the number of bytes requested, tlb malloc will call one of the functions as shown in Figure 4.1. LEN BYTES refers to the number of bytes the allocator uses to store the size of the allocation. Similarly, tlb free will call one the functions shown in Figure 4.2 depending on the size of the allocation referenced by ptr. Figure 4.1: tlb malloc internals 8

18 Figure 4.2: tlb free internals Data structures In this section, we introduce the data structures for our heap allocator. The description of the data structures is as generic as possible and is not tied to any specific language or implementation. The following code snippet shows the structure to represent a chunk of free memory, which we refer to as the free block. struct mblock { unsigned int length; struct mblock * next; }; In addition to the mblock structure, we need a free list per color to keep track of the available free memory. Figure 4.3 shows a generic representation of the free list. Color 0 pointer, Color 1 pointer,... Color n pointer are the base pointers, which point to the first available free block of that particular color. Each free block in turn points to the next available free block of the same color. We need two such free lists, one for small allocations and the other for large allocations, for each DTLB to be colored Generic algorithms Algorithm 1 shows the pseudo code for our heap allocator. Each of the functions shown in Figure 4.1 invokes this algorithm. The parameter type is used to identify which specific function is invoked. Line 3 sets free list to the appropriate list depending on type and color. The function on line 5 is responsible for walking through the free list to find a suitable memory block and returns the starting address of the memory block. It is also responsible for storing the allocation size in LEN BYTES preceding the returned starting address. Algorithm 2 shows the pseudo code for our heap deallocator. Since we store the length of the allocation in LEN BYTES preceding ptr, lines 2 and 3 obtain the length of the allocation and the base address of the block referenced by ptr. The function called in line 5 adds the memory block back to the appropriate free list. 9

19 Color 0 pointer Free block 1 Free block 2... Free block n Color 1 pointer Free block 1 Free block 2... Free block n Color 2 pointer Free block 1 Free block 2... Free block n Color n pointer Free block 1 Free block 2... Free block n Figure 4.3: Data structures Algorithm 3 shows how to ascertain the bits in the virtual address that determine the DTLB set. The algorithm relies on the performance monitoring library PAPI [6]. Let no of entries - in dtlb be the number of entries supported by the DTLB, dtlb assoc be the associativity of the DTLB and page size denote the page size on the system. The algorithm takes an array page - numbers as a parameter. Each element of the array indicates the page number to be accessed. Pages are numbered starting from zero. Line 5 allocates a memory region large enough to contain no of entries in dtlb+1 number of pages. In this algorithm, data is of the type integer but any other data type could also be used. The conflict page value (line 6) is used to index into data. This value represents the index of the first element in the conflicting page. To explain the main idea of the algorithm, we consider the following example. On the Intel Xeon E-2650, the level one DTLB is 4-way set associative supporting 64 entries. This DTLB handles 4kB translations. The conflict page value is in this case. We keep the conflicting page constant and vary the values in page numbers for all possible permutations of 64 pages. Lines 9-12 access the pages as indicated by the page numbers array. In this algorithm, we perform a write operation, but a read operation could also be used to trigger DTLB accesses. By reading the PAPI counters on lines 7 and 16, we determine the combination of 5 pages that result in a DTLB miss on every memory access in the outer loop (assuming a LRU / PLRU replacement policy). Thus, knowing the set of pages that conflict in the DTLB, we can examine the virtual address of these five pages to determine the common bits after stripping the page offset bits from the virtual address. After every memory access in the outer and inner loop, we need a 10

20 memory fence instruction to ensure that the memory references are issued in order and are recorded by the performance monitoring registers. Algorithm 1 Heap allocator 1: function allocate memory(type,size,color) 2: mem ptr = NULL 3: free list = get free list(type,color) 4: if size!= 0 & & free list!= NULL then 5: mem ptr = get free block(free list,size) 6: end if 7: return mem ptr 8: end function Algorithm 2 Heap deallocator 1: function free memory(type,ptr,color) 2: length = get allocation length(ptr) 3: base ptr = get base ptr(ptr); 4: free list = get free list(type,color) 5: add block to free list(length,free list,base ptr) 6: end function Intel Xeon E-2650 The Xeon E-2650 has a 4-way set associative DTLB supporting 64 entries at level one. From Corollary 1, we know that the value of N is 4. The parameter R is configurable and is passed as an input to tlb malloc init. The values for Equations 3.1 and 3.2 and are determined at run time. In our implementation, the mblock structure is the same as described above. For the free list per color, we maintain an array of pointers. Each index in the array refers to a particular color and the value at that index points to the first available block of free memory of that particular color. 11

21 Algorithm 3 Find address bits which determine DTLB set 1: function find bits(page numbers[dtlb assoc]) 2: n = 1000; 3: no of entries in dtlb = no of dtlb sets dtlb assoc 4: int elements = page size/sizeof(int); 5: int *data = malloc((no of entries in dtlb + 1) page size); 6: conflict page = (no of entries in dtlb) int elements; 7: PAPI read(); 8: for i = 0 to n do 9: for j = 0 to dtlb assoc do 10: data[ (pages[j]) int elements] = 2; 11: asm( memory fence instruction; ); 12: end for 13: data[ (conflict page) int elements] = 2; 14: asm( memory fence instruction; ); 15: end for 16: PAPI read(); 17: end function 12

22 Chapter 5 Experimental framework In this section, we describe the experimental framework used to evaluate our new heap allocator. Algorithm 4 shows the pseudo code for our experimental setup. It implements periodic releases of a task set without any special support from the operating system or the hardware. start time is a per task variable. It represents the time instant at which a task is first released. Function t1 code represents the main body of a task. We use this experimental setup to evaluate both synthetic and standard benchmarks. Algorithm 4 Pseudo code for our experimental setup 1: struct timeval start time t1 2: function main 3: gettimeofday(&start time t1) 4: plustime(&start time t1,1) add one second 5: pthread create(&t1, t1 code...) 6:... 7:... 8: end function 9: function t1 code 10: struct timeval wait, now, release 11: release = start time t1 12: t1 init() 13: loop 14: gettimeofday(&now) 15: minustime(&wait,&release,&now) wait = release - now 16: nanosleep(&wait,&wait) 17: t1 job() 18: plus time(&release,&period) 19: end loop 20: end function 13

23 We conducted experiments on an Intel Xeon E processor running Linux The system has 16 cores, each of which can run at 2 Ghz. The level one DTLB is 4-way set associative suporting 64 entries. The level two TLB supports 512 entries and is 4-way set associative. The level two TLB is shared between instruction and data. Both these TLBs handle 4kB sized page translations. For all experiments, we use rate monotonic scheduling of a set of tasks. On Linux, this can be achieved by setting the task priority while creating them. We use the SCHED FIFO real-time scheduling policy and bind all tasks to a particular core to avoid task migration. 14

24 Chapter 6 Results In this section, we first assess the predictability of malloc and tlb malloc. Then, we present our approach to identifying the DTLB sets which are subject to OS noise. Further, we estimate the L1 DTLB miss penalty and L2 DTLB miss penalty. We then evaluate the performance of our allocator with a set of synthetic benchmarks and standard benchmarks Predictability In this experiment, we create two tasks, T1 and T2. Each task allocates 32 pages. From the pool of 64 pages, the number of mappings to each DTLB set is recorded. We then compute the maximum and average mappings per DTLB set over 5 runs. Figure 6.1 shows the graph of the DTLB set versus the number of mappings. malloc max represents the maximum number of mappings that was observed when both tasks used malloc to allocate their pages. Similarly, malloc avg represents the average number of mappings. tlb malloc max avg represents both the maximum and average number of mappings when both tasks used tlb malloc to allocate their pages. Since the values are identical, we just use one line to represent both the cases. The level one DTLB is 4-way set associative and supports 64 entries. The best possible DTLB set utilization would be the one in which each DTLB set has four mappings. In Figure 6.1, malloc max shows that DTLB sets 0-5, 12, 14 and 15 have four mappings. While DTLB sets 6-11 and 13 have five mappings. malloc avg shows that the average mappings per DTLB set varies from 3.7 to 4.5. On the other hand, tlb malloc max avg shows each DTLB set to have exactly four mappings. From Figure 6.1, we can conclude that tlb malloc gives us more predictability compared to malloc. Our allocator is predictable in the sense that it guarantees the DTLB set that will hold a particular page translation. Thus, if tasks T1 and T2 use malloc to allocate pages, they might interfere with each other in the DTLB. If, in contrast, T1 and T2 use tlb malloc, such inter task interference can be avoided. 15

25 Number of mappings malloc_max malloc_avg tlb_malloc_max_avg DTLB set Figure 6.1: DTLB set utilization Identifying noisy DTLB sets So far, we have assumed that the entire DTLB is available for a set of tasks. But when the tasks run, they might invoke certain library functions. These library calls and OS activities, such as scheduling, interrupts, multi-level page tables etc., make use of the DTLB. We call this OS noise. Hence, only certain number of ways of a DTLB set can be used by the tasks. The following describes our methodology to identify the number of ways that can be used by tasks beyond those already occupied due to OS noise. tlb malloc gives us control over the DTLB set that will hold a particular page translation. Hence, we use tlb malloc to identify the number of ways of a DTLB set that are available for our experiments. We turn off heap randomization so that the stacks of the tasks always have the same base address across multiple runs. We start with two tasks, where each task allocates one page of a particular color. Each task has a warm up phase and a repeated access phase. When the tasks run, we should not observe any misses apart from the warm up phase. However, if one of the two tasks or both tasks incur misses, then we change the color of the allocation until both the tasks do not incur any misses. Further, the number of pages accessed by each task can be increased. Using this methodology, we can systematically distinguish those DTLB sets subject to OS noise from clean ones available for a task set. We use this methodology for experiments with tlb malloc to ensure that we do not use any noisy DTLB sets. For experiments using malloc, we cannot perform such identification because malloc does not give us control over the DTLB set that will hold a particular mapping. 16

26 6.0.7 L1/L2 DTLB miss penalty In this experiment, we determine the DTLB miss penalty for the Intel Xeon E We measure the execution cycles using the time stamp counter register. To calculate the L1 DTLB miss penalty, we map five pages to the same DTLB set using our allocator. We perform a warm up by accessing the pages once. We then access these five pages in a cyclic manner similar to the one shown in Lines 8-14 of Algorithm 3. This results in a L1 DTLB miss on every access. The average latency is computed by dividing the execution cycles by the number of L1 DTLB misses. This latency also factors in the cycles required for memory fence instructions and the overhead of the memory reference itself. To factor out this overhead via dual loop timing [1], we run another loop in which we access elements of an array in the similar manner as we access the five pages. Subtracting this overhead from the average latency calculated closely approximates the L1 DTLB miss penalty. Averaging over 10 runs, we determined the L1 DTLB miss penalty to be about 7.5 cycles. The standard deviation of the results was about To determine the L2 DTLB miss penalty, we use the same idea as mentioned above but map pages to different DTLB sets. We measure the latency of the warm up loop and compute the L2 DTLB miss penalty in a similar manner described above. By averaging over 10 runs, we found the L2 DTLB miss penalty to be about 1731 cycles with a standard deviation of Synthetic benchmarks In the synthetic benchmarks, each task allocates and accesses a certain number of pages. During initialization (line 12 of algorithm 4), each task accesses its set of pages for the first time. We refer to this as the warm up phase. When a job of either task is released, the job accesses its set of pages repeatedly. We refer to this as the repeated access phase. The Intel Xeon has a 4-way set associative level one DTLB. The worst case scenario occurs when the combined accesses of both tasks are greater than 4 pages and these pages map to the same DTLB set. The best case scenario occurs when the pages of each task map to different DTLB sets. We discuss two variants for this experiment: Same set vs. different set-using tlb malloc We use the following task set, where each task is denoted as (phase, period, execution): T1 (1ms, 2ms, 0.392ms), T2 (0ms, 16ms, 7.88ms). This task set ensures that the lower priority task suffers more than one preemption. In this experiment, we compare the number of DTLB misses that each task incurs in the best and the worst case scenario. For the worst case, both the tasks request four pages each from tlb malloc such that all pages are of the same color. This scenario is termed as same set. Conversely, for the best case, T1 allocates four pages of color 1 and T2 allocates four pages of color 2. This scenario is termed as different set. 17

27 Figure 6.2 depicts the results for this experiment over two hyperperiods. The x axis represents the time in milliseconds and the y axis represents the number of DTLB misses. The graphs for T1 and T2 are shown in the top half and bottom half of the figure, respectively. Each data point represents a job. The x axis shows the time at which the job is released and the y axis shows the number of DTLB misses for that job. The time instant before 0ms represents the warm up phase. The corresponding value on y axis gives the number of DTLB misses for the warm up phase. We refer to this as the 0th job of a task. Figure 6.2: Same set vs. diff set using tlb malloc The misses for the worst case scenario are depicted as T1 same set and T2 same set. Jobs 1-5 (released at times 1ms, 3ms, 5ms, 7ms, 9ms) of T1 incur four misses because all these jobs preempt T2 and will need to reload its pages into DTLB. Job 6 of T1 incurs four misses because this job is released after T2, completes its execution and will also need to reload its pages into the DTLB. Jobs 7 and 8 of T1 have zero misses because when these jobs are released, T2 has finished its execution and, hence, these jobs do not have any interference. This behaviour repeats for each hyperperiod. Every job of T2 is preempted five times. Hence, each job will incur 20 misses. The job in hyperperiod 1 has 20 misses while the jobs in other hyperperiods have 24 misses. This is because the job in hyperperiod 1 begins execution right away after the warm up. Jobs in other hyperperiods will need to load the pages into the DTLB first because the last job of T1 in the previous hyperperiod would have replaced T2 s mappings in DTLB. The misses for the best case are represented by T1 diff set and T2 diff set. In the first hyperperiod, even though jobs 1-5 of T1 preempt T2, neither T1 nor T2 incur any misses because both map to different DTLB sets. The misses observed are due to the warm up phase. From Figure 6.2, we can conclude that if tasks use different DTLB sets then they can be isolated 18

28 from each other and will not be subject to inter task interference. Figure 6.3 depicts the number of cycles needed by each job of T1 over five hyperperiods. same set represents the experiment when both tasks share a DTLB set. diff set represents the experiment when the two tasks do not share DTLB sets. For both runs, the jobs in hyperperiod one seem to require a larger number of cycles to execute as compared to the jobs in other hyperperiods. We attribute this to the warm ups happening in the architecture like the instruction caches, data caches, branch prediction etc. The execution cycles for jobs in the second hyperperiod for the same set experiment varies depending upon the interference from T2. Jobs 9,10,11,12,13 preempt T2. Since these jobs of T1 have to reload their pages, they will incur an extra cost in terms of L1 DTLB miss penalty. Job 14 executes just after T2 completes its execution and will also need to reload its pages and, hence, incurs additional execution cycles. Jobs 15 and 16 are not subject to any interference from T2 and, hence, require lower cycles to complete their execution. diff set shows the execution cycles for jobs of T1 in the second hyperperiod to be consistent at about cycles. On average, the difference between the execution cycles of jobs 9,10,11,12,13 in same set and diff set is about 30 cycles, which is roughly the L1 DTLB miss penalty for 4 pages. This behaviour is repeated for subsequent hyperperiods. From Figure 6.3, we further confirm that the tasks can be isolated from each other if they use different DTLB sets. The execution cycles for jobs of T2 also include the cost of preemption along with cycles needed by the OS scheduler. Since we do not use a real time OS, the scheduling activities are not bounded and may take a variable amount of time. Since the measurements of execution cycles for jobs of T2 are subject to noise, it is difficult to attribute the variation in the execution cycles directly to the DTLB interference. Hence, we do not discuss the execution cycles for T same set diff set Cycles job number Figure 6.3: Jobs vs. execution cycles for T1 19

29 malloc vs. tlb malloc We use the following task set : T1 (1ms, 2ms, 0.385ms), T2 (0ms, 16ms, 7.38ms). In this experiment, each task initially allocates 32 pages. Each task then randomly chooses 14 pages from the pool of 32 pages. The tasks then perform a write operation on these 14 pages. Each task has the warm up phase and repeated access phase as described previously. For both tasks, we compare the number of DTLB misses when the tasks use malloc versus tlb malloc to allocate its pages. For Figures , the axes and data points have the same meaning as described for Figure 6.2. Number of DTLB misses run_1 run_2 run_3 run_4 run_5 run_ time (ms) run_7 run_8 run_9 run_10 avg Figure 6.4: Malloc: Jobs vs DTLB misses for T1 Number of DTLB misses time (ms) runs 1-10 and average Figure 6.5: tlb malloc: Jobs vs DTLB misses for T1 20

30 Figure 6.4 shows the misses incurred by T1 when it uses malloc. The graph shows the misses over 10 runs and also the average of them. The number of DTLB misses varies with each run depending upon the interference to T1. In hyperperiod one, on average, we observe that jobs 1 and 2 of T1 have about 5 misses, jobs 3-5 of T1 have about 2 misses, jobs 6 and 8 have about 3.5 misses and job 7 has about 6 misses. Similar values are observed for hyperperiod two. Figure 6.5 shows the misses incurred by T1 when it uses tlb malloc. The number of misses for all 10 runs are identical and, hence, we use a single line to represent all 10 runs and the average. In Figure 6.5, the observed misses are due to the warm up phase. Subsequent jobs of T1 do not incur any DTLB misses at all. Number of DTLB misses run_1 run_2 run_3 run_4 run_5 run_6 run_7 run_8 run_9 run_10 avg time (ms) Figure 6.6: Malloc: Jobs vs DTLB misses for T2 Number of DTLB misses time (ms) runs 1-10 and average Figure 6.7: tlb malloc: Jobs vs DTLB misses for T2 21

31 Figures 6.6 and 6.7 show the results for T2 when it uses malloc and tlb malloc, respectively. The results are similar to T1. In Figure 6.6, the number of misses for malloc vary with each run. The average number of misses is about 15. In Figure 6.7, we again use a single line to represent the misses for all 10 runs and the average for tlb malloc, since the values are identical. The number of misses observed are due to the warm up phase while subsequent jobs do not incur any misses. From Figures , we can conclude that tlb malloc does provide task isolation with respect to the DTLB. These figures also show that the DTLB misses are predictable when the tasks use tlb malloc. Figure 6.8 depicts the execution cycles for each job of T1 over five hyperperiods. We slightly modify the experiment described above. Instead of accessing the first four bytes of each page, we access bytes such that the memory accessed maps to different L1 data cache sets. By this modification, we ensure that we do not have conflicts in the L1 data cache. Cycles job number malloc cycles tlb_malloc cycles malloc_misses DTLB misses Figure 6.8: Jobs vs execution cycles for T1 In Figure 6.8, malloc cycles and tlb malloc cycles represent the number of cycles needed by each job of T1 when the tasks use malloc and tlb malloc, respectively. malloc misses represents the number of misses incurred by each job of T1 when the tasks use malloc. The number of misses is zero when the tasks use tlb malloc. The x axis represents the jobs of T1, the primary y axis on the left represents the cycles and the secondary y axis on the right represents the number of DTLB misses. When the tasks use malloc, each job executes for a varying number of cycles depending on the DTLB interference caused. But if the tasks use tlb malloc, we observe that the execution cycles for each job is almost constant. For both malloc and tlb malloc, the first job seems to take larger execution cycles than the other jobs. This is mostly likely due to a warm up in the architecture like the instruction caches, branch prediction etc. Figure 6.8 further confirms that our allocator does provide task isolation. For similar reasons mentioned 22

32 in Section 6.0.8, we do not measure the execution cycles for T Malardalen benchmarks For the experiments described in this section, we use benchmarks from Malardalen suite to show task isolation. The experimental setup is similar to that described in Section We modified the benchmarks so that they use heap allocated regions instead of statically allocated ones. Table 6.1 shows the characteristics of tasks for various experiments. Phase, period and execution time are in milliseconds. Column 1 shows the number of tasks in an experiment. The tasks are depicted in decreasing order of priority. In an experiment of j tasks, each task allocates pages. The number of pages accessed by each task is shown in column 6 of Table 64 j 6.1. The loop bounds of the repeated access phase is varied to select an appropriate execution time for a task. We use the bubble sort, insertion sort, nth largest and statistics benchmarks for our experiments. We believe that these benchmarks represent basic functionalities of a real time task. Furthermore, we are constrained to small benchmarks since we only control DTLB entries that are not subject to OS noise under Linux. Table 6.1: Task characteristics # tasks Task Phase Period Execution Pages 2 T T T T T T T T T The experiments are designed in a way to show that our technique is applicable to a wide variety of scenarios. For two tasks, the periods are harmonic and the execution times are a combination of long and short runs. The number of pages accessed by each task are equal. The lower priority task T2 has more than one preemption. For three tasks, the periods are non-harmonic and have simultaneous releases of tasks. The number of pages accessed by each task differs. Figures show the average number of misses for each task across all jobs except 23

33 the 0th (warm up) job. The x-axis denotes the total number of tasks in an experiment and the y-axis represents the average number of DTLB misses. malloc avg and tlb malloc avg are the average results over 5 hyperperiods computed over 5 runs when the tasks use malloc and tlb malloc, respectively, to allocate their pages. For all the following graphs, tlb malloc avg has a value of zero. The graphs start at a negative value to make the value of tlb malloc avg visible. The start of DTLB misses for malloc avg are indicated by the positive value on the y-axis. Number of DTLB misses Total number of tasks malloc_avg tlb_malloc_avg Figure 6.9: T1 Number of DTLB misses Total number of tasks malloc_avg tlb_malloc_avg Figure 6.10: T2 In the experiment with 2 tasks (Figures ), if the tasks use malloc, we observe the number of DTLB misses to be about 13 and 130 for T1 and T2, respectively. If the tasks use tlb malloc, both T1 and T2 incur no misses at all. The standard deviation of measurements for T1 and T2 are about 1.47 and 2.7, respectively, when the tasks use malloc. The standard 24

34 Number of DTLB misses malloc_avg tlb_malloc_avg Number of DTLB misses malloc_avg tlb_malloc_avg 3 4 Total number of tasks 4 Total number of tasks Figure 6.11: T3 Figure 6.12: T4 deviation is zero for the two tasks when they use tlb malloc. For the experiment with 3 tasks (Figures ), we can see that if the tasks use malloc, the number of DTLB misses for T1, T2 and T3 are about 6, 32 and 20, respectively. But if the tasks use tlb malloc, then all three tasks incur zero misses. The standard deviation of results for T1, T2 and T3 are about 1.1, 5.5 and 2.5, respectively, when the three tasks use malloc. When the three tasks use tlb malloc, the standard deviation is zero. For the experiment with 4 tasks (Figures ), the trend is similar. If the tasks use malloc, the number of misses incurred are around 1.7, 6.8, 2.5 and 6.7 on average for T1, T2, T3 and T4, respectively. If the four tasks use tlb malloc, then each task incurs zero misses. The standard deviation of results for T1, T2, T3 and T4 are about 0.5, 0.8, 1 and 4.8, respectively, when the tasks use malloc. The standard deviation is zero when the tasks use tlb malloc. When the number of tasks is increased beyond four, the task set could not be isolated from OS noise resulting in DTLB misses even for tlb malloc. This is because we use a stock Linux system instead of a real time kernel and lack control for page allocation for code, stack and global variables. The adpcm and fft benchmarks represent larger and more realistic task work loads. We create two tasks with the following characteristics: T1 (1 ms, 4 ms, 2 ms) and T2 (0 ms, 20 ms, 4.6 ms) representing the fft and the adpcm benchmarks, respectively. The sum of the number of pages accessed by both the tasks is twenty. Figure 6.13 shows the results of this experiment. The description of the graph is identical to the ones described above except that the x-axis represents tasks in this figure. In Figure 6.13, T1 and T2 incur about 10 and 34 DTLB misses, respectively, when the tasks use malloc. If the tasks use tlb malloc, then each task incurs zero misses. The standard deviation of the results for T1 and T2 are about 3.1 and 1.8, respectively, when the tasks use malloc. The standard deviation of the results is zero when the tasks use tlb malloc. From Figures , we can conclude that a set of tasks can be isolated from each other with respect to the DTLB if they use tlb malloc to allocate their memory. 25

35 Number of DTLB misses Task malloc_avg tlb_malloc_avg Figure 6.13: T1 and T2 use fft and adpcm benchmarks, respectively Mi benchmarks In this section, we describe the experiments that we conducted using Mi benchmarks. We create two tasks, T1 (1 ms, 7 ms, 2.5 ms) and T2 (0 ms, 14 ms, 5.3 ms). T1 represents the FIR benchmark from Malardalen suite, while T2 represents the adpcm benchmark. T1 and T2 operate on input data of 16 kb and 450 kb, respectively. The adpcm benchmark from the Mi suite operates on actual audio files and performs file IO, while the adpcm benchmark from the Malardalen suite described in the previous experiment operates on randomly generated data. Figure 6.14 depicts the results for this experiment. The description of the graphs is identical to that of Figure Number of DTLB misses Task malloc_avg tlb_malloc_avg Figure 6.14: Combination of Malardalen and Mi benchmarks In Figure 6.14, T1 and T2 incur about 9 and 3 misses, respectively, when the tasks use malloc. If the tasks use tlb malloc they do not incur any misses. The standard deviation of the 26

36 results for T1 and T2 are about 2.03 and 0.5, respectively, when the two tasks use malloc. The standard deviation is zero when the tasks use tlb malloc. 27

37 Chapter 7 Related Work Providing a virtual address space for safety critical systems with different integrity levels has been explored by Bennett and Audsley [4]. The main goal of their work was to provision virtual memory for safety critical systems without complicating timing analysis for tasks with hard deadlines. To support the concept of a real time address space, kernel modifications are imperative. Their paper, however, neither considered tasks that use dynamic memory allocation nor the interference due to TLB sets shared by tasks. Our approach is different in the sense that no kernel modifications are needed and real time tasks can use dynamic memory allocation with bounded or even without TLB interference. Puaut et al. [14] have proposed a compiler approach to make paging more predictable. The main idea of the paper is to identify page in and page out points of virtual pages at compile time. This method relies on the static knowledge of the possible references to virtual pages. However, the focus of their paper is to make demand paging more predictable while ours is on task isolation with respect to TLB. Compiler-directed page coloring proposed by Bugnion et al. [5] involves three key phases. First, a compiler creates a summary of array references and communicates this information to the run time system. The run time system then uses machine-specific parameters like the cache size to generate a preferred color for the physical frame. The operating system then uses this color as a hint and does a best effort to honor them. This technique is applicable to physically indexed caches while our focus is on TLBs. In addition, our technique does not need profiling or modifications to the compiler and the operating system. Software cache partitioning is related to our idea. This is commonly known as cache coloring. The main idea of this technique is to color physical frames such that two frames of different color will not map to the same cache set. Liedtke at al. [10] propose OS-controlled cache partitioning. Mancuso et al. [11] use memory profiling to identity hot pages in virtual memory. Then, the kernel subsequently allocates physical frames to these pages, such that there are no 28

Multicore Cache and TLB coloring on Tegra3. Final Project Report

Multicore Cache and TLB coloring on Tegra3. Final Project Report Multicore Cache and TLB coloring on Tegra3 Final Project Report (CSC 714 Spring 2014 ) Professor: Dr. Frank Mueller URLfor the project: https://sites.google.com/a/ncsu.edu/multicore-cache-and-tlb-coloring/

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

6 - Main Memory EECE 315 (101) ECE UBC 2013 W2

6 - Main Memory EECE 315 (101) ECE UBC 2013 W2 6 - Main Memory EECE 315 (101) ECE UBC 2013 W2 Acknowledgement: This set of slides is partly based on the PPTs provided by the Wiley s companion website (including textbook images, when not explicitly

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

Empirical Approximation and Impact on Schedulability

Empirical Approximation and Impact on Schedulability Cache-Related Preemption and Migration Delays: Empirical Approximation and Impact on Schedulability OSPERT 2010, Brussels July 6, 2010 Andrea Bastoni University of Rome Tor Vergata Björn B. Brandenburg

More information

Chapter 8 Memory Management

Chapter 8 Memory Management Chapter 8 Memory Management Da-Wei Chang CSIE.NCKU Source: Abraham Silberschatz, Peter B. Galvin, and Greg Gagne, "Operating System Concepts", 9th Edition, Wiley. 1 Outline Background Swapping Contiguous

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358 Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement

More information

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses. 1 Memory Management Address Binding The normal procedures is to select one of the processes in the input queue and to load that process into memory. As the process executed, it accesses instructions and

More information

Operating Systems. IV. Memory Management

Operating Systems. IV. Memory Management Operating Systems IV. Memory Management Ludovic Apvrille ludovic.apvrille@telecom-paristech.fr Eurecom, office 470 http://soc.eurecom.fr/os/ @OS Eurecom Outline Basics of Memory Management Hardware Architecture

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation

More information

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization Requirements Relocation Memory management ability to change process image position Protection ability to avoid unwanted memory accesses Sharing ability to share memory portions among processes Logical

More information

CS399 New Beginnings. Jonathan Walpole

CS399 New Beginnings. Jonathan Walpole CS399 New Beginnings Jonathan Walpole Memory Management Memory Management Memory a linear array of bytes - Holds O.S. and programs (processes) - Each cell (byte) is named by a unique memory address Recall,

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 20 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Pages Pages and frames Page

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L17 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Was Great Dijkstra a magician?

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Lecture 8 Memory Management Strategies (chapter 8)

Lecture 8 Memory Management Strategies (chapter 8) Bilkent University Department of Computer Engineering CS342 Operating Systems Lecture 8 Memory Management Strategies (chapter 8) Dr. İbrahim Körpeoğlu http://www.cs.bilkent.edu.tr/~korpe 1 References The

More information

Precise and Efficient FIFO-Replacement Analysis Based on Static Phase Detection

Precise and Efficient FIFO-Replacement Analysis Based on Static Phase Detection Precise and Efficient FIFO-Replacement Analysis Based on Static Phase Detection Daniel Grund 1 Jan Reineke 2 1 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA Euromicro

More information

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L20-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation

More information

Chapter 8: Memory- Management Strategies

Chapter 8: Memory- Management Strategies Chapter 8: Memory Management Strategies Chapter 8: Memory- Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Memory Management 1 Hardware background The role of primary memory Program

More information

Basic Memory Management

Basic Memory Management Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester 10/15/14 CSC 2/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy // CS 6C: Great Ideas in Computer Architecture (Machine Structures) Set- Associa+ve Caches Instructors: Randy H Katz David A PaFerson hfp://insteecsberkeleyedu/~cs6c/fa Cache Recap Recap: Components of

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science Virtual Memory CSCI 315 Operating Systems Design Department of Computer Science Notice: The slides for this lecture have been largely based on those from an earlier edition of the course text Operating

More information

ABSTRACT. PAN, XING. Providing DRAM Predictability for Real-Time Systems and Beyond. (Under the direction of Rainer Mueller.)

ABSTRACT. PAN, XING. Providing DRAM Predictability for Real-Time Systems and Beyond. (Under the direction of Rainer Mueller.) ABSTRACT PAN, XING. Providing DRAM Predictability for Real-Time Systems and Beyond. (Under the direction of Rainer Mueller.) DRAM memory of modern multicores is partitioned into sets of nodes, each with

More information

Operating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France

Operating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France Operating Systems Memory Management Mathieu Delalandre University of Tours, Tours city, France mathieu.delalandre@univ-tours.fr 1 Operating Systems Memory Management 1. Introduction 2. Contiguous memory

More information

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging Memory Management Outline Operating Systems Processes (done) Memory Management Basic (done) Paging (done) Virtual memory Virtual Memory (Chapter.) Motivation Logical address space larger than physical

More information

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit Memory Management All data in memory before and after processing All instructions in memory in order to execute Memory management determines what is to be in memory Memory management activities Keeping

More information

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has Swapping Active processes use more physical memory than system has Operating Systems I Address Binding can be fixed or relocatable at runtime Swap out P P Virtual Memory OS Backing Store (Swap Space) Main

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Instructor: Dr. Turgay Korkmaz Department Computer Science The University of Texas at San Antonio Office: NPB

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT Felipe Cerqueira and Björn Brandenburg July 9th, 2013 1 Linux as a Real-Time OS 2 Linux as a Real-Time OS Optimizing system responsiveness

More information

CS 3733 Operating Systems:

CS 3733 Operating Systems: CS 3733 Operating Systems: Topics: Memory Management (SGG, Chapter 08) Instructor: Dr Dakai Zhu Department of Computer Science @ UTSA 1 Reminders Assignment 2: extended to Monday (March 5th) midnight:

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

Main Memory CHAPTER. Exercises. 7.9 Explain the difference between internal and external fragmentation. Answer:

Main Memory CHAPTER. Exercises. 7.9 Explain the difference between internal and external fragmentation. Answer: 7 CHAPTER Main Memory Exercises 7.9 Explain the difference between internal and external fragmentation. a. Internal fragmentation is the area in a region or a page that is not used by the job occupying

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

CS 5523 Operating Systems: Memory Management (SGG-8)

CS 5523 Operating Systems: Memory Management (SGG-8) CS 5523 Operating Systems: Memory Management (SGG-8) Instructor: Dr Tongping Liu Thank Dr Dakai Zhu, Dr Palden Lama, and Dr Tim Richards (UMASS) for providing their slides Outline Simple memory management:

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 10: Paging Geoffrey M. Voelker Lecture Overview Today we ll cover more paging mechanisms: Optimizations Managing page tables (space) Efficient

More information

Memory Management Cache Base and Limit Registers base limit Binding of Instructions and Data to Memory Compile time absolute code Load time

Memory Management Cache Base and Limit Registers base limit Binding of Instructions and Data to Memory Compile time absolute code Load time Memory Management To provide a detailed description of various ways of organizing memory hardware To discuss various memory-management techniques, including paging and segmentation To provide a detailed

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to:

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to: F2007/Unit6/1 UNIT 6 OBJECTIVES General Objective:To understand the basic memory management of operating system Specific Objectives: At the end of the unit you should be able to: define the memory management

More information

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1 Memory Allocation Copyright : University of Illinois CS 241 Staff 1 Recap: Virtual Addresses A virtual address is a memory address that a process uses to access its own memory Virtual address actual physical

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 33 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ How does the virtual

More information

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

MEMORY MANAGEMENT/1 CS 409, FALL 2013

MEMORY MANAGEMENT/1 CS 409, FALL 2013 MEMORY MANAGEMENT Requirements: Relocation (to different memory areas) Protection (run time, usually implemented together with relocation) Sharing (and also protection) Logical organization Physical organization

More information

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

More information

Physical memory vs. Logical memory Process address space Addresses assignment to processes Operating system tasks Hardware support CONCEPTS 3.

Physical memory vs. Logical memory Process address space Addresses assignment to processes Operating system tasks Hardware support CONCEPTS 3. T3-Memory Index Memory management concepts Basic Services Program loading in memory Dynamic memory HW support To memory assignment To address translation Services to optimize physical memory usage COW

More information

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure. File Systems I/O Management Hard Drive Management Virtual Memory Swap Memory Management Storage and I/O Introduction CSI3131 Topics Process Management Computing Systems Memory CPU Peripherals Processes

More information

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and

More information

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Memory Management To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Basic CPUs and Physical Memory CPU cache Physical memory

More information

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming

More information

Real-Time Cache Management for Multi-Core Virtualization

Real-Time Cache Management for Multi-Core Virtualization Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Processes and threads 1 Overview Process concept Process scheduling Thread

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs Virtual Memory Today Motivations for VM Address translation Accelerating translation with TLBs Fabián Chris E. Bustamante, Riesbeck, Fall Spring 2007 2007 A system with physical memory only Addresses generated

More information

LECTURE 12. Virtual Memory

LECTURE 12. Virtual Memory LECTURE 12 Virtual Memory VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a cache for magnetic disk. The mechanism by which this is accomplished

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

CS6401- Operating System UNIT-III STORAGE MANAGEMENT

CS6401- Operating System UNIT-III STORAGE MANAGEMENT UNIT-III STORAGE MANAGEMENT Memory Management: Background In general, to rum a program, it must be brought into memory. Input queue collection of processes on the disk that are waiting to be brought into

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1 Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

CACHE DESIGN AND TIMING ANALYSIS FOR PREEMPTIVE MULTI-TASKING REAL-TIME UNIPROCESSOR SYSTEMS. Yudong Tan

CACHE DESIGN AND TIMING ANALYSIS FOR PREEMPTIVE MULTI-TASKING REAL-TIME UNIPROCESSOR SYSTEMS. Yudong Tan CACHE DESIGN AND TIMING ANALYSIS FOR PREEMPTIVE MULTI-TASKING REAL-TIME UNIPROCESSOR SYSTEMS A Dissertation Presented to The Academic Faculty by Yudong Tan In Partial Fulfillment of the Requirements for

More information

Practical, transparent operating system support for superpages

Practical, transparent operating system support for superpages Practical, transparent operating system support for superpages Juan Navarro Sitaram Iyer Peter Druschel Alan Cox Rice University OSDI 2002 Overview Increasing cost in TLB miss overhead growing working

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Chapter 9: Virtual-Memory

Chapter 9: Virtual-Memory Chapter 9: Virtual-Memory Management Chapter 9: Virtual-Memory Management Background Demand Paging Page Replacement Allocation of Frames Thrashing Other Considerations Silberschatz, Galvin and Gagne 2013

More information

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1 Memory Management Disclaimer: some slides are adopted from book authors slides with permission 1 CPU management Roadmap Process, thread, synchronization, scheduling Memory management Virtual memory Disk

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition Carnegie Mellon Virtual Memory: Concepts 5-23: Introduction to Computer Systems 7 th Lecture, October 24, 27 Instructor: Randy Bryant 2 Hmmm, How Does This Work?! Process Process 2 Process n Solution:

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Memory management: outline

Memory management: outline Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray

More information

Memory management: outline

Memory management: outline Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray

More information

Processes and Virtual Memory Concepts

Processes and Virtual Memory Concepts Processes and Virtual Memory Concepts Brad Karp UCL Computer Science CS 37 8 th February 28 (lecture notes derived from material from Phil Gibbons, Dave O Hallaron, and Randy Bryant) Today Processes Virtual

More information