High performance computing. Memory

Size: px

Start display at page:

Download "High performance computing. Memory"

Meryl Heather Cooper
5 years ago
Views:

1 High performance computing Memory

2 Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact often occurs algorithms (programs) in which performance is limited only by the performance of the processor, or only by the efficiency of data retrieval from memory The latter situation is in recent years, more widespread

3 Memory wall

4 Estimating memory performance How to estimate the duration of the retrieval data from memory (in algorithms with efficiency reduced by the memory, this time is used to determine the final performance in the realization of the algorithm): T MEM = number_of_accesses * access_time Time of access to a single variable (single argument) is often given as a memory parameter and for the RAM is about several tens of ns

5 Estimating memory performance Example estimation for the matrix-vector multiplication algorithm (matrix size NxN): number_of_accesses - 2n 2 access_time - 100ns T MEM = 200n 2 x 10-9 s For comparison 2n 2 operations execution time of the algorithm, when including the theoretical performance of the 4-way processor with 2.5GHz frequency T proc = 0.2n 2 x 10-9 s memory slows down the execution 1,000 times?

6 Cache memory Reason no.1 for shorten the actual time of access - cache memory The approximate sample sizes and access times for different types of memory: Registers hundreds of B <1ns cache (L1) several kb approximately 1 ns cache (L2-3) several MB a few ns main several GB a tens ns disk hundreds of GB a few ms optimal cache usage is crucial for the performance of programs.

7 Cache memory Approximate memory sizes for different levels and access times (this time in the number of cycles)

8 Cache memory The operation of cache: processor communicates the desire for access to memory the corresponding cache line (or set of lines) is checked: hit - value is shared miss - reloaded the contents of the entire memory line read or write cells from the cache if the CPU writes strategy is needed to maintain cache coherency (relative to main memory)

9 Cache memory The organization of the cache: lines of memory with the capacity of few words (eg, 64B, 128B) mapping areas (blocks) of main memory into cache lines static - direct, direct mapped (each block has its own line) dynamic: associative, fully associative (each block can be mapped in any line of the entire memory) set associative (each block can be mapped in any line of a certain group) Harvard architecture: separate caches for data and separate for the code (for the avoidance of resource hazard)

10 Cache memory

11 Cache memory Haswell - 2 loads and 1 store per cycle Sandy Bridge only when using 256-bit memory accesses.

12 Localism of references The usefulness of the cache depends on the locality of references to the data in program Time locality: the data once used will soon be used again (it is better to keep them in faster memory) Spatial locality: if the program uses some data that in the near future neighboring data will be used (it is good to fetch to cache entire blocks) The measure of locality of references in the duration of execution of program is the hit ratio - the ratio of hits to misses

13 Localism of references The use of cache is usually increasing productivity, since many programs have naturally high degree of locality of references (in practice, the ratio of hits to misses is greater than 90%) To increase the performance of programs by optimal use of the cache we should maximize the degree of locality of references in the code (group references for the same and neighboring data in one place) An interesting alternative is the so-called cacheoblivious algorithms

14 Memory access time The calculation of the average memory access time for a specific program: t c access to the cache (hit time) t m access to main memory (miss time) h hit ratio in the cache T av average data access time for a given program T av = h t c + (1-h) t m Access time can be expressed in the number of clock cycles, it must then be recalculated accordingly.

15 Memory access time For multi-level cache we can apply the formula: t av = t c1 + m 1 * (t c2 + m 2 * (t c3 + m 3 * T m3 ) t ci access time to the cache (hit time) i-level t mi - time for handling misses in the i-level cache (including substitution of the line) - miss time m i - ratio of misses in the i-level cache t av - average access time for a given program The above formula assumes 3 level cache of identical modes of action for each level (which as know is not always the case, for example, the case of victim cache)

16 Memory hierarchy

17 Roofline

18 Optimizing the performance of cache Cache performance parameters: hits handling time (hit time) hit ratio (hit rate) or misses (miss rate) miss handling time (miss penalty) actual average data access time may be lower from the one obtained from the simplified analysis due to the concurrency and parallelism of memory operation, eg pipelining, multislots etc. The programmer should: maximize hit ratio reduce the impact of miss time handling by fetch from memory in advance (software prefetching)

19 Memory organization Memory organization of today's computer systems: a single processor multi-level cache virtual memory multiprocessor computers shared memory (UMA, NUMA, cc-numa) distributed memory hybrid architectures shared memory (huma, VSM) embedded memory (edram)

20 Virtual memory Virtual Memory - broadening of the scope of memory addressing relative to the main(physical) memory made by operating system Main memory contains only a portion of the address space Main memory is divided into frames, in which are contained virtual memory pages If page requested by the processor is not in any of the frames of main memory a page fault (page fault) and swapping pages occurs (such as substitution lines in the cache - similar strategies of substitution)

21 Virtual memory The mechanism of translating virtual to real addresses (taking into account the memory paging) and checking whether the pages are in the main memory, is based on the use of a page table Page table can be very large, and access to it should be fast - for its storage the entire memory hierarchy, including a special type of cache called Translation Lookaside Buffer TLB, is used

22 Virtual memory At the time when the processor translates the virtual address to real, it first checks whether the entry is in TLB memory. If no, the TLB miss handling is used Programs occupying large areas of memory may not fit all their pages addresses in TLB. To counteract the significant decline in performance (up to dozens of times!) in such cases, we can try increase the size of individual pages associated with the program (modern operating systems usually have a range of page sizes)

23 Virtual memory The procedure for a memory access in the case of a page fault is complex and timeconsuming In the high-performance programs should not be allowed to page faults (beyond the initial, when program and data are loaded into memory for the first time) You can not cause thrashing - continuous reloading pages that might consume more than 90% of CPU time

24 Summary By analyzing the theoretical performance of the memory processor system we can get some characteristics, such as: maximum efficiency of processing orders (MIPS as the product of the permeability and the frequency of the processor clock) maximum efficiency of processing floating points orders (as above, but only in relation to floating point operations - MFLOPS) maximum throughput of the memory-processor system (derived from timing and frequency of bus and memory properties - MB/s) latency in memory access (access time)

25 Summary Similarly, as the complexity of the operation of the processor makes it impossible to determine its actual performance in specific program, so the complexity of memory prevents a detailed estimate of its performance. For some algorithms we can use some simplified models and evaluate performance for these models, taking into account how actual performance is corelated with the model performance.

26 Summary The main parameter characterizing the specific program in terms of memory usage is the number of references to the memory in the program Often used indicator of intensity of using the memory is the ratio of the number of operations on data to the number of memory references required to realization of operations With this indicator and memory parameters we can calculate the maximum efficiency of the memoryprocessor system can achieve in a given calculation

27 Summary Actual calculation performance is always less than optimistic estimates obtained from the processor and memory-processor system analysis It is always worth to estimate what is each of these theoretical efficiency indicator for a particular system and obtain a measure of real performance Actual performance at the level of tens percent of minimum of the theoretical performance means correct implementations for a given system

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2