Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1
MPSoC - Vantagens MPSoC architecture has several advantages over a conventional strategy that employs a single, more powerful (but complex) processor on the chip: design of an on-chip multiprocessor composed of multiple simple processor cores is simpler than that of a complex single- processor system better utilization of the silicon space MPSoC architecture can exploit loop-level parallelism at the software level in array-intensive embedded applications energy savings through careful and selective management of individual processors 2
MPSoC Critical Component MPSoC: platform for executing array-intensive computations commonly found in embedded image and video processing applications Most critical components: memory system applications spend a significant portion of their cycles in the memory hierarchy the memory system can contribute up to 90% of the overall system power it is expected that a significant portion of the transistors in an MPSoC-based architecture will be devoted to the memory hierarchy 3
Ways of optimizing the memory performance 1. constructing a suitable memory organization/ hierarchy caches, scratch pad memories, stream buffers, FIFOs, etc 2. optimizing the software (application) for it traditional scheme: performance (execution cycles) MPSoC: energy/power consumption and memory space usage 4
MEMORY ARCHITECTURES The application-specific nature of embedded systems presents new opportunities for aggressive customization and exploration of architectural issues è the features of the given application can be used to determine the architectural parameters. Example: floating point arithmetic è DISCUSS! Traditionally, memory issues have been separately addressed by disparate research groups: computer architects compiler writers CAD/embedded systems community 5
Types of Architectures (1) Cache line size, associativity, that can be customized for a given application access time is subject to cache misses (2) Scratch Pad Memory (SPM) data memory residing on-chip that is mapped into an address space disjoint from the off-chip memory but connected to the same address and data busses fast access (SRAM), single-cycle access time 6
Cache and SPM 7
Cache and SPM 8
Simple example if source and mask were to be accessed through the data cache, the performance would be affected by cache conflicts storing the small mask array in the SPM eliminates all data conflicts in the data cache the data cache is used for memory accesses to source, which are very regular storing mask on-chip ensures that frequently accessed data are never ejected offchip, thereby significantly improving the memory performance and energy dissipation (a) Procedure CONV. (b) Memory access pa@ern in CONV. 9
Types of Architectures (3) DRAM multiple embedded memory (large) banks problems of modeling the access modes of synchronous DRAMs include: burst mode read/write: fast successive accesses to data in the same page interleaved row read/write modes: alternating burst accesses between banks. interleaved column access: alternating burst accesses between two chosen rows in different banks 10
Types of Architectures (4) Special Purpose Memories last-in, first-out protocol (LIFO) are used in microcontrollers queue or first-in, first-out protocol (FIFO) are used in network chips content-addressable memory (CAM) used in search applications 11
Customization of Memory Architectures Cache 1. cache line size 2. cache size if memory accesses are regular and consecutive (exhibit spatial locality), a longer cache line is desirable, since it minimizes the number of offchip accesses and exploits the locality by prefetching elements that will be needed in the immediate future. if the memory accesses are irregular, or have large strides, a shorter cache line is desirable, as this reduces off-chip memory traffic by not bringing unnecessary data into the cache. the maximum size of a cache line is the DRAM page size How to estimate? 12
Customization of Memory Architectures SPM + Cache MemExplore framework optimizes the on-chip data memory organization, addressing the following problem: given a certain amount of on-chip memory space, partition this into data cache and SPM so that the total access time and energy dissipation is minimized, i.e., the number of accesses to off-chip memory is minimized 13
DRAM optimization Multiple banks example of data set larger than the cache line Since each bank has its own private page buffer, there is no interference between the arrays, and the memory accesses do not represent a bo@leneck. 14
Memory Reconfigurabity reconfigure the cache (or SPM) architecture dynamically according to the application at hand The compiler can analyze a given application, divide its code into regions, and, for each region, select an optimum cache configuration for each processor Problems: architectural and circuit mechanisms for efficient and fast reconfiguration are essential control mechanisms for deciding when to reconfigure these caches are required mechanisms to determine the optimal configuration of the cache techniques for minimizing the overhead of data invalidation across different reconfiguration phases are essential 15
COMPILER SUPPORT Problems o Parallelism - parallelization strategy determines how memory is utilized by multiple on-chip processors and can be an important factor for achieving an acceptable performance - intrinsic data dependences - interprocessor communication costs Instruction and Data Locality - interprocessor communication can lead to frequent cache line invalidations/updates (interprocessor data sharing), which in turn increases overall latency - false sharing 16
Problems (cont) o Power/Energy Consumption - increasing the number of PE means powering up more processors along with their local caches (and/or SPMs) è more power - compiler should be able to balance the increase in power consumption and decrease in execution cycles o Memory Space - reducing memory space consumption might be of critical importance as doing so increases the effectiveness of on-chip memory utilization and can reduce the number of off-chip references 17
Solutions o Optimizing parallelism COMPILER SUPPORT - Parallelism can either be expressed by the programmer at the source level or be automatically derived by an optimizing compiler from the sequential code - By the compiler: - compiler needs to analyze data dependences and extract available parallelism - Task: 1. estimate performance, power consumption, and space requirements of a given piece of code 2. optimize the objective function under multiple constraints - It is easier to estimate the performance and energy consumption with SPMs (as opposed to caches) since the compiler is in full control of memory transfers 18
Solutions o Optimizing locality COMPILER SUPPORT - Locality optimization can be performed for both code and data - Goal: reduce the number of accesses to slower levels in the memory hierarchy. - Optimization of the data structures (statically or dynamically) - data space (memory layout) transformations, or data transformations for short - Example: software data prefetching 1. determine data references that are likely to be cache misses and therefore need to be prefetched 2. isolate the predicted cache miss instances through loop splitting 3. apply software pipelining and insert explicit prefetch instructions in the code. 19
COMPILER SUPPORT Solutions o Optimizing locality (cont) - From a multiple processor perspective, the locality problem is more challenging to tackle than the single processor - The SPM approach is simple and preferable - L1 is necessary à energy-aware coherence protocols/ algorithms in an important potential research direction 20
COMPILER SUPPORT Solutions o Optimizing Memory Space Utilization - consider lifetimes of program data structures/variables - enables sharing of the same data space - sharing data space may reduce performance due to data sharing - compiler should be able to resolve this tradeoff considering the performance and memory space con- straints at the same time. 21
COMPILER SUPPORT Solutions o Power/Energy Optimization - Power/energy and performance optimizations are conflict targets - Example: varying the number of PEs per application - more PEs è better performance (results normalized to 1 PE) 22
COMPILER SUPPORT Solutions o Power/Energy Optimization (cont) - But the energy increases 23
Solutions COMPILER SUPPORT o Power/Energy Optimization (cont) - most published work on parallelism for high-end machines is static; that is, the number of processors that execute the code is fixed for the entire execution - Alternative: adaptive parallelization - it is possible to consume much less energy by using the minimum number of processors for each loop nest, and shutting down the caches of the unused processors 24
Conclusions With the full advance knowledge of the applications being implemented by the system, many design parameters can be optimized and/or customized The optimal memory architecture for an applicationspecific system can be significantly different from the typical cache hierarchy of processors 25
http://www.artist-embedded.org/docs/events/2010/autrans/talks/pdf/teich/autrans2010_teich_trr89.ppt.pdf 26 26