Application-Specific Design of Low Power Instruction Cache Hierarchy for Embedded Processors

Size: px

Start display at page:

Download "Application-Specific Design of Low Power Instruction Cache Hierarchy for Embedded Processors"

Rebecca Baker
5 years ago
Views:

Agenda Application-Specific Design of Low Power Instruction ache ierarchy for mbedded Processors Ji Gu Onodera Laboratory Department of ommunications & omputer ngineering Graduate School of

1 Agenda Application-Specific Design of Low Power Instruction ache ierarchy for mbedded Processors Ji Gu Onodera Laboratory Department of ommunications & omputer ngineering Graduate School of Informatics Kyoto University 2 mbedded Systems: What are they? mbedded System is a special-purpose computer system designed to perform one or a few dedicated functions [Wikipedia] Processor based General processors Microcontrollers DSPs A subsystem Part of a larger system which it controls With special purpose In general, it does not provide programmability to users, as opposed to general purpose computer systems like P mbedded Systems: Where are they? verywhere! Our daily lives depend on embedded systems 3 4 Why use microprocessors? Some ommon haracteristics (1/2) 5 Microprocessors are very efficient: can use same logic to perform many different functions Microprocessors simplify the design of families of products Sonicare Plus toothbrush: 8-bit Zilog Z8 microprocessor BMW 745i: 53 8-bit microprocessor bit microprocessor 7 16-bit microprocessor NASA's Mars Rover: 8-bit Intel 8085 microprocessor annon OS 3: 3 32-bit microprocessor 6 While embedded systems cover a wide range of special-purpose systems, there are common characteristics Low cost Memory is typically small compared to a general purpose system Lightweight processors are used in embedded systems Low power Should consume low power especially in case of portable devices Low-power processors are used in embedded systems igh performance Playing video on portable devices, audio should be in sync with video Gaming gadgets of high performance Real-time property Job should be done within a time limit Aerospace applications, ar control systems, Medical gadgets are critical in terms of time constraint

2 Some ommon haracteristics (2/2) It is challenging to satisfy all the characteristics You may not be able to achieve high performance while maintaining low power consumption and making use of cheap components This talk focuses on the low power design of embedded systems Agenda 7 8 Processor power Research overview Processors in data centers consume 1.5% of the worldwide energy Processors and memories Typical components in embedded systems ores, control logic, on-chip caches, registers, off-chip caches/memories aches are energy-consuming due to instruction/data supply Typical for instruction cache due to frequent access Target System Architecture I ID X MM WB 3 DLI PU core I-cache 2 IP D-cache 1 IMem DMem Processor power Instruction supply power 1. Off-chip instruction bus: SBI design 2. Instruction cache: ROBTI design 3. Instruction fetch and decode: DLI design 9 10 Agenda Bus ncoding for Switching Reduction Transition-Zero (T0) (Benini1997) Gray ( Su1994) BI Bus-Invert (BI) (Stan1995) 11 12

3 Recent work: ode ompression Probe: Instruction Memory Data Bus? BI Switching Pattern: IMDB vs. IMAB vs. DMDB? BI Lv2003: dictionary-based opcode compression Petrov2004: major s compression - complex codec units - performance overhead Bit Switching Probability IMDB - Generally random, though not typical - BI encoding potentially applicable Probe: BI ncoding for IMDB Probe: IMDB Partition for BI ncoding BI encoding requirements: amming Distance (#bit switches) > bus bit width/2 Probability of BI Applicable NP [39..0] P1 [39..30] (Mibench.qsort) P2 [29..20] P3 [19..10] P4 [9..0] Op [39..32] Rs [31..24] Rt [23..16] Imm [15..0] To apply BI on IMDB, partition first Based on the instruction fields Instruction type of major s Partition as ing point, searching for most active bus lines Searching Bus Lines for BI Window-based Segment Search igh orrelation oefficient based: proposed in Shin2001, Partial Bus-Invert (PBI) coding for address bus Line Pairs orrelation: IMAB vs. IMDB Searching windows of different sizes and bit positions One segment for each partition P i-1 P i P i+1 or each window, use: δ = (D) + Dev(D) Larger (D) means more bus lines switch per bus cycle xplore amming Distance directly, window based searching Larger δ => better switching reduction 17 18

4 Results: Switching Reduction Results: Power valuation Average Red.[%] ontrol lines BI PBI MPBI SBI Power saving [%] Switching red. [%] Agenda Introduction Observations for embedded applications onsist of numerous basic s whose executions are of high locality Such locality can be exploited for cache tag reduction Objective: Reduced tag array & tag compare for low power A Reduced One-Bit Tag Instruction ache (ROBTI) Power optimization without performance sacrifice Design -- Tag size vs cache coverage 3. Design -- Dynamic cache coverage control ache coverage Definition: cache mappable address space of the main memory Overlapped cache coverage To avoid Ping-Pong effect and exploit locality of s ull tag cache has a full memory coverage 1-bit tag cache has a partial memory coverage: 1 or 2 or 3 Regions identified by cache coverage index: -index Dynamically changes with P address during program execution Overlapped 1 &2 Instructions for B2 can be retained when cache coverage changes from 1 to

5 Design -- ROBTI architecture Design -- ache coverage shift & detection Three consecutive coverages need differentiated for a Shift, which means 2 LSBs (-lops) of the full tag are sufficient With Gray-encoded -lops, adjacent coverages can be differentiated by 1 common value bit (V) and its bit position (VP) in a coverage eatures 1-bit tag for each cache entry ache operational control unit Uses 2 bits of P and current instruction type to control the cache operation Standard cache size of ROBTI: 32 Surveys of Benchmarks: > 90% s contain no more than Design -- Dynamic cache coverage control xperiments Three operation states Normal ache Access Like a traditional cache operation ache flush It invalidates all cache entries ache coverage shift (Shift) It moves the cache coverage to its neighboring region by offset of cache size Setup Processor MIP2 ISA Benchmark MiBench, Powerstone, some Kernel-like programs valuation Metrics Design cost, power Performance (hit rate) omparison onventional I-cache 5-bit partial tag compare Results Results 29 ROBTI with standard size of 32 Reduction: 30.9% of area and 2.1% of delay (normalized to full tag traditional I-cache) 30 ROBTI with standard size of 32 Performance: or most applications, can achieve the same hit rate as conventional cache Power reduction ROBTI: 25.8% averagely and 27.8% maximally PT-5: 1% averagely and 2.18% maximally

6 Agenda Research Problem aching decoded for most of s, including large, complicated and nested s to avoid repeated instruction fetching and decoding operations as much as possible Design Overview Design Approach I I/IDstall load DLI ID MUX X Xstall MM WB ardware/software o-design Using software to control the operation of DLI - to reduce the complexity of hardware design Xsrc Decoded Instruction Loop ache (DLI): Able to cache large, complicated s fficient: great energy savings and low overhead Using customized hardware design - to reduce the area and power consumption overhead Software Design ontrol low MUX inner s 35 36

7 37 38 inner s inner s MUX MUX inner s inner s 41 42

8 MUX ardware Design: ierarchical ache Table 43 inner s 44 opcode control word DLI Index Table flag operand c_index Instruction ormat Decoded Instruction Word ormat opcode ( a ) operand ( b ) branch memory target address ontrol Word Dictionary Table control word Branch ache Target Table dlic_index branch cache target address ( c ) DLI Overall Architecture xperimental Setup ISA (PISA) Application ASIPmeister Simplescalar G I-cache, Memory VDL (Syn.) VDL (Sim.) Object code ATI Synopsys Design ompiler DLI ModelSim W eval. area, energy, delay W/SW co-design execution trace SW eval. performance profiling Results: Reduction of Instr. fetch and decode Results: nergy consumption 100 DIB DLI adpcm bcnt blowfish crc32 des jpeg qsort rawcaudio rawdaudio rc4 rijndael salsa sha stringsearch AVG 48 Normalized energy consumption adpcm bcnt blowfish crc32 des jpeg qsort rawcaudio rawdaudio rc4 rijndael salsa sha stringsearch AVG

9 Results: Performance overhead Agenda 1.2 DIB DLI 49 Normalized execution cycles adpcm bcnt blowfish crc32 des jpeg qsort rawcaudio rawdaudio rc4 rijndael salsa sha stringsearch AVG 50 onclusions & uture work SBI: reducing instruction data bus switching power Little randomness/correlation can be exploited by existing bus encodings Profiling-based SBI gives more reduction and less overhead ROBTI: reducing instruction cache power A 1-bit tag cache for applications of high spatial/temporal locality Similar performance to full-tag cache, power/area reduced by 26% and 31% DLI: reducing instruction fetch/decode power Software-controlled SPM-like structure for decoded 66% (up to 87%) energy saved with performance overhead of 1.4% Potential xtension onsider bus coupling in SBI coding Loop cache design for s having procedure/function calls A framework involves these low-power techniques for design automation Thank you! 51 52

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine