Bespoke Processors for Applications with Ultra-low Area and Power Constraints

Size: px

Start display at page:

Download "Bespoke Processors for Applications with Ultra-low Area and Power Constraints"

Monica Booker
5 years ago
Views:

1 Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley

2 Overview Motivation Intro to Bespoke Benchmarks and Results Discussion

General Purpose CPUs in ULP Ultra-Low Power

Most capabilities of these processors are never

3 General Purpose CPUs in ULP Ultra-Low Power applications (IoT, wearables, implantables) typically use small, general purpose microprocessors Amortized cost of development Most capabilities of these processors are never used by the application Unused gates still drain power and take up area

4 What about ASICs and FPGAs? Both are expensive to develop ASICs IPs required for different applications Expensive at small scales FPGA Often larger than needed, to accommodate programmability May still use too much power

5 Algorithm Usage Examples

6 Bespoke Processors--Tuning Process Bespoke processor design flow: First use traditional module-level removal Next use Input-Independent Gate Activity Analysis Finally, cut-and-stitch the netlist to form the final design

7 Input-Independent Gate Activity Analysis 1. Load binary into memory 2. Set application inputs to Xs 3. After each cycle is simulated, the toggled gates are marked keep 4. If an X propagates to the PC, we have a possible branch a. Explore all possible branch paths, depth-first b. Remember the most conservative state (most Xs) i. Take union of gates of branches if most conservative is missing a few c. If branch is re-encountered i. Skip check if this state is a substate of that most conservative state ii. Merge lists of activated gates and make the result the new conservative state 5. Lists of all gates that are never toggled, along with their constant values, are passed to the cut-and-stitch function

8 Cutting and Stitching 1. After X propagation, untoggled gates are removed from the netlist and replaced by a constant voltage 2. Rerun logical synthesis for further optimizations a. Typically gates that have constant inputs can reduced to even simpler logic 3. Place and route (this is not any further optimized)

9 Input-independent Gate Activity Analysis Example

Benchmarks Baseline openmsp430 with TSMC 65nm Operating @1V @100MHz Bare metal simulation or FreeRTOS Either completely general purpose, or traditionally optimized for an

10 Benchmarks Baseline openmsp430 with TSMC Bare metal simulation or FreeRTOS Either completely general purpose, or traditionally optimized for an application by removing modules Each benchmark is then run on a Bespoke processor optimized for that benchmark All unused modules are removed X propagation and cut-and-stitch are performed

11 Used Gates per Benchmark

12 Results Reduction in gate count, area and power for a bespoke design vs. unmodified baseline

13 Results Reduction in gate count, area, and power in bespoke design vs. module optimized baseline

14 Results

15 Multiple Programs Multiple programs? Run bespoke tuning process on each and take the union of the results Ceiling at 80%... test suite does not activate all gates

In-Field Updates Bug fixes may need to be deployed, which may change the toggled gates Milu mutation testing tool used to emulate changes in the program for future updates Type

16 In-Field Updates Bug fixes may need to be deployed, which may change the toggled gates Milu mutation testing tool used to emulate changes in the program for future updates Type I: conditional operator changes (AND -> OR) Type II: computation operator mutants (add -> multiply) Type III: loop conditional operator mutants (less than -> less than or equal to)

Coverage for In-Field Updates Between 25% and 100% of mutants for each type are covered 70% of all mutants of all types of covered If mutants are significantly

17 Coverage for In-Field Updates Between 25% and 100% of mutants for each type are covered 70% of all mutants of all types of covered If mutants are significantly different, then they can be considered as independent programs Overhead of between 1% and 40% Total area reductions between 23% and 66%, total power reductions between 13% and 53%

18 Coverage for in-field Updates cont. An instruction that can be executed in one program is not necessarily executable in another program A particular ADD instruction may only use 16 bits out of a 32 bit ALU A tailored bespoke processor can support arbitrary software updates by supporting a Turing complete instruction (e.g. subneg) or a set of them A program written using Turing complete instruction can be consisted solely of that instruction

19 System Code Application analysis of system code for FreeRTOS shows 57% of the gates are never used by the OS When benchmarks are evaluated individually with FreeRTOS 37% unused in the worst case 49% unused on average Running 15 benchmarks on top of FreeRTOS still shows 27% of gates unused

20 Generality and Limitations Hardware with non-deterministic behaviors need additional techniques to be Bespoke tuned Branch predictors Caches Speculative operations Out-of-order cores Xs need to be injected as the results of...branch predictions...tag checks...values where speculation may be used Extending the X-prop process to explore data flow graphs may allow analysis of OoO to work

21 Discussion Points 1. All of the examples they tested are just algorithms such as binary search or FFT. But actual applications, even in IoT and smaller, typically do more than just, e.g., binary search. Do Bespoke tuned processors have any value for real-world programs? 2. Is using Milu and adding mutations representative of what in-field updates would actually change? 3. Can the Bespoke tuning process be used for lowering power consumption of high-performance accelerators? 4. Is Bespoke tuning better or worse for certain cases than technologies such as HLS, Simulate-and-Eliminate, or just making an ASIC design?

22 Related Works High-Level Synthesis Additional development costs New high-level specs of application behavior needs to be defined High-level spec needs to also be verified C to ASICs is very difficult to do, especially to do efficiently Unlikely to support multiple applications nor in-field updates Simulate-and-Eliminate Simulates the target application with a user-provided set of inputs on multiple base designs Require significant user input Only considers high-level, manually-identified components Relies on user inputs to determine unused components--user may forget a test case!

ECE 571 Advanced Microprocessor-Based Design Lecture 22

ECE 571 Advanced Microprocessor-Based Design Lecture 22 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 19 April 2018 HW#11 will be posted Announcements 1 Reading 1 Exploring DynamIQ