EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

EECS150 - Digital Design Lecture 13 - Accelerators Oct. 10, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek) http://www-inst.eecs.berkeley.edu/~cs150 1 Recap and Outline SRAM USB WebCam Host PC VGA Interface Frame Buffer DVI Interface Note: partners HW/ project Overview of MicroBlaze + feature detection Hardware acceleration/co-processors 2 1

90/10 rule: Motivation Often 90 percent of the program runtime and energy is consumed by 10 percent of the code (inner-loops). Only small portions of an application become the performance bottlenecks. Usually, these portions of code are data processing intensive with relatively fixed dataflow patterns (little control): cryptography, graphics, video, communications signal processing, networking,... The other 90 percent of the code not performance critical: UI, control, glue, exceptional cases,... Hardware accelerator/economizer implements specialized circuits for inner-loops. Processor packs the noncritical portions (90%), 10% of the computation into minimal space. Hybrid processor-core hardware accelerator 3 Energy Efficiency of CPU versus ASIC versus FPGA Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37 47, June 2010. ASIC 500x CPU Ian Kuon and Jonathan Rose. Measuring the gap between fpgas 7x and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA 06, pages 21 30, New York, NY, USA, 2006. ACM FPGA ASIC FPGA : CPU = 70x Similar story for performance efficiency Wawrzynek ReConFig 12/14/2010 4 2

Why is HW more efficient than processors? Performance/cost or Energy/op 1. exploit problem specific parallelism, at thread and instructions level 2. custom instructions match the set of operations needed for the algorithm (replace multiple instructions with one), custom word width arithmetic, etc. 3. remove overhead of instruction storage and fetch, ALU multiplexing What about FPGAs? 5 Three ARM cores, plus lots of accelerators Targets smart phones System on Chip Example 6 3

Xilinx Zinq Processors in FPGAs Altera: Dual-Core ARM Cortex-A9 MPCore Processor 7 Xilinx: Microblaze Soft Processor Altera: Nios, MIPS 8 4

Custom Hardware in the Pipeline 9 Custom Instructions Example: Tensilca Product Special language TIE is used for defining special function units Custom architecture automatically compiled, e.g. custom SIMD instructions Compiler support challenging 10 5

Tightly Coupled Co-processor MicroBlaze: Fast Simplex Links (FSL) Similar to MIPS coprocessor model 11 FSL : Fast Simplex Link 12 6

MicroBlaze Fast Simplex Links 13 Fast Simplex Link 14 7

Memory Mapped Accelerator Memory mapped control/data registers 15 Memory Mapped Accelerator Common Variations 16 8

CPU/Accelerator Shared Memory Processor instructs accelerator to independently access memory and perform work How does processor synchronize with accelerator (how does it know when it is done) Data Cache on CPU creates coherency issue What about a cache in the accelerator? 17 Tightly Coupled Co-processor MIPS: load/store to/from coprocessor, coprocessor op Memory mapped control/data registers 18 9

Summary so far Custom hardware in pipeline Tightly coupled co-processor e.g. Fast Simplex Link e.g. floating point co-processor memory-mapped co-processor 19 Feature Tracking Project USB WebCam SRAM Host PC VGA Interface Frame Buffer DVI Interface Feature Detector P.L.B. serial interface micro Blaze CPU F.S.L. Xilinx FPGA DDRAM (program+ data memory) EECS150 - Lec12-video 20 10

MicroBlaze Block Diagram 21 MicroBlaze IO DPLB: Data interface, Processor Local Bus DLMB: Data interface, Local Memory Bus (Block RAM only) IPLB: Instruction interface, Processor Local Bus ILMB: Instruction interface, Local Memory Bus (Block RAM only) MFSL 0..15: FSL master interfaces DWFSL 0..15: FSL master direct connection interfaces SFSL 0..15: FSL slave interfaces DRFSL 0..15: FSL slave direct connection interfaces DXCL: Data side Xilinx CacheLink interface (FSL master/slave pair) IXCL: Instruction side Xilinx CacheLink interface (FSL master/slave pair) Core: Miscellaneous signals for: clock, reset, debug, and trace M_AXI_DP: Peripheral Data Interface, AXI4-Lite or AXI4 interface M_AXI_IP: Peripheral Instruction interface, AXI4-Lite interface M0_AXIS..M15_AXIS: AXI4-Stream interface master direct connection interfaces S0_AXIS..S15_AXIS: AXI4-Stream interface slave direct connection interfaces M_AXI_DC: Data side cache AXI4 interface M_AXI_IC: Instruction side cache AXI4 interface >= Virtex 6 22 11

Processor Local Bus For frame buffer interface 23 from VGA Feature Detection using D.o.G. to MicroBlaze WB: convolution, D.o.G. ``A Parallel Hardware Architecture for Scale and Rotation Invariant Feature Detection Vanderlei Bonato, Eduardo Marques, and George A. Constantinides, IEEE Trans. on Circuits and Systems for Video Technology, vol. 18, no12. Dec. 2008. 24 12

SRAM PS #6, problem 2 DATA DOUT ADDR DIN 25 Conclusions Custom hardware in pipeline Tightly coupled co-processor e.g. Fast Simplex Link e.g. floating point co-processor memory-mapped co-processor MicroBlaze connections to coprocessor and frame buffer 26 13