Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms

Size: px

Start display at page:

Download "Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms"

Eileen Nelson
5 years ago
Views:

1 IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Design Space in Embedded Scalable Platforms Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, NY, USA

2 Why s? High-performance embedded systems are heterogeneous: they include multiple general-purpose processor cores they include special-function hardware accelerators Generality Processor Cores s Efficiency Processor Core #1 Processor Core #3 Processor Core #2 Processor Core #4 IEEE High Performance Extreme Computing Conference (HPEC), / 15

3 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms [L. Carloni, The Case for Embedded Scalable Platforms, DAC 2016] Memory Controller Processor Core I/O Misc. Channels, etc. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15

4 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms ESP instance for WAMI (Wide-Area Motion Imagery) Memory Controller Proc. core LEON3 CPU I/O Misc. Channels, etc. WARP GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15

5 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Controller Proc. core LEON3 CPU I/O Misc. Channels, etc. HLS WARP HLS HLS HLS GRAYSCALE MATRIX-SUB MATRIX-RES HLS HLS HLS HLS GRADIENT MATRIX-ADD CHANGE-DET HLS SD-UPDATE HESSIAN HLS HLS HLS DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15

Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and

6 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Controller GRAYSCALE Proc. core LEON3 CPU MATRIX-SUB I/O Misc. Channels, etc. MATRIX-RES WARP SD-UPDATE rapid integration and prototyping GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller ESP instance for WAMI (Wide-Area Motion Imagery) IEEE High Performance Extreme Computing Conference (HPEC), / 15

7 s with HLS SystemC Specification GRAYSCALE Interface Memory Controller GRAYSCALE Proc. core LEON3 CPU MATRIX-SUB I/O Misc. Channels, etc. MATRIX-RES WARP SD-UPDATE GRAYSCALE Logic GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller load compute store bank bank bank bank bank bank Input PLM bank bank Output PLM Private Local Memories (PLMs) IEEE High Performance Extreme Computing Conference (HPEC), / 15

8 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #1 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15

9 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #2 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15

10 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #3 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15

11 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #4 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15

12 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) High-Level Synthesis (HLS) Pareto Optimal Pareto Dominated RTL Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15

13 Standard HLS Knobs Standard knobs provided by the current HLS tools Knob Loop manipulations Array mappings Clock period Settings and Effects Unrolls, pipelines or breaks the body of loops Maps arrays to registers or on-chip memories Sets the target clock period for synthesis These knobs enable already a rich design-space exploration However, they are not sufficient for exploring accelerators We need other knobs to broaden the exploration IEEE High Performance Extreme Computing Conference (HPEC), / 15

14 Motivational Example #1 synthesized with the standard knobs synthesized with the proposed knobs Normalized Area DEBAYER Bounded by on-chip memory bandwidth Normalized Effective Latency Limiting factor: limited bandwidth to the on-chip memory We need knobs to tailor the PLM to the accelerator needs IEEE High Performance Extreme Computing Conference (HPEC), / 15

15 Motivational Example #2 synthesized with the standard knobs synthesized with the proposed knobs Normalized Area GRAYSCALE Bounded by off-chip memory bandwidth Normalized Effective Latency Limiting factor: limited bandwidth to the off-chip memory We need knobs to operate on the communication interfaces IEEE High Performance Extreme Computing Conference (HPEC), / 15

16 Contributions: Xknobs extended Knobs for High-Level Synthesis XKnob PLM PORTS DMA WIDTH DMA CHUNK Settings and Effects Sets the on-chip memory bandwidth Sets the off-chip memory bandwidth Sets the size of the input and output PLM IEEE High Performance Extreme Computing Conference (HPEC), / 15

17 Xknob #1: PLM PORTS Sets the number of read/write ports of input/output PLMs Higher values of PLM PORTS more read/write accesses Higher values of PLM PORTS higher area (more banks) Normalized Area PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = DEBAYER Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15

18 Xknob #2: DMA WIDTH Set the size in bits of the DMA communication channels Higher values of DMA WIDTH higher mem. throughput Higher values of DMA WIDTH higher area (more banks) (higher number of write/read ports of input/output PLMs) DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256 DMA WIDTH = 512 Normalized Area GRAYSCALE Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15

19 Xknob #3: DMA CHUNK Set the size of the PLM in multiple of the stored data type Higher values of DMA CHUNK optimized communication Higher values of DMA CHUNK higher area (for the PLM) DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 without contention with contention Normalized Area GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/8 Normalized Area GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/ Normalized Effective Latency Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15

20 Experimental Results We evaluate the combined effects of the XKnobs by using: GRAYSCALE accelerator limited by communication DEBAYER accelerator limited by computation The other WAMI accelerators behave similarly to either the GRAYSCALE accelerator or the DEBAYER accelerator IEEE High Performance Extreme Computing Conference (HPEC), / 15

21 Experiment #1 We consider two XKnobs: PLM PORTS and DMA WIDTH DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = GRAYSCALE Normalized Area PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = Normalized Effective Latency GRAYSCALE accelerator limited by communication IEEE High Performance Extreme Computing Conference (HPEC), / 15

22 Experiment #1 We consider two XKnobs: PLM PORTS and DMA WIDTH DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = PLM PORTS = 4 DEBAYER Normalized Area PLM PORTS = 2 PLM PORTS = Normalized Effective Latency DEBAYER accelerator limited by computation IEEE High Performance Extreme Computing Conference (HPEC), / 15

23 Experiment #2 We consider two XKnobs: PLM PORTS and DMA CHUNK DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 Normalized Area PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 DMA WIDTH = 256 CONTENTION GRAYSCALE PLM PORTS = Normalized Effective Latency GRAYSCALE accelerator limited by communication IEEE High Performance Extreme Computing Conference (HPEC), / 15

24 Experiment #2 We consider two XKnobs: PLM PORTS and DMA CHUNK DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 Normalized Area PLM PORTS = 4 DMA WIDTH = 256 CONTENTION DEBAYER PLM PORTS = 2 PLM PORTS = Normalized Effective Latency DEBAYER accelerator limited by computation IEEE High Performance Extreme Computing Conference (HPEC), / 15

25 Concluding Remarks We presented the XKnobs a set of knobs that aims at extending the standard knobs used in current HLS tools For WAMI, the Xknobs broaden the design space by up to 8.5x for performance and 3.5x for cost The XKnobs can be integrated in any HLS tools and design-space exploration methodologies to enrich the set of Pareto-optimal implementations of hardware accelerators IEEE High Performance Extreme Computing Conference (HPEC), / 15

26 IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Design Space in Embedded Scalable Platforms Thank you for the attention! Speaker: Luca Piccolboni Columbia University, NY

Optimization of Behavioral IPs in Multi-Processor System-on- Chips

Optimization of Behavioral IPs in Multi-Processor System-on- Chips Yidi Liu and Benjamin Carrion Schafer # Department of Electronic and Information Engineering b.carrionschafer@polyu.edu.hk # Outline High-Level