An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Size: px

Start display at page:

Download "An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection"

Conrad Oswin Thompson
6 years ago
Views:

1 An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013, Toshiba Corporation.

2 Executive Summary Future architecture will have many cores A key challenge : How to efficiently use them? We evaluated techniques to accelerate one type of important application (face detection) Performance scales up to 64 cores Energy efficiency is 20x better than desktop CPU 2

3 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 3

4 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 4

5 Two Key Trends in Embedded Systems Trend 1 : New applications (e.g. image recognition) need more computing power while keeping low power Trend 2 : New architecture can enable much more parallelism than before 5

6 Two Key Trends in Embedded Systems Trend 1 : New applications (e.g. image recognition) need more computing power while keeping low power Trend 2 : New architecture can enable much more parallelism than before Now : 500GOPS Heterogeneous Multi-Core CPU CPU CPU CPU Accelerator Accelerator Visconti TM 2 [ISSCC 12] 6

7 Two Key Trends in Embedded Systems Trend 1 : New applications (e.g. image recognition) need more computing power while keeping low power Trend 2 : New architecture can enable much more parallelism than before Future : More than 1TOPS Now : 500GOPS Heterogeneous Many-Core Heterogeneous Multi-Core CPU CPU CPU CPU Accelerator Accelerator Visconti TM 2 [ISSCC 12] CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Toshiba Many-Core [VLSI Sympo. 12] Accelerator Accelerator Accelerator Accelerator 7

8 Two Key Trends in Embedded Systems Trend 1 : New applications (e.g. image recognition) need more computing power while keeping low power Trend 2 : New architecture can enable much more parallelism than before Future : More than 1TOPS Now : 500GOPS Heterogeneous Many-Core Heterogeneous Multi-Core CPU CPU CPU CPU Accelerator Accelerator Visconti TM 2 [ISSCC 12] 8 CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Accelerator Accelerator Accelerator Accelerator Toshiba Many-Core [VLSI Sympo. 12] Result : A need for efficient and scalable application performance on many-core

9 Power Consumption [W] Power and Performance Target of our Many-Core Performance [GOPS] 9

10 Power Consumption [W] Power and Performance Target of our Many-Core 1000 High Performance Multi & Many-Cores for HPC 100 GHz cores Intel 80-Tile Intel SCC 10 Tilera Tile64 Cell Broadband Engine TM Performance [GOPS] 10

11 Power Consumption [W] Power and Performance Target of our Many-Core 1000 High Performance Multi & Many-Cores for HPC Intel 80-Tile 100 GHz cores Intel SCC 10 Less than 3W is needed for embedded applications Tilera Tile64 Cell Broadband Engine TM Performance [GOPS] 11

12 Power Consumption [W] Power and Performance Target of our Many-Core 1000 High Performance Multi & Many-Cores for HPC Intel 80-Tile 100 GHz cores Intel SCC 10 Less than 3W is needed for embedded applications Tilera Tile64 Cell Broadband Engine TM Renesas RP-X ARM Cortex TM -A5 Energy Efficient Embedded Multi-Cores Toshiba Multi-Core (Visconti TM 2) Performance [GOPS] 12

13 Power Consumption [W] Power and Performance Target of our Many-Core 1000 High Performance Multi & Many-Cores for HPC Intel 80-Tile 100 GHz cores Intel SCC 10 Less than 3W is needed for embedded applications Tilera Tile64 Cell Broadband Engine TM Renesas RP-X ARM Cortex TM -A5 Energy Efficient Embedded Multi-Cores Toshiba Multi-Core (Visconti TM 2) Our Target Area Performance [GOPS] 13

14 Power Consumption [W] Power and Performance Target of our Many-Core 1000 High Performance Multi & Many-Cores for HPC 100 GHz cores Intel 80-Tile Intel SCC Less than 3W is needed for embedded applications Renesas RP-X ARM Cortex TM -A5 Energy Efficient Embedded Multi-Cores Tilera Tile64 Toshiba Multi-Core (Visconti TM 2) Cell Broadband Engine TM Performance [GOPS] Toshiba Energy Efficient Many-Core (64Core) Our Target Area 14

15 Many-Core Scalability Performance The Number of Cores 15

16 Many-Core Scalability Performance Ideal The Number of Cores 16

17 Many-Core Scalability Performance Ideal Actual? The Number of Cores 17

18 Performance Many-Core Scalability Ideal Actual? The Number of Cores Can we achieve good performance scaling-up on face detection? 18

19 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 19

20 Face Detection 20

21 Face Detection 25 pixels 1st ROI 25 pixels ROI : Region of Interest Check if a face exists or not 21

22 Face Detection 25 pixels 2nd ROI 25 pixels ROI : Region of Interest 2 pixels Check if a face exists or not 22

23 Face Detection 25 pixels 3rd ROI 25 pixels ROI : Region of Interest 4 pixels Check if a face exists or not 23

24 Face Detection ROI : Region of Interest Nth ROI Check if a face exists or not 24

25 Face Detection 25 pixels 2 pixels ROI : Region of Interest (N+1)th ROI 25 pixels Check if a face exists or not 25

26 Face Detection ROI : Region of Interest Target ROI 26

27 Joint Haar-Like Features [ICCV 05] Extension to widely-used Viola and John s Method [CVPR 01] (using Haar-like features) Haar-like feature : Difference of image intensities between blue and red rectangles. ROI (Region Of Interest) Joint Haar-Like Feature Compared to each threshold If greater than : 1, otherwise :

28 Joint Haar-Like Features [ICCV 05] Extension to widely-used Viola and John s Method [CVPR 01] (using Haar-like features) Haar-like feature : Difference of image intensities between blue and red rectangles. Eye is darker than cheek ROI (Region Of Interest) Joint Haar-Like Feature Compared to each threshold If greater than : 1, otherwise :

29 Classifier using Joint Haar-Like Features Joint Haar-Like Features Possibility of face or not face Weight of the feature Face? Table Accumulate Face? Table Positions of features and tables are learned in advance and stored in the dictionary Face or Not Face 29

30 Characteristics of Face Detection Face detection for each ROI can be executed in parallel There are a lot of ROIs in an image 3M ROIs when image size is 4000x3200 A lot of coarse grain thread parallelism based on ROIs Overhead of thread scheduling can be minimized Many-core is good for face detection! 30

31 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 31

32 Chip Micrograph and Features Core 15.0mm Reconfigurable Engines Cluster 0 2MB L2 Cache 2MB L2 Cache Cluster 1 DDR3 I/F DDR3 I/F 14.0mm L2Cache Bank0 Technology Interconnect Chip Size L2Cache Bank1 Cluster Size 7.4mm L2 Cache L2 Cache Bank2 Bank3 40nm LP Process 8 metal (Cu) 15.0mm x 14.0mm 7.4mm x 5.7mm 5.7mm Transistors 87.5Million Cluster Frequency 333MHz, 1.1V Package 1369-pin FCBGA 32

33 Chip Micrograph and Features Core 15.0mm Reconfigurable Engines Cluster 0 2MB L2 Cache 2MB L2 Cache Cluster 1 DDR3 I/F DDR3 I/F 14.0mm L2Cache Bank0 Technology Interconnect Chip Size L2Cache Bank1 Cluster Size 7.4mm L2 Cache L2 Cache Bank2 Bank3 40nm LP Process 8 metal (Cu) 15.0mm x 14.0mm 7.4mm x 5.7mm 5.7mm Transistors 87.5Million Cluster Frequency 333MHz, 1.1V Package 1369-pin FCBGA 33

34 Structure of Many-Core Cluster Core Core Core 1 Core Core Core Core Core Core 1 Core Core Core Core Core Core 1 Core Core Core Core Core Core 1 Core Core Core Tree-based NoC Leaf nodes: Core Root nodes: L2 cache banks Core(Leaf) Core Core Core Core Core Core Core Core L2$(Root) Core(Leaf) L2 Cache 512 KB Bank0 L2 Cache 512 KB Bank Cluster Control Module - Hardware Semaphores - Interrupt Controller L2 Cache 512 KB Bank2 L2 Cache 512 KB Bank3 Router Four L2 Cache Banks 34

35 Many-Core SoC Architecture Image Recognition & Processing Accelerators Reconfigurable Engine x 2 Engine x 2 Many-Core Cluster 0 Core x 32 Many-Core Cluster 1 Core x 32 SRAM 512KB x 4 ARM Cortex-A9 X 2 L2 Cache 2MB L2 Cache 2MB Video In x 4 Video Out PCIe X 1 Peripherals External Interface PCIe X 1 SRAM 512KB DDR3 Controller X GB/s 35

Core : Media Processing Block (MPB) 3-Way VLIW Processor (MPB) I$ RISC Core L1 Inst Cache 32KB Address Protect Unit L1 Data Cache 16KB NoC I/F SIMD Co-Processor Debug Module Address Translate Unit

36 Core : Media Processing Block (MPB) 3-Way VLIW Processor (MPB) I$ RISC Core L1 Inst Cache 32KB Address Protect Unit L1 Data Cache 16KB NoC I/F SIMD Co-Processor Debug Module Address Translate Unit Dec./RF ALU D$ RF 32b RISC Core Dec. Cop. RF ALU Acc. ALU MUL Acc. 64b 64b 2-way SIMD Coprocessor 3-Way VLIW Processor L1 Instruction Cache: 32KB L1 Data Cache: 16KB 333 MHz Exploits multi-grain parallelism Thread level by many cores Instruction level by VLIW architecture Data level by SIMD instructions 36

37 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 37

Issues in implementing parallelized face detection High coarse-grain parallelism: Good for Parallelization There are enough ROIs to exploit by many cores Imbalanced workload: Bad for Processor

38 Issues in implementing parallelized face detection High coarse-grain parallelism: Good for Parallelization There are enough ROIs to exploit by many cores Imbalanced workload: Bad for Processor Utilization The workload of an ROI where a face exists is higher than that of an ROI without a face Implementation of parallelized face-detection Minimize the number of threads in order to reduce synchronization cost Allocate one thread to one core Find a good thread partitioning with balancing workload of threads Reduce data bandwidth (L1$-L2$ and L2$-DDR3) 38

39 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 39

40 Implementation on the Single Cluster We implemented the face detection with two methods to allocate image to cores Allocating Cyclically Splitting Equally 40

41 (1) Allocating Cyclically This way allocates lines to each core cyclically Effective in balancing workload ROI Image Core0 Core1 Core2 Core31 41

42 (2) Splitting Equally This way divides the image evenly Effective to reduce data size read by each core Image Height/32 Core0 Core1 Height Core2 Core31 42

Images for Evaluation High Resolution Images (5.76-12.7Mp) including many faces No.

43 Images for Evaluation High Resolution Images ( Mp) including many faces No. Resolution Number of Faces x x x x x x Face Detection Result of Image 4 43

44 Evaluation Board I/O and switches for evaluation Many-Core SoC (Fan-less Cooling) 44

Relative Performance Relative Performance on Single Cluster 35 30 25 20 15 10 5 Allocating Cyclically 15.

45 Relative Performance Relative Performance on Single Cluster Allocating Cyclically 15.5x 30x Splitting Equally 11x 21x img.0 img.1 img.2 img.3 img.4 img.5 ideal Number of Cores 45

46 Average Relative Performance Average Relative Performance on Single Cluster Ideal 30x x 11x 21x Allocating Cyclically Splitting Equally ideal The Number of Cores With Allocating Cyclically, performance scales up to 32 cores 46

Time (sec) Execution Time of the Fastest and Slowest Cores 45 40 35 Allocating Cyclically Splitting

47 Time (sec) Execution Time of the Fastest and Slowest Cores Allocating Cyclically Splitting Equally Fastest Core Slowest Core x x Image Number 47

48 Processor Utilization Processor Utilization(%) Allocating Cyclically : 90 ~ 95% Splitting Equally : 55 ~ 75% Allocating Cyclically Splitting Equally Image Number Low processor utilization deteriorates the performance of Splitting Equally 48

49 Relative Amount of Transferred Data Bandwidth of L2 Cache and DDR L1-L2 L2-DDR3 Allocating Cyclically Splitting Equally L1-L2 bandwidth is nearly the same L1 cache is not enough to store ROI line About L2-DDR3, Allocating Cyclically is better All cores access the small area at the same 49

50 Outline Introduction Face Detection using Joint Haar-Like Features Architecture of Energy Efficient Many-Core SoC Issues in Implementing Parallelized Face Detection Implementation and Evaluation of Parallelized Face Detection On the Single Cluster On the Dual Cluster Conclusion 50

51 Implementation on Dual Cluster Each cluster has its own L2 cache and shares DDR3 Because bandwidth is narrower than L1 and L2 cache, reducing bandwidth between L2 cache and DDR3 is important We implemented the two ways Allocating Cyclically Bisection MPB L1Cache MPBx32 MPB L1Cache MPB L1Cache MPBx32 MPB L1Cache L2 cache L2 cache DDR3 51

52 (1) Allocating Cyclically This way is the same as that of a single cluster Effective in balancing workload Core0 ROI Core0 Core1 Core1 Core2 Core2 Core31 Cluster0 Image 52 Core31 Cluster1

53 (2) Bisection This way divides the image into two blocks Height/2 Cluster0 Cluster1 Each cluster processes each block Image 53

54 (2) Bisection This way divides the image into two blocks Effective to reduce data size read by each cluster ROI Core0 Core0 Core1 Core2 ROI Core1 Core2 Core31 In each block, Allocating Cyclically is used Image Cluster0 Cluster1 54 Core31

55 Performance of Dual Cluster (64 Cores) Relative Performance x Ideal (64x) x Image Number Allocating Cyclically Bisection 55

56 Performance of Dual Cluster (64 Cores) Relative Performance x Ideal (64x) x By 0Allocating 1 Cyclically, Image Number performance scales up to 64 cores Allocating Cyclically Bisection 56

Time(sec) Execution Time of the Fastest and Slowest Cores 18 16 14 12 10 Allocating Cyclically

57 Time(sec) Execution Time of the Fastest and Slowest Cores Allocating Cyclically Bisection 2.8x Fastest Core Slowest Core x Image Number 57

58 Processor Utilization Processor Utilization (%) Allocating Cyclically : 87 ~ 95% Bisection : 66 ~ 91% Allocating Cyclically Bisection Image Number Low processor utilization deteriorates the performance of Bisection 58

59 DDR3 Bandwidth DDR3 Bandwidth (MB/s) Bandwidth in Allocating Cyclically Image Number Utilized bandwidth is 750MB/s ( only 7% of maximum (10.7GB/s ) ) Memory bandwidth is not bottleneck even when two clusters operate. 59

60 Power (mw) Power Consumption Our many-core SoC achieves less than 3W SoC : 2.21W The Number of Cores 2Clusers 1.18W Clusters Bus DDR3 IO Other Typical Process, Room Temperature, using Allocating Cyclically 60

61 Relative Performance Energy (J) Comparison with Desk-Top CPU Compared with Desk-Top CPU (Core -i7-3820: 3.6GHz, 4 Cores, 8 Threads) Performance Many-core Core -i x better 16.1 Many-core Energy Core -i TDP of Core -i (130W) is used for calculating energy 61

62 Conclusion Future architecture will have many cores A key challenge : How to efficiently use them? We evaluated the many-core SoC with parallelized face detection Many-core is suited for the face detection because it exploits ROI based coarse-grained parallelism efficiently Scale up by 30x (32 cores) to 60x (64 cores) Balancing workload is important Power consumption is only 2.21W under actual workload : enables fan-less cooling Our many-core SoC is remarkably energy efficient in image recognition applications 20x better than the desk-top CPU 62

63 Thank you! 63

Multi-Core SoCs for ADAS and Image Recognition Applications

Multi-Core SoCs for ADAS and Image Recognition Applications Takashi Miyamori, Senior Manager Embedded Core Technology Development Department Center for Semiconductor Research & Development Storage Device