An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

Size: px

Start display at page:

Download "An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms"

Vanessa Davidson
6 years ago
Views:

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms Yuankun Xue 1, Zhiliang Qian 2, Guopeng Wei 3, Paul

1 An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms Yuankun Xue 1, Zhiliang Qian 2, Guopeng Wei 3, Paul Bogdan 1, Chi-Ying Tsui 2, Radu Marculescu 3 1 University of Southern California 2 HKUST, 3 Carnegie Mellon University Symposium on Network-on-Chips, Sept.,

2 q Introduction Outline q Genetic Algorithm (GA) overview q Hierarchical Parallel Genetic Algorithms (HPGA) q NoC-based HPGA platform q Island-based HPGA-NoC q Performance bottleneck analysis q Proposed HPGA architecture q Dynamic Injection Bandwidth Multiplexing (DIBM) q Time-division Island Multiplexing (TIDM) q Task-aware adaptive routing q Experimental results q Conclusions 2

3 Genetic Algorithm q Genetic algorithm (GA) overview Slave process (DIS phase) Fitness calculation S i Parent Population P(S ) Fit(S ) i i S j Fitness functionbased individual selection Master process (GA phase) Crossover Individuals distribution Produce new Generation Mutation Produced new individuals Mutation Genetic operations mimic natural evolution 3

4 Example - Protein Folding Problem q Protein folding problem in the experiments q Protein final conformation corresponds to the minimal energy state q Fitness function based on 3DHPSC model q 6 contacts from set {BB,BH,BP,HP,HH,PP} are considered q Assume unity distance in cubic lattice q Penalty introduced for overlapped position Fitness = H N * PenaltyValue N N N H = e δ + e δ + e δ hh r hb r bb r i= 1, j> i+ 1 i= 1, j> i+ 1 i= 1, j> i+ 1 BH Contact hh hb bb ij ij ij N N N + e δ + e δ + e δ bp r pp r hp r i= 1, j> i+ 1 i= 1, j= i+ 1 i= 1, j= i+ 1 bp pp hp ij ij ij BB Contact e lm δ r i j l m HH Contact : Contact weight, l,m=b,h,p = PP Contact BP Contact HP Contact 1 : if (i,j) contact exists 0 : otherwise 4

5 Parallel Genetic Algorithm q The computation time of GA grows dramatically with the problem size q Parallel Genetic Algorithm (PGA) q Single-master multiple-slave based GA platform [E. Cantu 1998] q Multiple-master mutiple-slave based GA (HPGA) [E. Cantu 1998] Fitness return flow Individual distribution flow Master process Island Master process Migration flow among masters Slave process 1 Slave process 2 Slave process 3 Slave process n a) b) Slave process 5

6 Implementation of PGA q Previous PGA implementations q Computer-cluster based PGA [C.Benitez, 2009] q A single master process with multiple slave processes q Speedup tends to saturate as the process in the single master cannot be parallel q GPU-based architecture [P.Pospichal et.al, 2010] q The migration among master processes need to be compatible with CUDA software model q NoC-based MPSoC platform [R.Ferreira et.al, 2010] q Migration only occur among neighboring islands q Motivation of this work q Dedicated NoC architecture supporting dynamic migrations q Time-division multiplexing (TDM) schemes with higher utilization of the processing elements 6

7 q Introduction Outline q Genetic Algorithm (GA) overview q Hierarchical Parallel Genetic Algorithms (HPGA) q NoC-based HPGA platform q Island-based HPGA-NoC q Performance bottleneck analysis q Proposed HPGA architecture q Dynamic Injection Bandwidth Multiplexing (DIBM) q Time-division Island Multiplexing (TIDM) q Task-aware adaptive routing q Experimental results q Conclusions 7

8 Island-based HPGA-NoC q Straight forward implementation q Mapping multiple islands onto NoC a) b) S S S S Master process S S S S M S Slave process Master processor Slave processor Router Fitness return flow Individual distribute flow S M S S S S S S Operations for each island Master 3 Master 2 Hierarchical network for islands Master 0 Master 1 Migrations among the islands 8

9 Performance Bottleneck Analysis q Limited injection bandwidth of the master processor (limitation 1) q Every cycle, only one flit can be sent from the master to a slave in each island q Low utilization of slave cores (limitation 2) Limitation 1 Limitation 2 Breakdown of distribution and fitness calculation times [Y.-k.Xue, DAC,2014] Ratio of DIS phase (using slave processors for calculation) to GA (slave processors are idle) phase 9

10 q Introduction Outline q Genetic Algorithm (GA) overview q Hierarchical Parallel Genetic Algorithms (HPGA) q NoC-based HPGA platform q Island-based HPGA-NoC q Performance bottleneck analysis q Proposed HPGA architecture q Dynamic Injection Bandwidth Multiplexing (DIBM) q Time-division Island Multiplexing (TIDM) q Task-aware adaptive routing q Experimental results q Conclusions 10

11 Dynamic injection bandwidth multiplexing (DIBM) q Address limitation 1 by improving master injection bandwidth q Unbalanced utilization of master and slave injection bandwidth q Time-multiplexing the injection bandwidth of the slave processors Effective number of processors share the injection bandwidth 11

12 Time-division island multiplexing (TDIM) scheme q Address limitation 2 by improving the slave processor s utilization q Time-sharing the GA phase q The slave idle time in one island can be used to calculate the individual fitness of other islands 12

13 Task-aware adaptive routing q Avoid extra delays in TIDM scheme when two masters distribute individuals simultaneously q In the routing, packets (individuals) change the destinations adaptively Occupied Slave 13

14 Task-aware adaptive routing (Cont.) q Adaptive routing flow: Initial destination Set for chromosome packet in the master processor chromosome packet sent through XY routing In intermediate router Check the availability of slave processor Free slave && Granted usage Delivery chromosome to current PE Not free slave PE Proceed to next hop Reach the destination Original destination Check availability of the destination slave processor 14

15 q Introduction Outline q Genetic Algorithm (GA) overview q Hierarchical Parallel Genetic Algorithms (HPGA) q NoC-based HPGA platform q Island-based HPGA-NoC q Performance bottleneck analysis q Proposed HPGA architecture q Dynamic Injection Bandwidth Multiplexing (DIBM) q Time-division Island Multiplexing (TIDM) q Task-aware adaptive routing q Experimental results q Conclusions 15

16 Experimental results q Simulation setup q The HPGA-NoC platform is implemented in C++ q The protein folding problem with 3D HPSC model is considered for the GA problem q Parameters for the GA simulations: q 2000 generations with a population size 2400 q Crossover rate 80% and mutation rate 20% q Migration happens among masters every 40 generations q NoC network size ranges from 2 2 to q Buffer depth of 4 flits and 4 virtual channels q Master processor sends the multi-flit chromosome packets and slave processor returns a single flit fitness packet 16

17 Comparisons of speedup performance q We compare the speedup gain of the baseline design and the proposed architecture with DIBM q Various degree of injection bandwidth multiplexing is considered (P=3 to P=9) q Upperbound is obtained by theoretical derivation considering level of multiplexing q For baseline design (naïve mesh), the speedup tends to saturate early as two types of limitations exist q For NoC with DIBM, a 75X-206X speedup can be obtained 17

18 Evaluation of TDIM schemes q Comparisons of slave process utilization by TDIM schemes q Core utilization in baseline (naïve mesh) drops significantly q TDIM schemes efficiently improves the slave cores utilization as the number of slave processor increases 18

19 Evaluation of TDIM schemes (Cont.) q Maximum number of islands that can be multiplexed on a physical island q Impact of island multiplexing number on the overall speedup performance 19

20 Evaluation of adaptive routing q Compare the proposed routing algorithm against XY and minimal adaptive routing q The proposed task-aware routing effectively reduce the sojourn time of a chromosome packet q Overlap ratio is the percentage of time DIS phase of two logic islands overlap q The proposed routing algorithm achieves 10-15% reduction in the fitness calculation time 20

21 Hardware comparisons q The hardware overhead is normalized to that of the baseline design in terms of number of processors needed q The overhead grows lineally for the baseline design q The proposed TDIM scheme greatly reduced the number of PEs and routers required by sharing the same resources in the island q Combining DIBM, the proposed routing with TDIM further reduces the hardware requirement 21

22 Case study: Protein folding analysis q 7 real-world protein benchmarks are used for the analysis q We compare the proposed architecture with a single-master-single-slave design q 24 islands are mapped onto mesh NoC q The solution is represented by the H-H side-chain contacts (HH) 22

23 q Introduction Outline q Genetic Algorithm (GA) overview q Hierarchical Parallel Genetic Algorithms (HPGA) q NoC-based HPGA platform q Island-based HPGA-NoC q Performance bottleneck analysis q Proposed HPGA architecture q Dynamic Injection Bandwidth Multiplexing (DIBM) q Time-division Island Multiplexing (TIDM) q Task-aware adaptive routing q Experimental results q Conclusions 23

24 Conclusions q An efficient NoC-based multicore platform for HPGA is presented with: q DBIM to overcome the master processor injection bandwidth limitations q TDIM to improve the utilization of the slave processors and reduce the physical network size q Adaptive routing to reduce the chromosome packet delivery latency q We demonstrate the effectiveness of the overall architecture and each scheme using the protein folding problem as the case study q Future work includes detailed hardware implementation and optimization 24

25 q Thanks! q Q&A 25

26 Evaluating DIBM on FPGA q A DIBM-based platform is implemented in verilog with multiplexing level P=3 q The speedup is simulated based on synthesized results on Xilinx Virtex-6 LX760 FPGA Performance degradation is due to the complicated control logic in the hardware prototype 26

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong