GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R.

Size: px

Start display at page:

Download "GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R."

Lee Ambrose Bryant
5 years ago
Views:

1 GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T R

2 FPGA PLACEMENT PROBLEM Input A technology mapped netlist of Configurable Logic Blocks (CLB) realizing a given circuit Output CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized COST FUNCTION = bb = x bby q( i) + 1 Cav, x Cav, y N nets i Placement CLB Netlist FPGA

3 VPR - EXISTING TECHNIQUES FOR FPGA PLACEMENT Uses Simulated Annealing (SA) with Adaptive Annealing Schedule Tabu Search Based Method Force Directed Placement Genetic Algorithm Based Placement Partitioning and Clustering Based Techniques Which one to choose for parallelization? All the above algorithms are heavily time-consuming Most of them are not easily parallelizable except Genetic Algorithm Genetic Algorithm is a population based optimization technique It runs through many iterations called generations Genetic Algorithm is heavily computation-intensive but an efficient optimization technique for NP-hard Problems Each generation has atleast two cost evaluation phases that are suitable parallelization Reason: Each individual's cost evaluation is independent of each other

4 Encoding GENETIC ALGORITHM FOR FPGA PLACEMENT An individual (chromosome) in a genetic algorithm is a string of integers (genes) These integers(genes) represent the position of all the CLB and the IO blocks that are to be placed on the FPGA The index of the gene corresponds to the CLB or the IO block Example: Consider a chromosome that represents the placement of 7 CLBs and 3 IO blocks C IO 1 C 4 C 5 C 6 C 3 IO 3 IO 2 C 1 C 2

5 GENETIC ALGORITHM FOR FPGA PLACEMENT Initialization of Population Bounding Box Cost Cost Evaluation Bounding Box Cost = bb = x bby q( i) + 1 Cav, x Cav, y N nets i Tournament Selection : Pick two individuals from the population randomly and select the best one out of them based on the bounding box cost Repeat this pop_size times so that another population is created with the better individuals

6 Crossover : GENETIC ALGORITHM FOR FPGA PLACEMENT Perform Partially Mapped Crossover for CLB part and IO part of the chromosome separately because the CLBs must not be placed in the IO positions and vice-versa Mutation : Choose a chromosome randomly Choose two genes randomly and swap them We do this separately for CLB part and IO part for reason mentioned above

7 PARALLEL IMPLEMENTATION OF COST EVALUATION PHASE Population Cost Evaluation Bounding Box Cost Bounding Box Cost = bb = x bby q( i) + 1 Cav, x Cav, y N nets i Cost Evaluation of individuals are independent of each other and are heavy floating point calculations Also this is the most time consuming phase of the Genetic Algorithm So Cost Evaluation can be PARALLELIZED!!! Multiple threads evaluate the costs of independent individuals

8 PARALLELIZATION STRATEGIES IMPLEMENTED Global Memory Implementation Place the netlist structures on the global memory Place the population on the global memory Shared Memory Implementation Place the netlist structures on the global memory (Too large to be in SM) Netlist structures are about 100 KB of memory for typical benchmarks Place the population on the shared memory Copying the population to local registers not possible because the chromosomes are stored in the form of arrays Shared memory implementation is expected to offer more speed-up when compared to the global memory because for the bounding box calculation the chromosomes are accessed very frequently

9 GLOBAL MEMORY IMPLEMENTATION Memory space not a constraint, but Memory accesses take too long Speed-up is very less Since the cost is evaluated based on a netlist, its highly difficult for the host or the programmer to coalesce the memory accesses Before actually running the program, very little can be done with the memory accesses Next trial : Increase the number of threads At the maximum pop_size threads can be invoked in a kernel Increasing the number of threads did not help because the memory latency increased

10 GLOBAL MEMORY IMPLEMENTATION Why not increase the granularity of the kernel? Calculate each net cost in a thread The number of threads will increase by num_nets times bb = x( i) bby q( i) + 1 Cav, x( i) Cav, y N nets i BUT, This increases the occupancy of the kernel, reduces the probability of multiple threads accessing the same location This again did not work because of the warp divergence Threads in the warp were accessing non-consecutive locations depending on the netlist and the placement of the chromosome Conclusions from global memory implementation Global Memory Reduces the performance drastically when multiple threads access the same location Warp Divergence reduces the performance

11 SHARED MEMORY IMPLEMENTATION The population array was moved to the shared memory The netlist could not be moved because of the memory limitation The number of individuals that can be moved to the shared memory is limited by the size of the netlist ie the number of blocks to be placed According to our genotype (encoding of individuals) the memory required per individual is 4 bytes * number of blocks For example, consider the number of blocks to be placed is 100 Memory required for a single individual is 400 bytes Shared memory space is only 16 K Therefore / 400 = 40 individuals in one kernel ie 40 threads per kernel Maximum Very less occupancy, but the speed up was better than global memory

12 SHARED MEMORY IMPLEMENTATION Why not use all the shared memory resources from all the Streaming Multiprocessors? This is possible as only the population arrays are placed on the shared memory and each threads evaluates the cost of only one individual The threads in a SM needs accesses to the individuals only in that SM and not in the other SM The compiler takes care of the internal memory partitioning among the 8 shared memory spaced based on the memory accesses by the threads In-order to activate all the 8 Multiprocessors we need enough number of threads to be invoked But we do not have that many cost evaluations in a kernel!!!! So we invoked dummy threads These dummy kernels are present just to increase the occupancy and activate multiple SMs at point of invocation of a kernel Will this not cause a load imbalance? These dummy threads complete their execution almost immediate after invocation Also, its manged by setting appropriate grid and block dimensions and indexing

13 Does Shared memory solve all our problems? NO Still we face warp divergence Less occupancy Bank Conflicts Multiple accesses simultaneously to same memory location These are the reasons for not achieving the ideal speed-up Why not use Registers? Main Drawbacks Does not support arrays So the individuals cannot be moved to registers Limited number of registers available per SM 8192 registers per SM Creates a bottle neck on the number of threads that can be invoked Used for temporary registers used during the cost evaluation phase

14 RESULTS AND DISCUSSION Global Memory 12 Speed Up not very good 1 Reasons: Speed up reduces with the increase in the number of blocks Speed Up Simultaneous memory accesses increase with the increase the increase in the number of blocks More blocks lead more randomness, Warp divergence increases Number of blocks The threads do not increase with the number of blocks to be placed

15 RESULTS AND DISCUSSION Shared Memory Implementation Speed up increases when compared to global memory implementation 600 The number of threads that can be invoked in a single kernel is influenced by the problem size ` As the problem size increases the Memory Bank Conflicts increase Limitation of Shared Memory 6 Shared Memory Implementation Number of threads that can be invoked Speed Up Number of Blocks No of blocks to be placed

16 RESULTS AND DISCUSSION Shared memory is up to 5 times faster than global memory implementation Difference more prominent for smaller circuits ` In the shared memory, the number of kernel calls increases as the circuit size increases Speed Up Number of blocks to be placed

17 LESSONS LEARNED Implement Global memory before Shared memory Helps to predict the shared memory challenges Explore multiple strategies of threading before moving to shared memory Pick the best method and move on to shared memory version Memory bank conflicts Reduce the probability of multiple threads accessing same memory locations simultaneously Think a lot before moving to fine granularity parallelism Warp divergence in Global Memory is more severe than in Shared memory Usage of Shared Memory of multiple streaming Multi-processors is possible The block and grid dimensions have to be appropriately chosen to activate multiple SMs For eg Using 64 blocks in a kernel helps you to utilize shared memory of all the SMs Never blame the hardware architecture before completely exploring software optimizations

18 CONCLUSIONS A genetic algorithm based FPGA placement was implemented on GPU Different strategies for implementing the parallelization was explored and the best one was chosen (population array on the shared memory) A speed up curve was obtained with respect to the size of the netlist Speed up to 52X was achieved for smaller circuit sizes

20 Back up Slides

21 Half-perimeter Wire Length Model Bounding Box Cost Calculation 6 terminal net 4 (horizontal distance) + 2 (vertical distance) BB Cost of this net = 6 Net with 6 terminals Total cost is a summation over all the nets

22 GENETIC ALGORITHM

23 FPGA CAD FLOW

Genetic Placement: Genie Algorithm Way Sern Shong ECE556 Final Project Fall 2004

Genetic Placement: Genie Algorithm Way Sern Shong ECE556 Final Project Fall 2004 Introduction Overview One of the principle problems in VLSI chip design is the layout problem. The layout problem is complex