Dynamic load-balancing on multi-fpga systems a case study

Size: px

Start display at page:

Download "Dynamic load-balancing on multi-fpga systems a case study"

Laurence Marshall
6 years ago
Views:

1 Dynamic load-balancing on multi-fpga systems a case study Volodymyr Kindratenko Innovative Systems Lab (ISL) (NCSA) Robert Brunner and Adam Myers Department of Astronomy University of Illinois at Urbana-Champaign (UIUC)

2 SRC-6 Reconfigurable Computer SRC Hi-Bar 4-port Switch SNAP Memory MAPC Common Memory MAPE µp PCI-X dual-xeon 2.8 GHz, 1 GB memory OBM A OBM B OBM C Control FPGA OBM D 1.4 GB/s sustained payload OBM E OBM F Carte 2.2 User FPGA Dual-ported Memory 192 User FPGA MB/s each

Angular Correlation Function TPACF, denoted as ( ), is the frequency distribution of

distance between two points Blue points (random data) are, on average, randomly

( )>0 Can vary as a function of angular distance, (yellow circles) Blue: ( )=0 on all

3 Angular Correlation Function TPACF, denoted as ( ), is the frequency distribution of angular separations between celestial objects in the interval (, + ) is the angular distance between two points Blue points (random data) are, on average, randomly distributed, red points (observed data) are clustered Blue points: ( )=0 Red points: ( )>0 Can vary as a function of angular distance, (yellow circles) Blue: ( )=0 on all scales Red: ( ) is larger on smaller scales Image source:

4 The Method The angular correlation function is calculated using the estimator derived by Landy & Szalay (1993): 1 2 nd DD 1 n 2 R where DD( ) and RR( ) are the autocorrelation function of the data and random points, respectively, and DR( ) is the cross-correlation between the data and random points. 2 n n D RR R i DR i 1

5 Serial Code Organization // pre-compute bin boundaries, binb // compute DD docompute{cpu MAP(data, npd, data, npd, 1, DD, binb, nbins); // loop through random data files for (i = 0; i < random_count; i++) { // compute RR docompute{cpu MAP(random[i], npr[i], random[i], npr[i], 1, RRS, binb, nbins); // compute DR docompute{cpu MAP(data, npd, random[i], npr[i], 0, DRS, binb, nbins); // compute w for (k = 0; k < nbins; k++) { w[k] = (random_count * 2*DD[k] - DRS[k]) / RRS[k] + 1.0;

6 Reference C Kernel Implementation for (i = 0; i < ((autocorrelation)? n1-1 : n1); i++) { double xi = data1[i].x, yi = data1[i].y, zi = data1[i].z; for (j = ((autocorrelation)? i+1 : 0); j < n2; j++) { double dot = xi * data2[j].x + yi * data2[j].y + * data2[j].z; pi pj register int k, min = 0, max = nbins; if (dot >= binb[min]) data_bins[min] += 1; else if (dot < binb[max]) data_bins[max+1] += 1; // run binary search else { while (max > min+1) { k = (min + max) / 2; if (dot >= binb[k]) max = k; else min = k; ; data_bins[max] += 1; q 0 q 1 q 2 q 3 q 4 q 5

7 execution time (seconds) execution time (s) speedup OpenMP Implementation µp speedup CPU MAP 140x 120x Hi-Bar Switch x 82.8x 84.9x 86.7x 88.1x 89.1x 89.6x 89.7x 89.3x 100x 80x x MAP C MAP E x 20x 0x dataset size for (i = 0; i < random_count; i++) { #pragma omp parallel sections #pragma omp section docomputemap1(, mapc); #pragma omp section docomputemap2(, mape); MAP C MAP E number of points in the dataset (x10,000) MAP C processor is idle 18% of the time

8 Simplified Performance Model Analysis of a data/random file with 100 data points each Autocorrelation between the points in the random data file requires 100*(100-1)/2=4,950 Cross-correlation between the observed data and random data requires 100*100=10,000 MAP Series C processor is idle about 50% of the time! MAP C MAP E Autocorrelation (AC) 4,950 Cross-correlation () 10,000

Consider Data Partitioning A1Dataset A: A2100

Cross-correlation Jobs A1-B1 (cc) A1-B2 (cc) A1-B3

9 Consider Data Partitioning A1Dataset A: A2100 pointsa3 B1Dataset B: B2100 pointsb3 Autocorrelation Jobs A1-A1 (ac) A1-A2 (cc) A1-A3 (cc) A2-A2 (ac) A2-A3 (cc) A3-A3 (ac) Cross-correlation Jobs A1-B1 (cc) A1-B2 (cc) A1-B3 (cc) A2-B1 (cc) A2-B2 (cc) A2-B3 (cc) A3-B1 (cc) A3-B2 (cc) A3-B3 (cc) MAP C MAP E

processor is invoked with the first available unprocessed pair of segments MAP Series C processor is idle about

10 Consider Data Partitioning Analysis of a data/random file with 100 data points each Each data file is divided into 3 equally sized segments Autocorrelation is computed first, followed by the cross-correlation Each MAP processor is invoked with the first available unprocessed pair of segments MAP Series C processor is idle about 7% of the time! MAP C AC 528 1,122 1,122 1,089 1,089 1,122 1,089 MAP E 1,089 AC 528 AC 561 1,089 1,122 1,089 1,089 1,122

11 Job Scheduler `workers` dataset 1 dataset 2 Scheduler for each pair of d1/d2 segments, p ij for each MAP processor, m if m is free assign p ij to m break endif endfor endfor MAP C MAP E `jobs`

12 Job Scheduler Implementation do { for (k = 0; k < K; k++) { if (job[k].status == running) continue; if (job[k].status == done) continue; if (job[k].status == finished) { pthread_join(job[k].thread, (void **)&mytd); for (i = 0; i < nbins+2; i++) res[i] += mytd->res[i]; job[k].status = done; // loop thru all the jobs // let it run // nothing to do anymore // need to get results back // join the thread // copy results // set status to done TOTAL++; // count number of fully executed jobs continue; for (t = NPROCS-1; t >= 0; t--) { // is there a free MAP to run this job? if (thread_stat[t] == busy) continue; // thread is busy if (self && i == j && t == 1) continue; // not suitable thread for 'self struct my_thread_data *mytd = (struct my_thread_data *)malloc(sizeof(struct my_thread_data)); pthread_create(&(job[k].thread), NULL, my_map_proc, (void *)mytd); thread_stat[t] = busy; // lock it job[k].status = running; // set status to running break; // no need to check the rest of the MAPs usleep(1000); while (TOTAL!= K);

13 execution time (s) execution time (seconds) speedup Load-balanced Implementation µp speedup CPU MAP 140x Hi-Bar Switch x 84.8x 88.9x 92.8x 94.7x 95.8x 96.4x 96.2x 120x 100x 80x x 60x MAP C MAP E x 20x 0x dataset size for (i = 0; i < random_count; i++) { JobScheduler(data, random); JobScheduler(random, random); MAP C MAP E MAP E processor is idle lees than 1% of the time number of points in the dataset (x10,000)

14 Pros Conclusions A 9% performance improvement due to a better utilization of the idle resources Near-identical load on each of the MAPs Scalable solution that allows to mix compute subroutines with different performance characteristics Cons Performance hit for the smaller datasets due to the overhead in calling the MAP processors More complex execution flow and data management

15 Acknowledgements This work is funded by NASA Applied Information Systems Research (AISR) award number NNG06GH15G Prof. Robert Brunner and Dr. Adam Myers from UIUC Department of Astronomy NCSA Collaborators Dr. Rob Pennington, Dr. Craig Steffen, David Raila, Michael Showerman, Jeremy Enos, John Larson, David Meixner, Ken Sartain SRC Computers, Inc. David Caliga, Dr. Jeff Hammes, Dan Poznanovic, David Pointer, Jon Huppenthal

Accelerating Cosmology Applications

Accelerating Cosmology Applications from 80 MFLOPS to 8 GFLOPS in 4 steps Volodymyr Kindratenko Innovative Systems Lab (ISL) (NCSA) Robert Brunner and Adam Myers Department of Astronomy University of Illinois