Chapter 24 File Output on a Cluster

Size: px

Start display at page:

Download "Chapter 24 File Output on a Cluster"

Rhoda Lane
5 years ago
Views:

1 Chapter 24 File Output on a Cluster Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space Chapter 21. Cluster Parallel Loops Chapter 22. Cluster Parallel Reduction Chapter 23. Cluster Load Balancing Chapter 24. File Output on a Cluster Chapter 25. Interacting Tasks Chapter 26. Cluster Heuristic Search Chapter 27. Cluster Work Queues Chapter 28. On-Demand Tasks Part IV. GPU Acceleration Part V. Big Data Appendices 285

2 286 BIG CPU, BIG DATA R ecall the single-node multicore parallel program from Chapter 11 that computes an image of the Mandelbrot Set. The program partitioned the image rows among the threads of a parallel thread team. Each thread computed the color of each pixel in a certain row, put the row of pixels into a ColorImageQueue, and went on to the next available row. Simultaneously, another thread took each row of pixels out of the queue and wrote them to a PNG file. Now let s make this a cluster parallel program. The program will illustrate several features of the Parallel Java 2 Library, namely customized tuple subclasses and file output in a job. Studying the program s strong scaling performance will reveal interesting behavior that we didn t see with single-node multicore parallel programs. Figure 24.1 shows the cluster parallel Mandelbrot Set program s design. Like the previous multicore version, the program partitions the image rows among multiple parallel team threads. Unlike the previous version, the parallel team threads are located in multiple separate worker tasks. Each worker task runs in a separate backend process on one of the backend nodes in the cluster. I ll use a master-worker parallel for loop to partition the image rows among the tasks and threads. Also like the previous version, the program has an I/O thread responsible for writing the output PNG file, as well as a ColorImageQueue from which the I/O thread obtains rows of pixels. The I/O thread and the queue reside in an output task, separate from the worker tasks and shared by all of them. I ll run the output task in the job s process on the frontend node, rather than in a backend process on a backend node. That way, the I/O thread runs in the user s account and is able to write the PNG file in the user s directory. (If the output task ran in a backend process, the output task would typically run in a special Parallel Java account rather than the user s account, and the output task would typically not be able to write files in the user s directory.) This is a distributed memory program. The worker task team threads compute pixel rows, which are located in the backend processes memories. The output task s image queue is located in the frontend process s memory.

3 Chapter 24. File Output on a Cluster 287 Figure Cluster parallel Mandelbrot Set program

4 288 BIG CPU, BIG DATA Thus, it s not possible for a team thread to put a pixel row directly into the image queue, as the multicore parallel program could. This is where tuple space comes to the rescue. When a team thread has computed a pixel row, the team thread packages the pixel row into an output tuple and puts the tuple into tuple space. The tuple also includes the row index. A second gather thread in the output task repeatedly takes an output tuple, extracts the row index and the pixel row, and puts the pixel row into the image queue at the proper index. The I/O thread then removes pixel rows from the image queue and writes them to the PNG file. In this way, the computation s results are communicated from the worker tasks through tuple space to the output task. However, going through tuple space imposes a communication overhead in the cluster parallel program, which the multicore parallel program did not have. This communication overhead affects the cluster parallel program s scalability. Listing 24.1 gives the source code for class edu.rit.pj2.example.mandelbrotclu. Like all cluster parallel programs, it begins with the job main program that defines the tasks. The masterfor() method (line 40) sets up a task group with K worker tasks, where K is specified by the workers option on the pj2 command. The masterfor() method also sets up a master-worker parallel for loop that partitions the outer loop over the image rows, from 0 through height 1, among the worker tasks. Because the running time is different in every loop iteration, the parallel for loop needs a load balancing schedule; I specified a proportional schedule with a chunk factor of 10 (lines 38 39). This partitions the outer loop iterations into 10 times as many chunks as there are worker tasks, and each task will repeatedly execute the next available chunk in a dynamic fashion. The job main program also sets up the output task that will write the PNG file (lines 43 44), with the output task running in the job s process. Both the worker tasks and the output task are specified by start rules and will commence execution at the start of the job. Next comes the OutputTuple subclass (line 71). It conveys a row of pixel colors, a ColorArray (line 74), along with the row index (line 73), from a worker task to the output task. The tuple subclass also provides the obligatory no-argument constructor (lines 76 78), writeout() method (lines 88 92), and readin() method (lines 94 98). The WorkerTask class (line 102) is virtually identical to the single-node multicore MandelbrotSmp class from Chapter 11. There are only two differences. First, the worker task provides the worker portion of the masterworker parallel for loop (line 151). When the worker task obtains a chunk of row indexes from the master, the indexes are partitioned among the parallel team threads using a dynamic schedule for load balancing. Each loop iteration (line 160) computes the pixel colors for one row of the image, storing the colors in a per-thread color array (line 153). The second difference is that

5 Chapter 24. File Output on a Cluster package edu.rit.pj2.example; import edu.rit.image.color; import edu.rit.image.colorarray; import edu.rit.image.colorimagequeue; import edu.rit.image.colorpngwriter; import edu.rit.io.instream; import edu.rit.io.outstream; import edu.rit.pj2.job; import edu.rit.pj2.loop; import edu.rit.pj2.schedule; import edu.rit.pj2.section; import edu.rit.pj2.task; import edu.rit.pj2.tuple; import java.io.bufferedoutputstream; import java.io.file; import java.io.fileoutputstream; import java.io.ioexception; public class MandelbrotClu extends Job // Job main program. public void main (String[] args) // Parse command line arguments. if (args.length!= 8) usage(); int width = Integer.parseInt (args[0]); int height = Integer.parseInt (args[1]); double xcenter = Double.parseDouble (args[2]); double ycenter = Double.parseDouble (args[3]); double resolution = Double.parseDouble (args[4]); int maxiter = Integer.parseInt (args[5]); double gamma = Double.parseDouble (args[6]); File filename = new File (args[7]); // Set up task group with K worker tasks. Partition rows among // workers. masterschedule (proportional); masterchunk (10); masterfor (0, height - 1, WorkerTask.class).args (args); // Set up PNG file writing task. rule().task (OutputTask.class).args (args).runinjobprocess(); // Print a usage message and exit. private static void usage() System.err.println ("Usage: java pj2 [workers=<k>] " + "edu.rit.pj2.example.mandelbrotclu <width> <height> " + "<xcenter> <ycenter> <resolution> <maxiter> <gamma> " + "<filename>"); System.err.println ("<K> = Number of worker tasks (default " + "1)"); System.err.println ("<width> = Image width (pixels)"); System.err.println ("<height> = Image height (pixels)"); System.err.println ("<xcenter> = X coordinate of center " + Listing MandelbrotClu.java (part 1)

6 290 BIG CPU, BIG DATA once all columns in the row have been computed, the worker task packages the row index and the color array into an output tuple and puts the tuple into tuple space (line 189), whence the output task can take the tuple. Last comes the OutputTask class (line 196), which runs in the job s process. After setting up the PNG file writer and the color image queue (lines ), the task runs two parallel sections simultaneously in two threads (line 227). The first section (lines ) repeatedly takes an output tuple out of tuple space and puts the tuple s pixel data into the image queue at the row index indicated in the tuple. The taketuple() method is given a blank output tuple as the template; this matches any output tuple containing any pixel row, no matter which worker task put the tuple. The tuple s row index ensures that the pixel data goes into the proper image row, regardless of the order in which the tuples arrive. The first section takes exactly as many tuples as there are image rows (the height argument). The second section (lines ) merely uses the PNG image writer to write the PNG file. Each worker task terminates when there are no more chunks of pixel rows to calculate. The output task terminates when the first parallel section has taken and processed all the output tuples and the second parallel section has finished writing the PNG file. At that point the job itself terminates. I ran the Mandelbrot Set program on the tardis cluster to study the program s strong scaling performance. I computed images of size , , , , and pixels. For partitioning at the master level, the program is hard-coded to use a proportional schedule with a chunk factor of 10. For partitioning at the worker level, the program is hard-coded to use a dynamic schedule. To measure the sequential version, I ran the MandelbrotSeq program from Chapter 11 on one node using commands like this: $ java pj2 debug=makespan edu.rit.pj2.example.mandelbrotseq \ ms3200.png To measure the parallel version on one core, I ran the MandelbrotClu program with one worker task and one thread using commands like this: $ java pj2 debug=makespan workers=1 cores=1 \ edu.rit.pj2.example.mandelbrotclu \ ms3200.png To measure the parallel version on multiple cores, I ran the MandelbrotClu program with one to ten worker tasks and with all cores on each node (12 to 120 cores) using commands like this: $ java pj2 debug=makespan workers=2 \ edu.rit.pj2.example.mandelbrotclu \ ms3200.png

7 Chapter 24. File Output on a Cluster "point"); System.err.println ("<ycenter> = Y coordinate of center " + "point"); System.err.println ("<resolution> = Pixels per unit"); System.err.println ("<maxiter> = Maximum number of " + "iterations"); System.err.println ("<gamma> = Used to calculate pixel hues"); System.err.println ("<filename> = PNG image file name"); terminate (1); // Tuple for sending results from worker tasks to output task. private static class OutputTuple extends Tuple public int row; // Row index public ColorArray pixeldata; // Row's pixel data public OutputTuple() public OutputTuple (int row, ColorArray pixeldata) this.row = row; this.pixeldata = pixeldata; public void writeout (OutStream out) throws IOException out.writeunsignedint (row); out.writeobject (pixeldata); public void readin (InStream in) throws IOException row = in.readunsignedint(); pixeldata = (ColorArray) in.readobject(); // Worker task class. private static class WorkerTask extends Task // Command line arguments. int width; int height; double xcenter; double ycenter; double resolution; int maxiter; double gamma; // Initial pixel offsets from center. int xoffset; Listing MandelbrotClu.java (part 2)

8 292 BIG CPU, BIG DATA Figure MandelbrotClu strong scaling performance metrics

9 Chapter 24. File Output on a Cluster int yoffset; // Table of hues. Color[] huetable; // Worker task main program. public void main (String[] args) throws Exception // Parse command line arguments. width = Integer.parseInt (args[0]); height = Integer.parseInt (args[1]); xcenter = Double.parseDouble (args[2]); ycenter = Double.parseDouble (args[3]); resolution = Double.parseDouble (args[4]); maxiter = Integer.parseInt (args[5]); gamma = Double.parseDouble (args[6]); // Initial pixel offsets from center. xoffset = -(width - 1) / 2; yoffset = (height - 1) / 2; // Create table of hues for different iteration counts. huetable = new Color [maxiter + 2]; for (int i = 1; i <= maxiter; ++ i) huetable[i] = new Color().hsb (/*hue*/ (float) Math.pow ((double)(i 1)/maxiter, gamma), /*sat*/ 1.0f, /*bri*/ 1.0f); huetable[maxiter + 1] = new Color().hsb (1.0f, 1.0f, 0.0f); // Compute all rows and columns. workerfor().schedule (dynamic).exec (new Loop() ColorArray pixeldata; public void start() pixeldata = new ColorArray (width); public void run (int r) throws Exception double y = ycenter + (yoffset - r) / resolution; for (int c = 0; c < width; ++ c) double x = xcenter + (xoffset + c) / resolution; // Iterate until convergence. int i = 0; double aold = 0.0; double bold = 0.0; double a = 0.0; double b = 0.0; double zmagsqr = 0.0; Listing MandelbrotClu.java (part 3)

10 294 BIG CPU, BIG DATA Figure 24.2 plots the running times, speedups, and efficiencies I observed. The running time plots behavior is peculiar. The running times decrease as the number of cores increases, more or less as expected with strong scaling, but only up to a certain point. At around 36 or 48 cores, the running time plots flatten out, and there is no further reduction as more cores are added. Also, the efficiency plots show that there s a steady decrease in efficiency as more cores are added, much more of a drop than we ve seen before. What s going on? Fitting this model to the data yields this running time formula: T = ( N) + ( N) K + ( N) K. (24.1) Plugging a certain problem size N into Equation 24.1 yields a running time formula as a function of just K. For example, the pixel image has a problem size (number of inner loop iterations) N of For that problem size, the formula becomes T = K K. (24.2) For the numbers of cores I used, the second term in Equation 24.2 is negligible compared to the other terms. Figure 24.3 plots the first and third terms separately in black, along with their sum T in red. Because the third term s coefficient is so much larger than the first term s coefficient, the third term Figure MandelbrotClu running time model, pixel image

11 Chapter 24. File Output on a Cluster while (i <= maxiter && zmagsqr <= 4.0) ++ i; a = aold*aold - bold*bold + x; b = 2.0*aold*bold + y; zmagsqr = a*a + b*b; aold = a; bold = b; // Record number of iterations for pixel. pixeldata.color (c, huetable[i]); puttuple (new OutputTuple (r, pixeldata)); ); // Output PNG file writing task. private static class OutputTask extends Task // Command line arguments. int width; int height; File filename; // For writing PNG image file. ColorPngWriter writer; ColorImageQueue imagequeue; // Task main program. public void main (String[] args) throws Exception // Parse command line arguments. width = Integer.parseInt (args[0]); height = Integer.parseInt (args[1]); filename = new File (args[7]); // Set up for writing PNG image file. writer = new ColorPngWriter (height, width, new BufferedOutputStream (new FileOutputStream (filename))); filename.setreadable (true, false); filename.setwritable (true, false); imagequeue = writer.getimagequeue(); // Overlapped pixel data gathering and file writing. paralleldo (new Section() // Pixel data gathering section. public void run() throws Exception OutputTuple template = new OutputTuple(); Listing MandelbrotClu.java (part 4)

12 296 BIG CPU, BIG DATA dominates for small K values, and T decreases as K increases. But as K gets larger, the third term gets smaller, while the first term stays the same. Eventually the third term becomes smaller than the first term. After that, the running time T flattens out and approaches the first term as K increases. There s an important lesson here. When doing strong scaling on a cluster parallel computer, you don t necessarily want to run the program on all the cores in the cluster. Rather, you want to run the program on only as many cores as are needed to minimize the running time. This might be fewer than the total number of cores. Measuring the program s performance and deriving a running time model, as I did above, lets you determine the optimum number of cores to use. For the images I computed, the running times on 36 cores were very nearly the same as the running times on 120 cores. So on the tardis cluster I could compute three images on 36 cores each in about the same time as I could compute one image on 120 cores. Limiting the number of cores per job would improve utilization of the cluster, allowing more jobs to run in a given amount of time. This scaling behavior is a consequence of Amdahl s Law. If you run a parallel program on too many cores, the sequential portion for the Mandelbrot Set program, the portion that writes the output image file is going to dominate the parallelizable portion, and you won t get any further decreases in the running time. We didn t see this happening with the multicore parallel program because we could scale up to only 12 cores on one tardis node. Now with the cluster parallel program we can scale up to 120 cores on the whole tardis cluster, and we can observe the diminishing returns. Points to Remember In a cluster parallel program that must write (or read) a file, consider doing the file I/O in a task that runs in the job s process. Use tuple space to convey the worker tasks results to the output task. Define a tuple subclass whose fields hold the output results. When doing strong scaling on a cluster parallel program, as the number of cores increases, the running time initially decreases, but eventually flattens out. Use the program s running time model, fitted to the program s measured running time data, to determine the optimum number of cores on which to run the program the smallest number of cores needed to minimize the running time.

13 Chapter 24. File Output on a Cluster , OutputTuple tuple; for (int i = 0; i < height; ++ i) tuple = (OutputTuple) taketuple (template); imagequeue.put (tuple.row, tuple.pixeldata); new Section() // File writing section. public void run() throws Exception writer.write(); ); Listing MandelbrotClu.java (part 5)

14 298 BIG CPU, BIG DATA

Chapter 11 Overlapping

Chapter 11 Overlapping Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables