Chapter 24 File Output on a Cluster
|
|
- Rhoda Lane
- 5 years ago
- Views:
Transcription
1 Chapter 24 File Output on a Cluster Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space Chapter 21. Cluster Parallel Loops Chapter 22. Cluster Parallel Reduction Chapter 23. Cluster Load Balancing Chapter 24. File Output on a Cluster Chapter 25. Interacting Tasks Chapter 26. Cluster Heuristic Search Chapter 27. Cluster Work Queues Chapter 28. On-Demand Tasks Part IV. GPU Acceleration Part V. Big Data Appendices 285
2 286 BIG CPU, BIG DATA R ecall the single-node multicore parallel program from Chapter 11 that computes an image of the Mandelbrot Set. The program partitioned the image rows among the threads of a parallel thread team. Each thread computed the color of each pixel in a certain row, put the row of pixels into a ColorImageQueue, and went on to the next available row. Simultaneously, another thread took each row of pixels out of the queue and wrote them to a PNG file. Now let s make this a cluster parallel program. The program will illustrate several features of the Parallel Java 2 Library, namely customized tuple subclasses and file output in a job. Studying the program s strong scaling performance will reveal interesting behavior that we didn t see with single-node multicore parallel programs. Figure 24.1 shows the cluster parallel Mandelbrot Set program s design. Like the previous multicore version, the program partitions the image rows among multiple parallel team threads. Unlike the previous version, the parallel team threads are located in multiple separate worker tasks. Each worker task runs in a separate backend process on one of the backend nodes in the cluster. I ll use a master-worker parallel for loop to partition the image rows among the tasks and threads. Also like the previous version, the program has an I/O thread responsible for writing the output PNG file, as well as a ColorImageQueue from which the I/O thread obtains rows of pixels. The I/O thread and the queue reside in an output task, separate from the worker tasks and shared by all of them. I ll run the output task in the job s process on the frontend node, rather than in a backend process on a backend node. That way, the I/O thread runs in the user s account and is able to write the PNG file in the user s directory. (If the output task ran in a backend process, the output task would typically run in a special Parallel Java account rather than the user s account, and the output task would typically not be able to write files in the user s directory.) This is a distributed memory program. The worker task team threads compute pixel rows, which are located in the backend processes memories. The output task s image queue is located in the frontend process s memory.
3 Chapter 24. File Output on a Cluster 287 Figure Cluster parallel Mandelbrot Set program
4 288 BIG CPU, BIG DATA Thus, it s not possible for a team thread to put a pixel row directly into the image queue, as the multicore parallel program could. This is where tuple space comes to the rescue. When a team thread has computed a pixel row, the team thread packages the pixel row into an output tuple and puts the tuple into tuple space. The tuple also includes the row index. A second gather thread in the output task repeatedly takes an output tuple, extracts the row index and the pixel row, and puts the pixel row into the image queue at the proper index. The I/O thread then removes pixel rows from the image queue and writes them to the PNG file. In this way, the computation s results are communicated from the worker tasks through tuple space to the output task. However, going through tuple space imposes a communication overhead in the cluster parallel program, which the multicore parallel program did not have. This communication overhead affects the cluster parallel program s scalability. Listing 24.1 gives the source code for class edu.rit.pj2.example.mandelbrotclu. Like all cluster parallel programs, it begins with the job main program that defines the tasks. The masterfor() method (line 40) sets up a task group with K worker tasks, where K is specified by the workers option on the pj2 command. The masterfor() method also sets up a master-worker parallel for loop that partitions the outer loop over the image rows, from 0 through height 1, among the worker tasks. Because the running time is different in every loop iteration, the parallel for loop needs a load balancing schedule; I specified a proportional schedule with a chunk factor of 10 (lines 38 39). This partitions the outer loop iterations into 10 times as many chunks as there are worker tasks, and each task will repeatedly execute the next available chunk in a dynamic fashion. The job main program also sets up the output task that will write the PNG file (lines 43 44), with the output task running in the job s process. Both the worker tasks and the output task are specified by start rules and will commence execution at the start of the job. Next comes the OutputTuple subclass (line 71). It conveys a row of pixel colors, a ColorArray (line 74), along with the row index (line 73), from a worker task to the output task. The tuple subclass also provides the obligatory no-argument constructor (lines 76 78), writeout() method (lines 88 92), and readin() method (lines 94 98). The WorkerTask class (line 102) is virtually identical to the single-node multicore MandelbrotSmp class from Chapter 11. There are only two differences. First, the worker task provides the worker portion of the masterworker parallel for loop (line 151). When the worker task obtains a chunk of row indexes from the master, the indexes are partitioned among the parallel team threads using a dynamic schedule for load balancing. Each loop iteration (line 160) computes the pixel colors for one row of the image, storing the colors in a per-thread color array (line 153). The second difference is that
5 Chapter 24. File Output on a Cluster package edu.rit.pj2.example; import edu.rit.image.color; import edu.rit.image.colorarray; import edu.rit.image.colorimagequeue; import edu.rit.image.colorpngwriter; import edu.rit.io.instream; import edu.rit.io.outstream; import edu.rit.pj2.job; import edu.rit.pj2.loop; import edu.rit.pj2.schedule; import edu.rit.pj2.section; import edu.rit.pj2.task; import edu.rit.pj2.tuple; import java.io.bufferedoutputstream; import java.io.file; import java.io.fileoutputstream; import java.io.ioexception; public class MandelbrotClu extends Job // Job main program. public void main (String[] args) // Parse command line arguments. if (args.length!= 8) usage(); int width = Integer.parseInt (args[0]); int height = Integer.parseInt (args[1]); double xcenter = Double.parseDouble (args[2]); double ycenter = Double.parseDouble (args[3]); double resolution = Double.parseDouble (args[4]); int maxiter = Integer.parseInt (args[5]); double gamma = Double.parseDouble (args[6]); File filename = new File (args[7]); // Set up task group with K worker tasks. Partition rows among // workers. masterschedule (proportional); masterchunk (10); masterfor (0, height - 1, WorkerTask.class).args (args); // Set up PNG file writing task. rule().task (OutputTask.class).args (args).runinjobprocess(); // Print a usage message and exit. private static void usage() System.err.println ("Usage: java pj2 [workers=<k>] " + "edu.rit.pj2.example.mandelbrotclu <width> <height> " + "<xcenter> <ycenter> <resolution> <maxiter> <gamma> " + "<filename>"); System.err.println ("<K> = Number of worker tasks (default " + "1)"); System.err.println ("<width> = Image width (pixels)"); System.err.println ("<height> = Image height (pixels)"); System.err.println ("<xcenter> = X coordinate of center " + Listing MandelbrotClu.java (part 1)
6 290 BIG CPU, BIG DATA once all columns in the row have been computed, the worker task packages the row index and the color array into an output tuple and puts the tuple into tuple space (line 189), whence the output task can take the tuple. Last comes the OutputTask class (line 196), which runs in the job s process. After setting up the PNG file writer and the color image queue (lines ), the task runs two parallel sections simultaneously in two threads (line 227). The first section (lines ) repeatedly takes an output tuple out of tuple space and puts the tuple s pixel data into the image queue at the row index indicated in the tuple. The taketuple() method is given a blank output tuple as the template; this matches any output tuple containing any pixel row, no matter which worker task put the tuple. The tuple s row index ensures that the pixel data goes into the proper image row, regardless of the order in which the tuples arrive. The first section takes exactly as many tuples as there are image rows (the height argument). The second section (lines ) merely uses the PNG image writer to write the PNG file. Each worker task terminates when there are no more chunks of pixel rows to calculate. The output task terminates when the first parallel section has taken and processed all the output tuples and the second parallel section has finished writing the PNG file. At that point the job itself terminates. I ran the Mandelbrot Set program on the tardis cluster to study the program s strong scaling performance. I computed images of size , , , , and pixels. For partitioning at the master level, the program is hard-coded to use a proportional schedule with a chunk factor of 10. For partitioning at the worker level, the program is hard-coded to use a dynamic schedule. To measure the sequential version, I ran the MandelbrotSeq program from Chapter 11 on one node using commands like this: $ java pj2 debug=makespan edu.rit.pj2.example.mandelbrotseq \ ms3200.png To measure the parallel version on one core, I ran the MandelbrotClu program with one worker task and one thread using commands like this: $ java pj2 debug=makespan workers=1 cores=1 \ edu.rit.pj2.example.mandelbrotclu \ ms3200.png To measure the parallel version on multiple cores, I ran the MandelbrotClu program with one to ten worker tasks and with all cores on each node (12 to 120 cores) using commands like this: $ java pj2 debug=makespan workers=2 \ edu.rit.pj2.example.mandelbrotclu \ ms3200.png
7 Chapter 24. File Output on a Cluster "point"); System.err.println ("<ycenter> = Y coordinate of center " + "point"); System.err.println ("<resolution> = Pixels per unit"); System.err.println ("<maxiter> = Maximum number of " + "iterations"); System.err.println ("<gamma> = Used to calculate pixel hues"); System.err.println ("<filename> = PNG image file name"); terminate (1); // Tuple for sending results from worker tasks to output task. private static class OutputTuple extends Tuple public int row; // Row index public ColorArray pixeldata; // Row's pixel data public OutputTuple() public OutputTuple (int row, ColorArray pixeldata) this.row = row; this.pixeldata = pixeldata; public void writeout (OutStream out) throws IOException out.writeunsignedint (row); out.writeobject (pixeldata); public void readin (InStream in) throws IOException row = in.readunsignedint(); pixeldata = (ColorArray) in.readobject(); // Worker task class. private static class WorkerTask extends Task // Command line arguments. int width; int height; double xcenter; double ycenter; double resolution; int maxiter; double gamma; // Initial pixel offsets from center. int xoffset; Listing MandelbrotClu.java (part 2)
8 292 BIG CPU, BIG DATA Figure MandelbrotClu strong scaling performance metrics
9 Chapter 24. File Output on a Cluster int yoffset; // Table of hues. Color[] huetable; // Worker task main program. public void main (String[] args) throws Exception // Parse command line arguments. width = Integer.parseInt (args[0]); height = Integer.parseInt (args[1]); xcenter = Double.parseDouble (args[2]); ycenter = Double.parseDouble (args[3]); resolution = Double.parseDouble (args[4]); maxiter = Integer.parseInt (args[5]); gamma = Double.parseDouble (args[6]); // Initial pixel offsets from center. xoffset = -(width - 1) / 2; yoffset = (height - 1) / 2; // Create table of hues for different iteration counts. huetable = new Color [maxiter + 2]; for (int i = 1; i <= maxiter; ++ i) huetable[i] = new Color().hsb (/*hue*/ (float) Math.pow ((double)(i 1)/maxiter, gamma), /*sat*/ 1.0f, /*bri*/ 1.0f); huetable[maxiter + 1] = new Color().hsb (1.0f, 1.0f, 0.0f); // Compute all rows and columns. workerfor().schedule (dynamic).exec (new Loop() ColorArray pixeldata; public void start() pixeldata = new ColorArray (width); public void run (int r) throws Exception double y = ycenter + (yoffset - r) / resolution; for (int c = 0; c < width; ++ c) double x = xcenter + (xoffset + c) / resolution; // Iterate until convergence. int i = 0; double aold = 0.0; double bold = 0.0; double a = 0.0; double b = 0.0; double zmagsqr = 0.0; Listing MandelbrotClu.java (part 3)
10 294 BIG CPU, BIG DATA Figure 24.2 plots the running times, speedups, and efficiencies I observed. The running time plots behavior is peculiar. The running times decrease as the number of cores increases, more or less as expected with strong scaling, but only up to a certain point. At around 36 or 48 cores, the running time plots flatten out, and there is no further reduction as more cores are added. Also, the efficiency plots show that there s a steady decrease in efficiency as more cores are added, much more of a drop than we ve seen before. What s going on? Fitting this model to the data yields this running time formula: T = ( N) + ( N) K + ( N) K. (24.1) Plugging a certain problem size N into Equation 24.1 yields a running time formula as a function of just K. For example, the pixel image has a problem size (number of inner loop iterations) N of For that problem size, the formula becomes T = K K. (24.2) For the numbers of cores I used, the second term in Equation 24.2 is negligible compared to the other terms. Figure 24.3 plots the first and third terms separately in black, along with their sum T in red. Because the third term s coefficient is so much larger than the first term s coefficient, the third term Figure MandelbrotClu running time model, pixel image
11 Chapter 24. File Output on a Cluster while (i <= maxiter && zmagsqr <= 4.0) ++ i; a = aold*aold - bold*bold + x; b = 2.0*aold*bold + y; zmagsqr = a*a + b*b; aold = a; bold = b; // Record number of iterations for pixel. pixeldata.color (c, huetable[i]); puttuple (new OutputTuple (r, pixeldata)); ); // Output PNG file writing task. private static class OutputTask extends Task // Command line arguments. int width; int height; File filename; // For writing PNG image file. ColorPngWriter writer; ColorImageQueue imagequeue; // Task main program. public void main (String[] args) throws Exception // Parse command line arguments. width = Integer.parseInt (args[0]); height = Integer.parseInt (args[1]); filename = new File (args[7]); // Set up for writing PNG image file. writer = new ColorPngWriter (height, width, new BufferedOutputStream (new FileOutputStream (filename))); filename.setreadable (true, false); filename.setwritable (true, false); imagequeue = writer.getimagequeue(); // Overlapped pixel data gathering and file writing. paralleldo (new Section() // Pixel data gathering section. public void run() throws Exception OutputTuple template = new OutputTuple(); Listing MandelbrotClu.java (part 4)
12 296 BIG CPU, BIG DATA dominates for small K values, and T decreases as K increases. But as K gets larger, the third term gets smaller, while the first term stays the same. Eventually the third term becomes smaller than the first term. After that, the running time T flattens out and approaches the first term as K increases. There s an important lesson here. When doing strong scaling on a cluster parallel computer, you don t necessarily want to run the program on all the cores in the cluster. Rather, you want to run the program on only as many cores as are needed to minimize the running time. This might be fewer than the total number of cores. Measuring the program s performance and deriving a running time model, as I did above, lets you determine the optimum number of cores to use. For the images I computed, the running times on 36 cores were very nearly the same as the running times on 120 cores. So on the tardis cluster I could compute three images on 36 cores each in about the same time as I could compute one image on 120 cores. Limiting the number of cores per job would improve utilization of the cluster, allowing more jobs to run in a given amount of time. This scaling behavior is a consequence of Amdahl s Law. If you run a parallel program on too many cores, the sequential portion for the Mandelbrot Set program, the portion that writes the output image file is going to dominate the parallelizable portion, and you won t get any further decreases in the running time. We didn t see this happening with the multicore parallel program because we could scale up to only 12 cores on one tardis node. Now with the cluster parallel program we can scale up to 120 cores on the whole tardis cluster, and we can observe the diminishing returns. Points to Remember In a cluster parallel program that must write (or read) a file, consider doing the file I/O in a task that runs in the job s process. Use tuple space to convey the worker tasks results to the output task. Define a tuple subclass whose fields hold the output results. When doing strong scaling on a cluster parallel program, as the number of cores increases, the running time initially decreases, but eventually flattens out. Use the program s running time model, fitted to the program s measured running time data, to determine the optimum number of cores on which to run the program the smallest number of cores needed to minimize the running time.
13 Chapter 24. File Output on a Cluster , OutputTuple tuple; for (int i = 0; i < height; ++ i) tuple = (OutputTuple) taketuple (template); imagequeue.put (tuple.row, tuple.pixeldata); new Section() // File writing section. public void run() throws Exception writer.write(); ); Listing MandelbrotClu.java (part 5)
14 298 BIG CPU, BIG DATA
Chapter 11 Overlapping
Chapter 11 Overlapping Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationChapter 21 Cluster Parallel Loops
Chapter 21 Cluster Parallel Loops Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple
More informationChapter 27 Cluster Work Queues
Chapter 27 Cluster Work Queues Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space
More informationChapter 26 Cluster Heuristic Search
Chapter 26 Cluster Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple
More informationChapter 13 Strong Scaling
Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationChapter 25 Interacting Tasks
Chapter 25 Interacting Tasks Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space
More informationLoad Balancing & Broadcasting
What to Learn This Week? Load Balancing & Broadcasting Minseok Kwon Department of Computer Science Rochester Institute of Technology We will discover that cluster parallel programs can have unbalanced
More informationChapter 19 Hybrid Parallel
Chapter 19 Hybrid Parallel Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space
More informationChapter 6 Parallel Loops
Chapter 6 Parallel Loops Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationChapter 20 Tuple Space
Chapter 20 Tuple Space Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple Space Chapter
More informationChapter 31 Multi-GPU Programming
Chapter 31 Multi-GPU Programming Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Chapter 29. GPU Massively Parallel Chapter 30. GPU
More informationChapter 16 Heuristic Search
Chapter 16 Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationChapter 17 Parallel Work Queues
Chapter 17 Parallel Work Queues Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction
More informationChapter 36 Cluster Map-Reduce
Chapter 36 Cluster Map-Reduce Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Big Data Chapter 35. Basic Map-Reduce Chapter
More informationChapter 38 Map-Reduce Meets GIS
Chapter 38 Map-Reduce Meets GIS Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Big Data Chapter 35. Basic Map-Reduce Chapter
More informationChapter 9 Reduction Variables
Chapter 9 Reduction Variables Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationMaximum Clique Problem
Maximum Clique Problem Dler Ahmad dha3142@rit.edu Yogesh Jagadeesan yj6026@rit.edu 1. INTRODUCTION Graph is a very common approach to represent computational problems. A graph consists a set of vertices
More informationBuilding a Java First-Person Shooter
Building a Java First-Person Shooter Episode 5 Playing with Pixels! [Last update 5/2/2017] Objective This episode does not really introduce any new concepts. Two software defects are fixed (one poorly)
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky
BIG CPU, BIG DATA Solving the World s Toughest Computational Problems with Parallel Computing Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Second Edition. Alan Kaminsky
Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky
Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University
More informationHomework # 1 Due: Feb 23. Multicore Programming: An Introduction
C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationChapter 5 Supercomputers
Chapter 5 Supercomputers Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationAppendix A Clash of the Titans: C vs. Java
Appendix A Clash of the Titans: C vs. Java Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Map-Reduce Appendices Appendix A.
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationPRINCIPLES OF SOFTWARE BIM209DESIGN AND DEVELOPMENT 10. PUTTING IT ALL TOGETHER. Are we there yet?
PRINCIPLES OF SOFTWARE BIM209DESIGN AND DEVELOPMENT 10. PUTTING IT ALL TOGETHER Are we there yet? Developing software, OOA&D style You ve got a lot of new tools, techniques, and ideas about how to develop
More informationMost real programs operate somewhere between task and data parallelism. Our solution also lies in this set.
for Windows Azure and HPC Cluster 1. Introduction In parallel computing systems computations are executed simultaneously, wholly or in part. This approach is based on the partitioning of a big task into
More informationFINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)
FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913) Overview The partitioning of data points according to certain features of the points into small groups is called clustering.
More informationOOADP/OOSE Re- exam. 23 August Mapping marks onto grades. Answers
OOADP/OOSE Re- exam 23 August 213 Mapping marks onto grades Answers 1. [4 marks] The amount of communication required between team members increases (in the worst case) with the square of the number of
More informationProblem 1. (10 points):
Parallel Computer Architecture and Programming Written Assignment 1 30 points total + 2 pts extra credit. Due Monday, July 3 at the start of class. Warm Up Problems Problem 1. (10 points): A. (5 pts) Complete
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationCMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer
CMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer Rochester Institute of Technology, Department of Computer Engineering Instructor: Dr. Shaaban (meseec@rit.edu) TAs: Jason Lowden
More informationLoops. In Example 1, we have a Person class, that counts the number of Person objects constructed.
Loops Introduction In this article from my free Java 8 course, I will discuss the use of loops in Java. Loops allow the program to execute repetitive tasks or iterate over vast amounts of data quickly.
More informationDetermining the Number of CPUs for Query Processing
Determining the Number of CPUs for Query Processing Fatemah Panahi Elizabeth Soechting CS747 Advanced Computer Systems Analysis Techniques The University of Wisconsin-Madison fatemeh@cs.wisc.edu, eas@cs.wisc.edu
More informationCMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer
CMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer Rochester Institute of Technology, Department of Computer Engineering Instructor: Dr. Shaaban (meseec@rit.edu) TAs: Akshay Yembarwar
More informationAgenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2
Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationImage Processing The Easy Way
Image Processing The Easy Way Image Zoom with and without the AMD Performance Library Brent Hollingsworth Advanced Micro Devices November 2006-1 - 2006 Advanced Micro Devices, Inc. All rights reserved.
More informationA Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/
More informationCSCI Lab 9 Implementing and Using a Binary Search Tree (BST)
CSCI Lab 9 Implementing and Using a Binary Search Tree (BST) Preliminaries In this lab you will implement a binary search tree and use it in the WorkerManager program from Lab 3. Start by copying this
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationPriority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.
Priority Queues 1 Introduction Many applications require a special type of queuing in which items are pushed onto the queue by order of arrival, but removed from the queue based on some other priority
More information*** TROUBLESHOOTING TIP ***
*** TROUBLESHOOTING TIP *** If you are experiencing errors with your deliverable 2 setup which deliverable 3 is built upon, delete the deliverable 2 project within Eclipse, and delete the non working newbas
More informationIntroduction IS
Introduction IS 313 4.1.2003 Outline Goals of the course Course organization Java command line Object-oriented programming File I/O Business Application Development Business process analysis Systems analysis
More informationMaster-Worker pattern
COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Fall 2018 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2
More informationNote: Each loop has 5 iterations in the ThreeLoopTest program.
Lecture 23 Multithreading Introduction Multithreading is the ability to do multiple things at once with in the same application. It provides finer granularity of concurrency. A thread sometimes called
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationMaster-Worker pattern
COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Spring 2017 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationCSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018
CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationChapter 8: Main Memory. Operating System Concepts 9 th Edition
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationGenerating Charts in PDF Format with JFreeChart and itext
Generating Charts in PDF Format with JFreeChart and itext Written by David Gilbert May 28, 2002 c 2002, Simba Management Limited. All rights reserved. Everyone is permitted to copy and distribute verbatim
More informationFractals exercise. Investigating task farms and load imbalance
Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationSubset Sum - A Dynamic Parallel Solution
Subset Sum - A Dynamic Parallel Solution Team Cthulu - Project Report ABSTRACT Tushar Iyer Rochester Institute of Technology Rochester, New York txi9546@rit.edu The subset sum problem is an NP-Complete
More informationAn Exceptional Class. by Peter Lavin. June 1, 2004
An Exceptional Class by Peter Lavin June 1, 2004 Overview When a method throws an exception, Java requires that it be caught. Some exceptions require action on the programmer s part and others simply need
More informationAnimations involving numbers
136 Chapter 8 Animations involving numbers 8.1 Model and view The examples of Chapter 6 all compute the next picture in the animation from the previous picture. This turns out to be a rather restrictive
More informationRepe$$on CSC 121 Spring 2017 Howard Rosenthal
Repe$$on CSC 121 Spring 2017 Howard Rosenthal Lesson Goals Learn the following three repetition structures in Java, their syntax, their similarities and differences, and how to avoid common errors when
More informationCSCI 135 Exam #0 Fundamentals of Computer Science I Fall 2012
CSCI 135 Exam #0 Fundamentals of Computer Science I Fall 2012 Name: This exam consists of 7 problems on the following 6 pages. You may use your single- side hand- written 8 ½ x 11 note sheet during the
More informationChapter 8: Main Memory
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationPage 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1
Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationSorting. Bubble Sort. Selection Sort
Sorting In this class we will consider three sorting algorithms, that is, algorithms that will take as input an array of items, and then rearrange (sort) those items in increasing order within the array.
More informationProject 1 Computer Science 2334 Spring 2016 This project is individual work. Each student must complete this assignment independently.
Project 1 Computer Science 2334 Spring 2016 This project is individual work. Each student must complete this assignment independently. User Request: Create a simple movie data system. Milestones: 1. Use
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationLab 09 - Virtual Memory
Lab 09 - Virtual Memory Due: November 19, 2017 at 4:00pm 1 mmapcopy 1 1.1 Introduction 1 1.1.1 A door predicament 1 1.1.2 Concepts and Functions 2 1.2 Assignment 3 1.2.1 mmap copy 3 1.2.2 Tips 3 1.2.3
More informationA Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by Dan Grossman (with small tweaks by Alan Hu) Learning
More informationUnderstanding Parallelism and the Limitations of Parallel Computing
Understanding Parallelism and the Limitations of Parallel omputing Understanding Parallelism: Sequential work After 16 time steps: 4 cars Scalability Laws 2 Understanding Parallelism: Parallel work After
More informationNesting Foreach Loops
Steve Weston doc@revolutionanalytics.com December 9, 2017 1 Introduction The foreach package provides a looping construct for executing R code repeatedly. It is similar to the standard for loop, which
More informationSubset Sum Problem Parallel Solution
Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in
More informationCS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism
CS 61C: Great Ideas in Computer Architecture Amdahl s Law, Thread Level Parallelism Instructor: Alan Christopher 07/17/2014 Summer 2014 -- Lecture #15 1 Review of Last Lecture Flynn Taxonomy of Parallel
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationParallel Computing Concepts. CSInParallel Project
Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................
More informationParallel Programming
Parallel Programming OpenMP Dr. Hyrum D. Carroll November 22, 2016 Parallel Programming in a Nutshell Load balancing vs Communication This is the eternal problem in parallel computing. The basic approaches
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationCS 62 Practice Final SOLUTIONS
CS 62 Practice Final SOLUTIONS 2017-5-2 Please put your name on the back of the last page of the test. Note: This practice test may be a bit shorter than the actual exam. Part 1: Short Answer [32 points]
More informationAbout this exam review
Final Exam Review About this exam review I ve prepared an outline of the material covered in class May not be totally complete! Exam may ask about things that were covered in class but not in this review
More informationKeys to Faster Sampling in Dataflow
Keys to Faster Sampling in Dataflow Ben Chambers, former Cloud Software Engineer Rafael Fernandez, Cloud Engineering Manager Editor s Note: Ben Chambers made the majority of the contributions to this post
More informationLine Segment Intersection Dmitriy V'jukov
Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers
More informationPart 1. Summary of For Loops and While Loops
NAME EET 2259 Lab 5 Loops OBJECTIVES -Understand when to use a For Loop and when to use a While Loop. -Write LabVIEW programs using each kind of loop. -Write LabVIEW programs with one loop inside another.
More information1.00 Introduction to Computers and Engineering Problem Solving. Quiz 1 March 7, 2003
1.00 Introduction to Computers and Engineering Problem Solving Quiz 1 March 7, 2003 Name: Email Address: TA: Section: You have 90 minutes to complete this exam. For coding questions, you do not need to
More informationFractals. Investigating task farms and load imbalance
Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationTopics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)
Topics Data Structures and Information Systems Part 1: Data Structures Michele Zito Lecture 3: Arrays (1) Data structure definition: arrays. Java arrays creation access Primitive types and reference types
More information