I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies. Topics include orbital mechanics and the N-Body problem, graphics rendering via ray tracing and relaxation methods toward a steady-state. The programming language is C using both MPI and 3- D OpenGL. Additional tools and environments include OpenMP, pthreads, sockets, and Nvidia's CUDA for GPGPU. III. Performance Indicators TJ Specific Performance Indicators Standard 1 The student will investigate and understand that parallelism must scale properly (and efficiently) in the case of large 3-D rendering problems, for example using recursive ray tracing to map a defined geometry onto an output bitmap. Ray tracing models a 3-D geometry involving an eye, a screen, and a set of objects. Vector calculations determine which object is visible and also whether it is in shadow. Recursive calculations determine reflections for those object with that material property. Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. Indicator 1.a.1 Demonstrate basic lab techniques Demonstrate the following basic lab techniques: output a 2-D bitmap image file, solve a quadratic equation in the rendering code to determine sphere-line intersection, calculate a dot product to determine a gradient color value from a single point-light source. Benchmark 1.b Investigate and Understand Graphics Rendering Techniques The student will investigate and understand graphics rendering techniques. Indicator 1.b.1 Graphics Rendering Demonstrate the following technique: construct a scene containing spheres and infinite planes (axis aligned, but also with a checkerboard pattern), include shadow calculations and reflection, as well as recursive rendering therein. Indicator 1.b.2
Triangulated Geometry Demonstrate the following technique: determine the point-ofintersection for a line and a triangle, such that any geometry whose surface has been triangulated can then be rendered (e.g., teapot, rabbit, pyramid, elephant). Indicator 1.b.3 Animated Output Movie Demonstrate the following technique: loop the rendering function where at least one parameter of the scene is changing, outputting for each particular value of that parameter a single frame of a movie. After the run, as a post-processing step, combine those frames into an animated movie file. Benchmark 1.c Investigate and Understand Texture Mapping The student will investigate and understand texture mapping. Indicator 1.c.1 Texture Mapping Rather than define a solid color for a particular geometric object (e.g., the floor, or a sphere) the student will map the calculated point-ofintersection on that object to an image file and from that image file determine a color for that point. This technique may be used to map geographic data onto a sphere to produce a globe, or photographic data to show a person s face or the image of an animal, separate from changing the actual geometry of the object. Standard 2 The student will investigate and understand that fine-grain parallelism (i.e., not the decomposition of a coarse space, grid or otherwise) may be used for classic algorithms to improve runtime. A summation algorithm may be coded either in a loop or in a parallel tree. More sophisticated parallel tree code involves both up-and-down passes. The merge sort algorithm may then be coded to run in sub-linear time. Benchmark 2.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. Indicator 2.a.1 Demonstrate basic lab techniques Demonstrate the following basic lab techniques: launch the XMT simulator either directly or by first converting the source code to an OpenMP version, analyze the performance of an XMT-C code in terms of both work and time.
Benchmark 2.b Investigate and Understand the Use of Fine-Grain Parallel Code to Calculate a Summation The student will investigate and understand the use of fine-grain parallel code to calculate a summation, by using a binary tree structure rather than a simple 1-D list of values, and a loop over parallel-spawns rather than a serial loop. Indicator 2.b.1 Investigate and Understand the Use of a Parallel-Spawn for Simultaneous Execution The student will investigate and understand the use of a parallelspawn, a feature of the XMT-C language, which acts essentially like a massively multi-threaded code only with highly efficient hardware and a much simpler coding interface. This spawn command can be used to execute a series of pair-wise sums on a list of data. This process happens simultaneously in O(1) time and O(N) work, reducing the number of values still needed to be summed in half at each step. After O(log_2 N) levels of such spawns, done in a loop, we arrive at the overall sum. Total time is O(log_2 N) and total work is, still, O(N) operations. Work cannot be reduced in theory, all O(N) values must be seen. Indicator 2.b.2 Investigate and Understand the Use of Up-and-Down Passes in a Parallel Binary Tree The student will investigate and understand the use of up-and-down passes in a parallel binary tree in order to code more sophisticated algorithms such as prefix-sum (widely used in general) and prefix-min. In these cases the result of our parallel process is not a single value (i.e., the sum) but rather a list of values (i.e., all of the prefix-sums). Indicator 2.b.3 Investigate and Understand the Parallel Rank Operation on Two Sorted Sub-Lists The student will investigate and understand the parallel rank operation on two sorted sub-lists. The ultimate end goal is a parallel merge sort. A first step toward that goal is the determination of which slot a given value would occupy if it were actually in the other sorted list instead. Its rank in its own list is obviously known (it's the index) and since the second list is sorted the rank in that list can be determined with a binary search in O(log_2 N) time and work. Since all binary searches for all values in both lists can be performed in parallel the total time is also O(log_2 N) but the total work is O(N log_2 N), worse than a serial zipper-merge which requires only O(N) work (i.e., total operations).
Indicator 2.b.4 Implement a Parallel Merge Sort that Runs in Sub-Linear Time Implement a parallel merge sort using the parallel rank operation described above. As described the amount of work on each level of the recursive sort would be O(N log_2 N) rather than O(N), so the total work is O(N (log_2 N)^2) instead of O(N log_2 N). There are O(log_2 N) levels in total. Total time is O((log_2 N)^2) instead of O(N log_2 N). One goal is to maintain the significant time improvement while reducing total work back down to the serial level. Standard 3 The student will investigate and understand that all-pairs communication may be required in a parallel code for problems involving a highly-coupled calculation, such as when physical forces act at any distance. Applications such as gravity simulations use highly-coupled calculations. The simulation progresses by calculating forces and then updating positions. Theoretical scaling of such codes is realized in practice on computing clusters. Benchmark 3.a The student will investigate and understand the construction and analysis of an all-pairs simulation, assuming parallel code with a standard communication protocol on a modern parallel system. Indicator 3.a.1 Investigate and Understand the Construction of an All-Pairs Simulation The student will investigate and understand the construction of an allpairs simulation. Students should write code to build a working version of such a simulation. For instance, if celestial bodies are modeled where the interactions are based solely on gravity (i.e., no charged particles, no collisions) then each body will influence every other body, but perhaps by only a very small amount. Two loops are required, one over all the bodies and then an inner-loop over all the other bodies. Forces are accumulated for each body in the loops, after which a single loop updates all positions. Indicator 3.a.2 Investigate and Understand the Scaling of an All-Pairs Simulation The student will investigate and understand the scaling of an all-pair simulation. On the one hand, theoretical results using Amdhal's Law may determine a bound on the expected speed-up of a parallel code, based on the fraction of the overall code that remains serial. On the other hand, an actual implementation of running code in MPI or OpenMP or pthreads or any other system will show measurable improvement when deployed on an actual parallel system, a dedicated
cluster or otherwise. The observed results can then be compared to theory for a variety of cases. Indicator 3.a.3 Orally present the results of an investigation Orally present the results. Standard 4 The student will investigate and understand the use of massively parallel multithreaded systems such as many core chips, large supercomputers, and general purpose computing on commodity graphics cards. Rather than decompose a problem across nodes one can use threads instead. Threads have access to a shared memory space that sub-processes do not. The potential of appliance-like parallelism involves careful planning for the future. Benchmark 4.a The student will investigate and understand the use of threads, rather than processes, for parallel codes, where typically all sub-tasks are processed within a single machine or even within a single graphics card. Indicator 4.a.1 Investigate and Understand the Use of a Multi-Threaded Code The student will investigate and understand the use of a multi-threaded code. Options include the standard pthread library, scaled XMT-C spawn blocks, and graphics card programming such as for Nvidia's CUDA system. Typically a list of data is decomposed in an embarrassingly parallel way so that individual threads can then compute on a sub-portion of that list, using their thread ID numbers as a convenient instrument for mapping onto a non-overlapping region of the shared list. Indicator 4.a.2 Investigate and Understand the Various Applications of a Multi- Threaded Code The student will investigate and understand various applications of a multi-threaded code. For instance, the discrete cosine transform used in signals processing can be applied to form a JPEG image where a large matrix of pixel color values is decomposed into smaller 8-by-8 pixel blocks. These smaller blocks are then handled by separate threads in order to calculate the DCT and perform other operations required by this particular compression scheme. Other examples of similar calculations include matrix operations to solve linear systems and also recursive ray tracing. Indicator 4.a.3
Investigate and Understand the Potential for Wide-Scale Use of Thread-Based Parallelism The student will investigate and understand the potential for the widescale use of thread-based parallelism, most obviously as a result of the deployment, through current and next generation commodity graphics cards, of massively parallel multi-threaded systems to personal computers, in particular for gaming and entertainment purposes. This mass market effect is driving the rapid deployment of high-end parallel systems and that in turn opens the door for large-scale scientific and other technical applications, because powerful systems are now so widely accessible.