Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications

Size: px
Start display at page:

Download "Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications"

Transcription

1 Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster Performance in Irregular Applications Ideal Compiler/ System for Irregular Applications % of Pea Mflops Mflops 28% % % % % % % 50.0 Mflops True Mflops Effective Mflops Mflops 4% % % Problem Size Size [Mahinthaumar & Saied 99] 0% of pea typical for irregular apps BiCGSTAB solver on SGI Origin 2000 Performance degrades upon falling out of L2 cache Processor-memory performance gap increasing 3 Challenge: unable to effectively reorder data and computation at compile-time in irregular applications Approach: run-time reordering transformations Goal: Framewor 2... N Compiler Inspector/Executor Code Decision Code 4

2 Contributions Tal Outline Full Sparse Tiling (FST) - Data locality in Gauss-Seidel (ICCS0)(JHPCA03) - Parallelism for Gauss-Seidel (LCPC02) Framewor for run-time reordering transformations and their compositions (PLDI03) Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code 5 I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 6 Reordering s Example Loop with an Irregular Memory Reference Goal: Improve data locality and exploit parallelism How? Reordering data at runtime Reordering iterations at runtime Result: Performance improvements for irregular applications Cache: M i 0 x Z for i=0,7 Y[i] = Z[x[i]] M M M M M M M a b c d e f g h Z is the data array x is the index array Simple cache model - 2 elements of Z in each cache line - 2 cache lines of Z in cache at any one time 7 II. Introduction 8

3 Data Locality Data Reordering Spatial locality occurs when memory locations mapped to the same cache-line are used before the cache line is evicted Temporal locality occurs when the same memory location is reused before its cache line is evicted for i=0,7 Y[i] = Z[x[i]] Z [x [i]] Cache: M S M S M M M M i x x reorder items in data array Z at run-time update index array x with new locations increases spatial locality Z da b h c a d f e c bf g e h g II. Introduction 9 0 Iteration Reordering Inspector/Executor [Saltz et al. 994] for i =0,7 Y [i ] = Z[x [i ]] M T M T S M M T i x Cache: a b c d e f g h reorder iterations of the i loop remap index array x according to the new i loop ordering increases temporal and spatial locality Inspector - Traverses through index arrays - Creates data and/or iteration reordering functions - Reorders data and updates any affected index arrays by applying reordering functions Executor A transformed version of the original code which uses reordered data arrays and/or executes new iteration order II. Introduction 2

4 Traverses index array Generates data reordering function " Reorder data and updates index array Original Code for i=0,7 Y[i] = Z[x[i]] Inspector for i=0,7... x[i]... for =0,7 sigma[] =... for =0,7 Z [sigma[]]=z[] x []=sigma[x[]] Executor for i=0,7 Y[i] = Z [x [i]] Tal Outline I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 3 4 Effect of Composing Reordering s Sample Irregular Benchmars [Han & Tseng 2000] Hand-generated inspector and executor code for three benchmars Used existing data and iteration reordering heuristics within one loop Composed these with full sparse tiling to reorder iterations between loops Measured performance improvements and overhead 5 for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] IRREG, partial differential equation solver Non-bonded force calculations in molecular dynamics - MOLDYN abstracted from CHARMM - NBF abstracted from GROMOS, different data structure 6

5 Data Mapping for Loop Data Reordering: CPACK [Ding & Kennedy 99] for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] Z Z Z2 Z3 Z4 Z5 Z6 Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z 7 8 Iteration Reordering: lexgroup [Ding & Kennedy 99] Dependences Between Loops 7 8 for s=,t for i=,n... =...Z[i] Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[]

6 Iteration Reordering: Full Sparse Tiling (FST) Iteration & Data Reordering: Tile Pacing (tilepac) i Executor for Composed for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] for s=,t for t=,nt for i in sched(t,i)... =...Z [i ] for in sched(t,) Z [l [ ]] =... for in sched(t,)... += Z [ ] 23 Normalized Execution Time for Executor Effect of FST in Composition IBM Power 3, 375MHz, 64 KB L cache CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB IRREG NBF MOLDYN 24

7 Effect of FST in Composition Intel Pentium 4,.7GHz, 8KB L cache Inspector for FST Compositions IBM Power 3 and Intel Pentium 4 Normalized Execution Time for Executor CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB Brea Even Min (timesteps) Brea Even Max (timesteps) IRREG 6 NBF 5 8 MOLDYN IRREG NBF MOLDYN 26 Parallelism with Full Sparse Tiling Experimental Results IBM Power 3, 375MHz, 8MB L2 cache c=2 c c= Inspector determines dependences between tiles Full Sparse Tiled Iteration Space P P2 Time Tile Dependence Graph 27 Speedup w/out Overhead 0 owner-computes, p=2 owner-computes, p=4 9 owner-computes, p= Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 Wing N=924,672 NZ=38,360,266 28

8 Experimental Results SUN HPC0000, 36 UltraSPARC II, 400MHz, 4MB L2 Tal Outline Speedup w/out Overhead owner-computes, p=2 owner-computes, p=4 owner-computes, p=8 owner-computes, p=6 owner-computes, p=32 Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 full sparse tiling, p=6 full sparse tiling, p=32 Wing N=924,672 NZ=38,360, I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 30 Framewor Needed for Compositions Key Insights for Composing Reordering s Current options - Analyze and generate inspectors one at a time - Prepare every interesting composition by hand Framewor should provide a uniform representation to manipulate descriptions of any run-time reordering transformation at compile-time Framewor 2... N Compiler Inspector/Executor Code Decision Code 3 The inspectors traverse the data mappings and/or the data dependences We can express how the data mappings and data dependences will change Subsequent inspectors traverse the new data mappings and data dependences 32

9 Framewor Elements [Kelly & Pugh 95] [Pugh & Wonnacott 95] : Data Mapping I 0 7 Reordering s (PLDI 03) Data Reordering : Z a b c d e f g h Compile-time: : Z M I "Z = {[i] " [x(i)] 0 # i # 7} I a b c d e f g h Data Dependences 0 7 Compile-time: D I "I = {[i] " [ ] (0 # i < # 7) $(i = b( ) % = b(i))} 33 Compile-time: : Compile-time: Z d h a f c b e g R Z "Z ' = {[i] " [#(i)]} Iteration Reordering I I T I "I ' = {[i] " [#(i)]} 34 Data Mappings and Data Dependences for MOLDYN Data Reordering: CPACK for ts=,t for i=,n Z[i]=... for =,M... = Z[l[]]... = Z[r[]] M J 0 "Z 0 D I 0 "J 0 Data Mapping for i loop = {[i] " [i]} M I 0 "Z 0 Data Mapping for loop = {[ ] " [i] (i = l( ))#(i = r( ))} Data Dependences between i and loop = {[i] " [ ] (i = l( ))#(i = r( ))} 35 M J 0 "Z 0 = {[ ] " [i] (i = l( ))#(i = r( ))} Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z M J 0 "Z R Z 0 "Z = T I 0 "I = {[i] " [#(i)]} = {[ ] " [#(i)] (i = l( )) $(i = r( ))} 36

10 Iteration Reordering: lexgroup [Ding & Kennedy 99] Iteration Reordering: Full Sparse Tiling (FST) M J 0 "Z = {[ ] " [#(i)] (i = l( )) $(i = r( ))} D I "J = {[#(i)] " [$( )] (i = l( )) %(i = r( ))} 7 8 Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z T J 0 "J = {[ ] " [#( )]} T I "I 2 = {[i ] " [# (,i ),,i ]} M J "Z = {[#( )] " [$(i)] (i = l( )) %(i = r( ))} T J "J 2 = {[ ] " [# (2, ),2, ]} Compile-time Composition of Data and Iteration Reorderings (PLDI 03) Tal Outline Uniform representation using uninterpreted functions to represent run-time info Describe how to compose run-time reordering transformations Determining legality constraints for run-time reordering functions helps reduce overhead Possible optimization is to remap data only once Describe space of possible transformations I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 39 40

11 Conclusions Future Wor Framewor gives the compiler the ability to manipulate run-time reordering transformations Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code FST composed with other run-time reordering transformations can improve serial performance 4 Automatic generation of optimized inspectors guidance - Compile-time - Incorporating domain-specific nowledge into transformation guidance Separation of algorithmic and performance interfaces through generative techniques 42

Fourier-Motzkin and Farkas Questions (HW10)

Fourier-Motzkin and Farkas Questions (HW10) Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up

More information

An Approach for Code Generation in the Sparse Polyhedral Framework

An Approach for Code Generation in the Sparse Polyhedral Framework Computer Science Technical Report An Approach for Code Generation in the Sparse Polyhedral Framework Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine

More information

Abstractions for Specifying Sparse Matrix Data Transformations

Abstractions for Specifying Sparse Matrix Data Transformations Abstractions for Specifying Sparse Matrix Data Transformations Payal Nandy Mary Hall Eddie C. Davis Catherine Olschanowsky Mahdi S Mohammadi, Wei He Michelle Strout University of Utah Boise State University

More information

SPARSE TILING FOR STATIONARY ITERATIVE METHODS

SPARSE TILING FOR STATIONARY ITERATIVE METHODS SPARSE TILING FOR STATIONARY ITERATIVE METHODS Michelle Mills Strout 1 Larry Carter 2 Jeanne Ferrante 2 Barbara Kreaseck 3 Abstract In modern computers, a program s data locality can affect performance

More information

Improving Locality For Adaptive Irregular Scientific Codes

Improving Locality For Adaptive Irregular Scientific Codes Improving Locality For Adaptive Irregular Scientific Codes Hwansoo Han, Chau-Wen Tseng Department of Computer Science University of Maryland College Park, MD 7 fhshan, tsengg@cs.umd.edu Abstract Irregular

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems

The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems Journal of Instruction-Level Parallelism (4) XX-YY Submitted 5/3; published 6/4 The Efficacy of Software etching and Locality Optimizations on Future Systems Abdel-Hameed Badawy Aneesh Aggarwal Donald

More information

Alan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS

Alan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Alan LaMielle, Michelle Strout Colorado State University {lamielle,mstrout@cs.colostate.edu March 16, 2010

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling

Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck University of California, San Diego

More information

Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption

Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption Appears in the 4th Workshop on Complexity-Effective Design, San Diego, CA, June 23 Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption Deepak Agarwal, Wanli Liu, and Donald

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption

Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-22-64 Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption Deepak Agarwal and Donald

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Reducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking

Reducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking Submitted to the 1th International Symposium on High Performance Computer Architecture Reducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking Deepak Agarwal, Wanli Liu, and Donald

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Improving Compiler and Run-Time Support for. Irregular Reductions Using Local Writes. Hwansoo Han, Chau-Wen Tseng

Improving Compiler and Run-Time Support for. Irregular Reductions Using Local Writes. Hwansoo Han, Chau-Wen Tseng Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes Hwansoo Han, Chau-Wen Tseng Department of Computer Science, University of Maryland, College Park, MD 7 Abstract. Current

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine

A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine Pramod K. Bhatotia High Performance Computing Group, IBM India Research Laboratory, New Delhi-110070, INDIA Email:{pbhatoti}@in.ibm.com

More information

Improving Bank-Level Parallelism for Irregular Applications

Improving Bank-Level Parallelism for Irregular Applications Improving Bank-Level Parallelism for Irregular Applications Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli 2, Jagadish Kotra Pennsylvania State University VMware, Inc. 2 University Park, PA, USA Palo

More information

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

An Inspector-Executor Algorithm for Irregular Assignment Parallelization An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain

More information

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Published by the IEEE Computer Society /09/$ IEEE

Published by the IEEE Computer Society /09/$ IEEE Rese arch FE ATURE Refactoring for Data Locality Kristof Beyls, Tele Atlas Erik H. D Hollander, Ghent University Suggestions for locality optimizations (SLO), a cache profiling tool, analyzes runtime reuse

More information

Compiler-Directed Cache Polymorphism

Compiler-Directed Cache Polymorphism Compiler-Directed Cache Polymorphism J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, W. Zhang Microsystems Design Lab The Pennsylvania State University Outline Motivation Compiler-Directed

More information

SPARSITY: Optimization Framework for Sparse Matrix Kernels

SPARSITY: Optimization Framework for Sparse Matrix Kernels SPARSITY: Optimization Framework for Sparse Matrix Kernels Eun-Jin Im School of Computer Science, Kookmin University, Seoul, Korea Katherine Yelick Richard Vuduc Computer Science Division, University of

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Set and Relation Manipulation for the Sparse Polyhedral Framework

Set and Relation Manipulation for the Sparse Polyhedral Framework Set and Relation Manipulation for the Sparse Polyhedral Framework Michelle Mills Strout, Geri Georg, and Catherine Olschanowsky Colorado State University {mstrout,georg,cathie}@cs.colostate.edu Abstract.

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Uniprocessor Optimizations. Idealized Uniprocessor Model

Uniprocessor Optimizations. Idealized Uniprocessor Model Uniprocessor Optimizations Arvind Krishnamurthy Fall 2004 Idealized Uniprocessor Model Processor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Exist

More information

Evaluation of Hierarchical Mesh Reorderings

Evaluation of Hierarchical Mesh Reorderings Evaluation of Hierarchical Mesh Reorderings Michelle Mills Strout 1, Nissa Osheim 1, Dave Rostron 1, Paul D. Hovland 2, Alex Pothen 3 1 Colorado State University, Fort Collins CO 80523, USA, 2 Argonne

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

Abstractions for Specifying Sparse Matrix Data Transformations

Abstractions for Specifying Sparse Matrix Data Transformations Abstract Abstractions for Specifying Sparse Matrix Data Transformations Payal Nandy, Mary Hall University of Utah 50 S. Central Campus Drive Salt Lake City, UT 84112, USA payalgn@cs.utah.edu The inspector/executor

More information

EE482c Final Project: Stream Programs on Legacy Architectures

EE482c Final Project: Stream Programs on Legacy Architectures EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project

More information

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction CSE 262 Spring 2007 Scott B. Baden Lecture 1 Introduction Introduction Your instructor is Scott B. Baden, baden@cs.ucsd.edu Office: room 3244 in EBU3B Office hours: Tuesday after class (week 1) or by appointment

More information

Automated Timer Generation for Empirical Tuning

Automated Timer Generation for Empirical Tuning Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1 Propositions How do we measure success for tuning? The performance of the

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Tessellating Stencils Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Outline Introduction Related work Tessellating Stencils Stencil Stencil Overview update each

More information

Defensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester

Defensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester Defensive Loop Tiling for Shared Cache Bin Bao Adobe Systems Chen Ding University of Rochester Bird and Program Unlike a bird, which can learn to fly better and better, existing programs are sort of dumb---the

More information

Automating Wavefront Parallelization for Sparse Matrix Computations

Automating Wavefront Parallelization for Sparse Matrix Computations Automating Wavefront Parallelization for Sparse Matrix Computations Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout and Mary Hall Parallel Computing

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Review. Loop Fusion Example

Review. Loop Fusion Example Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Myths and Realities: The Performance Impact of Garbage Collection

Myths and Realities: The Performance Impact of Garbage Collection Myths and Realities: The Performance Impact of Garbage Collection Tapasya Patki February 17, 2011 1 Motivation Automatic memory management has numerous software engineering benefits from the developer

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

An Adaptive Algorithm Selection Framework

An Adaptive Algorithm Selection Framework An Adaptive Algorithm Selection Framework Hao Yu, Dongmin Zhang, Francis Dang, and Lawrence Rauchwerger hy8494,dzhang,fhd444,rwerger @cs.tamu.edu Technical Report TR4- Parasol Lab Department of Computer

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE Kamen Yotov IBM T. J. Watson Research Center kyotov@us.ibm.com Tom Roeder, Keshav Pingali Cornell University

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk03, wzd09, kandemir@cse.psu.edu

More information

CS377P Programming for Performance Single Thread Performance Caches I

CS377P Programming for Performance Single Thread Performance Caches I CS377P Programming for Performance Single Thread Performance Caches I Sreepathi Pai UTCS September 21, 2015 Outline 1 Introduction 2 Caches 3 Performance of Caches Outline 1 Introduction 2 Caches 3 Performance

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s . Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture

More information

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the

More information

Towards Intelligent Programing Systems for Modern Computing

Towards Intelligent Programing Systems for Modern Computing Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y2011 20

More information

Cache Oblivious Algorithms

Cache Oblivious Algorithms Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

Contour Detection on Mobile Platforms

Contour Detection on Mobile Platforms Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version) Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Fall 2002 Edward F. Gehringer Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version)

More information

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee. Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 261 Fall Caching. Mike Lam, Professor. (get it??) CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Executing Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models

Executing Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models Executing Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models Christopher D. Krieger, Michelle Mills Strout, Jonathan Roelofs Colorado State University Fort Collins, CO 8526

More information

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points) CSE 30321 Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: November 17, 2009 Due: December 1, 2009 This assignment can be done in groups of 1, 2,

More information

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N. GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA

More information

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Evaluating the Impact of RDMA on Storage I/O over InfiniBand Evaluating the Impact of RDMA on Storage I/O over InfiniBand J Liu, DK Panda and M Banikazemi Computer and Information Science IBM T J Watson Research Center The Ohio State University Presentation Outline

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning

SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning Tarek Elgamal 2, Shangyu Luo 3, Matthias Boehm 1, Alexandre V. Evfimievski 1, Shirish Tatikonda 4, Berthold Reinwald

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information