Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications
|
|
- Prudence Potter
- 5 years ago
- Views:
Transcription
1 Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster Performance in Irregular Applications Ideal Compiler/ System for Irregular Applications % of Pea Mflops Mflops 28% % % % % % % 50.0 Mflops True Mflops Effective Mflops Mflops 4% % % Problem Size Size [Mahinthaumar & Saied 99] 0% of pea typical for irregular apps BiCGSTAB solver on SGI Origin 2000 Performance degrades upon falling out of L2 cache Processor-memory performance gap increasing 3 Challenge: unable to effectively reorder data and computation at compile-time in irregular applications Approach: run-time reordering transformations Goal: Framewor 2... N Compiler Inspector/Executor Code Decision Code 4
2 Contributions Tal Outline Full Sparse Tiling (FST) - Data locality in Gauss-Seidel (ICCS0)(JHPCA03) - Parallelism for Gauss-Seidel (LCPC02) Framewor for run-time reordering transformations and their compositions (PLDI03) Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code 5 I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 6 Reordering s Example Loop with an Irregular Memory Reference Goal: Improve data locality and exploit parallelism How? Reordering data at runtime Reordering iterations at runtime Result: Performance improvements for irregular applications Cache: M i 0 x Z for i=0,7 Y[i] = Z[x[i]] M M M M M M M a b c d e f g h Z is the data array x is the index array Simple cache model - 2 elements of Z in each cache line - 2 cache lines of Z in cache at any one time 7 II. Introduction 8
3 Data Locality Data Reordering Spatial locality occurs when memory locations mapped to the same cache-line are used before the cache line is evicted Temporal locality occurs when the same memory location is reused before its cache line is evicted for i=0,7 Y[i] = Z[x[i]] Z [x [i]] Cache: M S M S M M M M i x x reorder items in data array Z at run-time update index array x with new locations increases spatial locality Z da b h c a d f e c bf g e h g II. Introduction 9 0 Iteration Reordering Inspector/Executor [Saltz et al. 994] for i =0,7 Y [i ] = Z[x [i ]] M T M T S M M T i x Cache: a b c d e f g h reorder iterations of the i loop remap index array x according to the new i loop ordering increases temporal and spatial locality Inspector - Traverses through index arrays - Creates data and/or iteration reordering functions - Reorders data and updates any affected index arrays by applying reordering functions Executor A transformed version of the original code which uses reordered data arrays and/or executes new iteration order II. Introduction 2
4 Traverses index array Generates data reordering function " Reorder data and updates index array Original Code for i=0,7 Y[i] = Z[x[i]] Inspector for i=0,7... x[i]... for =0,7 sigma[] =... for =0,7 Z [sigma[]]=z[] x []=sigma[x[]] Executor for i=0,7 Y[i] = Z [x [i]] Tal Outline I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 3 4 Effect of Composing Reordering s Sample Irregular Benchmars [Han & Tseng 2000] Hand-generated inspector and executor code for three benchmars Used existing data and iteration reordering heuristics within one loop Composed these with full sparse tiling to reorder iterations between loops Measured performance improvements and overhead 5 for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] IRREG, partial differential equation solver Non-bonded force calculations in molecular dynamics - MOLDYN abstracted from CHARMM - NBF abstracted from GROMOS, different data structure 6
5 Data Mapping for Loop Data Reordering: CPACK [Ding & Kennedy 99] for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] Z Z Z2 Z3 Z4 Z5 Z6 Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z 7 8 Iteration Reordering: lexgroup [Ding & Kennedy 99] Dependences Between Loops 7 8 for s=,t for i=,n... =...Z[i] Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[]
6 Iteration Reordering: Full Sparse Tiling (FST) Iteration & Data Reordering: Tile Pacing (tilepac) i Executor for Composed for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] for s=,t for t=,nt for i in sched(t,i)... =...Z [i ] for in sched(t,) Z [l [ ]] =... for in sched(t,)... += Z [ ] 23 Normalized Execution Time for Executor Effect of FST in Composition IBM Power 3, 375MHz, 64 KB L cache CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB IRREG NBF MOLDYN 24
7 Effect of FST in Composition Intel Pentium 4,.7GHz, 8KB L cache Inspector for FST Compositions IBM Power 3 and Intel Pentium 4 Normalized Execution Time for Executor CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB Brea Even Min (timesteps) Brea Even Max (timesteps) IRREG 6 NBF 5 8 MOLDYN IRREG NBF MOLDYN 26 Parallelism with Full Sparse Tiling Experimental Results IBM Power 3, 375MHz, 8MB L2 cache c=2 c c= Inspector determines dependences between tiles Full Sparse Tiled Iteration Space P P2 Time Tile Dependence Graph 27 Speedup w/out Overhead 0 owner-computes, p=2 owner-computes, p=4 9 owner-computes, p= Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 Wing N=924,672 NZ=38,360,266 28
8 Experimental Results SUN HPC0000, 36 UltraSPARC II, 400MHz, 4MB L2 Tal Outline Speedup w/out Overhead owner-computes, p=2 owner-computes, p=4 owner-computes, p=8 owner-computes, p=6 owner-computes, p=32 Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 full sparse tiling, p=6 full sparse tiling, p=32 Wing N=924,672 NZ=38,360, I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 30 Framewor Needed for Compositions Key Insights for Composing Reordering s Current options - Analyze and generate inspectors one at a time - Prepare every interesting composition by hand Framewor should provide a uniform representation to manipulate descriptions of any run-time reordering transformation at compile-time Framewor 2... N Compiler Inspector/Executor Code Decision Code 3 The inspectors traverse the data mappings and/or the data dependences We can express how the data mappings and data dependences will change Subsequent inspectors traverse the new data mappings and data dependences 32
9 Framewor Elements [Kelly & Pugh 95] [Pugh & Wonnacott 95] : Data Mapping I 0 7 Reordering s (PLDI 03) Data Reordering : Z a b c d e f g h Compile-time: : Z M I "Z = {[i] " [x(i)] 0 # i # 7} I a b c d e f g h Data Dependences 0 7 Compile-time: D I "I = {[i] " [ ] (0 # i < # 7) $(i = b( ) % = b(i))} 33 Compile-time: : Compile-time: Z d h a f c b e g R Z "Z ' = {[i] " [#(i)]} Iteration Reordering I I T I "I ' = {[i] " [#(i)]} 34 Data Mappings and Data Dependences for MOLDYN Data Reordering: CPACK for ts=,t for i=,n Z[i]=... for =,M... = Z[l[]]... = Z[r[]] M J 0 "Z 0 D I 0 "J 0 Data Mapping for i loop = {[i] " [i]} M I 0 "Z 0 Data Mapping for loop = {[ ] " [i] (i = l( ))#(i = r( ))} Data Dependences between i and loop = {[i] " [ ] (i = l( ))#(i = r( ))} 35 M J 0 "Z 0 = {[ ] " [i] (i = l( ))#(i = r( ))} Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z M J 0 "Z R Z 0 "Z = T I 0 "I = {[i] " [#(i)]} = {[ ] " [#(i)] (i = l( )) $(i = r( ))} 36
10 Iteration Reordering: lexgroup [Ding & Kennedy 99] Iteration Reordering: Full Sparse Tiling (FST) M J 0 "Z = {[ ] " [#(i)] (i = l( )) $(i = r( ))} D I "J = {[#(i)] " [$( )] (i = l( )) %(i = r( ))} 7 8 Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z T J 0 "J = {[ ] " [#( )]} T I "I 2 = {[i ] " [# (,i ),,i ]} M J "Z = {[#( )] " [$(i)] (i = l( )) %(i = r( ))} T J "J 2 = {[ ] " [# (2, ),2, ]} Compile-time Composition of Data and Iteration Reorderings (PLDI 03) Tal Outline Uniform representation using uninterpreted functions to represent run-time info Describe how to compose run-time reordering transformations Determining legality constraints for run-time reordering functions helps reduce overhead Possible optimization is to remap data only once Describe space of possible transformations I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 39 40
11 Conclusions Future Wor Framewor gives the compiler the ability to manipulate run-time reordering transformations Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code FST composed with other run-time reordering transformations can improve serial performance 4 Automatic generation of optimized inspectors guidance - Compile-time - Incorporating domain-specific nowledge into transformation guidance Separation of algorithmic and performance interfaces through generative techniques 42
Fourier-Motzkin and Farkas Questions (HW10)
Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up
More informationAn Approach for Code Generation in the Sparse Polyhedral Framework
Computer Science Technical Report An Approach for Code Generation in the Sparse Polyhedral Framework Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine
More informationAbstractions for Specifying Sparse Matrix Data Transformations
Abstractions for Specifying Sparse Matrix Data Transformations Payal Nandy Mary Hall Eddie C. Davis Catherine Olschanowsky Mahdi S Mohammadi, Wei He Michelle Strout University of Utah Boise State University
More informationSPARSE TILING FOR STATIONARY ITERATIVE METHODS
SPARSE TILING FOR STATIONARY ITERATIVE METHODS Michelle Mills Strout 1 Larry Carter 2 Jeanne Ferrante 2 Barbara Kreaseck 3 Abstract In modern computers, a program s data locality can affect performance
More informationImproving Locality For Adaptive Irregular Scientific Codes
Improving Locality For Adaptive Irregular Scientific Codes Hwansoo Han, Chau-Wen Tseng Department of Computer Science University of Maryland College Park, MD 7 fhshan, tsengg@cs.umd.edu Abstract Irregular
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationThe Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems
Journal of Instruction-Level Parallelism (4) XX-YY Submitted 5/3; published 6/4 The Efficacy of Software etching and Locality Optimizations on Future Systems Abdel-Hameed Badawy Aneesh Aggarwal Donald
More informationAlan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS
Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Alan LaMielle, Michelle Strout Colorado State University {lamielle,mstrout@cs.colostate.edu March 16, 2010
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationCombining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling
Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck University of California, San Diego
More informationExploiting Application-Level Information to Reduce Memory Bandwidth Consumption
Appears in the 4th Workshop on Complexity-Effective Design, San Diego, CA, June 23 Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption Deepak Agarwal, Wanli Liu, and Donald
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationA polyhedral loop transformation framework for parallelization and tuning
A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory
More informationExploiting Application-Level Information to Reduce Memory Bandwidth Consumption
University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-22-64 Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption Deepak Agarwal and Donald
More informationMemory Hierarchy Management for Iterative Graph Structures
Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationReducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking
Submitted to the 1th International Symposium on High Performance Computer Architecture Reducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking Deepak Agarwal, Wanli Liu, and Donald
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness
More informationImproving Compiler and Run-Time Support for. Irregular Reductions Using Local Writes. Hwansoo Han, Chau-Wen Tseng
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes Hwansoo Han, Chau-Wen Tseng Department of Computer Science, University of Maryland, College Park, MD 7 Abstract. Current
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationA Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine
A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine Pramod K. Bhatotia High Performance Computing Group, IBM India Research Laboratory, New Delhi-110070, INDIA Email:{pbhatoti}@in.ibm.com
More informationImproving Bank-Level Parallelism for Irregular Applications
Improving Bank-Level Parallelism for Irregular Applications Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli 2, Jagadish Kotra Pennsylvania State University VMware, Inc. 2 University Park, PA, USA Palo
More informationAn Inspector-Executor Algorithm for Irregular Assignment Parallelization
An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain
More informationUsing Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationMilind Kulkarni Research Statement
Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers
More informationPublished by the IEEE Computer Society /09/$ IEEE
Rese arch FE ATURE Refactoring for Data Locality Kristof Beyls, Tele Atlas Erik H. D Hollander, Ghent University Suggestions for locality optimizations (SLO), a cache profiling tool, analyzes runtime reuse
More informationCompiler-Directed Cache Polymorphism
Compiler-Directed Cache Polymorphism J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, W. Zhang Microsystems Design Lab The Pennsylvania State University Outline Motivation Compiler-Directed
More informationSPARSITY: Optimization Framework for Sparse Matrix Kernels
SPARSITY: Optimization Framework for Sparse Matrix Kernels Eun-Jin Im School of Computer Science, Kookmin University, Seoul, Korea Katherine Yelick Richard Vuduc Computer Science Division, University of
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationSet and Relation Manipulation for the Sparse Polyhedral Framework
Set and Relation Manipulation for the Sparse Polyhedral Framework Michelle Mills Strout, Geri Georg, and Catherine Olschanowsky Colorado State University {mstrout,georg,cathie}@cs.colostate.edu Abstract.
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationA Parallelizing Compiler for Multicore Systems
A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationUniprocessor Optimizations. Idealized Uniprocessor Model
Uniprocessor Optimizations Arvind Krishnamurthy Fall 2004 Idealized Uniprocessor Model Processor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Exist
More informationEvaluation of Hierarchical Mesh Reorderings
Evaluation of Hierarchical Mesh Reorderings Michelle Mills Strout 1, Nissa Osheim 1, Dave Rostron 1, Paul D. Hovland 2, Alex Pothen 3 1 Colorado State University, Fort Collins CO 80523, USA, 2 Argonne
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationAdvanced Numerical Techniques for Cluster Computing
Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationAbstractions for Specifying Sparse Matrix Data Transformations
Abstract Abstractions for Specifying Sparse Matrix Data Transformations Payal Nandy, Mary Hall University of Utah 50 S. Central Campus Drive Salt Lake City, UT 84112, USA payalgn@cs.utah.edu The inspector/executor
More informationEE482c Final Project: Stream Programs on Legacy Architectures
EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project
More informationCSE 262 Spring Scott B. Baden. Lecture 1 Introduction
CSE 262 Spring 2007 Scott B. Baden Lecture 1 Introduction Introduction Your instructor is Scott B. Baden, baden@cs.ucsd.edu Office: room 3244 in EBU3B Office hours: Tuesday after class (week 1) or by appointment
More informationAutomated Timer Generation for Empirical Tuning
Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1 Propositions How do we measure success for tuning? The performance of the
More informationGaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems
Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization
More informationTessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS
Tessellating Stencils Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Outline Introduction Related work Tessellating Stencils Stencil Stencil Overview update each
More informationDefensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester
Defensive Loop Tiling for Shared Cache Bin Bao Adobe Systems Chen Ding University of Rochester Bird and Program Unlike a bird, which can learn to fly better and better, existing programs are sort of dumb---the
More informationAutomating Wavefront Parallelization for Sparse Matrix Computations
Automating Wavefront Parallelization for Sparse Matrix Computations Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout and Mary Hall Parallel Computing
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationReview. Loop Fusion Example
Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationCache Designs and Tricks. Kyle Eli, Chun-Lung Lim
Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationCS 293S Parallelism and Dependence Theory
CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law
More informationMyths and Realities: The Performance Impact of Garbage Collection
Myths and Realities: The Performance Impact of Garbage Collection Tapasya Patki February 17, 2011 1 Motivation Automatic memory management has numerous software engineering benefits from the developer
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationAn Adaptive Algorithm Selection Framework
An Adaptive Algorithm Selection Framework Hao Yu, Dongmin Zhang, Francis Dang, and Lawrence Rauchwerger hy8494,dzhang,fhd444,rwerger @cs.tamu.edu Technical Report TR4- Parasol Lab Department of Computer
More informationOpenFOAM + GPGPU. İbrahim Özküçük
OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of
More informationAn Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE
An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE Kamen Yotov IBM T. J. Watson Research Center kyotov@us.ibm.com Tom Roeder, Keshav Pingali Cornell University
More informationAutomatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential
More informationOptimizing Sparse Matrix Vector Multiplication on Emerging Multicores
Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk03, wzd09, kandemir@cse.psu.edu
More informationCS377P Programming for Performance Single Thread Performance Caches I
CS377P Programming for Performance Single Thread Performance Caches I Sreepathi Pai UTCS September 21, 2015 Outline 1 Introduction 2 Caches 3 Performance of Caches Outline 1 Introduction 2 Caches 3 Performance
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationTwo main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s
. Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture
More informationCOSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University
COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the
More informationTowards Intelligent Programing Systems for Modern Computing
Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y2011 20
More informationCache Oblivious Algorithms
Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;
More informationLanguage and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors
Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More informationLecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)
Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Fall 2002 Edward F. Gehringer Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version)
More informationAdministration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice
CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationAutomatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.
Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationCS 261 Fall Caching. Mike Lam, Professor. (get it??)
CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer
More informationImproving Performance of Sparse Matrix-Vector Multiplication
Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationExecuting Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models
Executing Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models Christopher D. Krieger, Michelle Mills Strout, Jonathan Roelofs Colorado State University Fort Collins, CO 8526
More informationCSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)
CSE 30321 Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: November 17, 2009 Due: December 1, 2009 This assignment can be done in groups of 1, 2,
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationEvaluating the Impact of RDMA on Storage I/O over InfiniBand
Evaluating the Impact of RDMA on Storage I/O over InfiniBand J Liu, DK Panda and M Banikazemi Computer and Information Science IBM T J Watson Research Center The Ohio State University Presentation Outline
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationSPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning
SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning Tarek Elgamal 2, Shangyu Luo 3, Matthias Boehm 1, Alexandre V. Evfimievski 1, Shirish Tatikonda 4, Berthold Reinwald
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More information