Tools for Change David E. Bernholdt Oak Ridge National Laboratory

Size: px

Start display at page:

Download "Tools for Change David E. Bernholdt Oak Ridge National Laboratory"

Josephine Moody
6 years ago
Views:

Tools for Change David E. Bernholdt bernholdtde@ornl.

1 Tools for Change David E. Bernholdt Group Leader, Computer Science Research Computer Science and Mathematics Division and National Center for Computational Sciences Oak Ridge National Laboratory

2 Acknowledgements COMPOSE-HPC People: Galois: Matt Sottile LLNL: Tom Epperly, Tammy Dahlgren, Adrian Prantl ORNL: David Bernholdt, Wael Elwasif, Samantha Foley PNNL: Daniel Chavarria, Sriram Krishnamoorthy, Ajay Panyala SNL: Rob Armstrong, Ben Allan, Geoff Hulette Funding: DOE/ASCR X-Stack Software Research (2010) Hercules/Klonos People: ORNL: Oscar Hernandez, Christos Kartsaklis, Chung-Hsing Hsu, Wayne Joubert, Rich Graham U Houston: Barbara Chapman, Wei Ding (now ORNL) Funding: ORNL Director s R&D Fund Oak Ridge Leadership Computing Facility Oak Ridge Associated Universities This work was performed in part at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC for the U.S. Department of Energy under Contract No. DE-AC05-00OR

3 Stability is Good Since the mid-1990s, supercomputing has been dominated by one basic architecture Commodity CPU (especially x86) Deep memory hierarchy (multiple caches, DRAM) Interconnect (distributed memory, explicit messaging) External storage system (after some flirtation with node-local disk) Good for software productivity Increasingly well-understood abstract machine model Understand how to design SW for such machines 3

4 Stable Systems Resist Change The end of Dennard scaling (constant power density as transistors get smaller) triggered several new trends CPU clocks stopped getting faster Instead, increase parallelism to get more aggregate performance Shift to multi-core CPUs to continue leveraging Moore s Law Most applications adapted without radical shifts in the abstract machine model they were targeting Treat cores like CPUs one MPI rank per core Expose more parallelism: finer decompositions, larger problems Some changes start to appear, around the margins Additional level of parallelism (threading) More complexity in memory (NUMA domains) 4

5 We re Approaching a Breaking Point Recent trends Multi-core many-core Treating cores like CPUs leads to resource contention elsewhere Cache, memory bandwidth, memory capacity, etc. Adding accelerators into the mix Multiple architectures (GPUs, Phi, others ) Separate (PCI) bus or integrated? Separate memory or integrated? Interconnects are just devices to access remote memory (RDMA) Coming attractions Many-core many-many-core Greater heterogeneity Processing in memory Integrated NIC Multi-level networks (on-chip, inrack, between racks) Multi-level parallelism Event/task parallelism to deal with massive parallelism Power, power, power Reliability, reliability, reliability 5

6 Stability Has Left Us Unprepared We are entering a period of high flux and diversity in HPC architectures Applications will have to be re-thought, or at least adapted to coming architectures We are not prepared to deal with the changes that will be necessary Mentally Tool-wise Detrimental to software productivity Need agility Need tools that support agility 6

7 Scientific Applications are Constantly Evolving Porting to new architectures Extending scientific capabilities Improving quality or performance Example code changes: Updating library calls Changing data structures Platform-specific optimizations How do you transform the code? Manual editing? Sed, grep? Perl or python scripts? Simple text processing tools don t respect programming language semantics! COMPOSE-HPC focused on creating a tool chain to allow application developers to specify and apply languageaware transformations Hercules focuses on a compiler-based infrastructure to find patterns and apply transformations and optimizations; build knowledge-bases about code 7

8 KNOT: A Language-Aware Transformation Tool Chain COMPOSE-HPC KNOT-Based Custom Transformation Tool PAUL ROTE BRAID Annotated Source Code Annotation Language Transformations Generation/ Optimizations Defines Defines Defines Transformed Source Code Annotates Original Source Code Application or Tool Developer Compiler 8

9 PAUL: Annotation Parser COMPOSE-HPC Customizable, reusable, application-specific directives for parameterizing transformation tools Can be used with ROTE to guide rewrite tools with application-specific information, or can be used alone to add structured annotation facility to ROSE-based tools "I am an annotation inside a comment!" "I change the code in this way" /* %GUARD cond = NonZero arg = x onerrorcall = failwith errorcallarg = "x = 0" */ double foo(double x, double y) { return y / x; } double foo(double x, double y) { if(x!= 0) { failwith("x = 0"); } return y / x; } "Based on these parameters" Transformation (ROTE) "With these values" 9

10 ROTE: Retargetable Open Transformation Engine Internal Structure ROTE COMPOSE-HPC Source Code C, C++, Fortran ROSE Minitermite (src2term --stratego) Abstract Syntax Tree (AST) Term Representation Rewrite Rules Stratego/XT Minitermite (term2src --stratego) ROSE Transformed Term Representation Transformed AST Transformed Source Code C, C++, Fortran 10

11 A Simple Example of a Rewrite Rule COMPOSE-HPC R1 : multiply_op( add_op(stratego_a,stratego_b, binary_op_annotation(type_int, preprocessing_info(list)),gen_info()), STRATEGO_c, binary_op_annotation(type_int, preprocessing_info(list)),gen_info()) -> x = (a+b)*c; add_op( multiply_op(stratego_a,stratego_c, binary_op_annotation(type_int, preprocessing_info(list)),gen_info()), multiply_op(stratego_b,stratego_c, binary_op_annotation(type_int, preprocessing_info(list)),gen_info()), binary_op_annotation(type_int, preprocessing_info(list)),gen_info()) x = a*c + b*c; Color Legend: syntactic structure, variables, types, source-to-source location info, ROSE AST annotations 11

12 Structural Differencing to Yield Patches COMPOSE-HPC Goal: Infer rewrite rules from examples of transformation int foo(int a, int b, int c) { int x; x = (a+b)*c; return x; } int foo(int a, int b, int c) { int x; x = a*c + b*c; return x; } Stratego rewrite rule inferred from example code R1 : multiply_op(add_op(stratego_a,stratego_b,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),st RATEGO_c,binary_op_annotation(type_int,preprocessing_info(LIST)),gen_info()) -> add_op(multiply_op(stratego_a,stratego_c,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),mul tiply_op(stratego_b,stratego_c,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),binary_op_ann otation(type_int,preprocessing_info(list)),gen_info()) strategies 12

13 Woven Representation for Differencing COMPOSE-HPC 13

14 Turning Structural Differences into Rewrite Rules COMPOSE-HPC R1 : multiply_op(add_op(stratego_a,stratego_b,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),st RATEGO_c,binary_op_annotation(type_int,preprocessing_info(LIST)),gen_info()) -> add_op(multiply_op(stratego_a,stratego_c,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),mul tiply_op(stratego_b,stratego_c,binary_op_annotation(type_int,preprocessing_info(list)),gen_info()),binary_op_ann otation(type_int,preprocessing_info(list)),gen_info()) strategies 14

15 BRAID Customized Code Generation COMPOSE-HPC SIDL Parser Paul Annotations (under development) Rote Bridge Declarative IR Based on SIDL template-driven / rule-based transformations term-based Languageindependent IR optimization typemaps Code generators C C++ FORTRAN 77 Fortran 90/95 Fortran 2003/08 Python Java Chapel IR Intermediate Representation 15

16 Building on KNOT: Transformation Tools (for Exascale) COMPOSE-HPC Language interoperability (Chapel w/ C, C++, Fortran, Java) Transformation of NWChem dynamic load balancing using Global Array Toolkit shared counter to use TASCEL task pools with work stealing Porting to accelerator-based versions of libraries Selective instrumentation to facilitate simulation of exascale systems Composing code with different preferences for MPI processes vs threads Interface contracts Verify code after transformations Sanity checks for resilience (detect silent errors) 16

17 17 Porting to New Programming Models NWChem SCF uses global shared counter for dynamic load balancing Replace with TASCEL task pools with work stealing for load balancing Use a series of four KNOT-based transformations 1. Separate obtaining next task and what the ID means 2. Localize task enumeration 3. Locally filter for sparsity 4. Introduce TASCEL calls to queue tasks Parallel Efficiency Parallel Efficiency Base Tascel # Cores 1 Base Tascel Be atoms 352 Be atoms COMPOSE-HPC # Cores

18 Hercules Hercules A code translation tool that systematically helps to find patterns and transform codes in applications. Distinctive features: Infrastructure to manage program analysis information to facilitate the understanding of the application Automates the process of applying transformations multiple times throughout the code base Separation of concerns: application science vs. optimizations Documents the transformation process done by scientists Reusability of transformation workflow Implementation leverages compiler infrastructure, can combine transformations and optimizations 18

19 Architecture and Implementation Hercules 19

20 Workflow and Example Hercules 20

21 HScan: Pattern Detection Hercules HScan can scan a code base for patterns Identify interesting points General scan of code base for exploration/understanding Patterns of interest selected by users at invocation time Pattern library (currently) defined by Hercules developers Examples Detection of stencils $ bash-3.2$ hscan --stm-stencil --sln-triangular application.f90 hscan: (V) --stm-stencil: in application.f90::test02_: array B (regular 1d stencil of 3 points) hscan: (V) --stm-stencil: in application.f90::test11_: array A (regular 2d stencil of 5 points) hscan: (V) --stm-stencil: in application.f90::test11_: array A (regular 2d stencil of 4 points) hscan: (V) --sln-triangular: in application.b::test07_: loops at lines 72 & 74. Patterns for GPU porting in CAM/SE climate code hscan: (V) --sl-aos: in aquaplanet.b::aquaplanet_init_state.in.aquaplanet: in loops at line 542; arrays: [ELEM] hscan: (V) --sl-disjoint-reads-writes: in aquaplanet.b::aquaplanet_init_state.in.aquaplanet: loop at line 492; rset=[qve,tme], wset=[elem] hscan: (V) --sl-parallelizable:in aquaplanet.b::aquaplanet_forcing.in.aquaplanet: loop at line 147. hscan: (V) --sl-parallelizable: in aquaplanet.b::aquaplanet_init_state.in.aquaplanet: loop at line

Spin-Off: Klonos Similarity Analysis CAM/SE climate

porting of applications to new platforms Use

porting plan Saves months of manual inspections of

Systems Tree Families of Similar codes Hercules

Analysis requires 1000s computational hours, 2TB of

22 Spin-Off: Klonos Similarity Analysis CAM/SE climate code Simulation Code KLONOS System Facilitates the porting of applications to new platforms Use supercomputers to analyze applications and create a porting plan Saves months of manual inspections of codes NCCS supercomputers Porting Plan To OLCF Systems Tree Families of Similar codes Hercules Source Code Similarity Among CAM/SE Procedures Analysis requires 1000s computational hours, 2TB of Memory. Classification of Procedures Degree of Similarity Procedures Procedures 22

23 Summary and Conclusions The HPC community lacks robust, accessible tools to systematically transform code for refactoring and other purposes COMPOSE-HPC and Hercules begin to address this Challenging (interesting) underlying problems So far incomplete, but promising Trying to get the compiler expert out of the loop Potential to capture and share transformations in a systematic and reusable form Using novel measures of similarity to compare and comprehend code bases 23

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic