Research in High-Performance Grid Software

Size: px

Start display at page:

Download "Research in High-Performance Grid Software"

Leslie Bradley
6 years ago
Views:

1 Research in High-Performance Grid Software Lennart Johnsson NADA, KTH and Department of Computer Science University of Houston

2 EU Project: Neurogenerator Collection and processing of PET and fmri data from neurological experiments Partners KI Neuroscience KTH: PDC, TCS, CVAP UU Forewiss Active Knowledge Production starting February

3 Metadata Neurogenerator database schema Submission Interface Internet, DAT-tape etc. PDC/KI Submission Interface Manual inspection Population Raw Data Format conversion Segmentation Normalization Statistical Analysis Processing chains Workflow management User Interface Workflow User Interface Result databases Visualization

4 Swedish Space Corporation:ODIN research satellite Esrange PDC

Odin linus.esrange.ssc.se 300MB/day HSM toad.pdc.kth.se ftp://toad.pdc.kth.se https://esrange.

5 Odin linus.esrange.ssc.se 300MB/day HSM toad.pdc.kth.se ftp://toad.pdc.kth.se goby.pdc.kth.se Level 0 Level 1a Level 1b Level 2 Mattias Claesson Odin data

6 Level 0 Odin Data Growth Level 1a Mattias Claesson

2003: PAMELA Collaboration with institutes in: Bari, Florence, Moscow, NASA, NMSU, Rome, Siegen and Trieste Main purpose : antiproton and positron fluxes in space 80 MeV 190 GeV (anti-p) / 50 MeV 270

7 2003: PAMELA Collaboration with institutes in: Bari, Florence, Moscow, NASA, NMSU, Rome, Siegen and Trieste Main purpose : antiproton and positron fluxes in space 80 MeV 190 GeV (anti-p) / 50 MeV 270 GeV (e + ) 2 x 10 4 anti-p and 2 x 10 5 e + expected (2 years) Also: H C energy spectrum and search for anti-helium Polar orbit (70.4 o ) allows the study of low energy particles NB: Will be launched before AMS (now expected 2005) KTH are responsible for anticoincidence shield RE2B PAMELA - CERN Propose data downlink in Sweden - funding decision pending recognised experiment Acknowledgement Mark Pearce, SCFAB

2003: Downlink in Sweden Acknowledgement Mark Pearce, SCFAB Collaboration is concerned with the proposal for a single downlink station in Moscow Only 1 Gbyte / day assured (8 minute pass).

8 2003: Downlink in Sweden Acknowledgement Mark Pearce, SCFAB Collaboration is concerned with the proposal for a single downlink station in Moscow Only 1 Gbyte / day assured (8 minute pass). Does not maximise scientific return from PAMELA. For 5 Hz trigger rate, will record ~9 Gbyte per day. Bigger data set better understanding of systematic errors. Very important for antiproton studies. Likely delays between receipt of data and transmission to scientists in Europe. Pamela mission is short (3 years). Need to be able to read data and fix problems efficiently. Kiruna / Sturup has excellent coverage for satellite s polar orbit Competitive price quote from SSC Data would be sent to Stockholm and then distributed to collaboration by internet cheap and fast

9 Wavelength Disk Drives Calgary Regina Winnipeg CA*net 3/4 St. John s Vancouver Montreal Charlottetown Fredericton Halifax WDD Node Toronto Ottawa Computer data continuously circulates around the WDD

11 SimDB Architecture

12 Biological Imaging JEOL300 0-FEG Liquid He stage NSF support No. of Particles Needed for 3-D Reconstruction 500 Å 8.5 Å 4.5 Å Resolution B = 100 Å 2 6,000 5,000,000 B = 50 Å 2 3, , Å Structure of the HSV-1 Capsid

Data Mining Management Classify Particles Reproject

13 Vitrification Robot Particle Selection Power Spectrum Analysis EMAN Initial 3D Model EMEN Database Archival Data Mining Management Classify Particles Reproject 3D Model Align Average Deconvolute Build New 3D Model

14 Tele-Microscopy Osaka, Japan Mark Ellisman, UCSD

15 Computational Steering GEMSviz at igrid 2000 INET NORDUnet Paralleldatorcentrum KTH Stockholm APAN STAR TAP Universityof Houston NORDUnet Sep 00 - #17

16 GrADS Grid Application Development Software

17 Grids Contract Development

18 Grids - Contract Development

19 Grids Contract Development

20 Grids Application Launch

21 Grids Library Evaluation

22 Grids Performance Models

23 Grids Library Evaluation

24 Grids Library Evaluation

26 Cactus Job Migration

27 Cactus Migration Architecture

28 Cactus Migration example

29 Adaptive Software

30 Challenges Diversity of execution environments Growing complexity of modern microprocessors. Deep memory hierarchies Out-of-order execution Instruction level parallelism Growing diversity of platform characteristics SMPs Clusters (employing a range of interconnect technologies) Grids (heterogeneity, wide range of characteristics) Wide range of application needs Dimensionality and sizes Data structures and data types Languages and programming paradigms

31 Challenges Algorithmic High arithmetic efficiency low floating-point v.s. load/store ratio Unfavorable data access patterns (big 2 n strides) Application owns the datastructures/layout Additions/multiplications unbalanced Version explosion Verification Maintenance

32 Opportunities Multiple algorithms with comparable numerical properties for many functions Improved software techniques and hardware performance Integrated performance monitors, models and data bases Run-time code construction

33 Approach Automatic algorithm selection polyalgorithmic functions (CMSSL, FFTW, ATLAS, SPIRAL,..) Exploit multiple precision options Code generation from high-level descriptions (WASSEM, CMSSL, CM-Convolution-Compiler, FFTW, UHFFT, SPIRAL,..) Integrated performance monitoring, modeling and analysis Judicious choice between compile-time and run-time analysis and code construction Automated installation process

34 The UHFFT Program preparation at installation (platform dependent) Integrated performance models (in progress) and data bases Algorithm selection at run-time from set defined at installation Automatic multiple precision constant generation Program construction at run-time based on application and performance predictions

35 Performance Tuning Methodology Input Parameters System specifics, User options Input Parameters Size, dim., UHFFT Code generator Initialization Select best plan (factorization) Library of FFT modules Execution Calculate one or more FFTs Performance database Installation Performance Monitoring Database update Run-time

36 The UHFFT Software Architecture UHFFT Library Library of FFT Modules Initialization Routines Execution Routines Utilities FFT Code Generator Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Rader's Algorithm Unparser Scheduler Key: Optimizer Initializer (Algorithm Abstraction) Fixed library code Generated code Code generator

37 The UHFFT: Code Generation Structure Algorithm abstraction Optimization Generation of a DAG Scheduling of instructions Unparsing Implementation Code generator is written in C Speed, portability and installation tuning Highly optimized straight line C code Generates FFT codelets of arbitrary size, direction, and rotation

38 The UHFFT: Code Generation (cont d) Basic structure is an Expression Constant, variable, sum, product, sign change, Basic functions Expression sum, product, assign, sign change, Derived structures Expression vectors, matrices and lists Higher level functions Matrix vector operations FFT specific operations Algorithms currently supported Rader (two versions), PFA, Split-radix, Mixed-radix

39 The UHFFT: Factorization Logic if n<=2 use DFT else if n is prime use Rader s algorithm else { Chose factor r of n if r and n/r are coprime use PFA else if n is divisible by (r 2 ) and n>r 3 use Split-Radix algorithm else use Mixed-radix algorithm }

40 The UHFFT: Representation of Factorization FFTPrimeFactor n = 6, r = 3, dir = Forward, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1

41 Equation: W Is implemented as: The UHFFT: Code Generation Mixed-Radix Algorithm n (Wr Im )Dr,m(I r Wm )Πn,r /* * FFTMixedRadix() Mixed-radix splitting. * Input: * r radix, * dir, rot direction and rotation of the transform, * u input expression vector. */ ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u) { int m, n = u->n, *p; m = n/r; p = ModRSortPermutation(n, r); u = FFTxI(r, m, dir, rot, TwiddleMult(r, m, dir, rot, IxFFT(r, m, dir, rot, PermuteExprVec(u, p)))); free(p); return u; March } 11, 2003

42 The UHFFT: Performance Modeling Analytic models Cache influence on library codes Performance measuring tools (PCL, PAPI) Prediction of composed code performance Updated from execution experience Data base Library codes. Recorded at installation time Composed codes. Recorded and updated for each execution.

43 The UHFFT: Execution Plan Generation Optimal plan search options Exhaustive Recursive Empirical Algorithms used Rader (FFTW, UHFFT) PFA (UHFFT) Split-radix (UHFFT) Mixed-radix (FFTW, SPIRAL, UHFFT)

44 Characteristics of Some Processors Processor Clock frequency Peak Performance Cache structure Intel Pentium IV 1.8 GHz 1.8 GFlops L1: 8K+8K, L2: 256K AMD Athlon 1.4 GHz 1.4 GFlops L1: 64K+64K, L2: 256K PowerPC G4 867 MHz 867 MFlops L1: 32K+32K L2: 256K, L3: 1-2M Intel Itanium 800 Mhz 3.2 GFlops L1: 16K+16K L2: 92K, L3: 2-4M IBM Power3/4 375 MHz 1.5 GFlops L1: 64K+32K, L2: 1-16M HP PA 8x MHz 3 GFlops L1: 1.5M M Alpha EV67/ MHz 1.66 GFlops L1: 64K+64K, L2: 4M MIPS R1x MHz 1 GFlop L1: 32K+32K, L2: 4M

45 Codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

46 Radix-4 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

47 Radix-8 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

48 Plan Performance, 32- bit Architectures

49 Power3 plan performance MFLOPS Plan 222 MHz 888 Mflops

50 Power3 plan performance PFA sizes 800 Mflops peak

51 Itanium.. Processor Clock frequency Peak Performance Intel Itanium 800 Mhz 3.2 GFlops Intel Itanium Mhz 3.6 GFlops Intel Itanium Mhz 4 GFlops Sun UltraSparc-III 750 Mhz 1.5 GFlops Sun UltraSparc-III 1050 Mhz 2.1 GFlops Cache structure L1: 16K+16K (Data+Instruction) L2: 92K, L3: 2-4M (off-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 1.5M (on-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 3M (on-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) VIGrid Kick-off, Tested Lennart configuration Johnsson

52 Memory Hierarchy Itanium-2 (McKinley) Itanium L1I and L1D Size: Line size/associativity: Latency: Write Policies: 16KB + 16KB 64B/4-way 1 cycle Write through, No write allocate 16KB + 16KB 32B/4-way 1 cycle Write through, No write allocate Size: 256KB 96K B Unified L2 Line size/associativity: Integer Latency: FP Latency: 128B/8-way Min 5 cycles Min 6 cycles 64B/6-way Min 6 cycles Min 9 cycles Write Policies: Write back, write allocate Write back, write allocate Size: 3MB or 1.5MB on chip 4MB or 2MB off chip Unified L3 Line size/associativity: Integer Latency: FP Latency: 128B/12-way Min 12 cycles Min 13 cycles 64B/4-way Min 21 cycles Min 24 cycles Bandwith: 32B/cycle 16B/cycle

53 Itanium Comparison Workstation HP i2000 HP zx2000 Processor 800 MHz Intel Itanium 900 MHz Intel Itanium 2 (McKinley) Bus Speed 133 MHZ 400 MHz Bus Width 64 bit 128 bit Chipset Intel 82460GX HP zx1 Memory 2 GB SDRAM (133 MHz) 2 GB DDR SDRAM (266 MHz) OS 64-bit Red Hat Linux 7.1 HP version of the 64-bit RH Linux 7.2 Compiler Intel 6.0 Intel 6.0

HP zx1 Chipset 2-way block diagram Features: 2-way and 4-way Low latency connection to the DDR memory (112 ns) Directly (112 ns latency) Through (up to 12 )

54 HP zx1 Chipset 2-way block diagram Features: 2-way and 4-way Low latency connection to the DDR memory (112 ns) Directly (112 ns latency) Through (up to 12 ) scalable memory expanders (+25 ns latency) Up to 64 GB of DDR today (256 in the future) AGP 4x today (8x in the future versions) 1-8 I/O adapters supporting PCI, PCI-X, AGP

55 UHFFT Codelet Performance

56 UHFFT Codelet Performance

57 UHFFT Codelet Performance

58 Codelet Performance Radix-2

59 Codelet Performance Radix-3

60 Codelet Performance Radix-4

61 Codelet Performance Radix-5

62 Codelet Performance Radix-6

63 Codelet Performance Radix-7

64 Codelet Performance Radix-8

65 Codelet Performance Radix-9

66 Codelet Performance Radix-10

67 Codelet Performance Radix-11

68 Codelet Performance Radix-12

69 Codelet Performance Radix-13

70 Codelet Performance Radix-14

71 Codelet Performance Radix-15

72 Codelet Performance Radix-16

73 Codelet Performance Radix-24

74 Codelet Performance Radix-32

75 Codelet Performance Radix-64

76 The UHFFT: Summary Code generator written in C Code is generated at installation Codelet library is tuned to the underlying architecture The whole library can be easily customized through parameter specification No need for laborious manual changes in the source Existing code generation infrastructure allows easy library extensions Future: Inclusion of vector/streaming instruction set extension for various architectures Implementation of new scheduling/optimization algorithms New codelet types and better execution routines Unified algorithm specification VIGrid Kick-off, on all Lennart levels Johnsson

77 New Tools for Library Code Development Generalization of the tools developed for the UHFFT library CODELAB: A Developers' Tool for Efficient Code Generation and Optimization Combination of High-level scripting language Code generator Performance measurement tools Visualization Under development Several test examples show very promising results

78 CODELAB IDE STRUCTURE CODELAB IDE Script Interpreter Code Generator Visualization Performance measurement User Input Library code Support Code Compiler Execution Operating System Application

79 CODELAB Structure Application consists of Library code Support Code Application Library code: Automatically generated and optimized collection of subroutines Supporting code: Code that binds the library routines together It could be hand-written or automatically generated Application is instrumented for performance measurements automatically

80 Script Interpreter User Input CODELAB Structure Supporting Code Application User writes: Simple script that produces the code generator or supporting code Supporting code for the application The code generator should be able to produce a large variety of code depending on a few input parameters (otherwise it is simpler to write the code by hand) Example: A single code generator for FFT codelets of different size, type, direction, rotation, Script Interpreter: Simplifies construction of the code generator Very restricted set of commands at the moment

81 CODELAB Code Generator Structure Library code Support Code Application Visualization Code generator C program that generates the application and supporting code Uses abstract expression algebra for code generation Several layers of software: Basic expressions Complex expressions algebra Vector and matrix algebra Polynomial algebra The generated code can be instrumented for performance measurements The initial expression list is transformed into DAG and optimized: Simplification of expressions Folding of constants User can get a variety of information about the generated code: Number of arithmetic ops DAG graph, etc

82 CODELAB Structure Visualization Compiler Performance measurements Execution Operating System Performance measurements The application can be compiled and executed from within the IDE The performance data are collected and visualized User can modify the code and repeat the process until a satisfactory performance is obtained Detailed performance information by using PAPI library interface

83 CODELAB Applications FFT and DSP Libraries Efficient multiple precision arithmetic Finite Element Methods Linear Algebra Other well structured applications that allow for simple parameterization few parameters define a large variety of code

84 Overview UHFFT Performance on some new architectures Intel Itanium 800 MHz, Intel Itanium 2 (McKinley) 900 MHz Sun UltraSparc-III 750 MHz New Tools for Library Code Development CODELAB Integrated Development Environment (IDE) Introduction Structure of the CODELAB IDE Applications

85 The UHFFT: An Adaptive FFT Library UHFFT employs more ways of combining codelets for execution than any other library Better coverage of the space of possible algorithms The PFA algorithm yields good performance where the Mixed-Radix algorithm (MR) performs poorly PFA algorithm requires less FP operations than MR Data access pattern in PFA is more complex than in MR, but large 2 n strides can be avoided Example IBM Power3 Good: 128-way set associative L1 data and instruction caches Bad: Direct mapped L2 cache very vulnerable to cache trashing despite the large cache size March 11, PFA 2003 execution model works VIGrid Kick-off, better Lennart for large Johnsson FFT sizes

86 Acknowledgements GrADS contributors Dave Angulo, Ruth Aydt, Fran Berman, Anrew Chien, Keith Cooper, Holly Dail, Jack Dongarra, Ian Foster, Sridhar Gullapallii, Lennart Johnsson, Ken Kennedy, Carl Kesselman, Chuck Koelbel, Bo Liu, Chuang Liu, Xin Liu, Anirban Mandal, Mark Mazina, John Mellor-Crummey, Celso Mendes, Graziano Obertelli, Alex Olugbile, Mitul Patel, Dan Reed, Martin Swany, Linda Torczon, Satish Vahidyar, Shannon Whitmore, Rich Wolski, Huaxia Xia, Lingyun Yang, Asim Yarkin,. Funding: NSF Next Generation Software initiative, Los Alamos Computer Science Institute

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing