Grid Computing: Application Development

Similar documents
Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Grid Application Development Software

Research in High-Performance Grid Software

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Compilers and Run-Time Systems for High-Performance Computing

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Compiler Technology for Problem Solving on Computational Grids

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Introduction to HPC. Lecture 21

Component Architectures

Algorithms and Computation in Signal Processing

High Performance Computing. Without a Degree in Computer Science

Parallel FFT Program Optimizations on Heterogeneous Computers

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Parallelism in Spiral

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Research Related Activities

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Composite Metrics for System Throughput in HPC

Toward a Framework for Preparing and Executing Adaptive Grid Programs

Compilation for Heterogeneous Platforms

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.

Memory Systems IRAM. Principle of IRAM

Experiments with Scheduling Using Simulated Annealing in a Grid Environment

Performance Issues and Query Optimization in Monet

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Numerical Libraries And The Grid: The GrADS Experiments With ScaLAPACK

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Formal Loop Merging for Signal Transforms

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Why Parallel Architecture

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Cache Justification for Digital Signal Processors

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Vector IRAM: A Microprocessor Architecture for Media Processing

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

Biological Sequence Alignment On The Computational Grid Using The Grads Framework

The Mont-Blanc approach towards Exascale

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

SPIRAL Generated Modular FFTs *

CS 426 Parallel Computing. Parallel Computing Platforms

My 2 hours today: 1. Efficient arithmetic in finite fields minute break 3. Elliptic curves. My 2 hours tomorrow:

10th August Part One: Introduction to Parallel Computing

Compiler Architecture for High-Performance Problem Solving

Accurate Cache and TLB Characterization Using Hardware Counters

Many-core and PSPI: Mixing fine-grain and coarse-grain parallelism

Scientific Computing. Some slides from James Lambers, Stanford

Double-Precision Matrix Multiply on CUDA

Intel released new technology call P6P

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

The Role of Performance

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Why Performance Models Matter for Grid Computing

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS Understanding Parallel Computing

Energy Optimizations for FPGA-based 2-D FFT Architecture

Generating Parallel Transforms Using Spiral

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III

Virtual Grids. Today s Readings

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University

Algorithms and Computation in Signal Processing

HPCS HPCchallenge Benchmark Suite

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Algorithm Engineering with PRAM Algorithms

Microarchitecture Overview. Performance

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Microarchitecture Overview. Performance

Intel Enterprise Processors Technology

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

Energy Efficient Adaptive Beamforming on Sensor Networks

Technology Trends Presentation For Power Symposium

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

A SEARCH OPTIMIZATION IN FFTW. by Liang Gu

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Advances of parallel computing. Kirill Bogachev May 2016

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Learning to Construct Fast Signal Processing Implementations

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache-oblivious Programming

Transcription:

Grid Computing: Application Development Lennart Johnsson Department of Computer Science and the Texas Learning and Computation Center University of Houston Houston, TX Department of Numerical Analysis and Computer Science and PDC Royal Institute of Technology Stockholm, Sweden

Grids are Hot Courtesy Ken Kennedy

Grids: Application Development Courtesy Ken Kennedy

Grids: Middleware Development Courtesy Ken Kennedy

Grids: Administration Courtesy Ken Kennedy

Grid Application Development Courtesy Ken Kennedy

PDC Grid Application Development: A Software Grand Challenge Goal: Reliable performance on heterogeneous platforms, under varying load on computation nodes and on communications links Challenges: Presenting a high-level application development interface If programming is hard, its useless Designing and constructing applications for adaptability Mapping applications to dynamically changing configurations Determining when to interrupt execution and remap Application monitors Performance estimators

What is Needed Execution infrastructure for reliable performance Automatic resource location and execution initiation dynamic configuration to available resources Performance monitoring and control strategies deep integration across compilers, tools, and runtime systems performance contracts and dynamic reconfiguration Abstract Grid programming models and easy-to-use programming interfaces design of an implementation strategy for those models problem-solving environments Robust reliable numerical and data-structure libraries predictability and robustness of accuracy and performance reproducibility, fault tolerance, and auditability

PDC GrADSoft Architecture Goal: reliable performance under varying load Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer GrADS Project (NSF NGS): Berman, Chien, Cooper, Dongarra, Foster, Gannon Johnsson, Kennedy, Kesselman, Mellor-Crummey, Reed, Torczon, Wolski

PDC GrADSoft Architecture Program Preparation System Execution Environment Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer

High Level Programming System Built on Libraries Coded by Professionals Included mapping strategies and cost models High level knowledge of integration strategies Integration Produces Configurable Object Program Integrated mapping strategy and cost model Performance enhanced through context-dependent variants Context includes potential execution platform Dynamic Optimizer Performs Final Binding Implements mapping strategy Chooses machine-specific variants Inserts monitoring probes Strategy: move compilation overhead to PSEgeneration time

PDC Configurable Object Program Representation of the Application Supporting dynamic reconfiguration and optimization for distributed targets, may include Program intermediate code Annotations from the compiler mapping strategy and performance model Historical information (run profile to now) Mapping strategies Aggregation of data regions (submeshes) or tasks Definition of parameters for algorithm selection Challenge: synthesis of performance models User input, especially associated with libraries Synthesize different components, scale models to problem size

PDC Execution Cycle Configurable Object Program is presented Space of feasible resources must be defined Mapping strategy and performance model provided Service Negotiator solicits acceptable resource collections Performance model is used to evaluate each Best match is selected and contracted for Execution begins Dynamic optimizer tailors program to resources Selects mapping strategy Inserts sensors Contract monitoring is conducted during execution Soft violation detection based on fuzzy logic

PDC Performance Contracts At the Heart of the GrADS Model Fundamental mechanism for managing mapping and execution What are they? Mappings from resources to performance Mechanisms for determining when to interrupt and reschedule Abstract Definition Random Variable: r(a,i,c,t 0 ) with a probability distribution Challenge A = app, I = input, C = configuration, t 0 = time of initiation Important statistics: lower and upper bounds (95% confidence) When should a contract be (viewed as) violated? Strict adherence balanced against cost of reconfiguration

Grids - Performance Courtesy Dan Reed

Grids - Performance Courtesy Dan Reed

Performance Signatures Courtesy Dan Reed

PDC Contract Monitoring Input: Performance model Integrated from a variety of sources: user, compiler, application experience, application signatures Resources contracted for Trigger Registration information from sensors installed in applications Output Inserted by dynamic optimizer or user Rule based contract monitor that decides when contract violation is serious enough to merit reconfiguration Based on information from sensors

Grid Computing Core Library Courtesy Jack Dongarra

Grid Computing Core Library Courtesy Jack Dongarra

Grid Computing Library Use Courtesy Jack Dongarra

Grid Computing Library Use Courtesy Jack Dongarra

GrADS Library Sequence User Library Routine User makes a sequential call to a numerical library routine. The Library Routine has crafted code which invoke other components. Slide courtesy Jack Dongarra

GrADS Library Sequence User Library Routine Resource Selector Slide courtesy Jack Dongarra Library Routine calls a grid based routine to determine which resources are possible for use. The Resource Selector returns a bag of processors (coarse grid) that are available.

GrADS Library Sequence User Slide courtesy Jack Dongarra Library Routine Resource Selector The Library Routine calls the Performance Modeler to determine the best set of processors to use for the given problem. May be done by evaluating a formula or running a simulation. May assign a number of processes to a processor. At this point have a fine grid. Performance Model

GrADS Library Sequence User Library Routine Resource Selector The Library Routine calls the Contract Development routine to commit the fine grid for this call. A performance guarantee is generated. Contract Development Performance Model Slide courtesy Jack Dongarra

GrADS Library Sequence User Library Routine Resource Selector App Launcher Contract Development Performance Model mpirun machinefile fine_grid grid_linear_solve Slide courtesy Jack Dongarra

Grids Library Evaluation Courtesy Jack Dongarra

Arrays of Values Generated by Resource Selector Clique based 2 @ UT, UCSD, UIUC Full at the cluster level and the connections (clique leaders) Bandwidth and Latency information looks like this. Linear arrays for CPU and Memory x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Courtesy Jack Dongarra

Grids Performance Models Courtesy Jack Dongarra

Grids Library Evaluation Courtesy Jack Dongarra

Grids Library Evaluation Courtesy Jack Dongarra

Grids Library Evaluation Courtesy Jack Dongarra

Sample Application - Cactus Courtesy Ed Seidel

Cactus on the Grid Courtesy Ed Seidel

Cactus on the Grid Courtesy Ed Seidel

Cactus Migration basis Courtesy Ed Seidel, Ian Foster

Cactus Job Migration Courtesy Ed Seidel, Ian Foster

Cactus Migration Architecture Courtesy Ed Seidel, Ian Foster

Courtesy Ed Seidel, Ian Foster Cactus - Migration example

Bioimaging

Imaging Instruments Molecules Synchrotrons Microscopes Magnetic Resonance Imagers Macromolecular Complexes, Organelles, Cells Organs, Organ Systems, Organisms

JEOL300 0-FEG Liquid He stage NSF support No. of Particles Needed for 3-D Reconstruction Resolution B = 100 Å 2 B = 50 Å 2 8.5 Å 4.5 Å 6,000 5,000,000 3,000 150,000 500 Å 8.5 Å Structure of the HSV-1 Capsid

EM imaging EMAN Vitrification Robot Particle Selection Power Spectrum Analysis Initial 3D Model Micrographs 4-64 Mpixels, 16-bit (8 128 MB) 100 200/day per lab 10 1,000 particles per micrograph EMEN Database Archival Data Mining Management Classify Particles Reproject 3D Model Several TB/yr Project 200 10,000+ micrographs 10,000 10,000,00 particles Align Average Deconvolute Build New 3D Model 10k 1,000k pixels/particle Up to hundreds of PFlops

EMAN

GrADSoft

Grid Application Software Managing Data, Codes and Resources Securely Performance

SimDB: A Grid Based Problem Solving Environment

Molecular Dynamics Jim Briggs University of Houston

Molecular Dynamics Simulations

Trajectory Properties Function Result Transport coefficients Scalar Normal modes Set of scalar values Solvent residence times Set of scalar values Average structures 3 x 1D coordinates Animation 3 x 1D coordinates x N frames Radial distribution functions Energy profiles Solvent densities Angles/distances between solute groups Atom-atom distances Dihedral angles Root mean square deviations Simple thermodynamic properties Time correlation functions 1D histogram 1D or 2D histogram 2D or 3D histogram 1D time series 1D time series 1D time series 1D time series 1D time series 1D time series

SimDB Workflow

SimDB Architecture

SimDB Data Access

SimDB Data Generation

SimDB Administration

PDC GrADSoft Architecture Program Preparation System Execution Environment Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer

Adaptive Software

Challenges Algorithmic Multiple data structures and their interaction Unfavorable data access pattern (big 2 n strides) High efficiency of the algorithm low floating-point v.s. load/store ratio Additions/multiplications unbalance Version explosion Verification Maintenance

PDC Challenges Diversity of execution environments Growing complexity of modern microprocessors. Deep memory hierarchies Out-of-order execution Instruction level parallelism Growing diversity of platform characteristics SMPs Clusters (employing a range of interconnect technologies) Grids (heterogeneity, wide range of characteristics) Wide range of application needs Dimensionality and sizes Data structures and data types Languages and programming paradigms

Opportunities Multiple algorithms with comparable numerical properties for many functions Improved software techniques and hardware performance Integrated performance monitors, models and data bases Run-time code construction

Approach Automatic algorithm selection polyalgorithmic functions Entirely different algorithms, exploit decomposition of operations, Code generation from high-level descriptions Extensive application independent compile-time analysis Integrated performance modeling and analysis Run-time application and execution environment dependent composition Automated installation process

Performance Tuning Methodology Input Parameters System specifics, User options UHFFT Code generator Input Parameters Size, dim., Library of FFT modules Initialization Select best plan Performance database Installation Execution Calculate one or more FFTs Run-time

The UHFFT Program preparation at installation (platform dependent) Integrated performance models (in progress) and data bases Algorithm selection at run-time from set defined at installation Automatic multiple precision constant generation Program construction at run-time based on application and performance predictions

The Fast Fourier Transform o Application of W n requires O( n ) operations. o Fast algorithms sparse factorizations of 2 W n W n = A 1 A 2 KA k where A o Application of i are sparse ( O(n) operations) and k = log(n). W n can be evaluated in O(nlog(n)) ops. o Sparse factorization is not unique. o Many different ways to factor the same W n.

Sparse Factorization Algorithms Rader s Algorithm Prime Factor Algorithm Split-radix Mixed-radix (Cooley-Tukey)

Notation Simple way to describe highly structured matrices Tensor (Kronecker) product = B B B B B B B B B B A 1 1 11 10 1 1 11 10 1 0 01 00 n n n n n n a a a a a a a a a L M O M M L K blocking matrix matrix, scaling = A = B Direct sum notation: = B 0 0 A B A

Prime Factor Algorithm When n = rq, and gcd(r,q) = 1it is possible to reduce the number of operations by using the PFA (no twiddle factor multiplication). W n = = Π Π 1 1 (W (W r r I q )(I q r W )Π W )Π 2, q 2 where Π 1, Π 2 are permutation matrices.

Split-radix Algorithm o When n = 2 k the most efficient algorithm. o Two levels of radix-2 splitting. W o When n = 2q = n = (W2 In/2 )D2,n/2(I2 Wn/2 )Πn,2 W B B B Π SR a n m = n,q,2 I B q 4 p we can write SR a = (W = = B 2 B = ( I (W q m I Ω, q q n,p Π W q,2 p 3 n,p ) Π W )Π n,2 p ~ )[Iq (W2 I ~ Ω, W = 2 p n,q,2 )],, (1 i) W 2,

Mixed-radix Splitting When n = rq, Wn can be written as W n = (Wr Iq )Dr,q(Ir Wq )Πn,r, where D r, q is a diagonal twiddle factor D Ω r,q n,q = I and is a mod -r Π n,r q Ω = 1 ω n n,q matrix L Ω L ω r 1 n r 1 n,q sort permutation matrix.,,

Algorithm Selection Logic if n<=2 DFT else if n is prime use Rader s algorithm else { Chose factor r of n if r and n/r are coprime use PFA else if n is divisible by (r 2 ) and n>r 3 use Split-Radix algorithm else use Mixed-radix algorithm }

Example N = 6 FFTPrimeFactor n = 6, r = 3, dir = Forward, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1

UHFFT Architecture UHFFT Library Library of FFT Modules Initialization Routines Execution Routines Utilities FFT Code Generator Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Rader's Algorithm Unparser Scheduler Key: Optimizer Initializer (Algorithm Abstraction) Fixed library code Generated code Code generator Funded in part by the Alliance (NSF) and LACSI (DoE)

Optimization Strategy High level (performed at application initialization) Selection of the optimal factorization using preoptimized codelets Generation of an execution plan Low level Selected set of codelets Selected codelets automaticly generated and highly optimized by a special-purpose compiler

The UHFFT: Code Generation Structure Algorithm abstraction Optimization Generation of a DAG Scheduling of instructions Unparsing Implementation Code generator is written in C Speed, portability and installation tuning Highly optimized straight line C code Generates FFT codelets of arbitrary size, direction, and rotation

The UHFFT: Code Generation (cont d) Basic structure is an Expression Constant, variable, sum, product, sign change, Basic functions Expression sum, product, assign, sign change, Derived structures Expression vectors, matrices and lists Higher level functions Matrix vector operations FFT specific operations Algorithms currently supported Rader (two versions), PFA, Split-radix, Mixed-radix

The UHFFT: Code Generation Mixed-Radix Algorithm Equation: W Is implemented as: n = (Wr Im )Dr,m(Ir Wm )Πn,r /* * FFTMixedRadix() Mixed-radix splitting. * Input: * r radix, * dir, rot direction and rotation of the transform, * u input expression vector. */ ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u) { int m, n = u->n, *p; } m = n/r; p = ModRSortPermutation(n, r); u = FFTxI(r, m, dir, rot, TwiddleMult(r, m, dir, rot, IxFFT(r, m, dir, rot, PermuteExprVec(u, p)))); free(p); return u;

The UHFFT: Performance Modeling Analytic models Cache influence on library codes Performance measuring tools (PCL, PAPI) Prediction of composed code performance Updated from execution experience Data base Library codes. Recorded at installation time Composed codes. Recorded and updated for each execution.

The UHFFT: Execution Plan Generation Optimal plan search options Exhaustive Recursive Empirical Algorithms used Rader (FFTW, UHFFT) PFA (UHFFT) Split-radix (UHFFT) Mixed-radix (FFTW, SPIRAL, UHFFT)

PDC Characteristics of Some Processors Processor Clock frequency Peak Performance Cache structure Intel Pentium IV 1.8 GHz 1.8 GFlops L1: 8K+8K, L2: 256K AMD Athlon 1.4 GHz 1.4 GFlops L1: 64K+64K, L2: 256K PowerPC G4 867 MHz 867 MFlops L1: 32K+32K L2: 256K, L3: 1-2M Intel Itanium 800 Mhz 3.2 GFlops L1: 16K+16K L2: 92K, L3: 2-4M IBM Power3/4 375 MHz 1.5 GFlops L1: 64K+32K, L2: 1-16M HP PA 8x00 750 MHz 3 GFlops L1: 1.5M + 0.75M Alpha EV67/68 833 MHz 1.66 GFlops L1: 64K+64K, L2: 4M MIPS R1x000 500 MHz 1 GFlop L1: 32K+32K, L2: 4M

Codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

Plan Performance, 32- bit Architectures

Power3 plan performance 350 300 250 MFLOPS 200 150 100 50 0 16 2 8 4 4 8 2 2 2 4 Plan 2 4 2 4 2 2 2 2 2 2 222 MHz 888 Mflops

430 IBM Power3 222 MHz Execution Plan Comparison n = 2520 (PFA Plan) 420 410 400 390 380 "MFLOPS" 370 360 350 340 9 5 8 7 7 9 5 8 5 7 8 9 5 8 7 9 8 9 5 7 8 5 7 9 8 7 9 5 9 7 5 8 5 9 8 7 8 7 5 9 9 5 7 8 9 7 8 5 5 8 9 7 5 7 9 8 7 8 9 5 7 5 8 9 8 5 9 7 9 8 5 7 7 8 5 9 7 5 9 8 8 9 7 5 9 8 7 5 7 9 8 5 5 9 7 8 Plan

HP zx1 Chipset 2-way block diagram Features: 2-way and 4-way Low latency connection to the DDR memory (112 ns) Directly (112 ns latency) Through (up to 12 ) scalable memory expanders (+25 ns latency) Up to 64 GB of DDR today (256 in the future) AGP 4x today (8x in the future versions) 1-8 I/O adapters supporting PCI, PCI-X, AGP

PDC Itanium.. Processor Clock frequency Peak Performance Cache structure Intel Itanium 800 Mhz 3.2 GFlops Intel Itanium 2 900 Mhz 3.6 GFlops Intel Itanium 2 1000 Mhz 4 GFlops Sun UltraSparc-III 750 Mhz 1.5 GFlops Sun UltraSparc-III 1050 Mhz 2.1 GFlops L1: 16K+16K (Data+Instruction) L2: 92K, L3: 2-4M (off-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 1.5M (on-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 3M (on-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) Tested configuration

PDC Memory Hierarchy Itanium-2 (McKinley) Itanium L1I and L1D Size: Line size/associativity: Latency: Write Policies: 16KB + 16KB 64B/4-way 1 cycle Write through, No write allocate 16KB + 16KB 32B/4-way 1 cycle Write through, No write allocate Size: 256KB 96K B Unified L2 Line size/associativity: Integer Latency: FP Latency: 128B/8-way Min 5 cycles Min 6 cycles 64B/6-way Min 6 cycles Min 9 cycles Write Policies: Write back, write allocate Write back, write allocate Size: 3MB or 1.5MB on chip 4MB or 2MB off chip Unified L3 Line size/associativity: Integer Latency: FP Latency: 128B/12-way Min 12 cycles Min 13 cycles 64B/4-way Min 21 cycles Min 24 cycles Bandwith: 32B/cycle 16B/cycle

UHFFT Codelet Performance

UHFFT Codelet Performance

UHFFT Codelet Performance

Codelet Performance Radix-2

Codelet Performance Radix-3

Codelet Performance Radix-4

Codelet Performance Radix-5

Codelet Performance Radix-6

Codelet Performance Radix-7

Codelet Performance Radix-8

Codelet Performance Radix-9

Codelet Performance Radix-16

Codelet Performance Radix-32

Codelet Performance Radix-45

Codelet Performance Radix-64

GrADS Testbed

GrADS Testbed

GrADS Testbed Web Interface

GrADS Testbed Web Interface

GrADS Testbed Services

GrADS team Ken Kennedy Ruth Aydt/Celso Mendez Francine Berman/Henry Cassanova Andrew Chien Keith Cooper Jack Dongarra Ian Foster Dennis Gannon Lennart Johnsson Carl Kesselman Dan Reed Richard Wolski Linda Torczon NSF funded under NGS Many students