Research in High-Performance Grid Software
|
|
- Leslie Bradley
- 6 years ago
- Views:
Transcription
1 Research in High-Performance Grid Software Lennart Johnsson NADA, KTH and Department of Computer Science University of Houston
2 EU Project: Neurogenerator Collection and processing of PET and fmri data from neurological experiments Partners KI Neuroscience KTH: PDC, TCS, CVAP UU Forewiss Active Knowledge Production starting February
3 Metadata Neurogenerator database schema Submission Interface Internet, DAT-tape etc. PDC/KI Submission Interface Manual inspection Population Raw Data Format conversion Segmentation Normalization Statistical Analysis Processing chains Workflow management User Interface Workflow User Interface Result databases Visualization
4 Swedish Space Corporation:ODIN research satellite Esrange PDC
5 Odin linus.esrange.ssc.se 300MB/day HSM toad.pdc.kth.se ftp://toad.pdc.kth.se goby.pdc.kth.se Level 0 Level 1a Level 1b Level 2 Mattias Claesson Odin data
6 Level 0 Odin Data Growth Level 1a Mattias Claesson
7 2003: PAMELA Collaboration with institutes in: Bari, Florence, Moscow, NASA, NMSU, Rome, Siegen and Trieste Main purpose : antiproton and positron fluxes in space 80 MeV 190 GeV (anti-p) / 50 MeV 270 GeV (e + ) 2 x 10 4 anti-p and 2 x 10 5 e + expected (2 years) Also: H C energy spectrum and search for anti-helium Polar orbit (70.4 o ) allows the study of low energy particles NB: Will be launched before AMS (now expected 2005) KTH are responsible for anticoincidence shield RE2B PAMELA - CERN Propose data downlink in Sweden - funding decision pending recognised experiment Acknowledgement Mark Pearce, SCFAB
8 2003: Downlink in Sweden Acknowledgement Mark Pearce, SCFAB Collaboration is concerned with the proposal for a single downlink station in Moscow Only 1 Gbyte / day assured (8 minute pass). Does not maximise scientific return from PAMELA. For 5 Hz trigger rate, will record ~9 Gbyte per day. Bigger data set better understanding of systematic errors. Very important for antiproton studies. Likely delays between receipt of data and transmission to scientists in Europe. Pamela mission is short (3 years). Need to be able to read data and fix problems efficiently. Kiruna / Sturup has excellent coverage for satellite s polar orbit Competitive price quote from SSC Data would be sent to Stockholm and then distributed to collaboration by internet cheap and fast
9 Wavelength Disk Drives Calgary Regina Winnipeg CA*net 3/4 St. John s Vancouver Montreal Charlottetown Fredericton Halifax WDD Node Toronto Ottawa Computer data continuously circulates around the WDD
10
11 SimDB Architecture
12 Biological Imaging JEOL300 0-FEG Liquid He stage NSF support No. of Particles Needed for 3-D Reconstruction 500 Å 8.5 Å 4.5 Å Resolution B = 100 Å 2 6,000 5,000,000 B = 50 Å 2 3, , Å Structure of the HSV-1 Capsid
13 Vitrification Robot Particle Selection Power Spectrum Analysis EMAN Initial 3D Model EMEN Database Archival Data Mining Management Classify Particles Reproject 3D Model Align Average Deconvolute Build New 3D Model
14 Tele-Microscopy Osaka, Japan Mark Ellisman, UCSD
15 Computational Steering GEMSviz at igrid 2000 INET NORDUnet Paralleldatorcentrum KTH Stockholm APAN STAR TAP Universityof Houston NORDUnet Sep 00 - #17
16 GrADS Grid Application Development Software
17 Grids Contract Development
18 Grids - Contract Development
19 Grids Contract Development
20 Grids Application Launch
21 Grids Library Evaluation
22 Grids Performance Models
23 Grids Library Evaluation
24 Grids Library Evaluation
25
26 Cactus Job Migration
27 Cactus Migration Architecture
28 Cactus Migration example
29 Adaptive Software
30 Challenges Diversity of execution environments Growing complexity of modern microprocessors. Deep memory hierarchies Out-of-order execution Instruction level parallelism Growing diversity of platform characteristics SMPs Clusters (employing a range of interconnect technologies) Grids (heterogeneity, wide range of characteristics) Wide range of application needs Dimensionality and sizes Data structures and data types Languages and programming paradigms
31 Challenges Algorithmic High arithmetic efficiency low floating-point v.s. load/store ratio Unfavorable data access patterns (big 2 n strides) Application owns the datastructures/layout Additions/multiplications unbalanced Version explosion Verification Maintenance
32 Opportunities Multiple algorithms with comparable numerical properties for many functions Improved software techniques and hardware performance Integrated performance monitors, models and data bases Run-time code construction
33 Approach Automatic algorithm selection polyalgorithmic functions (CMSSL, FFTW, ATLAS, SPIRAL,..) Exploit multiple precision options Code generation from high-level descriptions (WASSEM, CMSSL, CM-Convolution-Compiler, FFTW, UHFFT, SPIRAL,..) Integrated performance monitoring, modeling and analysis Judicious choice between compile-time and run-time analysis and code construction Automated installation process
34 The UHFFT Program preparation at installation (platform dependent) Integrated performance models (in progress) and data bases Algorithm selection at run-time from set defined at installation Automatic multiple precision constant generation Program construction at run-time based on application and performance predictions
35 Performance Tuning Methodology Input Parameters System specifics, User options Input Parameters Size, dim., UHFFT Code generator Initialization Select best plan (factorization) Library of FFT modules Execution Calculate one or more FFTs Performance database Installation Performance Monitoring Database update Run-time
36 The UHFFT Software Architecture UHFFT Library Library of FFT Modules Initialization Routines Execution Routines Utilities FFT Code Generator Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Rader's Algorithm Unparser Scheduler Key: Optimizer Initializer (Algorithm Abstraction) Fixed library code Generated code Code generator
37 The UHFFT: Code Generation Structure Algorithm abstraction Optimization Generation of a DAG Scheduling of instructions Unparsing Implementation Code generator is written in C Speed, portability and installation tuning Highly optimized straight line C code Generates FFT codelets of arbitrary size, direction, and rotation
38 The UHFFT: Code Generation (cont d) Basic structure is an Expression Constant, variable, sum, product, sign change, Basic functions Expression sum, product, assign, sign change, Derived structures Expression vectors, matrices and lists Higher level functions Matrix vector operations FFT specific operations Algorithms currently supported Rader (two versions), PFA, Split-radix, Mixed-radix
39 The UHFFT: Factorization Logic if n<=2 use DFT else if n is prime use Rader s algorithm else { Chose factor r of n if r and n/r are coprime use PFA else if n is divisible by (r 2 ) and n>r 3 use Split-Radix algorithm else use Mixed-radix algorithm }
40 The UHFFT: Representation of Factorization FFTPrimeFactor n = 6, r = 3, dir = Forward, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1
41 Equation: W Is implemented as: The UHFFT: Code Generation Mixed-Radix Algorithm n (Wr Im )Dr,m(I r Wm )Πn,r /* * FFTMixedRadix() Mixed-radix splitting. * Input: * r radix, * dir, rot direction and rotation of the transform, * u input expression vector. */ ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u) { int m, n = u->n, *p; m = n/r; p = ModRSortPermutation(n, r); u = FFTxI(r, m, dir, rot, TwiddleMult(r, m, dir, rot, IxFFT(r, m, dir, rot, PermuteExprVec(u, p)))); free(p); return u; March } 11, 2003
42 The UHFFT: Performance Modeling Analytic models Cache influence on library codes Performance measuring tools (PCL, PAPI) Prediction of composed code performance Updated from execution experience Data base Library codes. Recorded at installation time Composed codes. Recorded and updated for each execution.
43 The UHFFT: Execution Plan Generation Optimal plan search options Exhaustive Recursive Empirical Algorithms used Rader (FFTW, UHFFT) PFA (UHFFT) Split-radix (UHFFT) Mixed-radix (FFTW, SPIRAL, UHFFT)
44 Characteristics of Some Processors Processor Clock frequency Peak Performance Cache structure Intel Pentium IV 1.8 GHz 1.8 GFlops L1: 8K+8K, L2: 256K AMD Athlon 1.4 GHz 1.4 GFlops L1: 64K+64K, L2: 256K PowerPC G4 867 MHz 867 MFlops L1: 32K+32K L2: 256K, L3: 1-2M Intel Itanium 800 Mhz 3.2 GFlops L1: 16K+16K L2: 92K, L3: 2-4M IBM Power3/4 375 MHz 1.5 GFlops L1: 64K+32K, L2: 1-16M HP PA 8x MHz 3 GFlops L1: 1.5M M Alpha EV67/ MHz 1.66 GFlops L1: 64K+64K, L2: 4M MIPS R1x MHz 1 GFlop L1: 32K+32K, L2: 4M
45 Codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz
46 Radix-4 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz
47 Radix-8 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz
48 Plan Performance, 32- bit Architectures
49 Power3 plan performance MFLOPS Plan 222 MHz 888 Mflops
50 Power3 plan performance PFA sizes 800 Mflops peak
51 Itanium.. Processor Clock frequency Peak Performance Intel Itanium 800 Mhz 3.2 GFlops Intel Itanium Mhz 3.6 GFlops Intel Itanium Mhz 4 GFlops Sun UltraSparc-III 750 Mhz 1.5 GFlops Sun UltraSparc-III 1050 Mhz 2.1 GFlops Cache structure L1: 16K+16K (Data+Instruction) L2: 92K, L3: 2-4M (off-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 1.5M (on-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 3M (on-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) VIGrid Kick-off, Tested Lennart configuration Johnsson
52 Memory Hierarchy Itanium-2 (McKinley) Itanium L1I and L1D Size: Line size/associativity: Latency: Write Policies: 16KB + 16KB 64B/4-way 1 cycle Write through, No write allocate 16KB + 16KB 32B/4-way 1 cycle Write through, No write allocate Size: 256KB 96K B Unified L2 Line size/associativity: Integer Latency: FP Latency: 128B/8-way Min 5 cycles Min 6 cycles 64B/6-way Min 6 cycles Min 9 cycles Write Policies: Write back, write allocate Write back, write allocate Size: 3MB or 1.5MB on chip 4MB or 2MB off chip Unified L3 Line size/associativity: Integer Latency: FP Latency: 128B/12-way Min 12 cycles Min 13 cycles 64B/4-way Min 21 cycles Min 24 cycles Bandwith: 32B/cycle 16B/cycle
53 Itanium Comparison Workstation HP i2000 HP zx2000 Processor 800 MHz Intel Itanium 900 MHz Intel Itanium 2 (McKinley) Bus Speed 133 MHZ 400 MHz Bus Width 64 bit 128 bit Chipset Intel 82460GX HP zx1 Memory 2 GB SDRAM (133 MHz) 2 GB DDR SDRAM (266 MHz) OS 64-bit Red Hat Linux 7.1 HP version of the 64-bit RH Linux 7.2 Compiler Intel 6.0 Intel 6.0
54 HP zx1 Chipset 2-way block diagram Features: 2-way and 4-way Low latency connection to the DDR memory (112 ns) Directly (112 ns latency) Through (up to 12 ) scalable memory expanders (+25 ns latency) Up to 64 GB of DDR today (256 in the future) AGP 4x today (8x in the future versions) 1-8 I/O adapters supporting PCI, PCI-X, AGP
55 UHFFT Codelet Performance
56 UHFFT Codelet Performance
57 UHFFT Codelet Performance
58 Codelet Performance Radix-2
59 Codelet Performance Radix-3
60 Codelet Performance Radix-4
61 Codelet Performance Radix-5
62 Codelet Performance Radix-6
63 Codelet Performance Radix-7
64 Codelet Performance Radix-8
65 Codelet Performance Radix-9
66 Codelet Performance Radix-10
67 Codelet Performance Radix-11
68 Codelet Performance Radix-12
69 Codelet Performance Radix-13
70 Codelet Performance Radix-14
71 Codelet Performance Radix-15
72 Codelet Performance Radix-16
73 Codelet Performance Radix-24
74 Codelet Performance Radix-32
75 Codelet Performance Radix-64
76 The UHFFT: Summary Code generator written in C Code is generated at installation Codelet library is tuned to the underlying architecture The whole library can be easily customized through parameter specification No need for laborious manual changes in the source Existing code generation infrastructure allows easy library extensions Future: Inclusion of vector/streaming instruction set extension for various architectures Implementation of new scheduling/optimization algorithms New codelet types and better execution routines Unified algorithm specification VIGrid Kick-off, on all Lennart levels Johnsson
77 New Tools for Library Code Development Generalization of the tools developed for the UHFFT library CODELAB: A Developers' Tool for Efficient Code Generation and Optimization Combination of High-level scripting language Code generator Performance measurement tools Visualization Under development Several test examples show very promising results
78 CODELAB IDE STRUCTURE CODELAB IDE Script Interpreter Code Generator Visualization Performance measurement User Input Library code Support Code Compiler Execution Operating System Application
79 CODELAB Structure Application consists of Library code Support Code Application Library code: Automatically generated and optimized collection of subroutines Supporting code: Code that binds the library routines together It could be hand-written or automatically generated Application is instrumented for performance measurements automatically
80 Script Interpreter User Input CODELAB Structure Supporting Code Application User writes: Simple script that produces the code generator or supporting code Supporting code for the application The code generator should be able to produce a large variety of code depending on a few input parameters (otherwise it is simpler to write the code by hand) Example: A single code generator for FFT codelets of different size, type, direction, rotation, Script Interpreter: Simplifies construction of the code generator Very restricted set of commands at the moment
81 CODELAB Code Generator Structure Library code Support Code Application Visualization Code generator C program that generates the application and supporting code Uses abstract expression algebra for code generation Several layers of software: Basic expressions Complex expressions algebra Vector and matrix algebra Polynomial algebra The generated code can be instrumented for performance measurements The initial expression list is transformed into DAG and optimized: Simplification of expressions Folding of constants User can get a variety of information about the generated code: Number of arithmetic ops DAG graph, etc
82 CODELAB Structure Visualization Compiler Performance measurements Execution Operating System Performance measurements The application can be compiled and executed from within the IDE The performance data are collected and visualized User can modify the code and repeat the process until a satisfactory performance is obtained Detailed performance information by using PAPI library interface
83 CODELAB Applications FFT and DSP Libraries Efficient multiple precision arithmetic Finite Element Methods Linear Algebra Other well structured applications that allow for simple parameterization few parameters define a large variety of code
84 Overview UHFFT Performance on some new architectures Intel Itanium 800 MHz, Intel Itanium 2 (McKinley) 900 MHz Sun UltraSparc-III 750 MHz New Tools for Library Code Development CODELAB Integrated Development Environment (IDE) Introduction Structure of the CODELAB IDE Applications
85 The UHFFT: An Adaptive FFT Library UHFFT employs more ways of combining codelets for execution than any other library Better coverage of the space of possible algorithms The PFA algorithm yields good performance where the Mixed-Radix algorithm (MR) performs poorly PFA algorithm requires less FP operations than MR Data access pattern in PFA is more complex than in MR, but large 2 n strides can be avoided Example IBM Power3 Good: 128-way set associative L1 data and instruction caches Bad: Direct mapped L2 cache very vulnerable to cache trashing despite the large cache size March 11, PFA 2003 execution model works VIGrid Kick-off, better Lennart for large Johnsson FFT sizes
86 Acknowledgements GrADS contributors Dave Angulo, Ruth Aydt, Fran Berman, Anrew Chien, Keith Cooper, Holly Dail, Jack Dongarra, Ian Foster, Sridhar Gullapallii, Lennart Johnsson, Ken Kennedy, Carl Kesselman, Chuck Koelbel, Bo Liu, Chuang Liu, Xin Liu, Anirban Mandal, Mark Mazina, John Mellor-Crummey, Celso Mendes, Graziano Obertelli, Alex Olugbile, Mitul Patel, Dan Reed, Martin Swany, Linda Torczon, Satish Vahidyar, Shannon Whitmore, Rich Wolski, Huaxia Xia, Lingyun Yang, Asim Yarkin,. Funding: NSF Next Generation Software initiative, Los Alamos Computer Science Institute
Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationGrid Computing: Application Development
Grid Computing: Application Development Lennart Johnsson Department of Computer Science and the Texas Learning and Computation Center University of Houston Houston, TX Department of Numerical Analysis
More informationGrid Application Development Software
Grid Application Development Software Department of Computer Science University of Houston, Houston, Texas GrADS Vision Goals Approach Status http://www.hipersoft.cs.rice.edu/grads GrADS Team (PIs) Ken
More informationAn Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston
An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.
More informationIntroduction to HPC. Lecture 21
443 Introduction to HPC Lecture Dept of Computer Science 443 Fast Fourier Transform 443 FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! 443 FFT followed
More informationInput parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan
Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationHigh Performance Computing. Without a Degree in Computer Science
High Performance Computing Without a Degree in Computer Science Smalley s Top Ten 1. energy 2. water 3. food 4. environment 5. poverty 6. terrorism and war 7. disease 8. education 9. democracy 10. population
More informationComponent Architectures
Component Architectures Rapid Prototyping in a Networked Environment Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/lacsicomponentssv01.pdf Participants Ruth Aydt Bradley Broom Zoran
More informationEmpirical Auto-tuning Code Generator for FFT and Trigonometric Transforms
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan
More informationBiological Sequence Alignment On The Computational Grid Using The Grads Framework
Biological Sequence Alignment On The Computational Grid Using The Grads Framework Asim YarKhan (yarkhan@cs.utk.edu) Computer Science Department, University of Tennessee Jack J. Dongarra (dongarra@cs.utk.edu)
More informationCompilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University
Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith Cooper Jack
More informationCompilers and Run-Time Systems for High-Performance Computing
Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationResearch Related Activities
Research Related Activities Lennart Johnsson Research Infrastructure Research Science and Engineering Research Infrastructure Observations Collaborators are increasingly chosen regardless of location Instruments
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationGrADSoft and its Application Manager: An Execution Mechanism for Grid Applications
GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications Authors Ken Kennedy, Mark Mazina, John Mellor-Crummey, Rice University Ruth Aydt, Celso Mendes, UIUC Holly Dail, Otto
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out
More informationPerformance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.
Performance Analysis of KDD Applications using Hardware Event Counters CAP Theme 2 http://cap.anu.edu.au/cap/projects/kddmemperf/ Peter Christen and Adam Czezowski Peter.Christen@anu.edu.au Adam.Czezowski@anu.edu.au
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationVector IRAM: A Microprocessor Architecture for Media Processing
IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology
More informationAffordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors
Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors Mitchell A Cox, Robert Reed and Bruce Mellado School of Physics, University of the Witwatersrand.
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationMy 2 hours today: 1. Efficient arithmetic in finite fields minute break 3. Elliptic curves. My 2 hours tomorrow:
My 2 hours today: 1. Efficient arithmetic in finite fields 2. 10-minute break 3. Elliptic curves My 2 hours tomorrow: 4. Efficient arithmetic on elliptic curves 5. 10-minute break 6. Choosing curves Efficient
More informationVirtual Grids. Today s Readings
Virtual Grids Last Time» Adaptation by Applications» What do you need to know? To do it well?» Grid Application Development Software (GrADS) Today» Virtual Grids» Virtual Grid Application Development Software
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationCompiler Technology for Problem Solving on Computational Grids
Compiler Technology for Problem Solving on Computational Grids An Overview of Programming Support Research in the GrADS Project Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/gridcompilers.pdf
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationNode Hardware. Performance Convergence
Node Hardware Improved microprocessor performance means availability of desktop PCs with performance of workstations (and of supercomputers of 10 years ago) at significanty lower cost Parallel supercomputers
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationOutline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III
Outline How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III Peter Christen and Adam Czezowski CAP Research Group Department of Computer Science,
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationItanium 2 Impact Software / Systems MSC.Software. Jay Clark Director, Business Development High Performance Computing
Itanium 2 Impact Software / Systems MSC.Software Jay Clark Director, Business Development High Performance Computing jay.clark@mscsoftware.com Agenda What MSC.Software does Software vendor point of view
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationToward a Framework for Preparing and Executing Adaptive Grid Programs
Toward a Framework for Preparing and Executing Adaptive Grid Programs Ken Kennedy α, Mark Mazina, John Mellor-Crummey, Keith Cooper, Linda Torczon Rice University Fran Berman, Andrew Chien, Holly Dail,
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationGeneration of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University
Generation of High Performance Domain- Specific Languages from Component Libraries Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith
More informationCS Understanding Parallel Computing
CS 594 001 Understanding Parallel Computing Web page for the course: http://www.cs.utk.edu/~dongarra/web-pages/cs594-2006.htm CS 594 001 Wednesday s 1:30 4:00 Understanding Parallel Computing: From Theory
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationPerformance Issues and Query Optimization in Monet
Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl 1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationWhy Performance Models Matter for Grid Computing
Why Performance Models Matter for Grid Computing Ken Kennedy 1 Rice University ken@rice.edu 1 Introduction Global heterogeneous computing, often referred to as the Grid [5, 6], is a popular emerging computing
More informationPERFORMANCE MEASUREMENT
Administrivia CMSC 411 Computer Systems Architecture Lecture 3 Performance Measurement and Reliability Homework problems for Unit 1 posted today due next Thursday, 2/12 Start reading Appendix C Basic Pipelining
More informationDevelopment of efficient computational kernels and linear algebra routines for out-of-order superscalar processors
Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationJava Performance Analysis for Scientific Computing
Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000
More informationMemory Hierarchies 2009 DAT105
Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement
More informationHP Integrity rx2600 server
HP Integrity rx2600 server Demand more more accountability, more agility, and a better return on IT to build your adaptive enterprise with the industry-leading HP Integrity rx2600 server. Based on the
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationHPCS HPCchallenge Benchmark Suite
HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationCase Studies on Cache Performance and Optimization of Programs with Unit Strides
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer
More informationHW Trends and Architectures
Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa
More informationLecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533
Lecture 2: Computer Performance Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Performance and Cost Purchasing perspective given a collection of machines, which has the - best performance?
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationRobert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!
Robert Jamieson Robs Techie PP Everything in this presentation is at your own risk! PC s Today Basic Setup Hardware pointers PCI Express How will it effect you Basic Machine Setup Set the swap space Min
More informationComputer Architecture. Introduction. Lynn Choi Korea University
Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationKen Kroeker. Partner Technology Access Center e Services Partner Division
Ken Kroeker Partner Technology Access Center e Services Partner Division Ken_kroeker@hp.com Processor Evolution performance you are here next generation EPIC Itanium Superscalar RISC ~ 2 instructions/cycle
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationOverview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class
Overview of Today s Lecture: Cost & Price, Performance EE176-SJSU Computer Architecture and Organization Lecture 2 Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class EE176
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationComputer Architecture. Fall Dongkun Shin, SKKU
Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationWhy Performance Models Matter for Grid Computing
Why Performance Models Matter for Grid Computing Ken Kennedy 1 Rice University ken@rice.edu 1 Introduction Global heterogeneous computing, often referred to as the Grid [5, 6], is a popular emerging computing
More informationAccurate Cache and TLB Characterization Using Hardware Counters
Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More information