Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Size: px
Start display at page:

Download "Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters"

Transcription

1 Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University {wuxf, Abstract The MIMD Lattice Computation (MILC) code (version 7.4.) is a set of codes developed by the MIMD Lattice Computation collaboration for doing simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. In this paper, we analyze the performance of the MILC code on four large-scale cluster systems: UNC RENCI IBM BlueGene/L, SDSC DataStar, NERSC Seaborg and Jacquard. We discuss the scalability of the MILC code with processor and problem scaling, use processor partitioning to investigate the performance impacts of the application components, and identify possible performance optimizations for further work. Then we use Prophesy to online generate the performance models for MILC on each platform so that we can use these models to predict the performance on larger number of processors. 1. MILC code The MIMD Lattice Computation (MILC) code (version 7.4.) [1] is a set of codes written in C developed by the MIMD Lattice Computation collaboration for doing simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. The MILC code is one of applications packages that use SciDAC modules [2] shown in Figure 1. The MILC code uses five modules from SciDAC software modules: QMP, QLA, QIO, QDP/C, and QOPQDP. QMP (QCD Message Passing) provides a standard communications layer for Lattice QCD based on MPI. QLA (QCD Linear Algebra) provides a standard interface for linear algebra routines that can act on a site or a array of sites with several indexing options. QIO (QCD Input/Output) provides as suite of input/output routines for lattice data. QDP/C is the C implementation of the QDP (QCD Data Parallel) interface. QOPQDP is an implementation of the QOP (QCD Operations) level three interface using QDP. Figure 1. The SciDAC Layers and the software module architecture [2] 1

2 In this paper, we focus on simulations with dynamical Kogut-Susskind fermions with improved actions. The program su3_rmd uses the refreshed molecular dynamics algorithm for the Symanzik 1 loop gauge action and Asqtad quark action. 2. Experiment Platforms In this section, we describe the specifications of four large-scale cluster systems: Linux cluster Jacquard [3] at NERSC (National Energy Research Scientific Computing center), IBM BlueGene/L system [4] in RENCI at the University of North Carolina, IBM POWER3 SMP cluster Seaborg [5] at NERSC, and IBM POWER4 SMP cluster DataStar [6] at SDSC (San Diego Supercomputing Center). NERSC Jacquard has 356 dual-processor nodes available for scientific calculations. An additional 8 spare nodes will be in service if available. Four nodes are dedicated as login nodes. There are additional I/O and service nodes. Each processor runs at a clock speed of 2.2GHz, and has a theoretical peak performance of 4.4 GFlop/s. Processors on each node share 6GB of memory. The nodes are interconnected with a high-speed InfiniBand network. Shared file storage is provided by a GPFS file system. UNC RENCI BlueGene/L is designed to achieve high performance for low cost and with low power consumption. The BG/L has the following configuration: Compute dual 7 MHz PowerPC 44 nodes with 1GB memory per node Storage - 11 TB of clusterwide disk storage Network - IBM BlueGene Torus, Global Tree network (2.1 GB/s) and Global Interrupt Table 1. Specification of four large-scale clusters System Name Jacquard RENCI BlueGene/L DataStar Seaborg Number of Nodes CPUs per Node CPU type 2.2GHz AMD Opteron 7 MHz IBM PowerPC GHz POWER4 375MHz POWER3 Memory per Node 6GB 1GB 16-32GB 16-64GB Network InfiniBand Torus Federation Colony OS Linux Linux AIX AIX The NERSC Seaborg is a distributed memory computer with 6,8 processors available to run scientific computing applications. The processors are distributed among 38 compute nodes with 16 processors per SMP node. Processors on each node have a shared memory pool of between 16 and 64 GB (312 nodes with 16GB, 64 nodes with 32GB, and 4 nodes with 64GB). The compute nodes are connected by the IBM Colony switch. The use of 16-way nodes is exclusive: only one user is allowed to use a node at any given time, regardless of the number of CPUs one needs on that node. The DataStar is SDSC's largest IBM terascale machine. It has 176 (8-way) P655 compute nodes with 1.5GHz POWER4 and 16 GB memory per node, and 96 (8-way) compute nodes with 1.7GHz POWER4 and 32 GB memory per node. The nodes are connected by 2

3 the IBM Federation switch. The use of 8-way nodes is exclusive: only one user is allowed to use a node at any given time, regardless of the number of CPUs one needs on that node. 3. Performance Analysis In this section, we only use su3_rmd to do some performance analysis of the MILC code on four large-scale cluster systems, especially discuss the scalability of the MILC code on up to 248 processors. We also use Prophesy [7, 8] to instrument the code to collect performance data for our analysis. We ported this code and its libraries to SDSC DataStar and NERSC Seaborg. UNC RENCI BuleGene/L has 2 processors per node, NERSC Jacquard has 2 processors per node, SDSC DataStar has 8 processors per node, and NERSC Seaborg has 16 processors per node. The performance results in this section are for using all processors per node. Note that the total execution time for the application means the time it takes from the beginning of the application program to the end, which includes I/O time, too. 3.1 Datasets for Problem and Processor Scaling The problem sizes for the MILC code are listed in Table 2 (Input-8) and Table 3 (Input-1), where nx is the size of X dimension of grid, ny is the size of Y dimension of grid, nz is the size of Z dimension of grid, and nt is the size of T dimension of grid. Table 2. Dataset Input-8 with scaling the number of processors #Procs nx ny nz nt Table 3. Dataset Input-8 with scaling the number of processors #Procs nx ny nz nt For problem dataset input-8, the workload per processor is 8x8x8x8. For the problem dataset input-1, the workload per processor is 1x1x1x Weak Scaling In this section, we use the execution time on 8 processors as a baseline, and define that the relative slowdown is the execution time on p processors divided by the execution time on 8 processors. Given the fixed workload per processors, the relative slowdown quantifies how much the application performance degrades with increasing the number of processors. 1) Problem size (input-8): 8x8x8x8 per processor 3

4 Runtime vs Processors for MILC7.4 (Input-8) 3 25 Runtime (secs) BlueGene/L Jacquard DataStar Seaborg Processors Figure 2. Performance comparisons for MILC code with Input-8 on four clusters Relative Slowdown vs Processors for MILC7.4 (Input-8) Slowdown BlueGene/L Jacquard DataStar Seaborg Processors Figure 3. Relative slowdowns for MILC code with Input-8 on four clusters 2) Problem size (input-1): 1x1x1x1 per processor 4

5 Runtime vs Processors for MILC7.4 (Input-1) 6 5 Runtime (secs) BlueGene/L Jacquard DataStar Seaborg Processors Figure 4. Performance comparisons for MILC code with Input-1 on four clusters Relative Slowdown vs Processors for MILC7.4 (Input-1) Slowdown BlueGene/L Jacquard DataStar Seaborg Processors Figure 5. Relative slowdowns for MILC code with Input-1 on four clusters Overall, the MILC code has the best scalability on SDSC DataStar with the weak scaling. When increasing workload per processor from 8x8x8x8 to 1x1x1x1, the relative slowdown on BlueGene/L is the largest after 64 processors because of small memory size per node (i.e., 1GB per node). The execution time of the MILC on Jacquard increases significantly starting from 128 processors or more. 3.3 Strong Scaling Based on problem size of 32x32x32x32, we discuss the scalability of the MILC code on up to 248 processors. 5

6 Relative Speedup for 32x32x32x32 1 Speedup (log) 1 BlueGene/L Jacquard DataStar Seaborg ideal Processors Figure 6. Relative speedups for MILC code with the problem size of 32x32x32x32 on four clusters Overall, the MILC code has the best scalability on RENCI BuleGene/L with the strong scaling. Although BlueGene/L has the smallest memory size per node (i.e, 1GB per node), for the fixed the problem size of 32x32x32x32, with increasing the total number of processors, the workload per processor decreases significantly. This benefits BlueGene/L. 3.4 Application Sensitivity Analysis Using Processor Partitioning We define processor partitioning as how many processors per node to be used for an application execution. A processor partitioning scheme NxM means N nodes with M processors per node. We can use processor partitioning to quantify the time difference among different processor partitioning schemes (PPS) and to investigate how the application and its components are sensitive to different communication patterns and memory access patterns for different PPS. For example, given the PPS NxM, when M=1, it means that the communication pattern is internode only, and only one processor on each node uses the whole memory. When M > 1, it means that there are internode and intranode communications, and M processors on each node competes the whole memory. In this section, we investigate how processor partitioning impacts the performance of the application and its components. We use MILC7.4 with input-1 (the large problem size of 4x4x4x4) on 256 processors on four large-scale clusters as an example to conduct our experiments. Runtime means the total application execution time which includes I/O time. Update() is one function of the MILC code. Others mean the other components of the MILC code except Update(). 6

7 MILC with Input-1 on RENCI BlueGene/L 3 25 Runtime (secs) Runtime Update() Others 5 256x1 Processor Partitioning Scheme (256) 128x2 Figure 7. Performance comparison for different PPS on 256 processors on UNC RENCI BlueGene/L Figure 7 shows the performance comparison for different PPS on 256 processors on UNC RENCI BlueGene/L. The time difference between the scheme 256x1 and 128x2 is 24.61%. MILC with Input-1 on NERSC Jacquard Runtime (secs) Runtime Update() Others x1 Processor Partitioning Scheme (256) 128x2 Figure 8. Performance comparison for different PPS on 256 processors on NERSC Jacquard Figure 8 shows the performance comparison for different PPS on 256 processors on NERSC Jacquard. The time difference between the scheme 256x1 and 128x2 is 19.92%. 7

8 MILC with Input-1 on NERSC Seaborg Runtime (secs) Runtime Update() Others 256x1 128x2 64x4 32x8 16x16 Processor Partitioning Scheme (256) Figure 9. Performance comparison for different PPS on 256 processors on NERSC Seaborg Figure 9 shows the performance comparison for different PPS on 256 processors on NERSC Seaborg. The time difference between the scheme 256x1 and 16x16 is 15.25%. MILC with Input-1 on SDSC DataStar Runtime (secs) Runtime Update() Others x1 128x2 64x4 32x8 Processor Partitioning Scheme (256) Figure 1. Performance comparison for different PPS on 256 processors on SDSC DataStar Figure 1 shows the performance comparison for different PPS on 256 processors on SDSC DataStar. The time difference between the scheme 256x1 and 32x8 is 16.35%. The function update() results in the time differences for different processor partitioning schemes. The function update() consists of dominated function update_h(), ks_congrad_two_src() and update_u(). There is a large double nested-loop in update_u(). For an example, the number of iterations is 16 for Input-1 and for Input Possible Performance Optimization Dominated function update(): update_h(), ks_congrad_two_src(), and update_u(). 8

9 Use loop blocking technique to optimize them. First, try to optimize update_u(), then consider how to optimize update_h() and ks_congrad_two_src() because the major function calls in the two functions are from QOPQDP library. Two optimization strategies for further work: Cache blocking technique at source code level Prefetch technique at compiling time (Carleton DeTar suggested) 5. Performance Modeling Using Prophesy System The Prophesy system [8] provides the PAIDE automatic performance instrumentation and data entry tool [7]. PAIDE provides the automatic instrumentation of codes at the level of basic blocks, procedures, or functions. The default mode consists of instrumenting the entire code at the level of basic loops and procedures. A user can specify that the code be instrumented at a finer granularity than that of loops or identify the particular events to be instrumented. The resultant performance data is automatically uploaded into the Prophesy database at the end of application execution by PAIDE, and is used to produce performance models and predict application performance via Prophesy web-based interfaces [9]. For MILC code with Input-1, we use Prophesy to illustrate the modeling process as follows. Figures 11 through 13 illustrate performance model and predicted performance on 496 processors based on the performance data on 8 to 248 processors. Figure 11. Performance model for MILC with Input-1 on RENCI BlueGene/L 9

10 Figure 12. Performance model for MILC with Input-1 on SDSC DataStar Figure 13. Performance model for MILC with Input-1 on NERSC Seaborg Figures 11 and 12 show more accurate performance models on BlueGene/L and Datastar, where the norm of residuals is less than 2 (or 2% of the total execution times). But the performance model in Figure 13 is not accurate, where the norm of residuals is (or 8.45% of the total execution time on 8 processors). 1

11 Figure 14. Application-level performance data stored in Prophesy Database Figure 15. Fucntion-level performance data stored in Prophesy Database Figures 14 indicates the brief application-level performance data stored in Prophesy database, and 15 shows the brief function-level performance data stored in Prophesy database. 11

12 Acknowledgements We would like to thank Ying Zhang from UNC RENCI for providing initial MILC code and datasets, and thank Carleton DeTar from University of Utah and Steven Gottlieb from Indiana University for their help about understanding the code and results. Summary We have carried out the following tasks: Compared the performance of the MILC code on 4 different clusters Discussed which platform performs the best for the MILC code, in other word, which platform is suitable to the SciDAC MILC application Investigated how processor partitioning impacts the performance of MILC in order to identify possible performance bottlenecks for further work Modeled the performance of MILC and predicted the performance on a larger number of processors References [1] The MIMD Lattice Computation (MILC) Collaboration code, [2] US Lattice Quantum Chromodynamics, [3] NERSC Jacquard, [4] UNC RENCI BlueGene/L, [5] NERSC Seaborg, resources/sp/. [6] SDSC DataStar, [7] Xingfu Wu, Valerie Taylor and Rick Stevens, Design and Implementation of Prophesy Automatic Instrumentation and Data Entry System, Proc. of the 13th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS21), Anaheim, CA, August 21. [8] Valerie Taylor, Xingfu Wu, and Rick Stevens, Prophesy: An Infrastructure for Performance Analysis and Modeling of Parallel and Grid Applications, ACM SIGMETRICS Performance Evaluation Review, Volume 3, Issue 4, March 23. [9] Xingfu Wu, Valerie Taylor, and Joseph Paris, A Web-based Prophesy Automated Performance Modeling System, the IASTED International Conference on Web Technologies, Applications and Services (WTAS26), July 17-19, 26, Calgary, Canada. 12

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems Xingfu Wu, Valerie Taylor, Charles Lively, and Sameh Sharkawi Department of Computer Science, Texas A&M

More information

Performance database technology for SciDAC applications

Performance database technology for SciDAC applications Performance database technology for SciDAC applications D Gunter 1, K Huck 2, K Karavanic 3, J May 4, A Malony 2, K Mohror 3, S Moore 5, A Morris 2, S Shende 2, V Taylor 6, X Wu 6, and Y Zhang 7 1 Lawrence

More information

Performance Analysis and Optimization of the Regional Ocean Model System on TeraGrid

Performance Analysis and Optimization of the Regional Ocean Model System on TeraGrid Performance Analysis and Optimization of the Regional Ocean Model System on TeraGrid Yue Zuo, Xingfu Wu, and Valerie Taylor Department of Computer Science, Texas A&M University Email: {zuoyue, wuxf, taylor}@cs.tamu.edu

More information

arxiv: v1 [hep-lat] 1 Dec 2017

arxiv: v1 [hep-lat] 1 Dec 2017 arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics

More information

A Web-based Prophesy Automated Performance Modeling System

A Web-based Prophesy Automated Performance Modeling System A Web-based Prophesy Automated Performance Modeling System Xingfu Wu, Valerie Taylor Department of Computer Science, Texas A & M University, College Station, TX 77843, USA Email: {wuxf, taylor}@cs.tamu.edu

More information

MILC Performance Benchmark and Profiling. April 2013

MILC Performance Benchmark and Profiling. April 2013 MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

MILC Code Basics. Carleton DeTar. First presented at Edinburgh EPCC HackLatt 2008 Updated HackLatt

MILC Code Basics. Carleton DeTar. First presented at Edinburgh EPCC HackLatt 2008 Updated HackLatt MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013 HackLatt 2008 1 MILC Code Capabilities Molecular dynamics evolution Staggered fermion actions (Asqtad, Fat7,

More information

MILC Code Basics. Carleton DeTar HackLatt HackLatt

MILC Code Basics. Carleton DeTar HackLatt HackLatt MILC Code Basics Carleton DeTar HackLatt 2008 HackLatt 2008 1 HackLatt 2008 2 Mathias Gruenewald: Temptation of St Anthony (1515) HackLatt 2008 3 MILC Code Capabilities Molecular dynamics evolution Staggered

More information

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var

More information

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems Xingfu Wu and Valerie Taylor Department of Computer Science

More information

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Large-scale Multithreaded BlueGene/Q Supercomputer

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Large-scale Multithreaded BlueGene/Q Supercomputer Performance Characteristics of Hybrid Scientific Applications on a Large-scale Multithreaded BlueGene/Q Supercomputer Xingfu Wu and Valerie Taylor Department of Computer Science and Engineering Texas A&M

More information

APPLICATIONS. 1 Introduction

APPLICATIONS. 1 Introduction APPLICATIONS COMPUTING APPLICATIONS APPLICATIONS David A. Bader NEW MEXICO, USA Robert Pennington NCSA, USA 1 Introduction Cluster computing for applications scientists is changing dramatically with the

More information

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer Xingfu Wu Department of Computer Science & Engineering, Texas A&M University

More information

arxiv: v2 [hep-lat] 3 Nov 2016

arxiv: v2 [hep-lat] 3 Nov 2016 MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton

More information

San Diego Supercomputer Center. Georgia Institute of Technology 3 IBM UNIVERSITY OF CALIFORNIA, SAN DIEGO

San Diego Supercomputer Center. Georgia Institute of Technology 3 IBM UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalability of a pseudospectral DNS turbulence code with 2D domain decomposition on Power4+/Federation and Blue Gene systems D. Pekurovsky 1, P.K.Yeung 2,D.Donzis 2, S.Kumar 3, W. Pfeiffer 1, G. Chukkapalli

More information

arxiv: v1 [hep-lat] 13 Jun 2008

arxiv: v1 [hep-lat] 13 Jun 2008 Continuing Progress on a Lattice QCD Software Infrastructure arxiv:0806.2312v1 [hep-lat] 13 Jun 2008 Bálint Joó on behalf of the USQCD Collaboration Thomas Jefferson National Laboratory, 12000 Jefferson

More information

Performance comparison between a massive SMP machine and clusters

Performance comparison between a massive SMP machine and clusters Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Making a Case for a Green500 List

Making a Case for a Green500 List Making a Case for a Green500 List S. Sharma, C. Hsu, and W. Feng Los Alamos National Laboratory Virginia Tech Outline Introduction What Is Performance? Motivation: The Need for a Green500 List Challenges

More information

Resource allocation and utilization in the Blue Gene/L supercomputer

Resource allocation and utilization in the Blue Gene/L supercomputer Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

QUDA Programming For Staggered Quarks

QUDA Programming For Staggered Quarks QUDA Programming For Staggered Quarks Steven Gottlieb, Guochun Shi, Aaron Torok, Volodymyr Kindratenko National Center for Supercomputing Applications & Indiana University 1 Outline Background New staggered

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Summarized by: Michael Riera 9/17/2011 University of Central Florida CDA5532 Agenda

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes arallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MI Datatypes Torsten Hoefler and Steven Gottlieb National Center for Supercomputing Applications University of Illinois

More information

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64

More information

The Red Storm System: Architecture, System Update and Performance Analysis

The Red Storm System: Architecture, System Update and Performance Analysis The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI

More information

Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg

Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg LBNL-XXXXX Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg Richard A. Gerber User Services Group, NERSC Division July 2003 This work was supported by the Director, Office

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Computation for Beyond Standard Model Physics

Computation for Beyond Standard Model Physics Computation for Beyond Standard Model Physics Xiao-Yong Jin Argonne National Laboratory Lattice for BSM Physics 2018 Boulder, Colorado April 6, 2018 2 My PhD years at Columbia Lattice gauge theory Large

More information

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman 1 2 Introduction to High Performance Computing (HPC) Introduction High-speed computing. Originally pertaining only to supercomputers

More information

Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar

Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, and Jaguar Abhinav Bhatelé, Lukasz Wesolowski, Eric Bohm, Edgar Solomonik and Laxmikant V. Kalé Department

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

System Architecture PARALLEL FILE SYSTEMS

System Architecture PARALLEL FILE SYSTEMS Software and the Performance Effects of Parallel Architectures Keith F. Olsen,, Poughkeepsie, NY James T. West,, Austin, TX ABSTRACT There are a number of different parallel architectures: parallel hardware

More information

QDP++/Chroma on IBM PowerXCell 8i Processor

QDP++/Chroma on IBM PowerXCell 8i Processor QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) frank.winter@desy.de University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos,

More information

arxiv: v2 [hep-lat] 21 Nov 2018

arxiv: v2 [hep-lat] 21 Nov 2018 arxiv:1806.06043v2 [hep-lat] 21 Nov 2018 E-mail: j.m.o.rantaharju@swansea.ac.uk Ed Bennett E-mail: e.j.bennett@swansea.ac.uk Mark Dawson E-mail: mark.dawson@swansea.ac.uk Michele Mesiti E-mail: michele.mesiti@swansea.ac.uk

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

PoS(LATTICE2014)028. The FUEL code project

PoS(LATTICE2014)028. The FUEL code project Argonne Leadership Computing Facility 9700 S. Cass Ave. Argonne, IL 60439, USA E-mail: osborn@alcf.anl.gov We give an introduction to the FUEL project for lattice field theory code. The code being developed

More information

Parallel Computing: From Inexpensive Servers to Supercomputers

Parallel Computing: From Inexpensive Servers to Supercomputers Parallel Computing: From Inexpensive Servers to Supercomputers Lyle N. Long The Pennsylvania State University & The California Institute of Technology Seminar to the Koch Lab http://www.personal.psu.edu/lnl

More information

Incremental Call-Path Profiling 1

Incremental Call-Path Profiling 1 Incremental Call-Path Profiling 1 Andrew R. Bernat Barton P. Miller Computer Sciences Department University of Wisconsin 1210 W. Dayton Street Madison, WI 53706-1685 {bernat,bart}@cs.wisc.edu Abstract

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

LQCD Computing at BNL

LQCD Computing at BNL LQCD Computing at BNL 2012 USQCD All-Hands Meeting Fermilab May 4, 2012 Robert Mawhinney Columbia University Some BNL Computers 8k QCDSP nodes 400 GFlops at CU 1997-2005 (Chulwoo, Pavlos, George, among

More information

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Running 1 Million Jobs in 10 Minutes via the Falkon Fast and Light-weight Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago In Collaboration with: Ian Foster,

More information

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users Phil Andrews, Bryan Banister, Patricia Kovatch, Chris Jordan San Diego Supercomputer Center University

More information

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project Cluster Computing Chip Watson Jefferson Lab High Performance Computing Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project Jie Chen, Ying Chen, Balint Joo, JLab HPC Group Distributed

More information

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY Table of Contents Introduction 3 Performance on Hosted Server 3 Figure 1: Real World Performance 3 Benchmarks 3 System configuration used for benchmarks 3

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2 Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving

More information

MuMMI: Multiple Metrics Modeling Infrastructure

MuMMI: Multiple Metrics Modeling Infrastructure MuMMI: Multiple Metrics Modeling Infrastructure Xingfu Wu Charles Lively Valerie Taylor Dept. of Computer Science & Engineering Texas A&M University College Station, TX 77843 Email: {wuxf, clively, taylor}

More information

BNL FY17-18 Procurement

BNL FY17-18 Procurement BNL FY17-18 Procurement USQCD ll-hands Meeting JLB pril 28-29, 2017 Robert Mawhinney Columbia University Co Site rchitect - BNL 1 BGQ Computers at BNL USQCD half-rack (512 nodes) 2 racks RBRC 1 rack of

More information

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads 89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

MuMMI: Multiple Metrics Modeling Infrastructure

MuMMI: Multiple Metrics Modeling Infrastructure MuMMI: Multiple Metrics Modeling Infrastructure Xingfu Wu*, Valerie Taylor*, Charles Lively* Hung-Ching Chang**, Kirk Cameron** Dan Terpstra***, and Shirley Moore+ * Dept. of Computer Science & Engineering,

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter CUG 2011, May 25th, 2011 1 Requirements to Reality Develop RFP Select

More information

Performance Modeling for Systematic Performance Tuning

Performance Modeling for Systematic Performance Tuning Performance Modeling for Systematic Performance Tuning Torsten Hoefler with inputs from William Gropp, Marc Snir, Bill Kramer Invited Talk RWTH Aachen University March 30 th, Aachen, Germany All used images

More information

Practical Scientific Computing

Practical Scientific Computing Practical Scientific Computing Performance-optimized Programming Preliminary discussion: July 11, 2008 Dr. Ralf-Peter Mundani, mundani@tum.de Dipl.-Ing. Ioan Lucian Muntean, muntean@in.tum.de MSc. Csaba

More information

QCD Performance on Blue Gene/L

QCD Performance on Blue Gene/L QCD Performance on Blue Gene/L Experiences with the Blue Gene/L in Jülich 18.11.06 S.Krieg NIC/ZAM 1 Blue Gene at NIC/ZAM in Jülich Overview: BGL System Compute Chip Double Hummer Network/ MPI Issues Dirac

More information

Illinois Proposal Considerations Greg Bauer

Illinois Proposal Considerations Greg Bauer - 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS HPC User Forum, 7 th September, 2016 Outline of Talk Introduction of FLAGSHIP2020 project An Overview of post K system Concluding Remarks

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

IBM System p5 185 Express Server

IBM System p5 185 Express Server The perfect entry system with a 3-year warranty and a price that might surprise you IBM System p5 185 Express Server responsiveness. As such, it is an excellent replacement for IBM RS/6000 150 and 170

More information

Asian Option Pricing on cluster of GPUs: First Results

Asian Option Pricing on cluster of GPUs: First Results Asian Option Pricing on cluster of GPUs: First Results (ANR project «GCPMF») S. Vialle SUPELEC L. Abbas-Turki ENPC With the help of P. Mercier (SUPELEC). Previous work of G. Noaje March-June 2008. 1 Building

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

arxiv:hep-lat/ v1 17 Nov 2000

arxiv:hep-lat/ v1 17 Nov 2000 1 Comparing Clusters and Supercomputers for Lattice QCD Steven Gottlieb Department of Physics, Indiana University, Bloomington, IN 47405, USA arxiv:hep-lat/0011071v1 17 Nov 2000 Since the development of

More information

Comparison of Storage Protocol Performance ESX Server 3.5

Comparison of Storage Protocol Performance ESX Server 3.5 Performance Study Comparison of Storage Protocol Performance ESX Server 3.5 This study provides performance comparisons of various storage connection options available to VMware ESX Server. We used the

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Technical Brief: Specifying a PC for Mascot

Technical Brief: Specifying a PC for Mascot Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Optimization of Lattice QCD codes for the AMD Opteron processor

Optimization of Lattice QCD codes for the AMD Opteron processor Optimization of Lattice QCD codes for the AMD Opteron processor Miho Koma (DESY Hamburg) ACAT2005, DESY Zeuthen, 26 May 2005 We report the current status of the new Opteron cluster at DESY Hamburg, including

More information

Grid Application Development Software

Grid Application Development Software Grid Application Development Software Department of Computer Science University of Houston, Houston, Texas GrADS Vision Goals Approach Status http://www.hipersoft.cs.rice.edu/grads GrADS Team (PIs) Ken

More information