|
|
- Vernon Wheeler
- 5 years ago
- Views:
Transcription
1 5 th Annual Workshop on Charm++ and Applications Welcome and Introduction State of Charm++ Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 4/23/2007 CharmWorkshop2007 1
2 A Glance at History 1987: Chare Kernel arose from parallel Prolog work Dynamic load balancing for state-space search, Prolog, : Charm : Position Paper: Application Oriented yet CS Centered Research NAMD : 1994, 1996 Charm++ in almost current form: Chare arrays, Measurement Based Dynamic Load balancing 1997 : Rocket Center: a trigger for AMPI 2001: Era of ITRs: Quantum Chemistry collaboration Computational Astronomy collaboration: ChaNGa 4/23/2007 CharmWorkshop2007 2
3 What is Charm++ and why is it good Overview of recent results Language work: raising the level of abstraction Domain Specific Frameworks: ParFUM Guebelle: crack propoagation Haber: spae-time meshing Applications NAMD (picked by NSF, new scaling results to 32k procs.) ChaNGa: released, gravity performance LeanCP: Use at National centers BigSim Outline Scalable Performance tools Scalable Load Balancers Fault tolerance Cell, GPGPUs,.. Upcoming Challenges and opportunities: Multicore Funding 4/23/2007 CharmWorkshop2007 3
4 PPL Mission and Approach To enhance Performance and Productivity in programming complex parallel applications Performance: scalable to thousands of processors Productivity: of human programmers Complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research Develop enabling technology, for a wide collection of apps. Develop, use and test it in the context of real applications How? Develop novel Parallel programming techniques Embody them into easy to use abstractions So, application scientist can use advanced techniques with ease Enabling technology: reused across many apps 4/23/2007 CharmWorkshop2007 4
5 Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI User View System implementation Benefits Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization 4/23/2007 CharmWorkshop2007 5
6 Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, /23/2007 CharmWorkshop2007 6
7 Realization: Charm++ s Object Arrays A collection of data-driven objects With a single global name for the collection Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... Mapping of element objects to procs handled by the system A[0] A[1] A[2] A[3] A[..] User s view 4/23/2007 CharmWorkshop2007 7
8 Realization: Charm++ s Object Arrays A collection of data-driven objects With a single global name for the collection Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... Mapping of element objects to procs handled by the system A[0] A[1] A[2] A[3] A[..] User s view System view A[0] A[3] 4/23/2007 CharmWorkshop2007 8
9 Charm++: Object Arrays A collection of data-driven objects With a single global name for the collection Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... Mapping of element objects to procs handled by the system A[0] A[1] A[2] A[3] A[..] User s view System view A[0] A[3] 4/23/2007 CharmWorkshop2007 9
10 AMPI: Adaptive MPI 7 MPI processes 4/23/2007 CharmWorkshop
11 AMPI: Adaptive MPI 7 MPI processes Implemented as virtual processors (user-level migratable threads) Real Processors 4/23/2007 CharmWorkshop
12 Load Balancing Refinement Load Balancing Aggressive Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a refinement step. 4/23/2007 CharmWorkshop
13 Shrink/Expand Problem: Availability of computing platform may change Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 4/23/2007 CharmWorkshop
14 So, Whats new? 4/23/2007 CharmWorkshop
15 New Higher Level Abstractions Previously: Multiphase Shared Arrays Provides a disciplined use of global address space Each array can be accessed only in one of the following modes: ReadOnly, Write-by-One-Thread, Accumulate-only Access mode can change from phase to phase Phases delineated by per-array sync Charisma++: Global view of control Allows expressing global control flow in a charm program Separate expression of parallel and sequential Functional Implementation (Chao Huang PhD thesis) LCR 04, HPDC 07 4/23/2007 CharmWorkshop
16 Multiparadigm Interoperability Charm++ supports concurrent composition Allows multiple module written in multiple paradigms to cooperate in a single application Some recent paradigms implemented: ARMCI (for Global Arrays) Use of Multiparadigm programming You heard yesterday how ParFUM made use of multiple paradigms effetively 4/23/2007 CharmWorkshop
17 Blue Gene Provided a Showcase. Co-operation with Blue Gene team Sameer Kumar joins BlueGene team BGW days competetion 2006: Computer Science day 2007: Computational cosmology: ChaNGa LeanCP collaboration with Glenn Martyna, IBM 4/23/2007 CharmWorkshop
18 Cray and PSC Warms up 4000 fast processors at PSC 12,500 processors at ORNL Cray support via a gift grant 4/23/2007 CharmWorkshop
19 IBM Power7 Team Collaborations begun with NSF Track 1 proposal 4/23/2007 CharmWorkshop
20 Our Applications Achieved Unprecedented Speedups 4/23/2007 CharmWorkshop
21 Applications and Charm++ Application Issues Techniques & libraries Charm++ Other Applications Synergy between Computer Science Research and Biophysics has been beneficial to both 4/23/2007 CharmWorkshop
22 Charm++ and Applications Synergy between Computer Science Research and Biophysics has been beneficial to both LeanCP Space-time meshing NAMD Issues Techniques & libraries Charm++ Other Applications ChaNGa Rocket Simulation 4/23/2007 CharmWorkshop
23 Quantum Chemistry LeanCP Develop abstractions in context of full-scale applications NAMD: Molecular Dynamics Protein Folding Computational Cosmology STM virus simulation Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Rocket Simulation Dendritic Growth Space-time meshes The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE 4/23/2007 CharmWorkshop
24 Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds Newtonian mechanics Thousands of atoms (10, ,000) 1 femtosecond time-step, millions needed! At each time-step Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal s Short-distance: every timestep Long-distance: every 4 timesteps using PME (3D FFT) Multiple Time Stepping Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers 4/23/2007 CharmWorkshop
25 NAMD: A Production MD program NAMD Fully featured program NIH-funded development Distributed free of charge (~20,000 registered users) Binaries and source code Installed at NSF centers User training and support Large published simulations 4/23/2007 CharmWorkshop
26 NAMD: A Production MD program NAMD Fully featured program NIH-funded development Distributed free of charge (~20,000 registered users) Binaries and source code Installed at NSF centers User training and support Large published simulations 4/23/2007 CharmWorkshop
27 NAMD Design Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation into a large number of objects Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: Spatial decomposition of atoms into cubes (called patches) For every pair of interacting patches, create one object for calculating electrostatic interactions Recent: Blue Matter, Desmond, etc. use this idea in some form 4/23/2007 CharmWorkshop
28 NAMD Parallelization using Charm++ Example Configuration 847 VPs 100,000 VPs 108 VPs These 100,000 Objects (virtual processors, or VPs) are assigned to real processors by the Charm++ runtime system 4/23/2007 CharmWorkshop
29 Performance on BlueGene/L Simulation Rate in Nanoseconds Per Day Processors IAPP (5.5K atoms) LYSOZYME (40K atoms) APOA1 (92K atoms) ATPase (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) IAPP simulation (Rivera, Straub, BU) at 20 ns per day on 256 processors 1 us in 50 days STMV simulation at 6.65 ns per day on 20,000 processors 4/23/2007 CharmWorkshop
30 Comparison with Blue Matter Nodes ApoLipoprotein-A1 (92K atoms) Blue Matter (SC 06) ms/step NAMD ms/step NAMD (Virtual Node) ms/step NAMD is about 1.8 times faster than Blue Matter on 1024 nodes (and 2.4 times faster with VN mode, where NAMD can use both processors on a node effectively). However: Note that NAMD does PME every 4 steps. 4/23/2007 CharmWorkshop
31 Performance on Cray XT3 100 Simulation Rate in Nanoseconds Per Day Processors IAPP (5.5K atoms) LYSOZYME(40K atoms) APOA1 (92K atoms) ATPASE (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) RIBOSOME (2.8M atoms) 4/23/2007 CharmWorkshop
32 Computational Cosmology N body Simulation (NSF) N particles (1 million to 1 billion), in a periodic box Move under gravitation Organized in a tree (oct, binary (k-d),..) Output data Analysis: in parallel (NASA) Particles are read in parallel Interactive Analysis Issues: Load balancing, fine-grained communication, tolerating communication latencies. Multiple-time stepping New Code Released: ChaNGa Collaboration with T. Quinn (Univ. of Washington) UofI Team: Filippo Giaochin, Pritish Jetley, Celso Mendes 4/23/2007 CharmWorkshop
33 ChanGA Load Balancing Challenge: Trade-off between communication and balance 4/23/2007 CharmWorkshop
34 Recent Sucesses in Scaling ChaNGa Number of Processors x Execution Time e e drgas lambb dwf hrwh_lcdms dwarf Execution Time Scaling Number of Processors 4/23/2007 CharmWorkshop
35 Car-Parinello MD Quantum Chemistry: LeanCP Illustrates utility of separating decomposition and mapping Very complex set of objects and interactions Excellent scaling achieved Collaboration with Glenn Martyna (IBM), Mark Tuckerman (NYU) UofI team: Eric Bohm, Abhinav Bhatele 4/23/2007 CharmWorkshop
36 LeanCP Decomposition 4/23/2007 CharmWorkshop
37 LeanCP Scaling 4/23/2007 CharmWorkshop
38 Space-time meshing Discontinuous Galerkin method Tent-pitcher algorithm Collaboration with Bob Haber, Jeff Ericsson, Michael Garland PPL team: Aaron Baker, Sayantan Chakravorty, Terry Wilmarth 4/23/2007 CharmWorkshop
39 4/23/2007 CharmWorkshop
40 Dynamic, coupled physics simulation in 3D Finite-element solids on unstructured tet mesh Finite-volume fluids on structured hex mesh Rocket Simulation Coupling every timestep via a least-squares data transfer Challenges: Multiple modules Dynamic behavior: burning surface, mesh adaptation Robert Fielder, Center for Simulation of Advanced Rockets Collaboration with M. Heath, P. Geubelle, others 4/23/2007 CharmWorkshop
41 Dynamic load balancing in Crack Propagation 4/23/2007 CharmWorkshop
42 Colony: FAST-OS Project DOE funded collaboration Terry Jones: LLNL Jose Moreira, et al IBM At Illinois: supports Scalable Dynamic Load Balancing Fault tolerance 4/23/2007 CharmWorkshop
43 Colony Project Overview Collaborators Lawrence Livermore National Laboratory Terry Jones University of Illinois at Urbana-Champaign Laxmikant Kale Celso Mendes Sayantan Chakravorty International Business Machines Jose Moreira Andrew Tauferner Todd Inglett Title Services and Interfaces to Support Systems with Very Large Numbers of Processors Topics Parallel Resource Instrumentation Framework Scalable Load Balancing OS mechanisms for Migration Processor Virtualization for Fault Tolerance Single system management space Parallel Awareness and Coordinated Scheduling of Services Linux OS for cellular architecture 4/23/2007 CharmWorkshop
44 Load Balancing on Very Large Machines Existing load balancing strategies don t scale on extremely large machines Consider an application with 1M objects on 64K processors Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcast from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier 4/23/2007 CharmWorkshop
45 A Hybrid Load Balancing Strategy Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Each group has a leader (the central node) which performs centralized load balancing A particular hybrid strategy that works well Gengbin Zheng, PhD Thesis, /23/2007 CharmWorkshop
46 Fault Tolerance Automatic Checkpointing Migrate objects to disk In-memory checkpointing as an option Automatic fault detection and restart Proactive Fault Tolerance Impending Fault Response Migrate objects to other processors Adjust processor-level parallel data structures Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn t have to run back to their checkpoints! Sender-side message logging Latency tolerance helps mitigate costs Restart can be speeded up by spreading out objects from failed processor 4/23/2007 CharmWorkshop
47 BigSim Simulating very large parallel machines Using smaller parallel machines Reasons Predict performance on future machines Predict performance obstacles for future machines Do performance tuning on existing machines that are difficult to get allocations on Idea: Emulation run using virtual processor processors (AMPI) Get traces Detailed machine simulation using traces 4/23/2007 CharmWorkshop
48 Objectives and Simualtion Model Objectives: Develop techniques to facilitate the development of efficient peta-scale applications Based on performance prediction of applications on large simulated parallel machines Simulation-based Performance Prediction: Focus on Charm++ and AMPI programming models Performance prediction based on PDES Supports varying levels of fidelity processor prediction, network prediction. Modes of execution : online and post-mortem mode 4/23/2007 CharmWorkshop
49 Big Network Simulation Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem simulation Switches are connected in torus network BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) BigNetSim Timestamp-corrected Tasks 4/23/2007 CharmWorkshop
50 Projections: Performance visualization 4/23/2007 CharmWorkshop
51 Architecture of BigNetSim 4/23/2007 CharmWorkshop
52 Performance Prediction (contd.) Predicting time of sequential code: User supplied time for every code block Wall-clock measurements on simulating machine can be used via a suitable multiplier Hardware performance counters to count floating point, integer, branch instructions, etc Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio Instruction level simulation (not implemented) Predicting Network performance: No contention, time based on topology & other network parameters Back-patching, modifies comm time using amount of comm activity Network-simulation, modelling the netowrk entirely 4/23/2007 CharmWorkshop
53 Multi-Cluster Co-Scheduling Cluster A Intra-cluster latency (microseconds) Cluster B Inter-cluster latency (milliseconds) Job co-scheduled to run across two clusters to provide access to large numbers of processors But cross cluster latencies are large! Virtualization within Charm++ masks high inter-cluster latency by allowing overlap of communication with computation 4/23/2007 CharmWorkshop
54 Multi-Cluster Co-Scheduling 4/23/2007 CharmWorkshop
55 Faucets: Optimizing Utilization Within/across Clusters Job Specs Cluster Job Specs File Upload Job Id Job Submission Bids File Upload Cluster Job Id Job Monitor Cluster 4/23/2007 CharmWorkshop
56 Other Ongoing Projects Parallel Debugger Automatic out-of-core execution Parallel algorithms Current: Prim s spanning tree algorithm, sorting,.. New collaborations being explored Prof. Paulino, Prof. Pantano,.. 4/23/2007 CharmWorkshop
57 Domain Specific Frameworks Motivation Reduce tedium of parallel programming for commonly used paradigms and parallel data structures Encapsulate parallel data structures and algorithms Provide easy to use interface Used to build concurrenltly composible parallel modules Frameworks Unstructured Meshes:ParFUM Generalized ghost regions Used in Rocfrac, Rocflu at rocket center, and Outside CSAR Fast collision detection Multiblock framework Structured Meshes Automates communication AMR Common for both above Particles Multiphase flows MD, tree codes 4/23/2007 CharmWorkshop
58 Summary and Messages We at PPL have advanced migratable objects technology We are committed to supporting applications We grow our base of reusable techniques via such collaborations Try using our technology: AMPI, Charm++, Faucets, ParFUM,.. Available via the web charm.cs.uiuc.edu 4/23/2007 CharmWorkshop
59 Parallel Programming Laboratory Sr.STAFF IBM PERCS High Productivity GRANTS NSF: ITR Chemistry Car- Parinello MD, QM/MM NCSA Faculty Fellows Program NSF: ITR, NASA Computational Cosmology and Visualization DOE HPC-Colony Services and Interfaces for Large Computers NIH Biophysics NAMD DOE CSAR Rocket Simulation NSF: ITR CPSD Space / Time Meshing NSF Next Generation Software BlueGene ENABLING PROJECTS Faucets: Dynamic Resource Management for Grids Charm++ and Converse Load-Balance: Centralized, Distributed, Hybrid AMPI Adaptive MPI Fault-Tolerance: Checkpointing, Fault-Recovery, Proc.Evacuation Projections: Performance Analysis ParFUM: Supporting Unstructured Meshes (Comp.Geometry) Orchestration and Parallel Languages BigSim: Simulating Big Machines and Networks 4/23/2007 CharmWorkshop
60 Keynote: Kathy Yelick PGAS Languages and Beyond System progress talks Adaptive MPI BigSim: Performance prediction Scalable Performance Analysis Fault Tolerance Over the next two days Applications Molecular Dynamics Quantum Chemistry (LeanCP) Computational Cosmology Rocket Simulation Cell Processor Grid Multi-cluster applications Charm++ AMPI Tutorials Projections BigSim 4/23/2007 CharmWorkshop
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at
More informationA Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé
A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé Parallel Programming Laboratory Department of Computer Science University
More informationScalable Interaction with Parallel Applications
Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Outline Overview Case Studies Charm++
More informationScalable Dynamic Adaptive Simulations with ParFUM
Scalable Dynamic Adaptive Simulations with ParFUM Terry L. Wilmarth Center for Simulation of Advanced Rockets and Parallel Programming Laboratory University of Illinois at Urbana-Champaign The Big Picture
More informationA CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign
More informationAdaptive Runtime Support
Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at
More informationAbhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationPresented by: Terry L. Wilmarth
C h a l l e n g e s i n D y n a m i c a l l y E v o l v i n g M e s h e s f o r L a r g e - S c a l e S i m u l a t i o n s Presented by: Terry L. Wilmarth Parallel Programming Laboratory and Center for
More informationWelcome to the 2017 Charm++ Workshop!
Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017
More informationAdvanced Parallel Programming. Is there life beyond MPI?
Advanced Parallel Programming Is there life beyond MPI? Outline MPI vs. High Level Languages Declarative Languages Map Reduce and Hadoop Shared Global Address Space Languages Charm++ ChaNGa ChaNGa on GPUs
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationCharm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System
Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer
More informationUsing BigSim to Estimate Application Performance
October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator
More informationNAMD: A Portable and Highly Scalable Program for Biomolecular Simulations
NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations Abhinav Bhatele 1, Sameer Kumar 3, Chao Mei 1, James C. Phillips 2, Gengbin Zheng 1, Laxmikant V. Kalé 1 1 Department of Computer
More informationNAMD Serial and Parallel Performance
NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance
More informationLoad Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods
Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods Aaron K. Becker (abecker3@illinois.edu) Robert B. Haber Laxmikant V. Kalé University of Illinois, Urbana-Champaign Parallel
More informationEnzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy
Enzo-P / Cello Formation of the First Galaxies James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department
More informationIS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology
More informationDr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An
More informationA Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications
A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference
More informationRefactoring NAMD for Petascale Machines and Graphics Processors
Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips Research/namd/ NAMD Design Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation
More informationOvercoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms
Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms Abhinav Bhatelé, Sameer Kumar, Chao Mei, James C. Phillips, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science
More informationProgramming Models for Supercomputing in the Era of Multicore
Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need
More informationScalable Topology Aware Object Mapping for Large Supercomputers
Scalable Topology Aware Object Mapping for Large Supercomputers Abhinav Bhatele (bhatele@illinois.edu) Department of Computer Science, University of Illinois at Urbana-Champaign Advisor: Laxmikant V. Kale
More informationEnzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy
Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2
More informationCharm++ Workshop 2010
Charm++ Workshop 2010 Eduardo R. Rodrigues Institute of Informatics Federal University of Rio Grande do Sul - Brazil ( visiting scholar at CS-UIUC ) errodrigues@inf.ufrgs.br Supported by Brazilian Ministry
More informationHPC Colony: Linux at Large Node Counts
UCRL-TR-233689 HPC Colony: Linux at Large Node Counts T. Jones, A. Tauferner, T. Inglett, A. Sidelnik August 14, 2007 Disclaimer This document was prepared as an account of work sponsored by an agency
More informationc Copyright by Gengbin Zheng, 2005
c Copyright by Gengbin Zheng, 2005 ACHIEVING HIGH PERFORMANCE ON EXTREMELY LARGE PARALLEL MACHINES: PERFORMANCE PREDICTION AND LOAD BALANCING BY GENGBIN ZHENG B.S., Beijing University, China, 1995 M.S.,
More informationNAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale
NAMD at Extreme Scale Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale Overview NAMD description Power7 Tuning Support for
More informationCHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME
More informationPICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications
PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun, Laxmikant V. Kalé University of Illinois at Urbana-Champaign sun51@illinois.edu November 26,
More informationDynamic Load Balancing for Weather Models via AMPI
Dynamic Load Balancing for Eduardo R. Rodrigues IBM Research Brazil edrodri@br.ibm.com Celso L. Mendes University of Illinois USA cmendes@ncsa.illinois.edu Laxmikant Kale University of Illinois USA kale@cs.illinois.edu
More informationITR/AP: Multiscale Models for Microstructure Simulation and Process Design
ITR/AP: Multiscale Models for Microstructure Simulation and Process Design Principal Investigators: Bob Haber (Theor.. & Applied Mechs.), Jonathan Dantzig (Mech.. & Ind. Engng.), Duane Johnson (Matl..
More informationStrong Scaling for Molecular Dynamics Applications
Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics
More informationNAMD: Biomolecular Simulation on Thousands of Processors
NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant V. Kalé Abstract NAMD is a fully featured, production molecular dynamics program for high
More informationToward Runtime Power Management of Exascale Networks by On/Off Control of Links
Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major
More informationc Copyright by Sameer Kumar, 2005
c Copyright by Sameer Kumar, 2005 OPTIMIZING COMMUNICATION FOR MASSIVELY PARALLEL PROCESSING BY SAMEER KUMAR B. Tech., Indian Institute Of Technology Madras, 1999 M.S., University of Illinois at Urbana-Champaign,
More informationTraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks
TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,
More informationIntegrated Performance Views in Charm++: Projections Meets TAU
1 Integrated Performance Views in Charm++: Projections Meets TAU Scott Biersdorff, Allen D. Malony Performance Research Laboratory Department of Computer Science University of Oregon, Eugene, OR, USA email:
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationAn Orchestration Language for Parallel Objects
An Orchestration Language for Parallel Objects Laxmikant V. Kalé Department of Computer Science University of Illinois kale@cs.uiuc.edu Mark Hills Department of Computer Science University of Illinois
More informationImproving NAMD Performance on Volta GPUs
Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer,
More informationNAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop
NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop James Phillips Research/gpu/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics
More informationSimulating Life at the Atomic Scale
Simulating Life at the Atomic Scale James Phillips Beckman Institute, University of Illinois Research/namd/ Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics
More informationBigNetSim Tutorial. Presented by Gengbin Zheng & Eric Bohm. Parallel Programming Laboratory University of Illinois at Urbana-Champaign
BigNetSim Tutorial Presented by Gengbin Zheng & Eric Bohm Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator Charm++ on the Emulator Simulation
More informationSimulation-Based Performance Prediction for Large Parallel Machines
Simulation-Based Performance Prediction for Large Parallel Machines Gengbin Zheng, Terry Wilmarth, Praveen Jagadishprasad, Laxmikant V. Kalé Dept. of Computer Science University of Illinois at Urbana-Champaign
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationMemory Tagging in Charm++
Memory Tagging in Charm++ Filippo Gioachin Department of Computer Science University of Illinois at Urbana-Champaign gioachin@uiuc.edu Laxmikant V. Kalé Department of Computer Science University of Illinois
More informationCharm++ on the Cell Processor
Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale 1 Motivation Cell Processor (CBEA) is powerful (peak-flops) Allow Charm++ applications utilize the Cell processor
More informationA Case Study in Tightly Coupled Multi-paradigm Parallel Programming
A Case Study in Tightly Coupled Multi-paradigm Parallel Programming Sayantan Chakravorty 1, Aaron Becker 1, Terry Wilmarth 2 and Laxmikant Kalé 1 1 Department of Computer Science, University of Illinois
More informationNAMD Performance Benchmark and Profiling. November 2010
NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory
More informationChapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
More informationPreliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations
Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Bilge Acun 1, Nikhil Jain 1, Abhinav Bhatele 2, Misbah Mubarak 3, Christopher D. Carothers 4, and Laxmikant V. Kale 1
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationScheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications
Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for
More informationReducing Communication Costs Associated with Parallel Algebraic Multigrid
Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationMemory Tagging in Charm++
Memory Tagging in Charm++ Filippo Gioachin Department of Computer Science University of Illinois at Urbana-Champaign gioachin@uiuc.edu Laxmikant V. Kalé Department of Computer Science University of Illinois
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationFlexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters
Flexible Hardware Mapping for Finite Element Simulations on Hybrid /GPU Clusters Aaron Becker (abecker3@illinois.edu) Isaac Dooley Laxmikant Kale SAAHPC, July 30 2009 Champaign-Urbana, IL Target Application
More informationGeneric finite element capabilities for forest-of-octrees AMR
Generic finite element capabilities for forest-of-octrees AMR Carsten Burstedde joint work with Omar Ghattas, Tobin Isaac Institut für Numerische Simulation (INS) Rheinische Friedrich-Wilhelms-Universität
More informationMemory Tagging in Charm++
Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana Champaign Outline Overview Memory tagging Charm++ RTS CharmDebug Charm++ memory
More informationBigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng, Gunavardhan Kakulapati, Laxmikant V. Kalé Dept. of Computer Science University of Illinois at
More informationNAMD Performance Benchmark and Profiling. January 2015
NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationIntroducing Overdecomposition to Existing Applications: PlasComCM and AMPI
Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing
More informationc Copyright by Milind A. Bhandarkar, 2002
c Copyright by Milind A. Bhandarkar, 2002 CHARISMA: A COMPONENT ARCHITECTURE FOR PARALLEL PROGRAMMING BY MILIND A. BHANDARKAR M.S., Birla Institute of Technology and Science, Pilani, 1989 M.Tech., Indian
More informationNAMD GPU Performance Benchmark. March 2011
NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory
More informationScalable In-memory Checkpoint with Automatic Restart on Failures
Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th
More informationNAMD Performance Benchmark and Profiling. February 2012
NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -
More informationProjections A Performance Tool for Charm++ Applications
Projections A Performance Tool for Charm++ Applications Chee Wai Lee cheelee@uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu
More informationCycle Sharing Systems
Cycle Sharing Systems Jagadeesh Dyaberi Dependable Computing Systems Lab Purdue University 10/31/2005 1 Introduction Design of Program Security Communication Architecture Implementation Conclusion Outline
More informationNAMD: Biomolecular Simulation on Thousands of Processors
NAMD: Biomolecular Simulation on Thousands of Processors James Phillips Gengbin Zheng Laxmikant Kalé Abstract NAMD is a parallel, object-oriented molecular dynamics program designed for high performance
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationEnabling Contiguous Node Scheduling on the Cray XT3
Enabling Contiguous Node Scheduling on the Cray XT3 Chad Vizino vizino@psc.edu Pittsburgh Supercomputing Center ABSTRACT: The Pittsburgh Supercomputing Center has enabled contiguous node scheduling to
More informationPreliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede
Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison
More informationDEBUGGING LARGE SCALE APPLICATIONS WITH VIRTUALIZATION FILIPPO GIOACHIN DISSERTATION
c 2010 FILIPPO GIOACHIN DEBUGGING LARGE SCALE APPLICATIONS WITH VIRTUALIZATION BY FILIPPO GIOACHIN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationIn-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD
In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University
More informationLecture 23 Database System Architectures
CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationc 2011 by Ehsan Totoni. All rights reserved.
c 2011 by Ehsan Totoni. All rights reserved. SIMULATION-BASED PERFORMANCE ANALYSIS AND TUNING FOR FUTURE SUPERCOMPUTERS BY EHSAN TOTONI THESIS Submitted in partial fulfillment of the requirements for the
More informationOptimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6
Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6 Yanhua Sun, Gengbin Zheng, Chao Mei, Eric J. Bohm, James C. Phillips, Laxmikant V. Kalé University of Illinois
More informationIOS: A Middleware for Decentralized Distributed Computing
IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationParallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationDetecting and Using Critical Paths at Runtime in Message Driven Parallel Programs
Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Isaac Dooley, Laxmikant V. Kale Department of Computer Science University of Illinois idooley2@illinois.edu kale@illinois.edu
More informationMolecular Dynamics and Quantum Mechanics Applications
Understanding the Performance of Molecular Dynamics and Quantum Mechanics Applications on Dell HPC Clusters High-performance computing (HPC) clusters are proving to be suitable environments for running
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationOutline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems
Distributed Systems Outline Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems What Is A Distributed System? A collection of independent computers that appears
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationCommunication and Topology-aware Load Balancing in Charm++ with TreeMatch
Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François
More informationVisualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay
Visualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology
More informationScalability of Uintah Past Present and Future
DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John Schmidt,
More information