Jack Dongarra INNOVATIVE COMP ING LABORATORY. University i of Tennessee Oak Ridge National Laboratory

Size: px

Start display at page:

Download "Jack Dongarra INNOVATIVE COMP ING LABORATORY. University i of Tennessee Oak Ridge National Laboratory"

Isabel Welch
6 years ago
Views:

1 Computational Science, High Performance Computing, and the IGMCS Program Jack Dongarra INNOVATIVE COMP ING LABORATORY University i of Tennessee Oak Ridge National Laboratory 1

2 The Third Pillar of 21st Century Science Computational science is a rapidly growing g multidisciplinary field that uses advanced computing capabilities to understand and solve complex problems. Computational Science enables us to Investigate phenomena where economics or constraints preclude experimentation Evaluate complex models and manage massive data volumes Transform business and engineering g practices 2

3 Computational Science Fuses Three Distinct Elements: 3

4 Computational Science As An Emerging Academic Pursuit Many Programs in Computational Science College for Computing Georgia Tech; NJIT; CMU; Degrees Rice, Utah, UCSB; Minor Penn State, U Wisc, SUNY Brockport Certificate Old Dominion, U of Georgia, Boston U, Concentration Cornell, Northeastern, t Colorado State, t Courses 4

5 Graduate Minor in Computational Science Students in one of the three general areas of Computational Science; Applied Mathematics, Computer C t related, or a Domain Science will become exposed to and better versed in the other two areas that are currently outside their home area. A pool of courses which deals with each of the three main areas has been put together by participating department for 5 students to select from.

6 IGMCS: Requirements The Minor requires a combination of course work from three disciplines - Computer related, Mathematics/Stat, and a participating Science/Engineering domain (e.g., Chemical Engineering, Chemistry, Physics). At the Masters level a minor in Computational Science will require 9 hours (3 courses) from the pools. At least 6 hours (2 courses) must be taken outside the student s home area. Students must take at least 3 hours (1 course) from each of the 2 non-home areas Computer Related Applied Mathematics Domain Sciences At the Doctoral level a minor in computation science will require 15 hours (5 courses) from the pools. At least 9 hours (3 courses) must be taken outside the student s t home area. Students must take at least 3 hours (1 course) from each of the 2 non-home areas 6

7 IGMCS Process for Students 1. A student, with guidance from their faculty advisor, lays out a program of courses 2. Next, discussion with department s IGMCS liaisoni 3. Form generated with courses to be taken 4. Form is submitted for approval by the IGMCS Program Committee 7

8 IGMCS Participating Departments Department IGMCS Liaison Biochemistry & Cellular and Molecular Biology Dr. Cynthia Peterson Chemical Engineering Dr. David Keffer d Chemistry Dr. Robert Hinde Earth and Planetary Sciences Dr. Edmund Perfect Ecology & Evolutionary Biology Dr. Louis Gross Electrical Engineering and Computer Science Dr. Jack Dongarra Genome Science & Technology Dr. Cynthia Peterson Geography Dr. Bruce Ralston Information Science Dr. Peiling Wang Materials Science and Engineering i Dr. James Morris morrisj@ornl.gov Mathematics Dr. Chuck Collins ccollins@math.utk.edu Mechanical, Aerospace and Biomedical Engineering Dr. A.J. Baker ajbaker@utk.edu Physics Dr. Thomas Papenbrock tpapenbr@utk.edu Statistics Dr. Hamparsum Bozdogan bozdogan@utk.edu 8

9 Currently 12 students signed up for the program One graduate: Daniel Lucio 9

10 Students in Departments Not Participating in the IGMCS Program A student in such a situation can still participate. Student and advisor should submit to the Chair of the IGMCS Program Committee the courses to be taken. Requirement is still the same: Minor requires a combination of course work from three disciplines - Computer Science related, Mathematics/Stat, and a participating Science/Engineering domain (e.g., Chemical Engineering, Chemistry, Physics). Student s department should be encouraged to participate in the IGMCS program. Easy to do, needs approved set of courses and a liaison 10

11 Internship This is optional o but strongly encouraged. Students in the program can fulfill 3 hrs. of their requirement through an Internship with researchers outside the student s major. The internship may be taken offsite, e.g. ORNL, or on campus by working with a faculty member in another department. Internships must have the approval of the IGMCS Program Committee. 11

12 IGMCS Seminar Series DATE Oct 15 Oct 29 Nov 12 Nov 26 Dec 03 SPEAKER DEPT/AFFILIATION Prof. Jack Dongarra Electrical Eng. and Comp. Science Prof. Lou Gross Ecology & Evolutionary Biology Prof. Jeremy Smith Biochem/Cell & Molec Biol/ORNL IGMCS student Dr. Phil Andrews Project Director of the UT National Center for Computational Sciences 12

13 13

14 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from 14 Rate

15 Performance Development 10 Pflop/s 1 Pflop/s 11.7 PF/s IBM Roadrunner 1.02 PF/s IBM BlueGene/L 100 Tflop/s 10 Tflop/s 1 Tflop/s 1.17 TF/s 59.7 GF/s SUM #1 Intel ASCI Red 6-8 years IBM ASCI White NEC Earth Simulator TF/s 100 Gflop/s Fujitsu 'NWT' 10 Gflop/s #500 My Laptop 1 Gflop/s 100 Mflop/s 0.4 GF/s

16 Performance Development & Projections 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1Gflop/s 100 Mflop/s 10 Mflop/s 1 Mflop/s SUM N=1 N=500 Page 16

17 Performance Development & Projections 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1Gflop/s 100 Mflop/s 10 Mflop/s 1 Mflop/s ~1000 years SUM N=1 N=500 ~1 year ~8 hours ~1 min 1 Gflop/s 1 Tflop/s 1 Pflop/s 1 Eflop/s O(1) Thread O(10 3 ) Threads O(10 6 )Threads Page 17 O(10 9 ) Threads

18 LANL Roadrunner A Petascale System in 2008 Connected Unit cluster 192 Opteron nodes (180 w/ 2 dual-cell blades connected w/ 4 PCIe x8 links) 13,000 Cell HPC chips 1.33 PetaFlop/s (from Cell) 7,000 dual-core Opterons 122,000 cores 17 clusters Cell chip for each core 2 nd stage InfiniBand 4x DDR interconnect (18 sets of 12 links to 8 switches) 2 nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s (DP) Cell chip Hybrid Design (2 kinds of chips & 3 kinds of cores) Programming required at 3 levels. Dual Core Opteron Chip 18

19 Top10 of the June 2008 List Computer IBM / Roadrunner BladeCenter QS22/LS21 IBM / BlueGene/L eserver Blue Gene Solution IBM / Intrepid Blue Gene/P Solution SUN / Ranger SunBlade x6420 CRAY / Jaguar Rmax [TF/s] Rmax / Rpeak Installation Site Country #Cores Power [MW] MFlops/ Watt 1,026 75% DOE/NNSA/LANL USA 122, % DOE/NNSA/LLNL USA 212, % DOE/OS/ANL USA 163, % NSF/TACC USA 62, % DOE/OS/ORNL USA 30, Cray XT4 QuadCore IBM / JUGENE Forschungszentrum % Blue Gene/P Solution Juelich (FZJ) SGI / Encanto New Mexico Computing % SGI Altix ICE 8200 Applications Center HP / EKA Computational Research % Cluster Platform 3000 BL460c Lab, TATA SONS IBM / Blue Gene/P Solution 10 SGI / Altix ICE 8200EX % Germany 65, USA 14, India 14, % IDRIS France 40, Total Exploration Production France 10,

20 Top10 of the June 2008 List Computer IBM / Roadrunner BladeCenter QS22/LS21 IBM / BlueGene/L eserver Blue Gene Solution IBM / Intrepid Blue Gene/P Solution SUN / Ranger SunBlade x6420 CRAY / Jaguar Rmax [TF/s] Rmax / Rpeak Installation Site Country #Cores Power [MW] MFlops/ Watt 1,026 75% DOE/NNSA/LANL USA 122, % DOE/NNSA/LLNL USA 212, % DOE/OS/ANL USA 163, % NSF/TACC USA 62, % DOE/OS/ORNL USA 30, Cray XT4 QuadCore IBM / JUGENE Forschungszentrum % Blue Gene/P Solution Juelich (FZJ) SGI / Encanto New Mexico Computing % SGI Altix ICE 8200 Applications Center HP / EKA Computational Research % Cluster Platform 3000 BL460c Lab, TATA SONS IBM / Blue Gene/P Solution 10 SGI / Altix ICE 8200EX % Germany 65, USA 14, India 14, % IDRIS France 40, Total Exploration Production France 10,

ORNL/UTK Computer Power Cost Projections 2007-2012 Over the next 5 years ORNL/UTK will deploy 2 large

21 ORNL/UTK Computer Power Cost Projections Over the next 5 years ORNL/UTK will deploy 2 large Petascale systems Using 4 MW today, going to 15MW before year end By 2012 could be using more than 50MW!! Cost estimates based on $ per KwH Cooling adds 30% to the technical load, this is very efficient i Power becomes the architectural driver for future large systems 21 Power Per Year Includes both DOE and NSF systems.

22 Something s Happening Here From K. Olukotun, L. Hammond, H. Sutter, and B. Smith A hardware issue just became a software problem In the old days it was: each year processors would become faster Today the clock speed is fixed or getting slower Things are still doubling every months Moore s Law reinterpretated. Number of cores Number of cores double every months 22

independent threads of control Why multicore? The race for ever higher clock speeds is over.

23 Multicore What is multicore? A multicore chip is a single chip (socket) that combines two or more independent processing units that provide independent threads of control Why multicore? The race for ever higher clock speeds is over. In the old days, new the chips where faster Applications ran faster on the new chips Today new chips are not faster, just have more processors per chip Applications and software must use those extra processors to 23 become faster

24 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 24

25 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 25

26 Today s Multicores 98% of Top500 Systems Are Based on Multicore 282 use Quad-Core 204 use Dual-Core 3 use Nona-core IBM Cell (9 cores) Intel Clovertown (4 cores) Sun Niagra2 (8 cores) SciCortex (6 cores) Intel Polaris (80 cores) AMD Opteron (4 cores) IBM BG/P (4 cores) 26

27 Moore s Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to be able to easily replace inter- chip parallelism li with intro-chip parallelism N b f th d f ti Number of threads of execution doubles every 2 year 27

And then there s the GPGPU s NVIDIA s Tesla T10P

T10P devices; 700 Watts GTX 280 1 T10P; 1.

28 And then there s the GPGPU s NVIDIA s Tesla T10P T10P chip 240 cores; 1.5 GHz Tpeak 1 Tflop/s - 32 bit floating point Tpeak 100 Gflop/s - 64 bit floating point S1070 board 4 - T10P devices; 700 Watts GTX T10P; 1.3 GHz Tpeak 864 Gflop/s - 32 bit floating point Tpeak 86.4 Gflop/s - 64 bit floating point 28

29 Intel s Larrabee Chip Many X 86 IA cores Scalable to Tflop/s New cache architecture New vector instructions set Vector memory operations Conditionals o Integer and floating point arithmetic New vector processing unit / wide SIMD 29

30 Architecture of Interest Manycore chip Composed of hybrid cores Some general purpose p Some graphics Some floating point 30

31 Architecture of Interest Board composed of multiple chips sharing memory Memory 31

32 Architecture of Interest Rack composed of multiple boards Memory 32

33 Architecture of Interest A room full of these racks Memory Think millions of cores 33

34 Major Changes to Software Must rethink the design of our software Another disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software 34

35 Exascale Computing Exascale systems (10 18 Flop/s) are likely feasible by 2017± Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly 3D packaging likely Large-scale optics based interconnects PB of aggregate memory > 10,000 s of I/O channels to Exabytes of secondary storage, disk bandwidth to storage ratios not optimal for HPC use Hardware and software based fault management Achievable performance per watt will likely be the primary measure of progress 35

36 Conclusions Moore s Law Reinterpreted Number of cores per chip doubles every two year, while clock speed roughly stable Threads of execution double every 2 years 100 M cores Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! MPI and programming languages from the 60 s will not make it Power limiting clock rate growth Power becomes the architectural driver for Exescale systems. 36

An Overview of High Performance Computing and Challenges for the Future

An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,