MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

Size: px

Start display at page:

Download "MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory"

Tyrone Bishop
5 years ago
Views:

1 A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures MAGMA QR, 1 GPU, All Available Cores Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee 20 July, 2011

2 Outline 1) Motivation 2) Algorithm Description 3) Algorithm Tuning 4) Algorithm Optimization 5) Results 6) What's Next 7) Power Efficiency

3 Motivation Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years. This trend has continued for more than half a century and is expected to continue until 2015 or 2020 or later. Wikipedia Kepler 6,000,000,

4 Motivation May s Law Software efficiency halves every 18 months, compensating Moore's Law. Wikipedia

5 Motivation Nothing you can't spell will ever work. Will Rogers Sourcebook of Parallel Computing Dongarra, Foster, Fox

6 QUARK LAPACK Factoriza:on (6 cores, Q) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128

7 Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 GPU Update

8 QUARK LAPACK Factoriza:on (6 cores, Q) Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 GPU Update

9 Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 GPU Update

10 QUARK LAPACK Factoriza:on (6 cores, Q) Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 GPU Update

11 Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 QUARK LAPACK Factoriza:on (6 cores, Q)

12 Sequen:al LAPACK Update (6 cores, P) MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128

13 MAGMA QR 1 GPU All Available Cores 2 x 6 cores Algorithm Descrip/on 3360 x 3360 NB = 128 IB = 12 OB = 128 MAGMA Op:mized Panel Factoriza:on 6 cores

14 Algorithm Tuning Nightmare Q P NB IB OB

15 Algorithm Tuning Nightmare Q P NB IB OB 3360 x 3360

16 Algorithm Tuning Nightmare 5920 x 5920 Q P NB IB OB

17 Algorithm Tuning Nightmare x Q P NB IB OB

2080 3360 4640 5920 7200 125 8480 Matrix

18 Algorithm Tuning Nightmare Q P 8 x 6 cores, double precision NB OB IB Matrix Size

2080 3360 4640 5920 125 7200 8480 Matrix

19 Algorithm Tuning Nightmare Q P 2 x 6 cores, double precision NB OB IB Matrix Size

Algorithm Optimization 20 Performance of multicore QR factorization on Tall Skinny

8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.

Gflop/s 10 5920 x 128, 4 cores, IB=8 5 0 800 x 64, 6 cores, IB=12 Recursive Left Looking

20 Algorithm Optimization 20 Performance of multicore QR factorization on Tall Skinny Matrices Comparing Different Algorithms 12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak Single Node, Single GPU Single Precision x 192, 2 cores, IB=16 Gflop/s x 128, 4 cores, IB= x 64, 6 cores, IB=12 Recursive Left Looking Execution Left Looking Insertion Parallel MKL Matrix Size LeV Looking Inser:on LeV Looking Execu:on Recursive

21 Results 900 Performance of MAGMA QR with 1 GPU and all Available Cores Comparing Precisions 48 Cores (8 x 6-cores), 2.8 GHz-AMD Opteron 8439 SE, 129 GB, Peak 1080 Gflop/s [ig] 1 GeForce GTX GHz Clock - Theoretical Peak: * 2 * 480 = Tflop/s numactl --interleave=all Gflop/s Matrix Size Single Double Complex Single Complex Double

22 Results 800 Performance of MAGMA QR with 1 GPU and all Available Cores Comparing Precisions 12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak Single Node, Single GPU Gflop/s Matrix Size Double Single Complex Single Complex Double

23 Results Performance of MAGMA QR with 1 GPU and all Available Cores, Double Precision Comparing Against MAGMA 1.0 and MKL 12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak Single Node, Single GPU Gflop/s Matrix Size 1 GPU, 2 sockets, New Approach 1 GPU, 1 Socket, MAGMA GPU, 1 Core, MAGMA GPUs, 1 Socket, MKL

24 Results 600 Performance of MAGMA LU with 1 GPU and all Available Cores, Single Precision Comparing Against MAGMA 1.0 and MKL 12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak Single Node, Single GPU Gflop/s Matrix Size 1 GPU, All Cores 1 GPU, 1 Socket, MAGMA GPUs, 2 Sockets, MKL 24

25 What's Next

26 Power Efficiency Math is free. Transistors are free. Power is expensive. Performance Per Watt = Performance. Jen Hsun (gen shyuhn) [jensen] Huang, CEO of Nvidia

27 Power Efficiency Peak performance of any system is essen:ally limited by the amount of power it can draw and the amount of heat it can dissipate. Consequently, performance per wa] of a GPU design translates directly into peak performance of a system that uses that design. While performance per wa] is useful, absolute power requirements are also important. Claims of improved performance per wa] may be used to mask increasing power demands. For instance, though newer genera:on GPU architectures may provide be]er performance per wa], con:nued performance increases can negate the gains in efficiency, and the GPUs con:nue to consume large amounts of power. Wikipedia

28 Power Efficiency A Google engineer has warned that if the performance per watt of today's computers doesn't improve, the electrical costs of running them could end up far greater than the initial hardware price tag. "If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin," Luiz Andre Barroso, who previously designed processors for Digital Equipment Corp., said in a September paper. Google recently unveiled a major new datacenter site in a remote part of Oregon, where power costs are a frac/on of those at Google's home base in Silicon Valley. 28

29 Power Efficiency This has nothing to do with being "green." Every system and subsystem has to fit within some power budget. Brough Turner

30 Power Efficiency "You want a battery on this device and that device that lasts three or four days? I do, too. Well, if we have much more high-performing systems at much lower watts that will trickle down into your cell phone and this recorder and everything else. So, if I can get a processor that runs on 5 watts, as opposed to 100 watts or whatever... Voila, batteries are going to last a lot longer... ORNL scientific computing chief Jeff Nichols 30

31 Questions 31

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010