Scaling Deep Learning. Bryan

Size: px

Start display at page:

Download "Scaling Deep Learning. Bryan"

Leonard Greer
6 years ago
Views:

1 Scaling Deep

2 What do we want AI to do? Guide us to content Keep us organized Help us find things Help us communicate 帮助我们沟通 Drive us to work Serve drinks?

3 Image Q&A Baidu IDL Sample questions and answers

4 Medical Diagnostics App Baidu BDL AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems.

5 Progress in AI Idea Test Code Latency from Idea to Idea is the limiting factor

6 Why Deep Learning? 1. Scale Matters Bigger models usually win 2. Data Matters More data means less cleverness necessary Accuracy Deep Learning Many previou methods 3. Productivity Matters Data & Compute Teams with better tools can try out more ideas

7 Scaling up Make progress on AI by focusing on systems Make models bigger Tackle more data Reduce research cycle time Accelerate large-scale experiments

8 Training Deep Neural Networks Computation dominated by dot products GEMM (Compute bound!) Convolutional layers even more compute bound 20 Exaflops to train one model

9 Natural User Interfaces Goal: Make interacting with computers as natural as interacting with humans AI problems: Speech recognition Emotional recognition Semantic understanding Dialog systems Speech synthesis

10 End-to-end speech with Deep Learning Deep neural network predicts characters directly from audio T H _ E D O G......

11 Bidirectional Recurrent Network RNNs model temporal dependence Various flavors used in many applications Especially time series data Sequential dependence complicates parallelism

12 Connectionist Temporal Classification T H _ E D O G?? How to connect speech data with transcription? Use CTC loss function, from [Graves 06] Efficient dynamic programming of all possible alignments to compute error of {audio, transcription} GPU implementation uses ModernGPU + custom kernels to get 10-30X speedup over simple OpenMP implementation

13 SVAIL Infrastructure FT77CB7079 Service Engineer s Manual Software: CUDA, MPI, Majel (SVAIL library) Hardware: 8 * Titan X NVIDIA GeForce GTX Titan X Mellanox FDR Infiniband 1 ~5 Petaflops, SP

14 Parallelism Model Parallel MPI_Allreduce() Training Data Data Parallel Training Data For these models, Data Parallelism works best

TFLOP/s Performance for RNN training 512 256 128 one node multi node 64 32 16 8 4 2 Typical training run 55% of GPU FMA peak using a single

15 TFLOP/s Performance for RNN training one node multi node Typical training run 55% of GPU FMA peak using a single GPU ~48% of peak using 8 GPUs in one node Weak scaling very efficient, albeit algorithmically challenged Number of GPUs

16 Scalability Batch size is hard to increase algorithm, memory limits Performance at small batch sizes (32, 64) leads to scalability limits

17 Determinism Determinism very important So much randomness, hard to tell if you have a bug Networks train despite bugs, although accuracy impaired Reproducibility is important For the usual scientific reasons Progress not possible without reproducibility

-31-30 -29-28 -27-26 -25-24 -23-22 -21-20 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 -1 0 Count Precision FP32 works No need for

18 Count Precision FP32 works No need for FP64 FP16 also works Use FP32 for softmax and weight updates Weight Distribution Magnitude

19 Teraflops/s FP16 HGEMM for deployment ner vana baidu Outer dimension n of x in Ax = b Performance on Quadro K1200 (1.1 Tflop peak, 45W) We batch, but n is still small small Custom kernels for HGEMM help 2-2.5X more performance at small batches

20 Conclusion Deep Learning is extreme HPC Systems matter a lot for deep learning We favor dense clusters of GPUs for training Custom software makes it efficient 50 Tflops sustained GPUs work for deployment as well Thanks to Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley and all of

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s