Large Scale Distributed Deep Networks

Size: px

Start display at page:

Download "Large Scale Distributed Deep Networks"

Maria Townsend
5 years ago
Views:

1 Large Scale Distributed Deep Networks Yifu Huang School of Computer Science, Fudan University COMP Data Intensive Computing Report, 2013 Yifu Huang (FDU CS) COMP Report 2013/11/20 1 / 21

2 Outline 1 Motivation 2 Software Framework: DistBelief 3 Distributed Algorithm Downpour SGD Sandblaster L-BFGS 4 Experiments 5 Discussion Yifu Huang (FDU CS) COMP Report 2013/11/20 2 / 21

3 Motivation Motivation Why I choose this paper [1]? Google + Stanford Jerey Dean + Andrew Y. Ng Why we need large scale distributed deep networks? Large model can dramatically improve performance Training examples + model parameters Exist methods have limitations GPU, MapReduce, GraphLab What can we learn from this paper? Best parallelism design ideas for deep networks up to now Model parallelism + data parallelism Yifu Huang (FDU CS) COMP Report 2013/11/20 3 / 21

4 Motivation Motivation Why I choose this paper [1]? Google + Stanford Jerey Dean + Andrew Y. Ng Why we need large scale distributed deep networks? Large model can dramatically improve performance Training examples + model parameters Exist methods have limitations GPU, MapReduce, GraphLab What can we learn from this paper? Best parallelism design ideas for deep networks up to now Model parallelism + data parallelism Yifu Huang (FDU CS) COMP Report 2013/11/20 3 / 21

5 Motivation Motivation Why I choose this paper [1]? Google + Stanford Jerey Dean + Andrew Y. Ng Why we need large scale distributed deep networks? Large model can dramatically improve performance Training examples + model parameters Exist methods have limitations GPU, MapReduce, GraphLab What can we learn from this paper? Best parallelism design ideas for deep networks up to now Model parallelism + data parallelism Yifu Huang (FDU CS) COMP Report 2013/11/20 3 / 21

6 Motivation Preliminaries Neural Networks [6] Deep Networks [7] Multiple hidden layers This will allow us to compute much more complex features of the input Yifu Huang (FDU CS) COMP Report 2013/11/20 4 / 21

7 Motivation Preliminaries Neural Networks [6] Deep Networks [7] Multiple hidden layers This will allow us to compute much more complex features of the input Yifu Huang (FDU CS) COMP Report 2013/11/20 4 / 21

8 Software Framework: DistBelief Software Framework: DistBelief Model parallelism Inside parallelism Multi-thread + message passing -> large scale User denes Computation in node, message upward/downward Framework manages Synchronization, data transfer Performance depends on Connectivity structure, computational needs Yifu Huang (FDU CS) COMP Report 2013/11/20 5 / 21

9 Software Framework: DistBelief Software Framework: DistBelief Model parallelism Inside parallelism Multi-thread + message passing -> large scale User denes Computation in node, message upward/downward Framework manages Synchronization, data transfer Performance depends on Connectivity structure, computational needs Yifu Huang (FDU CS) COMP Report 2013/11/20 5 / 21

10 Software Framework: DistBelief Software Framework: DistBelief Model parallelism Inside parallelism Multi-thread + message passing -> large scale User denes Computation in node, message upward/downward Framework manages Synchronization, data transfer Performance depends on Connectivity structure, computational needs Yifu Huang (FDU CS) COMP Report 2013/11/20 5 / 21

11 Software Framework: DistBelief Software Framework: DistBelief Model parallelism Inside parallelism Multi-thread + message passing -> large scale User denes Computation in node, message upward/downward Framework manages Synchronization, data transfer Performance depends on Connectivity structure, computational needs Yifu Huang (FDU CS) COMP Report 2013/11/20 5 / 21

12 Software Framework: DistBelief Software Framework: DistBelief (cont.) An example of model parallelism in DistBelief Yifu Huang (FDU CS) COMP Report 2013/11/20 6 / 21

13 Distributed Algorithm Distributed Algorithm Data parallelism Outside parallelism Multiple model instances optimize a single objective -> high speed A centralized sharded parameter server Dierent model replicas retrieve/update their own parameters Load balance, robust Tolerate variance in the processing speed of dierent model replicas The wholesale failure may be taken oine or restart at random Yifu Huang (FDU CS) COMP Report 2013/11/20 7 / 21

14 Distributed Algorithm Distributed Algorithm Data parallelism Outside parallelism Multiple model instances optimize a single objective -> high speed A centralized sharded parameter server Dierent model replicas retrieve/update their own parameters Load balance, robust Tolerate variance in the processing speed of dierent model replicas The wholesale failure may be taken oine or restart at random Yifu Huang (FDU CS) COMP Report 2013/11/20 7 / 21

15 Distributed Algorithm Distributed Algorithm Data parallelism Outside parallelism Multiple model instances optimize a single objective -> high speed A centralized sharded parameter server Dierent model replicas retrieve/update their own parameters Load balance, robust Tolerate variance in the processing speed of dierent model replicas The wholesale failure may be taken oine or restart at random Yifu Huang (FDU CS) COMP Report 2013/11/20 7 / 21

16 Distributed Algorithm Downpour SGD Outline 1 Motivation 2 Software Framework: DistBelief 3 Distributed Algorithm Downpour SGD Sandblaster L-BFGS 4 Experiments 5 Discussion Yifu Huang (FDU CS) COMP Report 2013/11/20 8 / 21

17 Distributed Algorithm Downpour SGD Downpour SGD SGD [8] Minimize the object function F (ω) Update parameters ω = ω η ω asynchronous SGD [3] Downpour Massive parameters retrieved and updated Adagrad learning rate [4] η i,k = γ/ K j=1 ω i,j 2 Improve both robust and scale Yifu Huang (FDU CS) COMP Report 2013/11/20 9 / 21

18 Distributed Algorithm Downpour SGD Downpour SGD SGD [8] Minimize the object function F (ω) Update parameters ω = ω η ω asynchronous SGD [3] Downpour Massive parameters retrieved and updated Adagrad learning rate [4] η i,k = γ/ K j=1 ω i,j 2 Improve both robust and scale Yifu Huang (FDU CS) COMP Report 2013/11/20 9 / 21

Distributed Algorithm Downpour SGD Downpour SGD SGD [8] Minimize the object function F (ω) Update parameters ω = ω η ω asynchronous SGD [3] Downpour Massive

19 Distributed Algorithm Downpour SGD Downpour SGD SGD [8] Minimize the object function F (ω) Update parameters ω = ω η ω asynchronous SGD [3] Downpour Massive parameters retrieved and updated Adagrad learning rate [4] η i,k = γ/ K j=1 ω i,j 2 Improve both robust and scale Yifu Huang (FDU CS) COMP Report 2013/11/20 9 / 21

20 Distributed Algorithm Downpour SGD Downpour SGD (cont.) Algorithm visualization Yifu Huang (FDU CS) COMP Report 2013/11/20 10 / 21

21 Distributed Algorithm Downpour SGD Downpour SGD (cont.) Algorithm pseudo code Yifu Huang (FDU CS) COMP Report 2013/11/20 11 / 21

22 Distributed Algorithm Sandblaster L-BFGS Outline 1 Motivation 2 Software Framework: DistBelief 3 Distributed Algorithm Downpour SGD Sandblaster L-BFGS 4 Experiments 5 Discussion Yifu Huang (FDU CS) COMP Report 2013/11/20 12 / 21

23 Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS BFGS [9] An iterative method for solving unconstrained nonlinear optimization Compute an approximation to the Hessian matrix B Limited-memory BFGS [10] Sandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work assigned by coordinator Yifu Huang (FDU CS) COMP Report 2013/11/20 13 / 21

24 Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS BFGS [9] An iterative method for solving unconstrained nonlinear optimization Compute an approximation to the Hessian matrix B Limited-memory BFGS [10] Sandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work assigned by coordinator Yifu Huang (FDU CS) COMP Report 2013/11/20 13 / 21

Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS BFGS [9] An iterative method for solving unconstrained nonlinear optimization Compute an approximation to the Hessian matrix B

25 Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS BFGS [9] An iterative method for solving unconstrained nonlinear optimization Compute an approximation to the Hessian matrix B Limited-memory BFGS [10] Sandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work assigned by coordinator Yifu Huang (FDU CS) COMP Report 2013/11/20 13 / 21

26 Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS (cont.) Algorithm visualization Yifu Huang (FDU CS) COMP Report 2013/11/20 14 / 21

27 Distributed Algorithm Sandblaster L-BFGS Sandblaster L-BFGS (cont.) Algorithm pseudo code Yifu Huang (FDU CS) COMP Report 2013/11/20 15 / 21

28 Experiments Experiments Setup Object recognition in still images [5] Acoustic processing for speech recognition [2] Model parallelism benchmarks Yifu Huang (FDU CS) COMP Report 2013/11/20 16 / 21

29 Experiments Experiments Setup Object recognition in still images [5] Acoustic processing for speech recognition [2] Model parallelism benchmarks Yifu Huang (FDU CS) COMP Report 2013/11/20 16 / 21

30 Experiments Experiments (cont.) Optimization method comparisons Yifu Huang (FDU CS) COMP Report 2013/11/20 17 / 21

31 Experiments Experiments (cont.) Optimization method comparisons Application to ImageNet [2] This network achieved a cross-validated classication accuracy of over 15%, a relative improvement over 60% from the best performance we are aware of on the 21k category ImageNet classication task Yifu Huang (FDU CS) COMP Report 2013/11/20 18 / 21

32 Discussion Discussion Contributions Increase the scale and speed of deep networks training Drawbacks There is no guarantee that parameters are consistent Yifu Huang (FDU CS) COMP Report 2013/11/20 19 / 21

33 Discussion Discussion Contributions Increase the scale and speed of deep networks training Drawbacks There is no guarantee that parameters are consistent Yifu Huang (FDU CS) COMP Report 2013/11/20 19 / 21

34 Appendix References I [1] Large Scale Distributed Deep Networks. NIPS [2] Building High-level Features Using Large Scale Unsupervised Learning. ICML [3] Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. NIPS [4] Adaptive subgradient methods for online learning and stochastic optimization. JMLR [5] Improving the speed of neural networks on cpus. NIPS [6] [7] Deep_Networks:_Overview [8] Gradient_checking_and_advanced_optimization Yifu Huang (FDU CS) COMP Report 2013/11/20 20 / 21

35 Appendix References II [9] [10] Yifu Huang (FDU CS) COMP Report 2013/11/20 21 / 21

Toward Scalable Deep Learning

한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 2015.10.16 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr Breakthrough: Big Data + Machine Learning