Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering

Size: px

Start display at page:

Download "Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering"

Cassandra Gibbs
5 years ago
Views:

1 Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering

학위 공학박사, EE, University of Southern California, USA (2004. 5) 공학석사, EE, University of Southern California, USA (1999. 5) 공학사, EE, Yonsei University, Korea (1996. 9) 경력 연세대학교공과대학전기전자공학부부교수 (2007.

2 학위 공학박사, EE, University of Southern California, USA ( ) 공학석사, EE, University of Southern California, USA ( ) 공학사, EE, Yonsei University, Korea ( ) 경력 연세대학교공과대학전기전자공학부부교수 ( ~ present) California State University, Northridge, ECE, Assistant Professor ( ~ ) ARM Inc. Contract Software Engineer, Irvine, California (2006 ~ 2007) University of California, Irvine, Assistant Specialist (Post-Doc.) (2004) Apple Computer Inc. Intern, Cupertino, California (2003 ~ 2004) 연락처 Engr. Bldg. III C220 Yonsei University, Seoul, Korea Tel: wro@yonsei.ac.kr 연구주제 마이크로프로세서 Graphics Processing Unit (GPU) 아키텍쳐 고성능컴퓨팅시스템 병렬처리및분산시스템 뉴럴프로세서및머신러닝가속기

3 Embedded Systems and Computer Architecture Lab

6 Computer Systems 6

7 Computer Architectures Multi-Core Microprocessor Architectures Memory Hierarchy: Cache & Memory Storage: Flash & SSD GPU Architectures Threads Scheduling Algorithm Power Efficient Architectures Data Compression / Computation Reuse High- Performance Computer Systems Neural Network Accelerator Memory Design for Neural Networks New Computation Paradigm for NN Neural Networks Processing Unit Parallel Processing High Performance Computing Heterogeneous Computing Parallel Processing

8 AlphaGo ( ) TPU: CPU 와 GPU 가아닌전용 ASIC ( ) Nature (2016.1)

9 인공신경망가속기구현

TPU Details @ ISCA 2017 Domain specific ASIC called

10 TPU ISCA 2017 Domain specific ASIC called TPU Inference of NN 65,536 8-bit MAC 28 MiB on-chip memory NN applications (MLPs, CNNs, and LSTMs) 15X ~ 30X faster GPU (K80), CPU (Haswell) TOPS/Watt 30X ~ 80X

11 Systolic execution on Matrix Multiply Unit Weighted FIFO Control Systolic Data Setup + + +

12 TPU2 A: 4 TPU2 chips (with heat sinks) B: 2 BlueLink 25GB/s C: 2 Omni-Path Architecture (OPA) cables D: back of power connector E: a network switch

TPU2 Stamp 청색 LED : TPU2 Rack 녹색 LED : CPU Rack Google TPU2 stamp composed of four-rack CPU board 와 TPU2 board 는 1 대 1 로연결되어작동 Stamp Architecture 4 Racks (2 TPU2 Racks,

13 TPU2 Stamp 청색 LED : TPU2 Rack 녹색 LED : CPU Rack Google TPU2 stamp composed of four-rack CPU board 와 TPU2 board 는 1 대 1 로연결되어작동 Stamp Architecture 4 Racks (2 TPU2 Racks, 2 CPU Racks) 으로구성 CPU 와 TPU2 모두 1 Rack 는 32 Board 로구성 128 개의 CPU chip (Intel Xeon) 와 256 개의 TPU2 chip 으로구성 Google 은현재 4 개의 Stamp 를도입 총 512 개의 CPU chip 개의 TPU2 chip

14 Performance Speedup FP16 의경우 45 TFLOPS per chip, 11.5 PFLOPS per stamp Power Consumption 방열판의크기로미루어보아 chip 당최소 200~250 watts 의전력소비량을가진다고가정 TPU2 Rack : 30~36 kw per rack 추정 TPU2 Stamp : 100~112 kw per Stamp, 100~115 GFLOPS/Watt per Stamp 추정 TPU chip TPU2 chip Power 40W 200W Matrix Multiply 8 bits (integer) 16 bits (floating point) Performance 23 TOPS (INT 16) 45 TFLOPS (FP16) Design Single chip Quad chip board Connection 2 PCI-Express 3.0 x 8 Dual OPA ports and Network Switch, Two BlueLink ports Purpose Inference Inference & Training Memory Bandwidth 16 GB/s N/A

15 Two Phase of Neural Network TRAINING Learning a new capability from existing data INFERENCE Applying this capability to new data Deep Learning Framework TRAINING DATASET NEW DATA cat Untrained Neural Network Model Trained Model New Capability? Trained Model Optimized for Predict Accuracy cat dog cat

16 CONVOLUTION = SIMD Paradise! 100 s 1000 s CNVLUTIN: Ineffectual-neuron-free DNN computing (ISCA 2016)

18 * Slides from David Kirk Keynote Speech (GTX Korea 2016)

19 NVIDIA Volta Streaming Multiprocessor (SM)

Commercial Processors Skylake-X (Core i9-7980xe) Volta (Tesla V100, GV100)

mm 2 (estimated) 815 mm 2 646 mm 2 Total Transistors 5.9 B (estimated) 21.

of Cores 18 CPU Cores / 36 Threads 5,120 CUDA Cores + 640 Tensor Cores 72

20 Commercial Processors Skylake-X (Core i9-7980xe) Volta (Tesla V100, GV100) Knights Landing (Xeon Phi 7290F) Release Date Q3 17 Q3 17 Q4 16 Die Size 473 mm 2 (estimated) 815 mm mm 2 Total Transistors 5.9 B (estimated) 21.1 B 8 B Num. of Cores 18 CPU Cores / 36 Threads 5,120 CUDA Cores Tensor Cores 72 Cores / 288 Threads TDP 165 W 300 W 260 W Clock & Technology ~ 4.4 GHz / 14 nm ~ 1455 MHz / 12 nm ~ 1.7 GHz / 14 nm Memory 16 GB, 900GB/sec 16 GB, 7.2GT/sec

21 Movidius - Neural Compute Stick USB 3.0 로사용가능한 Stick 형태의 Device Peak Power 가 1.2W 낮은소비전력으로 Trained 된 NN 을실행 Movidius Myriad 2 Soc Processor Neural network 프레임워크 Caffe 지원 4Gb LPDDR3 메모리용량 GoogLeNet: 17 inferences/sec ( 55 ms/inference) Movidius Neural Compute Stick ( , $79) Hardware Spec. (Myriad2) Core Clock SIMD Width IPC Memory Bandwidth Data Bit Load Inst. Delay 600 MHz 48 (3x16) Gbit/s 32 1 cycle

Mobile AI 사례 9 월 2 일 IFA 2017 에서 Huawei 가 NPU 를넣은새로운

Mate 10 시리즈발표 Mobile AI = Cloud AI + On-Device AI

4GHz 12-Core GPU Mali G72MP12 공정 Kirin 970

22 Mobile AI 사례 9 월 2 일 IFA 2017 에서 Huawei 가 NPU 를넣은새로운 SoC Kirin 970 을발표 10 월 16 일 Munich 에서 Kirin 970 을사용한모바일 Mate 10 시리즈발표 Mobile AI = Cloud AI + On-Device AI Latency, Stability, Privacy 8-Core CPU Up to 2.4GHz 12-Core GPU Mali G72MP12 공정 Kirin 970 Specification TSMC 10nm Kirin NPU 1.92T FP16 OPS HiAI... Image DSP 512bit SIMD 집적도 5.5billion TRs/cm 2 성능비교 4-Core Coretex-A73의 25배 Performance/Watt 4-Core Coretex-A73의 50배 Kirin 970 의주요 Chips

23 Alan Turing Turing s View on A.I. (Intelligent Machinery, 1948) Certainly the nerve has many advantages. It is extremely compact, does not wear out (probably for hundreds of years if kept in a suitable medium!) and has a very low energy consumption. Against these advantages the electronic circuits have only one counter-attraction, that of speed. This advantage is however, on such a scale that it may possibly outweigh the advantages of the nerve.

24 머신러닝기술활용분야 (1) Autonomous Driving 머신러닝을활용해사물은인지하여대응하는자동차 테슬라, BMW, 현대등많은기업에서자율주행에대한연구진행 Augmented Reality (AR) 사물을인식하여관련정보를가상으로덧붙이는기술 IKEA 에서증강현실을이용하여가상가구배치서비스제공 Face Recognition 얼굴인식을활용하여결제, 잠금해제등을수행 Samsung, Apple 등얼굴인식을통한보안서비스제공 Autonomous Driving Augmented Reality Face Recognition

및여러기업에서영상에서문장을출력하여검색하는서비스를제공 Machine Translation 머신러닝을활용한번역은정확도가높고구어체및의역에대한번역에강함

25 머신러닝기술활용분야 (2) Speech Recognition 머신러닝을활용해음성을문자로변환하는응용분야 Siri, Bixby, Cortina 등음성비서서비스제공 Image Captioning 영상에서물체들의위치, 동작, 관계를분석하여영상의내용을설명 Google, Naver 및여러기업에서영상에서문장을출력하여검색하는서비스를제공 Machine Translation 머신러닝을활용한번역은정확도가높고구어체및의역에대한번역에강함 Google, Naver 및여러기업에서머신러닝을활용한번역서비스를제공 Speech Recognition Image Captioning & Searching Machine Translation

26 LAB. PICTURES Embedded Systems and Computer Architecture Lab.

27 LAB. MEMBERS Embedded Systems and Computer Architecture Lab.

30 Any questions or comments

Deep Learning Accelerators

Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction