Developing Machine Learning Models. Kevin Tsai

Size: px

Start display at page:

Download "Developing Machine Learning Models. Kevin Tsai"

Adam Copeland
5 years ago
Views:

1 Developing Machine Learning Models Kevin Tsai

3 GOTO Chicago 2018 Developing Machine Learning Models Kevin Tsai, Google

4 Agenda Machine Learning with tf.estimator Performance pipelines with TFRecords and tf.data Distributed Training with GPUs and TPUs

5 Machine Learning with tf.estimator

6 Machine Learning Pipeline Training Data Transform & Feature Engineering Model Building Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

7 Machine Learning Pipeline Training Data Transform & Feature Engineering split() train() Model evaluate() Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

8 Machine Learning Pipeline Training Data Transform & Feature Engineering split() train() Logs Model evaluate() Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

9 Machine Learning Pipeline Training Data Transform & Feature Engineering split() train() Logs Model export() evaluate() Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

10 Machine Learning Pipeline with tf.estimator Training Data Transform & Feature Engineering model_fn estimator tf.estimator Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

11 Machine Learning Pipeline with tf.estimator train() Training Data Transform & Feature Engineering model_fn evaluate() predict() estimator tf.estimator Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

12 Machine Learning Pipeline with tf.estimator train_input_fn train() Training Data Transform & Feature Engineering eval_input_fn model_fn evaluate() predict_input_fn predict() estimator tf.estimator Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

13 Machine Learning Pipeline with tf.estimator train_input_fn train() Training Data Transform & Feature Engineering eval_input_fn model_fn evaluate() predict_input_fn predict() estimator Logs tf.estimator Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

14 Machine Learning Pipeline with tf.estimator train_input_fn train() Training Data Transform & Feature Engineering eval_input_fn model_fn evaluate() predict_input_fn predict() serving_input_rec_fn estimator Logs tf.estimator export_savedmodel() Training Pipeline Inference Pipeline Inference Requests Transform & Feature Engineering Model Serving Predictions

15 Estimator in TensorFlow Canned estimators Models in a box Estimator Keras model Train and evaluate models Layers Build models Python C++ Java Go... TensorFlow Distributed Execution Engine CPU GPU TPU Android ios...

16 Example: MNIST? 1 MNIST: Yann LeCunn

17 Steps to Run Training on Premade Estimator Training: 1. Import data 2. Create input functions 3. Create estimator 4. Create train and evaluate specs 5. Run train and evaluate Test Inference: 6. Test run predict() 7. Export model

18 Premade and Custom Estimators Premade Estimators DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor + tf.contrib.estimator or Custom Estimator

19 Creating Custom Estimator

20 #1 Inference() #1 Inference() (-1,784)

21 #1 Inference() #1 Inference() (-1,784) (-1,28,28,1) (-1,14,14,64) (-1,7,7,64) Reshape Convolution ReLU Max Pool Convolution ReLU Max Pool

22 #1 Inference() #1 Inference() (-1,784) (-1,28,28,1) (-1,14,14,64) (-1,7,7,64) (-1,3136) (-1,1024) Reshape Convolution ReLU Max Pool Convolution ReLU Max Pool Reshape Dense ReLU Dropout

23 #1 Inference() #1 Inference() (-1,784) (-1,28,28,1) (-1,14,14,64) (-1,7,7,64) (-1,3136) (-1,1024) (-1,10) (-1,10) Reshape Convolution ReLU Max Pool Convolution ReLU Max Pool Reshape Dense ReLU Dropout Dense Softmax

24 #1 Inference() (-1,784) (-1,28,28,1) (-1,14,14,64) (-1,7,7,64) (-1,3136) (-1,1024) (-1,10) (-1,10) #1 Inference() Reshape Convolution ReLU Max Pool Convolution ReLU Max Pool tf.name_scope log tf.nn.conv2d tf.nn.relu tf.nn.max_pool tf.summary.histogram( tf.summary.histogram tf.summary.histogram Reshape Dense ReLU Dropout Dense Softmax tf.name_scope log relu=false tf.add tf.matmul tf.nn.relu tf.summary.histogram tf.summary.histogram tf.summary.histogram Hands-on TensorBoard: Dandelion Mane

25 #1 Inference() and #2 Calculations and Metrics #### 1 INFERENCE MODEL input_layer = tf.reshape(features["x"], [-1, 28, 28, 1]) conv1 = _conv(input_layer,kernel=[5,5,1,64],name='conv1',log=params['log']) conv2 = _conv(conv1,kernel=[5,5,64,64],name='conv2',log=params['log']) dense = _dense(conv2,size_in=7*7*64,size_out=params['dense_units'], name='dense',relu=true,log=params['log']) if mode==tf.estimator.modekeys.train: dense = tf.nn.dropout(dense,params['drop_out']) logits = _dense(dense,size_in=params['dense_units'], size_out=10,name='output',relu=false,log=params['log']) #1 Inference()

26 #1 Inference() and #2 Calculations and Metrics #### 2 CALCULATIONS AND METRICS predictions = {"classes": tf.argmax(input=logits,axis=1), "logits": logits, "probabilities": tf.nn.softmax(logits,name='softmax')} export_outputs = {'predictions': tf.estimator.export.predictoutput(predictions)} if (mode==tf.estimator.modekeys.train or mode==tf.estimator.modekeys.eval): loss = tf.losses.sparse_softmax_cross_entropy(labels=labels,logits=logits) accuracy = tf.metrics.accuracy( labels=labels, predictions=tf.argmax(logits,axis=1)) metrics = {'accuracy':accuracy} tf.summary.scalar('accuracy',accuracy[1]) #2 Calculations

27 #3 Predict, #4 Train, and #5 Eval #### 3 MODE = PREDICT if mode == tf.estimator.modekeys.predict: return tf.estimator.estimatorspec( mode=mode, predictions=predictions, export_outputs=export_outputs) #3 MODE = PREDICT

28 #3 Predict, #4 Train, and #5 Eval #### 4 MODE = TRAIN if mode == tf.estimator.modekeys.train: learning_rate = tf.train.exponential_decay( params['learning_rate'],tf.train.get_global_step(), decay_steps=100000,decay_rate=0.96) optimizer = tf.train.gradientdescentoptimizer(learning_rate=learning_rate) if params['replicate']==true: optimizer = tf.contrib.estimator.toweroptimizer(optimizer) train_op = optimizer.minimize(loss=loss,global_step=tf.train.get_global_step()) tf.summary.scalar('learning_rate', learning_rate) return tf.estimator.estimatorspec( mode=mode, loss=loss, train_op=train_op) #4 MODE = TRAIN

29 #3 Predict, #4 Train, and #5 Eval #5 MODE = EVAL #### 5 MODE = EVAL if mode == tf.estimator.modekeys.eval: return tf.estimator.estimatorspec( mode=mode,loss=loss,eval_metric_ops=metrics)

30 TFRecords and tf.data

31 (Inefficient) Pipeline # Load Data mnist = tf.contrib.learn.datasets.load_dataset("mnist") train_data = mnist.train.images train_labels = np.asarray(mnist.train.labels, dtype=np.int32) eval_data = mnist.test.images eval_labels = np.asarray(mnist.test.labels, dtype=np.int32) # Create Input Functions train_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": train_data},y=train_labels,batch_size=100,num_epochs=none,shuffle=true) eval_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": eval_data},y=eval_labels,num_epochs=1,shuffle=false) pred_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": eval_data},num_epochs=1,shuffle=false)

32 TFRecord tf.train.example tf.train.example tf.train.example tf.train.example tf.train.example... train004.tfrecords

33 TFRecord tf.train.example tf.train.example tf.train.example tf.train.example tf.train.example train004.tfrecords tf.python_io.tfrecordwriter tf.train.example tf.train.feature tf.train.features tf.train.feature tf.train.feature tf.train.feature tf.train.feature... tf.train.example tf.train.features...

34 tf.data tf.train.example tf.train.features tf.train.feature tf.train.feature tf.train.feature... tf.fixedlenfeature tf.fixedlenfeature tf.fixedlenfeature tf.parse_single_example

35 tf.data input_fn() dataset = tf.data.tfrecorddataset(c['files'],num_parallel_reads=c['threads'])

36 tf.data input_fn() dataset = dataset.map(parse_tfrecord,num_parallel_calls=c['threads']) dataset = dataset.map(lambda x,y: (image_scaling(x),y),num_parallel_calls=c['threads'])

37 tf.data input_fn() if c['mode']==tf.estimator.modekeys.train: dataset = dataset.map(lambda x,y: (distort(x),y),num_parallel_calls=c['threads']) dataset = dataset.shuffle(buffer_size=c['shuffle_buff'])

38 tf.data input_fn() dataset = dataset.repeat() dataset = dataset.batch(c['batch']) dataset = dataset.prefetch(2*c['batch'])

39 tf.data input_fn() iterator = dataset.make_one_shot_iterator() images, labels = iterator.get_next() features = {'images': images} return features, labels

40 tf.data input_fn() train_spec = tf.estimator.trainspec(input_fn=lambda: dataset_input_fn(train_params)) eval_spec = tf.estimator.evalspec(input_fn=lambda: dataset_input_fn(eval_params))

41 Distributed Training

42 Parallelism is Not Automatic

43 Data and Model Parallelism ypred = WX + b X mat mul add ypred graph W b

44 Data and Model Parallelism Data Parallelism X split mat mul graph add gpu:0 ypred = WX + b mat mul graph add gpu:1 X mat mul add ypred graph W b

45 Data and Model Parallelism Data Parallelism X split mat mul graph add gpu:0 ypred = WX + b mat mul graph add gpu:1 Model Parallelism X mat mul add graph ypred X mat mul graph add gpu:0 gpu:1 W b

46 Data and Model Parallelism Data Parallelism X split mat mul graph add gpu:0 ypred = WX + b mat mul graph add gpu:1 Model Parallelism X mat mul add graph ypred X split mat mul graph add gpu:0 gpu:1 W b mat mul graph add gpu:2 gpu:3

47 Output Reduction X split mat mul graph add gpu:0 [ grad(gpu:0), ypred(gpu:0), loss(gpu:0) ] mat mul graph add gpu:1 [ grad(gpu:1), ypred(gpu:1), loss(gpu:1) ]

48 Output Reduction X split mat mul graph add gpu:0 [ grad(gpu:0), ypred(gpu:0), loss(gpu:0) ] mat mul graph add gpu:1 [ grad(gpu:1), ypred(gpu:1), loss(gpu:1) ] avg concat sum [ grad(model), ypred(model), loss(model) ] Update Model Params

49 Output Reduction gpu:0 gpu:0 gpu:1 gpu:2 cpu/gpu gpu:3 gpu:1 gpu:3 gpu:2 Naive Consolidation Ring All-Reduce Baidu All-Reduce: Andrew Ng

50 Distributed Accelerator Options replicate_model_fn (TF1.5+; Naive Consolidation; Single Server) distribute.mirroredstrategy (TF1.8+; All-Reduce; Single Server for now) TPUEstimator (TF1.4+; All-Reduce; 64-TPU pod) Horovod (TF1.8+; All-Reduce; Multi Server) multi_gpu_model (Keras; Naive Consolidation; Single Server) Meet Horovod: Alex Sergeev, Mike Del Balso

3% 16 4096 01:30 76.5% 32 8192 45 min 76.

51 TPU / TPU Pod TPUs Batch Size Time to 90 Epochs Accuracy : % : % : % min 76.1% min 75.0% > min 76.1% ImageNet is the New MNIST

52 Summary tf.estimator Premade Estimators for quick and dirty jobs (5-part recipe) Custom models w/o the plumbing (5-part function) Keras Functional API for Inference() in custom model_fn() Use Checkpoints/Preemptible VMs/Cloud Storage to reduce cost Scalable model building from laptop to GPUs to TPU Pods TFRecords & tf.data for pipeline performance Parallelize creation of TFRecord files ~ MB Use tf.data input_fn() Distributed Accelerator options Input pipeline before bigger/faster/more accelerators Scale Up before Out

53 Thank you.

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation