CP365 Artificial Intelligence

Tech News! Apple news conference tomorrow?

Tech News! Apple news conference tomorrow? Google cancels Project Ara modular phone

Weather-Based Stock Market Predictions?

Dataset Preparation Clean remove bogus data/fill in missing data Normalize data adjust features to be similar magnitudes

Deal with Missing Data Option 1: remove datapoints with any missing feature values

Deal with Missing Data Option 1: remove datapoints with any missing feature values Option 2: fill in missing data with <data_missing> tags for categorical data

Deal with Missing Data Option 1: remove datapoints with any missing feature values Option 2: fill in missing data with <data_missing> tags for categorical data Option 3: fill in missing data with global means for numeric data

Remove Outliers Some datapoints may have ridiculous feature values. We can remove outliers from our dataset to increase performance. What is an outlier?

Outliers Patient Height (cm) Patient Weight (kg)... Prognosis 131.2 59.2... Good 176.7 82.9... Good 12613.9 66.0... Poor 161.0 70.2... Poor

Outliers Patient Height (cm) 131.2 176.7 PatientObvious Weight (kgs) outlier... How can we define what makes an outlier? 59.2... We 82.9could use 3σ... as the threshold. Prognosis Good Good 12613.9 66.0... Poor 161.0 70.2... Poor

Outliers Patient Height (cm) This column has x = 156.3 and Patient Weight... Prognosis σ = 23.1 (without the possible (kgs) outlier). 131.2 59.2 176.7 The 3σ thresholds would be (156.3-3 * 23.1, * 23.1) 82.9... 156.3 + 3 Good or (87, 225.6) 12613.9 161.0... Good 66.0... Poor 70.2... Poor

A Bad Dataset Patient Height (nm) Patient Weight (tons)... Prognosis 1.31 x 109 0.065... Good 1.76 x 109 0.091... Good 1.23 x 109 0.073... Poor 1.61 x 109 0.077... Poor

A Bad Dataset How will these large differences affect learning? Patient Height (nm) Patient Weight (tons)... Prognosis 1.31 x 109 0.065... Good 1.76 x 109 0.091... Good 1.23 x 109 0.073... Poor 1.61 x 109 0.077... Poor

Data Normalization Procedure Patient Height (nm) 1.31 x 109 Range of Extreme Values 1.76 x 109 1.76 x 109 1.23 x 109 1.23 x 109 1.61 x 109

Data Normalization Procedure Patient Height (nm) 1.31 x 109 Range of Extreme Values 1.76 x 109 1.76 x 109 1.23 x 109 1.23 x 109 1.61 x 109 Normalized Range Mapping 1.0 0.0 (-1.0)

Data Normalization Formula Patient Height (nm) 1.31 x 109 1.76 x 109 1.23 x 109 1.61 x 109 Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt. oldmax = 1.76 x 109 oldmin = 1.23 x 109 newmax = 1.0 newmin = 0.0

Data Normalization Formula Patient Height (nm) 1.31 x 109 1.76 x 109 1.23 x 109 1.61 x 10 9 Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt. oldmax = 1.76 x 109 oldmin = 1.23 x 109 newmax = 1.0 newmin = 0.0 ( newpt= pt oldmin ( newmax newmin ) +newmin oldmax oldmin newpt=0.15 )

How do we know if an ML model is any good?

Overfitting

Testing Error Training Epoch

A Biological Neuron

Human Brain

How many neurons? Animal Number Neurons (cerebral cortex) Rat 20,000,000 Dog 160,000,000 Cat 300,000,000 Pig 450,000,000 Horse 1,200,000,000 Dolphin 5,800,000,000 African Elephant 11,000,000,000 Human 20,000,000,000

How many connections? Human 100,000,000,000,000

How many connections? Human Google (2012) 100,000,000,000,000 1,700,000,000 Google/Stanford (2013) 11,200,000,000 Digital Reasoning (2015) 160,000,000,000

Artificial Neuron Output connections Threshold Function w1 w2 w3 Input connections and weights

Hard Threshold S = Sum up all inputi * weighti if S > THRESHOLD: output = 1 else: output = 0 Threshold Function w1 w2 w3

Hard Threshold: Step Function

Write down artificial neurons with weights and thresholds that model the following functions: Identity Logical AND Logical OR Logical XOR Constant function

Sigmoid Threshold S = Sum up all inputi * weighti Threshold Function output = w1 w2 w3 1 S 1 e

Sigmoid Threshold: 'S' Function

sigmoid w1 = 0.1 w3 = 0.42 w2 = 0.2

sigmoid w1 = 0.1 w3 = 0.42 w2 = 0.2 Features x1 = 0.66 x2 = 0.11 x3 = 0.20

Output Calculations s = w1 * x1 + w2 * x2 + w3 * x3 s = 0.1 * 0.66 + 0.2 * 0.11 + 0.42 * 0.2 s = 0.09 1 =0.52 0.09 1 e

y1 = 0.52 sigmoid w1 = 0.1 w3 = 0.42 w2 = 0.2 Features x1 = 0.66 x2 = 0.11 x3 = 0.20

Perceptron Network Output Layer Input Layer

Perceptron: Linear Boundary

Linear Boundary?

Multilayer Network Output Layer Hidden Layer(s) Input Layer

ANN Learning How to get the weights?

ANN Learning How to get the weights? error weight1 weight2

ANN Learning How do we get the right weights? Perceptron: Gradient descent Multilayer Network: Back propagation

Node Activation Function Activation (output) of node j. n a j =g(input j )=g( w ij ai ) i=0

Node Activation Function Activation (output) of node j. n a j =g(input j )=g( w ij ai ) i=0 g is the threshold activation function.

Node Activation Function Sum of all weights and input values. Activation (output) of node j. n a j =g(input j )=g( w ij ai ) i=0 g is the threshold activation function.

Minimize Global Error Function For every output node, j, sum up... error = (t j a j ) 2 j

Minimize Global Error Function...the difference in target value vs. generated output value and square it. For every output node, j, sum up... 2 error= (t j a j ) j

Perceptron Learning Δ w ij =η(t j a j )ai Update the weight on connection i j

Perceptron Learning The learning rate (0.3ish) Δ w ij =η(t j a j )ai Update the weight on connection i j

Perceptron Learning The learning rate (0.3ish) Δ w ij =η(t j a j )ai Update the weight on connection i j Difference in target and generated output.

Perceptron Learning The learning rate (0.3ish) Input activation Δ w ij =η(t j a j )ai Update the weight on connection i j Difference in target and generated output.

Let's learn NAND! Starting weight values: W1 = 0.81, W2 = 0.55, W3 = 0.16 n a j=g (input j )=g ( w ji ai ) i=0 η = 0.3 Δ wij =η(t j a j ) ai Use sigmoid threshold Dataset: NAND Input1 Input2 Label 0 0 1 0 1 1 1 0 1 1 1 0 Out W1 In1 W2 In2 W3 1.0

ANN Learning - Backpropagation Output Layer Hidden Layer Input Layer Put in input values and feed the activation forward to produce the output.

ANN Learning - Backpropagation Output Layer Hidden Layer Input Layer Calculate the error in the output layer and then backpropagate it to update lower weights.

ANN Learning - Backpropagation Update the weight on connection i j Δ w ij =ηδ j ai

ANN Learning - Backpropagation Update the weight on connection i j Δ w ij =ηδ j ai Think of this as the error measure for node j. Different for output and hidden weights.

ANN Learning - Backpropagation Update the weight on connection i j Input activation Δ w ij =ηδ j ai Think of this as the error measure for node j. Different for output and hidden weights.

ANN Learning Backpropagation for Output Nodes δ j =a j (1 a j )(t j a j ) Error measure for output node, j.

ANN Learning Backpropagation for Output Nodes Derivative of sigmoid function. δ j =a j (1 a j )(t j a j ) Error measure for output node, j.

ANN Learning Backpropagation for Output Nodes Derivative of sigmoid function. Difference in target vs. generated output. δ j =a j (1 a j )(t j a j ) Error measure for output node, j.

ANN Learning Backpropagation for Hidden Nodes δ j =a j (1 a j ) δk w jk k Error measure for hidden node, j.

ANN Learning Backpropagation for Hidden Nodes Derivative of sigmoid function. δ j =a j (1 a j ) δk w jk k Error measure for hidden node, j.

ANN Learning Backpropagation for Hidden Nodes Derivative of sigmoid function. Error measure a combination of output errors that this weight contributes to. δ j =a j (1 a j ) δk w jk k Error measure for hidden node, j.

ANN Learning Initialize random network weights for epoch in range NUMBER_EPOCHS: Train network on random presentation of instances Update weights with backpropagation Report global error function value

Choosing the Learning Rate, η What happened when our learning rate was too high for linear regression? How do we choose an appropriate learning rate for ANNs?

Bold Driver After each epoch... sodahead.com if error went down: η = η * 1.05 else: η = η * 0.50

Choosing the Network Structure Output Layer How many nodes? What are their connections? Hidden Layer Input Layer

Choosing the Network Structure # of output nodes determined by the number of function Output outputs. Layer Hidden Layer Input Layer

Choosing the Network Structure # of input nodes Output determined by the Layer number of function inputs. Hidden Layer Input Layer

Choosing the Network Too Structure few hidden nodes: unable to get a detailed enough approximation of the target function Output Layer Hidden Layer Input Layer

Choosing the Network Structure Output Layer Too many hidden nodes: slower to train and easier to overfit training data Hidden Layer Input Layer

ANN Representational Power With one hidden layer: Model all continuous functions With two hidden layers: Model all functions

Rules of Thumb Use 1 or 2 hidden layers

Rules of Thumb Use 1 or 2 hidden layers Use about (2/3)n hidden nodes for reasonably complex functions

Rules of Thumb Use 1 or 2 hidden layers Use about (2/3)n hidden nodes for reasonably complex functions Don't train for too many epochs

Splitting up datasets Training data use to train your ML model Validation data use to improve your ML model while training Testing data use to test performance of your ML model

K-Fold Cross Validation Full Dataset Dataset split into k chunks

K-Fold Cross Validation: Pass 1 Training Dataset Validation Dataset

K-Fold Cross Validation: Pass 2 Training Dataset Validation Dataset

K-Fold Cross Validation Perform K training/validation passes Each pass counts as a classification accuracy sample Extreme case: K = datasetsize Leave-one-out testing

ANN Implementation?

Break!