PREDICTION OVERVIEW In this experiment, two of the Project PEACH datasets will be used to predict the reaction of a user to atmospheric factors. This experiment represents the first iteration of the Machine Learning Process and further iterations and explorations should be performed in order to achieve the desired performance of the model. As a reference, please find a version of the Machine Learning Process described in the diagram below. This experiment can be found in the Cortana Analytics Gallery https://gallery.cortanaanalytics.com/experiment/6bf324d25ae24ca19b20494e12c3b44d
LOAD DATA There are multiple ways to bring data into Azure Machine Learning Studio. For this sample experiment the datasets were uploaded from local files (af_data.csv and 1. Upload data from local file Click on New at the bottom left of the page in Azure ML Studio. Go to Dataset -> From Local File.
Choose the file to upload from your local machine, check the correct options are entered and click OK. Repeat this for all datasets you want to upload. The datasets uploaded should now appear under Saved Datasets -> My Datasets when you start a new experiment.
2. Use the Reader module The Reader module can be used to load data into Azure ML from various data sources. You can choose from public Web URL, Hive Query, Azure table, Azure blob and Data feed provider. As an example, you can load the datasets from Project Peach which are stored in Azure blob (more info at http://aka.ms/uclhack). You need to insert the URI and choose the file format as shown below.
EXPLORE DATA AND ENGINEER FEATURES Once your datasets are in Azure ML, you can drag then into the workspace. You can visualize the data by rightclicking on the output of the dataset module and selecting Visualize. The User Profile dataset has 200000 rows and 21 columns, and you can see a preview of the data. If you click on one of the columns, some basic statistics appear on the right-hand side. Visualize the atmospheric factors dataset. The Timestamp column is a numeric feature and the format of the date is not one recognized by Azure ML. To convert this column in to a DateTime feature, you need to first convert it to a string and then use some simple R code to convert it to a format accepted by Azure ML.
Search for the Metadata Editor module in the search box at the top-left of the page and drag the module onto the workspace. Connect the output of the Atmospheric dataset to the input of the Metadata Editor. Click on the Metadata Editor module and its Properties are going to appear in the right-hand side. Click Launch column selector and include the TIMESTAMP column in the pop-up window as shown below.
Insert a Execute R Script module onto the workspace and click on it. On the right-hand side you will see a box filled with sample R code. Delete the existing code and insert the code from the text box below. This code is reading the values in the column TIMESTAMP and converting them in the standard Date format which is readable by Azure ML.
# Map 1-based optional input ports to variables dataset1 <- maml.mapinputport(1) # class: data.frame dataset1$timestamp <- as.date(dataset1$timestamp, format="%y%m%d") # Select data.frame to be sent to the output Dataset port maml.mapoutputport("dataset1"); Click Run at the bottom of the page to run the experiment. As you progress in this sample, run the experiment after inserting new modules to be able to visualize the output of the modules and check everything is running as expected. After the experiment finished running, right-click visualize the output of the Execute R Script module. The TIMESTAMP column should be in a different format as shown below. The ATMOSPHERIC_CONDITION column is a string feature and the EXPOSURE_LEVEL column is a numeric feature. These columns only take a specific number a values so their data type should be converted to categorical. This can be achieved using another Metadata Editor module.
Drag another Metadata Editor module into the workspace and change its properties as shown below.
Next step is to join the two datasets to be able to train a machine learning models on the combined data. Drag the Join module in the workspace and connect the output of the Metadata Editor module to the left-hand side input and the output of the User Profile data to the right-hand side input of the Join module. The two dataset are joined on User ID. Click on the Join module and set its properties as shown below. Make sure the Keep the right key columns in the joined table box is unticked.
Run the experiment and right-click visualize the output of the join once the experiment finished running. You should see a join dataset as below.
TRAIN, SCORE AND EVALUATE MACHINE LEARNING MODELS The data needs to be split into two sets: a training set and a testing set. The training set is used to the train the machine learning models and the testing set is used to measure the performance of the trained model. To split the data, we use the Split Data module. Set the properties of the Split Data module to: Splitting mode Split Rows, Fraction of rows in the first output dataset 0.7, Random seed 0 and Stratified split False. Make sure the Randomized split checkbox is ticked. The first output of this module will contain 70% of the data which is the training set and the second output of the module will contain the remaining 30%. Next we need to choose the machine learning model for this prediction problem. We can to predict if the user has no reaction or a negative reaction to the atmospheric factors. The user reaction is recorded in the dataset, so we will use this information to make future predictions. Hence, in this scenario we use a class of machine learning algorithms called supervised learning as our data points are labeled. More specifically we will use binary classification algorithms to predict the user reaction to atmospheric factors. For this sample, we select the Two-class boosted decision tree model. Search for this module and drag it into the workspace. Add a Train Model module. Connect the output of the Two-class decision tree to the left-hand
side input of the Train Model module and the first output of the Split Data module to the right-hand input of the Train Model module as shown below. Click on the Train Model module and set the Label column to USER S FEEDBACK using the Launch column selector. Add a Score Model module to the workspace and connect it to the Train Model and the Split Data module as shown below. In this sample, we want to compare two different models to determine which one performs better for this particular problem. Add a Two-class logistic regression module and connect it to a new Train Model and Score Model modules as in the previous step. You should obtain something as in the figure below.
In order to compare and evaluate these trained models, we use the Evaluate Model module connected as shown below. Run the experiment and once the experiment finished running, right-click visualize the output of the Evaluate Model module. The Evaluate Results Page shows different metrics and performance measures for your trained algorithms. For example, the accuracy of the Boosted decision tree algorithm is 0.912 very similar the accuracy of the logistic regression algorithm which is 0.911. The boosted decision tree is performing better in terms of the ROC curve while the logistic regression is performing better in terms of precision.