Ensemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017

Similar documents
Bagging Exponential Smoothing Methods using STL Decomposition and Box-Cox Transformation

Keras: Handwritten Digit Recognition using MNIST Dataset

Recurrent Neural Network (RNN) Industrial AI Lab.

SYS 6021 Linear Statistical Models

A Quick Guide on Training a neural network using Keras.

Global Journal of Engineering Science and Research Management

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

April 3, 2012 T.C. Havens

Time Series Analysis DM 2 / A.A

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Using Machine Learning to Optimize Storage Systems

For Monday. Read chapter 18, sections Homework:

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Allstate Insurance Claims Severity: A Machine Learning Approach

CS489/698: Intro to ML

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Tutorial on Machine Learning Tools

A Dendrogram. Bioinformatics (Lec 17)

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Chapter 7: Numerical Prediction

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Keras: Handwritten Digit Recognition using MNIST Dataset

Neural Network Neurons

Random Forest A. Fornaser

Network Traffic Measurements and Analysis

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

Lecture 26: Missing data

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Robust Linear Regression (Passing- Bablok Median-Slope)

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

VCEasy VISUAL FURTHER MATHS. Overview

Markov Chain Monte Carlo (part 1)

MoonRiver: Deep Neural Network in C++

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

7. Boosting and Bagging Bagging

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Slides for Data Mining by I. H. Witten and E. Frank

House Price Prediction Using LSTM

Can Support Vector Machine be a Major Classification Method?

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Random Forests and Boosting

Collaborative Filtering Applied to Educational Data Mining

forecast the method was neural used to

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Nonparametric Approaches to Regression

Time Series Prediction and Neural Networks

Lab 3: From Data to Models

Climate Precipitation Prediction by Neural Network

Data Driven Networks. Sachin Katti

Pseudo code of algorithms are to be read by.

Fathom Dynamic Data TM Version 2 Specifications

Data Driven Networks

A Practical Guide to Support Vector Classification

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

NNIGnets, Neural Networks Software

ECE 5470 Classification, Machine Learning, and Neural Network Review

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Radial Basis Function Networks: Algorithms

The Bootstrap and Jackknife

Machine Learning Techniques for Data Mining

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Deep Learning Applications

Learn What s New. Statistical Software

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Data-Splitting Models for O3 Data

Lecture 16: High-dimensional regression, non-linear regression

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach

SLStats.notebook. January 12, Statistics:

EECS 496 Statistical Language Models. Winter 2018

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

CS229 Final Project: Predicting Expected Response Times

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Lecture on Modeling Tools for Clustering & Regression

Final Report: Kaggle Soil Property Prediction Challenge

Excel Scientific and Engineering Cookbook

Statistics & Curve Fitting Tool

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

10-701/15-781, Fall 2006, Final

arxiv: v1 [cs.cv] 2 Sep 2018

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

scikit-learn (Machine Learning in Python)

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Notes on Multilayer, Feedforward Neural Networks

Transcription:

Ensemble of Specialized Neural Networks for Time Series Forecasting Slawek Smyl slawek@uber.com ISF 2017

Ensemble of Predictors Ensembling a group predictors (preferably diverse) or choosing one of them do a forecast in a particular situation is an attractive idea being actively researched. Dr. Sven F. Crone in his talk* during 2016 International Joint Conference on Neural Networks (IJCNN) reported a method of creating diversity of a group of neural networks by training them on different subset of features. My work has been inspired by his talk, but it uses different method of achieving the diversity and also attempts a more traditional (and usually unsuccessful goal J) of choosing a best predictor from a pool. *Sven F. Crone and Stephan Häger, Feature selection of autoregressive Neural Network inputs for trend Time Series Forecasting, 2016 International Joint Conference on Neural Networks (IJCNN)

The Idea 1. Create a pool of NNs that specialize (i.e. are particularly good) in a subset of time series. 2. Find a good way either to ensemble them or to choose one of them to do the forecast An obvious way to approach task 1 is to apply clustering, but it is not obvious what metric should be used, as a standard metric may not have much to do with the forecasting performance

Data Preprocessing All experiments were done on 1428 monthly time series of M3 Competition data set (package Mcomp of R)

Data Preprocessing, cont. All inputs and outputs are normalized (e.g. divided) by some reasonably stable value reflecting size of the series near end of the output window. Here it is last point of deseasonalized and smoothed input

Data Preprocessing, cont. This is a simple data processing. Instead we could e.g. 1. Create several (perhaps overlapping) smaller output windows 2. Apply statistical algorithm on the time series (up to and including last input point) and extract some features (e.g. acf, ETS model types, ETS forecast)

Data Preprocessing, cont. Last record in the training set (for this time series)

Data Preprocessing, cont. Last record (and only one used) in the validation set (for this time series) Because M3 time series are short, each of them is split into training part, and an nonoverlapping validation part composed of last 18 points (=forecast horizon). I am not doing a proper backtesting.

Neural Network Architecture Output (vector of max forecast horizon length) Standard hidden layer (no bias) Applying idea of ResNet (Residual Networks) while stacking LSTM (Long Short Term Memory) recurrent neural networks LSTM2 LSTM1 Inputs

Loss Function Many Neural Network systems allow to specify a custom loss function, which opens possibility to Optimize for a metric (e.g. smape) Using pinball loss to make a neural network to do a quantile regression, and therefore by running it several times getting a number of prediction intervals. E.g. in CNTK, using Python, one can define smape loss as def smapeloss (z, t): loss=o.reduce_mean(o.abs(t-z)/(t+z))*200 return loss This definition will work, even if actuals and forecasts are normalized by dividing by a common level

Task 1: Creating Pool of Specialized NNs Start with randomly allocating a part, e.g. half of the time series, to each network (e.g. 7 of them) In a loop for each network Execute a single training on the allocated training subset Record performance on the whole training set (insample, average per all points of the training part of a series) Rank the networks for each series. A series gets allocated to top N (e.g. 2) best networks Repeat, make an early stop when average error in the validation area starts growing. 6 5 4 3 2 1 0 An example allocation of 10 series to 7 nets, topn=2 Net1 Net2 Net3 Net4 Net5 Net6 Net7 Series 1 Series 2 Series 3 Series 4 Series 5 Series 6 Series 7 Series 8 Series 9 Series 10

Task 1: Creating Poll of Specialized NNs, cont. The success is reinforced a network that does well, on average, with a particular series is likely to see it again in the next epoch of training, and will surely see it if this network is the best on it. The experiments were done with recurrent NNs, therefore a particular net was receiving the whole series. If I were to use standard networks, without state, it would be possible to allocate pieces of one series to different networks. The architecture, inputs are all the same, what is different, and actively manipulated between epochs, is the composition of the training data set for each network. Example: Identity of best network for a particular series along the series steps: [ 0., 0., 1., 0., 0., 0., 0.] [ 0., 1., 0., 0., 0., 0., 0.] [ 0., 1., 0., 0., 0., 0., 0.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.] [ 0., 0., 0., 0., 0., 0., 1.]

Task 2: Ensembling or Choosing Best Network Advanced, second-level network methods Using forecasts of each of the network as an additional input to the final network - did not work Creating a recommender network that chooses which of the base network should deal with a particular forecast did not work Creating a network that outputs weights for the base network did not work Simpler, rule methods worked Simple average Weighted average by average training loss Weighted average by average frequency of being a best network Weighted average of top N (2) networks (average loss in training) Choosing a network that was best on the last training point Choosing a network that was best on average

Comparing Network Validation Results, smape Specialized Best Possible (perfect choosing of) single net 11.46 13.47 Non-specialized Single net 14.15, sd=0.06 14.37, sd=0.04 Best nets (min avg error over all training points of a series) Single best (best at last training point) net 14.51, sd=0.07 14.3 Ensemble of top 2 out of 7 nets (by average error over all training points of a series) 14.09, sd=0.04 Weighted ensemble of the above nets 13.99 14.35

Results In Context M3 dataset is a popular dataset, smape, Monthly series: ARIMA 14.98 Ensemble of non-specialized NNs 14.35 ETS (ZZZ) 14.15 Ensemble of specialized NNs 13.99 Hybrid (mean of ARIMA and ETS) 13.91 Best algorithm in M3 competition 13.89 (THETA) SGT 13.72, sd=0.019 Best of Bagged ETS 13.64 (BLD.MBB) SGT Seasonal, Global Trend is my algorithm that is a kind of generalized Exponential Smoothing, it has generalized (between additive and multiplicative) trend, Student distribution of error, a function for error size etc., fitted with MCMC.

Questions? Slawek Smyl slawek@uber.com

Tools Used - CNTK Microsoft s CNTK, which used to mean Computational Network Toolkit, but now it means Cognitive Toolkit J It is modern, fast and, I believe, easier to use than Tensorflow. Open sourced, it runs on Windows and Linux. It has own scripting/configuration language, as well as Python APIs that I used. Also C++ and some C# APIs.

SGT Seasonal Global Trend Algorithm Created by S. Smyl time series algorithm that generalizes (between additive and multiplicative) Trend and Error Exponential Smoothing models. It uses Student-t distribution for errors Formula for the expected value: µ t =(l #$% +γ l #$% ) ) s # where l #$% is previous level; γ is the global trend coefficient; p <0,1> Function for scale of the error distribution: σ l #$% 0 + ς where σ, ς are positive, and τ <0,1> All parameters are fitted by MCMC (Stan)

LGT Local Global Trend Algorithm Created by S. Smyl time series algorithm that generalizes Additive and Multiplicative Trend and Error Exponential Smoothing models. It uses Student-t distribution for errors Formula for the expected value: µ t = l #$% + γ l #$% 3 + λ b #$% where l #$% is previous level; γ is the global trend coefficient; p, λ <0,1>; b #$% is previous (local) trend Function for scale of the error distribution: σ l #$% 0 + ς where σ, ς are positive, and τ <0,1> All parameters are fitted by MCMC (Stan)

My ISF 2016 paper: Results for Yearly M3 series Algorithm/Metric smape MASE ETS (ZZZ) 17.37 2.86 ARIMA 17.12 2.96 Auto NN (M3 participant) Best algorithm in M3 LSTM with minimal preprocessing LSTM using ETS for preprocessing LSTM using LGT for preprocessing 18.57 3.06 16.42 (RBF) 16.4 2.88 15.94 2.73 15.46 2.65 2.63 (ROBUST-Trend) LGT 15.23, sd= 0.015 2.50, sd= 0.004 645 series (rather short, 14-40 points) smape = 200 B h > y # y A # #C% y # + y A # MASE = B #C% h $% y # y A # J 0CHK% Fn s) $% y 0 yi 0$H Summation happens over prediction horizons, except for the divisor of the MASE equation, where it happens over past (in-sample) data

Quantile Regression and Pinball Loss In normal regression we minimize sum of squared errors, that leads to average value forecast. If we minimize L1 metric, we get median forecast. More generally, if we minimize Pinball loss defined as (Forecast-Actual)*tau when Forecast>Actual (Actual-Forecast)*(1-tau) otherwise we are going to obtain quantile forecast for the specified quantile value, tau, e.g. 0.95 Code for Pinball loss in CNTK: def PinballLoss (z, t, tau): tauv=o.constant(tau, shape=t.shape, dtype=np.float32) tau1mv=o.constant(1-tau, shape=t.shape, dtype=np.float32) loss=o.reduce_mean(o.element_select(o.greater_equal(t, z o.element_times(o.minus(t,z tauv o.element_times(o.minus(z,t tau1mv))) return loss