Vector Regression Machine. Rodrigo Fernandez. LIPN, Institut Galilee-Universite Paris 13. Avenue J.B. Clement Villetaneuse France.

Similar documents
J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

data points semiparametric nonparametric parametric

Practical selection of SVM parameters and noise estimation for SVM regression

Local Linear Approximation for Kernel Methods: The Railway Kernel

Improvements to the SMO Algorithm for SVM Regression

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

Leave-One-Out Support Vector Machines

Introduction to Support Vector Machines

Computer Experiments. Designs

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

Support Vector Machines

Naïve Bayes for text classification

Second Order SMO Improves SVM Online and Active Learning

Basis Functions. Volker Tresp Summer 2017

Training Data Selection for Support Vector Machines

Kernel-based online machine learning and support vector reduction

Bagging for One-Class Learning

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

A Short SVM (Support Vector Machine) Tutorial

Support vector machines experts for time series forecasting

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support Vector Regression with ANOVA Decomposition Kernels

Stochastic Function Norm Regularization of DNNs

Improved smoothing spline regression by combining estimates of dierent smoothness

Support Vector Machines

Local Machine Learning Models for Spatial Data Analysis

EE 8591 Homework 4 (10 pts) Fall 2018 SOLUTIONS Topic: SVM classification and regression GRADING: Problems 1,2,4 3pts each, Problem 3 1 point.

CPU Time (sec) Classification Accuracy (%) 80. Classification Accuracy (%) CPU Time (sec) Classification Accuracy (%) IncrSVM QuerySVM StatQSVM

The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the co

K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

Instance Based Learning. k-nearest Neighbor. Locally weighted regression. Radial basis functions. Case-based reasoning. Lazy and eager learning

Generating the Reduced Set by Systematic Sampling

No more questions will be added

Lecture 17 Sparse Convex Optimization

Predictive System for Multivariate Time Series

Lecture 9: Support Vector Machines

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Modied support vector machines in nancial time series forecasting

Regularity Analysis of Non Uniform Data

Support Vector Machine. Reference Manual. C. Saunders, M. O. Stitson, J. Weston. Royal Holloway. University oflondon

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

A Boosting-Based Framework for Self-Similar and Non-linear Internet Traffic Prediction

Kernel Methods and Visualization for Interval Data Mining

Efficient Case Based Feature Construction

Nelder-Mead Enhanced Extreme Learning Machine

Optimization Methods for Machine Learning (OMML)

3 Nonlinear Regression

Elemental Set Methods. David Banks Duke University

Support Vector Regression for Software Reliability Growth Modeling and Prediction

Introduction to Machine Learning

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

MATH 31A HOMEWORK 9 (DUE 12/6) PARTS (A) AND (B) SECTION 5.4. f(x) = x + 1 x 2 + 9, F (7) = 0

Learning Inverse Dynamics: a Comparison

Random Forest A. Fornaser

Robo-Librarian: Sliding-Grasping Task

Don Hush and Clint Scovel max ( c) (1) jj =1: () For a xed, ( c) is maximized when c = c1+c;1 and then = c1();c(). Consequently, we can say thatahyper

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Bias-Variance Analysis of Ensemble Learning

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

LS-SVM Functional Network for Time Series Prediction

STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING

Comparing Self-Organizing Maps Samuel Kaski and Krista Lagus Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2C, FIN

Semiparametric Support Vector and Linear Programming Machines

Functions. Edexcel GCE. Core Mathematics C3

Department of Electrical Engineering, Keio University Hiyoshi Kouhoku-ku Yokohama 223, Japan

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

CS570: Introduction to Data Mining

MEI STRUCTURED MATHEMATICS METHODS FOR ADVANCED MATHEMATICS, C3. Practice Paper C3-B

Rainfall forecasting using Nonlinear SVM based on PSO

Chapter 1. Introduction

A Comparative Study of SVM Kernel Functions Based on Polynomial Coefficients and V-Transform Coefficients

A General Greedy Approximation Algorithm with Applications

Introduction to ANSYS DesignXplorer

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal. Master of Science Thesis

Adaptive Kernel Regression for Image Processing and Reconstruction

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Benders Decomposition

Modeling of ambient O 3 : a comparative study

DM6 Support Vector Machines

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Wind energy production forecasting

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Lab 2: Support Vector Machines

Storage Efficient NL-Means Burst Denoising for Programmable Cameras

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Chap.12 Kernel methods [Book, Chap.7]

Increasing/Decreasing Behavior

Applying Supervised Learning

Using classification to determine the number of finger strokes on a multi-touch tactile device

Evaluation of Performance Measures for SVR Hyperparameter Selection

An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs

Application of Support Vector Machine Algorithm in Spam Filtering

KBSVM: KMeans-based SVM for Business Intelligence

Transcription:

Predicting Time Series with a Local Support Vector Regression Machine Rodrigo Fernandez LIPN, Institut Galilee-Universite Paris 13 Avenue J.B. Clement 9343 Villetaneuse France rf@lipn.univ-paris13.fr Abstract Recently, a new Support Vector Regression (SVR) algorithm has been proposed, this approach, called -SV regression allows the SVR to adjust its accuracy parameter automatically. In this paper, we combine -SV regression with a local approach in order to obtain accurate estimations of both the function and the noise distribution. This approach seems to be extremely useful when the noise distribution does depend on the input values. We illustrate the properties of the algorithm by atoyexample and we benchmark our approach on a 1 points time series (Santa Fe Competition data set D) obtaining the state of the art performance on this data set. 1 Introduction Support Vector (SV) Machines are becoming a powerful and useful tool for classication and regression tasks. SV regressors are able to approximate a real valued function y(x) in terms of a small subset (named support vectors) of the training examples. In [7], together with the SVR algorithm itself, Vapnik proposes a loss function with nice properties: jj = maxf jj ;g, where = y ; f(x) is the estimation error. This loss function, called -insensitive, does not penalize errors smaller than >, which has to be chosen a priori. In [3], Scholkopf et al. propose a new SVR algorithm, called -SV regression (-SVR), which adjusts automatically the parameter. In this paper, we combine -SVR and a local approach to obtain accurate estimations of both the function y(x) and the noise model. The paper is organized as follows: in the Section 2 we briey sketch our algorithm, the Section 3 illustrates the properties of the algorithm by atoy example, in the Section 4 we benchmark our approach on a 1 points time series (Santa Fe Competition data set D), a brief discussion of the results an of some possible further issues concludes the paper. 2 Local -SV regression Support Vector Regression seeks to estimate functions f(x) =w x + b w x 2 IR n b2 IR (1) 1

starting from l independent identically distributed (iid) data f(x i y i )gir n IR. We show here how SVR can be used when data are not iid. For example, for non-stationary time series forecast. The -SVR algorithm minimizes the risk R [f] = 1 2 kwk2 + C 1 jy i ; f(x i )j + C: (2) l The solution (w b ) of (2) can be obtained by solving the following QP problem: h max 1 lx lx i ; ( i ; i )( j ; 2 j)x i x j + y i ( i ; i ) (3) i j=1 subject to lx ( i ; i ) = lx ( i + i ) Cl i i 2 [ C] i =1 l: The quantities and b are the dual variables of (3), as we useaninterior point primal-dual solver LOQO [4], they can be recovered P easily. The computation of l w follows the standard SV techniques: w = ( i ; i )x i. The extension of the algorithm to non linear regression via the Hilbert-Schmidt kernels follows also standard SV techniques. See [6] for the details. Among other interesting properties, Scholkopf et al. prove that the positive quantity is an upper bound on the fraction of errors and a lower bound on the fraction of support vectors [3]. Our local approach for support vector regression is inspired by the Local Linear classier, proposed by Bottou and Vapnik [1]. The algorithm is simple, let be Z = f(x1 y1) (x l y l )g the training set and let be x a test input, in order to estimate the output y(x ), Local SVR proceed as follows: lx 1. Compute the M nearest neighbors, in the input space, of x among the training inputs. these M neighbors dene a subset of the training set. Let's call Z M (x ) Z this subset. 2. Using Z M (x ) as learning set, compute a SVRM. This SVRM approximates y(x) inaneighborhood of x by f M (x). 3. Estimate y(x )by f M (x ). The combined choice of and M bounds tightly the number of training examples playing a role in the expansion of fm, in fact, M denes the number of examples used to compute the mini SVRM and bounds the proportion of these examples supporting fm. For example, if M =1and =:3, the number of training examples that support the local estimation is between 3 and 1. 2

Remark that every Local -SVR adjusts automatically the (local) accuracy parameter. Itisclearthat is directly related to the noise in the following sense: the more important is the noise rate the bigger is the computed by -SVR. That means that Local -SVR estimates simultaneously the target function and the noise. Since the parameter M is much smaller than l, a single-point regression using local SVR is as fast (or as slow) as the KNN algorithm. 3 Atoy example The task is to estimate the sinc function dened by sinc(x) = sin(x)=(x)+ (x) where the noise (x) is input dependent 1. The learning set is composed by 41 samples drawn uniformly from the interval [-2,2]. In order to test the ability of the -SVR to match the noise, two dierent noise models were used. Two kind of regressors were trained: a global (or standard) -SVR and a local -SVR. The risk (or test error) was computed with respect to the sinc function without noise over the interval [-1.5,1.5], the results are averaged over 5 trials. The Figure 1 shows two examples of noise model estimation via the automatic -tube one gets from the regression. For each test point x, local -SVR provides, besides the estimation of the target function, a value of which can be interpreted as the estimated accuracy of the model in a neighborhood of x. After processing the whole train set, one gets a curve (x) estimating the accuracy of the model at each point. The results are given in Table 1, for each noise model, the best 3 performances are reported. In both cases the local approach performs better than the global one. 4 Time series prediction We implemented the local -SVR algorithm using LOQO as solver for the subproblems. The algorithm was tested on a time series known to be long (1 points) and dicult (the series is non-stationary). Data set D from Santa Fe Competition [8] is articial data generated by numerically integrating the equations of motion for a dumped, driven particle d 2 x dt 2 + dx + rv (x) =F (t): dt In the experiences, we used an embedding size of 25, 2 and 15 consecutive points the kernel for the SVMs is a RBF kernel with 2 =:3 and the regularization parameter C was xed to 1, no cross-validation over these parameters was made, but some preliminary experiments ensure that these values of and C works fairly well. The last 1 points of the series was used to determine, by cross validation over 1-step forecast, the parameters of the local -SVR algorithm M and ). For the nal prediction the full time series was used. The Tables 2,3,4 show the results of the cross-validation procedure for 1 That is, (x) =u(x)a(x), where u(x) follows a uniform distribution over [-.5,.5] and a(x) modules its amplitude. Dierent choices of a(x) allows us to dierent noise models. 3

.5.4.3 Noise Model 1 real noise +epsilon-tube -epsilon-tube.5.4.3 Noise Model 2 real noise +epsilon-tube -epsilon-tube.2.2.1.1 -.1 -.1 -.2 -.2 -.3 -.3 -.4 -.4 -.5-1.5-1 -.5.5 1 1.5 1.2 1.8.6.4.2 -.2 -.4 Noise Model 1 learn set test set local est full est -.6-2 -1.5-1 -.5.5 1 1.5 2 -.5-1.5-1 -.5.5 1 1.5 1.8.6.4.2 -.2 -.4 Noise Model 2 learn set test set local est full est -.6-2 -1.5-1 -.5.5 1 1.5 2 Figure 1: Top: The automatic estimates the noise model. In this gure we compare the smoothed (x)-tube ( = :3, averaged over 5 trials) with a sample of each noise model. Down: Regression examples (only one trial). According to the results shown in 1. Left: Noise Model 1, right: Noise Model 2. Noise Model 1 Full Local error error.5.77.1.57.4.78.5.51.3.81.3.48 Noise Model 2 Full Local error error.5.7.1.56.4.69.5.52.3.77.3.56 Table 1: Local -SVR consistently outperforms the full regression. Remark that full -SVR performs well for big values of while local -SVR works well with little values of. 4

dierent lengths of the embedding dimension, only the best results are reported. The couples ( M) achieving the minimum validation error are selected for the nal forecast. The Figure 2 shows the best 25 step iterated prediction, in the Table 5 we compare the Root Mean Squared error for the rst 25 predicted values to the error achieved by other algorithms. We include the results reported bymuller et al. [2] and Pawelzik et al. [5] using a segmentation of the learning set into regimes of approximately stationary dynamics. The best local -SVR algorithm is 39% better than the best result reported using the full set and it is slightly better than the best result reported using the segmentation preprocessing, and, in general, local -SVR performs consistently better than global approaches. This result is remarkable because in our experimental setup any a priori assumption about the stationarity of the data was made. 5 Discussion This paper presents a local regression algorithm that prots from the properties of the -SV regression. Local -SVR outperforms global (or \standard") SV regression, one can explain that by the fact that the local approach is free to change the radius of the -tube for each point to be estimated (see Figure 1), hence it is more robust to the variations of the amplitude of the noise. In addition, it provides an estimation of the noise that matches quite well the real noise model. The results on the Santa Fe data set D are excellent. Local - SVR algorithm outperforms by nearly 4% the best result reported using the full set and it is comparable (slightly better in fact) to the best result reported using a segmentation algorithm as preprocessing. Since we used the full 1 points series to predict the next 25 values, very little information about the stationarity of the data was used in our prediction 2. It will be interesting to test the local approach over the segmented set and to compare the estimation of the noise model over the training set provided by the local approach with the segmentation proposed in [5] 2 The parameters M and were determined using the last 1 values of the series. One could think that the dynamics of the last 1 points of the series and of the 25 predicted points are the same, in this sense, our predictor incorporates a little amount of information about the stationarity of the series. 5

Validation error, 15 points M =:12 =:15 =:18 =:21 6.34.33.33.36 7.34.3.3.32 8.33.32.34.38 1.326.332.338.345 Table 2: One-step forecast over the last 1 points of the series. The Root Mean Squared (RMS) Error as a function of M et is reportetd. The embedding size is 15, =:3 and C = 1. Validation error, 2 points M =:2 =:3 =:4 =:5 5.332.326.32.32 6.288.28.278.281 7.299.297.31.36 8.321.327.341.352 Table 3: One-step forecast over the last 1 points of the series. The RMS error as a function of M et is reportetd. The embedding size is 2, =:3 and C = 1. Validation error, 25 points M =:1 =:14 =:2 =:24 6.355.354 355 357 8.327.323.319.319 1.343.336.332.332 12.329.324.323.326 Table 4: One-step forecast over the last 1 points of the series. The RMS error as a function of M et is reportetd. The embedding size is 25, =:3 and C = 1. p=15 p=2 p=25 Muller [2] ZH [9] PKM [5] full set.456.539.391.639.665 - segmented set - - -.418 -.596 Table 5: Root Mean Squared error for the rst 25 predicted values achieved by several algorithms. Full set predictions assume stationary data, segmented set predictions use a segmentation of the learning set into regimes of approximately stationary dynamics. `-' indicates no prediction available. 6

.25 true pred.2.15.1.5 -.5 5 1 15 2 25 Figure 2: Santa Fe data set D. 25 step iterated prediction using an embedding dimension of 25 points. The mean value of the full training set (.575) was substracted of both curves. References [1] L. Bottou, V. Vapnik. Local Learning Algorithms. Neural Computation, 4:888-9. 1992. [2] K-R. Muller, A. Smola, G. Ratsh, B. Scholkopf, J. Kohlmorgen and V. Vapnik. Predicting time series with support vector machines. In Proceedengs ICANN'97. 1997. [3] B. Scholkopf, P. Bartlett, A. Smola, R. Williamson. Support Vector Regression with Automatic Accuracy Control. Proceedings ICANN 98, 1998. [4] R.J. Vanderbei. LOQO: User's manual. Program in Statistics and Operational Research, Princeton University. 1997. [5] K. Pawelzik, J. Kohlmorgen and K-R. Muller. Annealed competition of experts for segmentation and classication of switching dynamics. Neural Computation, 8(2):342-358. 1996. [6] A.J. Smola, B. Scholkopf. A Tutorial on Support Vector Regression. Technical Report NeuroCOLT2 TR-1998-3. 1998. [7] V. Vapnik. The Nature of Statistical Learning Theory. Springer. 1995. [8] A. Weigend and N. Gershenfeld (Eds.). Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley. 1994. [9] X. Zhang and J. Hutchinson. Simple architectures on fast machines: practical issues in nonlinear time series prediction. In [8]. 1994. 7