Time Series Analysis with R 1

Similar documents
Time Series Analysis DM 2 / A.A

High Dimensional Data Mining in Time Series by Reducing Dimensionality and Numerosity

This report is based on sampled data. Jun 1 Jul 6 Aug 10 Sep 14 Oct 19 Nov 23 Dec 28 Feb 1 Mar 8 Apr 12 May 17 Ju

Monthly SEO Report. Example Client 16 November 2012 Scott Lawson. Date. Prepared by

Discovering Playing Patterns: Time Series Clustering of Free-To-Play Game Data

Fluidity Trader Historical Data for Ensign Software Playback

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

Introduction to R for Time Series Analysis Lecture Notes 2

The Curse of Dimensionality. Panagiotis Parchas Advanced Data Management Spring 2012 CSE HKUST

Excel Functions & Tables

Broadband Rate Design for Public Benefit

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Package rucrdtw. October 13, 2017

Asks for clarification of whether a GOP must communicate to a TOP that a generator is in manual mode (no AVR) during start up or shut down.

Tracking the Internet s BGP Table

All King County Summary Report

Aon Hewitt. Facts & Figures. March 2016 Update. Risk. Reinsurance. Human Resources. Empower Results

Proximity and Data Pre-processing

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

3. EXCEL FORMULAS & TABLES

RLMYPRINT.COM 30-DAY FREE NO-OBLIGATION TRIAL OF RANDOM LENGTHS MY PRINT.

CS490D: Introduction to Data Mining Prof. Chris Clifton

Similarity Search in Time Series Databases

Kenora Public Library. Computer Training. Introduction to Excel

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Characterization and Modeling of Deleted Questions on Stack Overflow

Imputation for missing data through artificial intelligence 1

HPE Security Data Security. HPE SecureData. Product Lifecycle Status. End of Support Dates. Date: April 20, 2017 Version:

Time series. VCEcoverage. Area of study. Units 3 & 4 Data analysis

A Supervised Time Series Feature Extraction Technique using DCT and DWT

software.sci.utah.edu (Select Visitors)

Annex A to the DVD-R Disc and DVD-RW Disc Patent License Agreement Essential Sony Patents relevant to DVD-RW Disc

Needs Driven Workflow Design

Peak Season Metrics Summary

Penetrating the Matrix Justin Z. Smith, William Gui Zupko II, U.S. Census Bureau, Suitland, MD

Quarterly Sales (in millions) FY 16 FY 15 FY 14 Q1 $706.8 $731.1 $678.5 Q Q Q

Scaling on one node Hybrid engines with Multi-GPU on In-Memory database queries

SYS 6021 Linear Statistical Models

ICT PROFESSIONAL MICROSOFT OFFICE SCHEDULE MIDRAND

DAS LRS Monthly Service Report

EVM & Project Controls Applied to Manufacturing

Real Estate Forecast 2015

How to use Chart from Table macro

Polycom Advantage Service Endpoint Utilization Report

REPORT ON TELECOMMUNICATIONS SERVICE QUALITY WINDSTREAM FLORIDA, INC.

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

SOCIAL MEDIA MINING. Data Mining Essentials

A Symbolic Representation of Time Series Employing Key-Sequences and a Hierarchical Approach

Section 1.2: What is a Function? y = 4x

A Study of knn using ICU Multivariate Time Series Data

Sand Pit Utilization

January 2015 GREAT INTERNET NUMBERS

SAS Publishing SAS. Forecast Studio 1.4. User s Guide

New CLIK (CLIK 3.0) CLimate Information tool Kit User Manual

Intro to ARMA models. FISH 507 Applied Time Series Analysis. Mark Scheuerell 15 Jan 2019

Contents:

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Peak Season Metrics Summary

Peak Season Metrics Summary

Time series representations: a state-of-the-art

New Concept for Article 36 Networking and Management of the List

Package TSrepr. January 26, Type Package Title Time Series Representations Version Date

ADVANCED SPREADSHEET APPLICATIONS (07)

Statistical Methods in Trending. Ron Spivey RETIRED Associate Director Global Complaints Tending Alcon Laboratories

Data Mining and Machine Learning: Techniques and Algorithms

Research Article Piecewise Trend Approximation: A Ratio-Based Time Series Representation

Chapter 15. Time Series Data Clustering

The Power of Prediction: Cloud Bandwidth and Cost Reduction

The State of the Raven. Jon Warbrick University of Cambridge Computing Service

More Binary Search Trees AVL Trees. CS300 Data Structures (Fall 2013)

Nigerian Telecommunications (Services) Sector Report Q3 2016

Peak Season Metrics Summary

OBJECT ORIENTED PROGRAMMING USING C++

More BSTs & AVL Trees bstdelete

MARKOV MODEL BASED TIME SERIES SIMILARITY MEASURING

Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

How It All Stacks Up - or - Bar Charts with Plotly. ISC1057 Janet Peterson and John Burkardt Computational Thinking Fall Semester 2016

ACTIVE MICROSOFT CERTIFICATIONS:

Determination of the Starting Point in Time Series for Trend Detection Based on Overlapping Trend

Spiegel Research 3.0 The Mobile App Story

Polycom Advantage Service Endpoint Utilization Report

October Real Sector Statistics Division. Methodology

PROFESSIONAL CERTIFICATES AND SHORT COURSES: MICROSOFT OFFICE. PCS.uah.edu/PDSolutions

MetrixND Newsletters Contents List By Topic

Practical Machine Learning Agenda

Peak Season Metrics Summary

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

ON SELECTION OF PERIODIC KERNELS PARAMETERS IN TIME SERIES PREDICTION

2

2014 Forecast Results

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

2

2

Symbolic Representation and Clustering of Bio-Medical Time-Series Data Using Non-Parametric Segmentation and Cluster Ensemble

Today s Vision Tomorrow s Reality. Brian Beaulieu CEO

San Francisco Housing Authority (SFHA) Leased Housing Programs October 2015

Transcription:

Time Series Analysis with R 1 Yanchang Zhao http://www.rdatamining.com R and Data Mining Workshop @ AusDM 2014 27 November 2014 1 Presented at UJAT and Canberra R Users Group 1 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 2 / 39

Time Series Analysis with R 2 time series data in R time series decomposition, forecasting, clustering and classification autoregressive integrated moving average (ARIMA) model Dynamic Time Warping (DTW) Discrete Wavelet Transform (DWT) k-nn classification 2 Chapter 8: Time Series Analysis and Mining, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/rdatamining.pdf 3 / 39

R a free software environment for statistical computing and graphics runs on Windows, Linux and MacOS widely used in academia and research, as well as industrial applications 5,800+ packages (as of 13 Sept 2014) CRAN Task View: Time Series Analysis http://cran.r-project.org/web/views/timeseries.html 4 / 39

Time Series Data in R class ts represents data which has been sampled at equispaced points in time frequency=7: a weekly series frequency=12: a monthly series frequency=4: a quarterly series 5 / 39

Time Series Data in R a <- ts(1:20, frequency = 12, start = c(2011, 3)) print(a) ## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## 2011 1 2 3 4 5 6 7 8 9 10 ## 2012 11 12 13 14 15 16 17 18 19 20 str(a) ## Time-Series [1:20] from 2011 to 2013: 1 2 3 4 5 6 7 8 9 10... attributes(a) ## $tsp ## [1] 2011 2013 12 ## ## $class ## [1] "ts" 6 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 7 / 39

What is Time Series Decomposition To decompose a time series into components: Trend component: long term trend Seasonal component: seasonal variation Cyclical component: repeated but non-periodic fluctuations Irregular component: the residuals 8 / 39

Data AirPassengers Data AirPassengers: monthly totals of Box Jenkins international airline passengers, 1949 to 1960. It has 144(=12 12) values. plot(airpassengers) AirPassengers 100 200 300 400 500 600 1950 1952 1954 1956 1958 1960 Time 9 / 39

Decomposition apts <- ts(airpassengers, frequency = 12) f <- decompose(apts) plot(f$figure, type = "b") # seasonal figures f$figure 40 20 0 20 40 60 2 4 6 8 10 12 Index 10 / 39

Decomposition plot(f) Decomposition of additive time series random seasonal trend observed 40 0 20 40 60 40 0 20 40 60 150 250 350 450 100 300 500 2 4 6 8 10 12 Time 11 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 12 / 39

Time Series Forecasting To forecast future events based on known past data For example, to predict the price of a stock based on its past performance Popular models Autoregressive moving average (ARMA) Autoregressive integrated moving average (ARIMA) 13 / 39

Forecasting # build an ARIMA model fit <- arima(airpassengers, order = c(1, 0, 0), list(order = c(2, 1, 0), period = 12)) fore <- predict(fit, n.ahead = 24) # error bounds at 95% confidence level U <- fore$pred + 2 * fore$se L <- fore$pred - 2 * fore$se ts.plot(airpassengers, fore$pred, U, L, col = c(1, 2, 4, 4), lty = c(1, 1, 2, 2)) legend("topleft", col = c(1, 2, 4), lty = c(1, 1, 2), c("actual", "Forecast", "Error Bounds (95% Confidence)")) 14 / 39

Forecasting 100 200 300 400 500 600 700 Actual Forecast Error Bounds (95% Confidence) 1950 1952 1954 1956 1958 1960 1962 Time 15 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 16 / 39

Time Series Clustering To partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar Measure of distance/dissimilarity Euclidean distance Manhattan distance Maximum norm Hamming distance The angle between two vectors (inner product) Dynamic Time Warping (DTW) distance... 17 / 39

Dynamic Time Warping (DTW) DTW finds optimal alignment between two time series. library(dtw) idx <- seq(0, 2 * pi, len = 100) a <- sin(idx) + runif(100)/10 b <- cos(idx) align <- dtw(a, b, step = asymmetricp1, keep = T) dtwplottwoway(align) Query value 1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 100 Index 18 / 39

Synthetic Control Chart Time Series The dataset contains 600 examples of control charts synthetically generated by the process in Alcock and Manolopoulos (1999). Each control chart is a time series with 60 values. Six classes: 1-100 Normal 101-200 Cyclic 201-300 Increasing trend 301-400 Decreasing trend 401-500 Upward shift 501-600 Downward shift http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_ control.html 19 / 39

Synthetic Control Chart Time Series # read data into R sep='': the separator is white space, i.e., one # or more spaces, tabs, newlines or carriage returns sc <- read.table("./data/synthetic_control.data", header = F, sep = "") # show one sample from each class idx <- c(1, 101, 201, 301, 401, 501) sample1 <- t(sc[idx, ]) plot.ts(sample1, main = "") 20 / 39

Six Classes 201 25 30 35 40 45 501 10 15 20 25 30 35 101 15 20 25 30 35 40 45 401 25 30 35 40 45 1 24 26 28 30 32 34 36 301 0 10 20 30 0 10 20 30 40 50 60 Time 0 10 20 30 40 50 60 Time 21 / 39

Hierarchical Clustering with Euclidean distance # sample n cases from every class n <- 10 s <- sample(1:100, n) idx <- c(s, 100 + s, 200 + s, 300 + s, 400 + s, 500 + s) sample2 <- sc[idx, ] observedlabels <- rep(1:6, each = n) # hierarchical clustering with Euclidean distance hc <- hclust(dist(sample2), method = "ave") plot(hc, labels = observedlabels, main = "") 22 / 39

Hierarchical Clustering with Euclidean distance Height 20 40 60 80 100 120 140 4 66 66 6 6 6 4 4444 4 6 6 4 433 1 1 11 4 3 3 3 5 3 333 1 1 1 1 1 5 555 5 5 55 1 5 2 2 2 2 2 2 2 dist(sample2) hclust (*, "average") 23 / 39

Hierarchical Clustering with Euclidean distance # cut tree to get 8 clusters memb <- cutree(hc, k = 8) table(observedlabels, memb) ## memb ## observedlabels 1 2 3 4 5 6 7 8 ## 1 10 0 0 0 0 0 0 0 ## 2 0 3 1 1 3 2 0 0 ## 3 0 0 0 0 0 0 10 0 ## 4 0 0 0 0 0 0 0 10 ## 5 0 0 0 0 0 0 10 0 ## 6 0 0 0 0 0 0 0 10 24 / 39

Hierarchical Clustering with DTW Distance mydist <- dist(sample2, method = "DTW") hc <- hclust(mydist, method = "average") plot(hc, labels = observedlabels, main = "") # cut tree to get 8 clusters memb <- cutree(hc, k = 8) table(observedlabels, memb) ## memb ## observedlabels 1 2 3 4 5 6 7 8 ## 1 10 0 0 0 0 0 0 0 ## 2 0 4 3 2 1 0 0 0 ## 3 0 0 0 0 0 6 4 0 ## 4 0 0 0 0 0 0 0 10 ## 5 0 0 0 0 0 0 10 0 ## 6 0 0 0 0 0 0 0 10 25 / 39

Hierarchical Clustering with DTW Distance Height 0 200 400 600 800 1000 3 33 3 33 55 3 33 5 555 5 555 6 66 4 64 44 4444 4 4 6 6 6611 6 1 111 2 2 2 222 2 222 1 111 mydist hclust (*, "average") 26 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 27 / 39

Time Series Classification Time Series Classification To build a classification model based on labelled time series and then use the model to predict the lable of unlabelled time series Feature Extraction Singular Value Decomposition (SVD) Discrete Fourier Transform (DFT) Discrete Wavelet Transform (DWT) Piecewise Aggregate Approximation (PAA) Perpetually Important Points (PIP) Piecewise Linear Representation Symbolic Representation 28 / 39

Decision Tree (ctree) ctree from package party classid <- rep(as.character(1:6), each = 100) newsc <- data.frame(cbind(classid, sc)) library(party) ct <- ctree(classid ~., data = newsc, controls = ctree_control(minsplit = 20, minbucket = 5, maxdepth = 5)) 29 / 39

Decision Tree pclassid <- predict(ct) table(classid, pclassid) ## pclassid ## classid 1 2 3 4 5 6 ## 1 100 0 0 0 0 0 ## 2 1 97 2 0 0 0 ## 3 0 0 99 0 1 0 ## 4 0 0 0 100 0 0 ## 5 4 0 8 0 88 0 ## 6 0 3 0 90 0 7 # accuracy (sum(classid == pclassid))/nrow(sc) ## [1] 0.8183 30 / 39

DWT (Discrete Wavelet Transform) Wavelet transform provides a multi-resolution representation using wavelets. Haar Wavelet Transform the simplest DWT http://dmr.ath.cx/gfx/haar/ DFT (Discrete Fourier Transform): another popular feature extraction technique 31 / 39

DWT (Discrete Wavelet Transform) # extract DWT (with Haar filter) coefficients library(wavelets) wtdata <- NULL for (i in 1:nrow(sc)) { a <- t(sc[i, ]) wt <- dwt(a, filter = "haar", boundary = "periodic") wtdata <- rbind(wtdata, unlist(c(wt@w, wt@v[[wt@level]]))) } wtdata <- as.data.frame(wtdata) wtsc <- data.frame(cbind(classid, wtdata)) 32 / 39

Decision Tree with DWT ct <- ctree(classid ~., data = wtsc, controls = ctree_control(minsplit=20, minbucket=5, maxdepth=5)) pclassid <- predict(ct) table(classid, pclassid) ## pclassid ## classid 1 2 3 4 5 6 ## 1 98 2 0 0 0 0 ## 2 1 99 0 0 0 0 ## 3 0 0 81 0 19 0 ## 4 0 0 0 74 0 26 ## 5 0 0 16 0 84 0 ## 6 0 0 0 3 0 97 (sum(classid==pclassid)) / nrow(wtsc) ## [1] 0.8883 33 / 39

plot(ct, ip_args = list(pval = F), ep_args = list(digits = 0)) 1 V57 2 W43 117 > 117 9 V57 3 4 > 4 6 140 > 140 11 W5 W31 V57 178 > 178 12 17 W22 W31 8 > 8 6 > 6 6 > 6 14 15 > 15 19 W31 W43 13 > 13 3 > 3 Node 4 (n = 68) 1 0.8 0.6 0.4 0.2 0 123456 Node 5 (n = 6) 1 0.8 0.6 0.4 0.2 0 123456 Node 7 (n = 9) 1 0.8 0.6 0.4 0.2 0 123456 Node 8 (n = 86) 1 0.8 0.6 0.4 0.2 0 123456 Node 10 (n = 31) 1 0.8 0.6 0.4 0.2 0 123456 Node 13 (n = 80) 1 0.8 0.6 0.4 0.2 0 123456 Node 15 (n = 9) 1 0.8 0.6 0.4 0.2 0 123456 Node 16 (n = 99) 1 0.8 0.6 0.4 0.2 0 123456 Node 18 (n = 12) 1 0.8 0.6 0.4 0.2 0 123456 Node 20 (n = 103) 1 Node 21 (n = 97) 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 123456 123456 34 / 39

k-nn Classification find the k nearest neighbours of a new instance label it by majority voting needs an efficient indexing structure for large datasets k <- 20 newts <- sc[501, ] + runif(100) * 15 distances <- dist(newts, sc, method = "DTW") s <- sort(as.vector(distances), index.return = TRUE) # class IDs of k nearest neighbours table(classid[s$ix[1:k]]) ## ## 4 6 ## 3 17 35 / 39

k-nn Classification find the k nearest neighbours of a new instance label it by majority voting needs an efficient indexing structure for large datasets k <- 20 newts <- sc[501, ] + runif(100) * 15 distances <- dist(newts, sc, method = "DTW") s <- sort(as.vector(distances), index.return = TRUE) # class IDs of k nearest neighbours table(classid[s$ix[1:k]]) ## ## 4 6 ## 3 17 Results of majority voting: class 6 35 / 39

The TSclust Package TSclust: a package for time seriesclustering 3 measures of dissimilarity between time series to perform time series clustering. metrics based on raw data, on generating models and on the forecast behavior time series clustering algorithms and cluster evaluation metrics 3 http://cran.r-project.org/web/packages/tsclust/ 36 / 39

Outline Introduction Time Series Decomposition Time Series Forecasting Time Series Clustering Time Series Classification Online Resources 37 / 39

Online Resources Chapter 8: Time Series Analysis and Mining, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/rdatamining.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/r-refcard-data-mining.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (8,000+ members) http://group.rdatamining.com RDataMining on Twitter (1,800+ followers) @RDataMining 38 / 39

The End Thanks! Email: yanchang(at)rdatamining.com 39 / 39