ABSTRACT. Data mining is a method of gathering all useful information from large databases. Data

Size: px

Start display at page:

Download "ABSTRACT. Data mining is a method of gathering all useful information from large databases. Data"

Stella Lewis
5 years ago
Views:

1 ABSTRACT The developed system is able to produce motifs by making use of data mining methods. Data mining is a method of gathering all useful information from large databases. Data mining methods greatly support efforts to predict future outcomes. The developed system identifies motifs in a given time series dataset and these patterns are employed to improve the prediction of future outcomes. The method used in the developed system is a data mining method called Symbolic Aggregate approximation (SAX). Random Projection algorithm is used to discover the unknown time series motifs. These motifs are tested for accuracy by comparing them with the brute force algorithm. This method requires only one parameter to identify the time series discords, unlike other methods that require a large number of parameters. ii

2 TABLE OF CONTENTS Abstract...ii Table of Contents iii List of Figures...vi List of Tables.vii 1. Background and Rationale Data Mining Need for Data Mining....2 Scalability..2 High Dimensionality...2 Heterogeneous and Complex Data.3 Data Ownership and Distribution..3 Nontraditional Analysis.3 Artificial Neural Network Method.. 4 Symbolic Aggregate approximation.5 Time Series Motifs Narrative...7.TCOON Data Symbolic Aggregate approximation EMMA Algorithm...15 ADM Algorithm..18 Probabilistic Discovery of Time Series Motifs Euclidean Distance Measure Lower Bounding Distance Measure..23 iii

3 2.5.3 Time Series Subsequence Sliding Window Match Trivial Match Time Series Projection Brute Force Algorithm Finding Planted Motifs Developed Research Code Design Obtaining SAX Breakpoints Random Projection Algorithm Functionality Brute force Algorithm Basic System Functionality Testing and Evaluation...40 Problems Faced while Developing the Project.41 Results Expected Results and Future Work Acknowledgements...55 Appendix.56 Bibliography and References...57 iv

4 LIST OF FIGURES Figure 1.3. TCOON Station Figure 2.1. Tidal Datum s Figure 2.2 Three Representations of distance measures 12 Figure 2.3 EMMA Algorithm 16 Figure 2.4 ADM algorithm Figure 2.5 Euclidean Distance Measure 22 Figure 2.6 Lower Bounding Distance Measure.23 Figure 2.7 Time Series..25 Figure 2.8 Time Series Representation..26 Figure 2.9 Comparison between Time Series Representation Figure 2.10 Match..27 Figure 2.11 Trivial Match..28 Figure 2.12 Time Series...28 Figure 2.13 A Randomly Chosen Mask.29 Figure 2.14 Another iteration with another randomly chosen mask..29 Figure 2.15 Two Planted Motifs 30 Figure 2.16 Data Set with Planted Motifs..31 Figure 2.17 Contour plot with Planted Motifs...31 Figure 2.18 Motifs Discovered..32 Figure 3.1 A Normal Probability Plot 34 Figure 3.2 Time Series Representation..34 Figure 3.3 Symbolic Strings..35 v

5 Figure 3.4 System Architecture Figure 4.1 Scalability of Algorithms.41 Figure 4.2 Motif Matches for two day long..43 Figure 4.3 Best Motif for two day long Figure 4.4 Motif Matches for week long..45 Figure 4.5 Best Motif for week long..46 Figure 4.6 Motif Matches for the year 2001 two day long...47 Figure 4.7 Best motif among the curves for the year 2001 two day long...48 Figure 4.8 Motif matches for the year 2001 for week long Figure 4.9 Best motif among the curves for the year 2001 for week long...50 Figure 4.10 Motif matches for the year 2008 for two day long...51 Figure 4.11 Best motif for the year 2008 for week long...52 Figure 4.12 Best motif for the year 2008 for week long for brute force algorithm...53 vi

6 LIST OF TABLES Table 2.1. TCOON Data Schema 9 Table 2.2 Input timeseries2symbol...13 Table 2.3 Output timeseries2symbol...14 Table 3.4 Pseudo code for the EMMA Algorithm. 22 vii

7 1. BACKGROUND AND RATIONALE The discovery of time series motifs is of much importance to improve the water level predictions. These predictions thereby are useful to the shipping industry, people living in the coastal areas, and even for emergency evacuation in case of a hurricane. There are different algorithms available for the discovery of time series motifs. This project makes use of the random projection algorithm to extract the motives in the given primary water levels. Data Mining Data mining is the process of arranging data from large amounts of data and selecting the appropriate information. Data mining is commonly known as the science of automatically discovering useful information from large data sets [Wikipedia 2008]. Traditionally, most of the information is extracted from stored data. But due to the increase in the amount of data being stored, the size of data sets have begun to grow rapidly and become more complex which emphasize the need for more sophisticated tools. Current technologies have made the data collection and organization much easier. Data mining is an integral part of knowledge discovery in databases (KDD). Knowledge discovery is the process of converting the unprocessed data into a readable format. Knowledge discovery provides precise information that can be easily understood by a user. Data mining uses real data and makes information clearly readable. In some approaches like the neural network method, information is not as clear. Some of the data mining methods are only based on prediction rather than 1

8 knowledge discovery in data sets. The collected data can be stored in various formats and it can be distributed across many sites. [Tan 2006] 1.2 Need for Data Mining As new data sets have grown in size and complexity, they posed difficulties to traditional data analysis techniques in terms of analysis and storage which led to the development of the data mining techniques. Some of the challenges that led to the development of data mining are shown in the next sections Scalability Due to the increase in the amount of data being collected, data sets are growing larger in size leading to gigabytes, terabytes and even petabytes of data. In order to handle these types of data sets, data mining algorithms must be scalable. Special search strategies are employed by various data mining algorithms to handle exponential search problems. Sampling is another technique by which scalability can be improved. [Tan 2006] High Dimensionality Currently, data sets come with hundreds or even thousands of attributes instead of the few commonly used before. Traditional data analysis techniques work for low dimensionality data, but they cannot work for high dimensionality. For example, if the temperature measurements of the water are taken at regular intervals, the number of dimensions increases as the number of measurements over a specific period of time increase. This type of the data analysis may not be handled by the traditional data analysis techniques. [Tan 2006] 2

9 1.2.3 Heterogeneous and Complex Data Traditional data analysis methods make use of data sets that contain attributes of the same type. Analysis may be difficult if attributes reside on different systems. Data mining techniques can handle these heterogeneous types of attributes and they take into consideration the relationships of the data, like temporal and spatial autocorrelation and parent-child relationships between semistructured text elements and XML documents. [Tan 2006] Data Ownership and Distribution In some situations the data needed for analysis does not reside in one location or it may not belong to one particular organization; it may be distributed among multiple entities. With this kind of data there arises some challenges like how to reduce the amount of time needed to perform the computation and how to solve the security issues in case of a distributed computation. Data Mining techniques solve the problem of data ownership and distribution by reducing the amount of time taken for computation. [Tan 2006] Nontraditional Analysis The traditional approach to data analysis is based on the hypothesize-and-test paradigm wherein, based on the hypothesis, an experiment is designed to collect the data and the collected data is analyzed. This type of process is very time consuming and difficult. Data mining techniques make use of nontraditional analysis and often represent opportunistic samples of the data. [Tan 2006] 3

10 1.3 Texas Coastal Ocean Observation Network The Texas Coastal Ocean Observation Network (TCOON) is a state-of-the-art waterlevel measurement system along the Texas Coast which has been operating since TCOON is operated by the Conrad Blucher Institute for Surveying and Science (CBI) at Texas A & M University- Corpus Christi. The measuring stations are located along the Gulf Coast of Texas. TCOON provides measurements of precise water levels, wind, temperature and barometric pressure. It follows the NOAA/NOS standards and maintains a real-time, online database. The TCOON data can be used for predicting the tidal datum s, littoral boundaries, oil-spill response, navigation and even for storm predicting and preparation. [TCOON 2008] Figure 1.3 is a TCOON station used to obtain environmental measurements. Sensors are used to measure the various environmental parameters and these sensors are controlled by a data collection computer. Solar panels provide the power needs to the TCOON stations. Most of the TCOON stations make use of Next Generation Water Level Measurement System (NGWLMS). This system has a computer at its heart that controls the sensors and stores the data collected on-site temporarily and transmits the data to the stations. TCOON measures the environmental parameters like water level measurements at every six-minute intervals. The collected data is then stored in a database and this data is used for forecasting water levels, wind speeds and barometric pressures. [TCOON 2008] 4

Figure 1.3 TCOON station [TCOON 2008] 1.4 Symbolic Aggregate approximation (SAX) SAX is the primary symbolic representation for time series data and it allows dimensionality reduction and indexing.

11 Figure 1.3 TCOON station [TCOON 2008] 1.4 Symbolic Aggregate approximation (SAX) SAX is the primary symbolic representation for time series data and it allows dimensionality reduction and indexing. SAX provides a lower-bound distance measure. SAX was developed in 2002 by Eamonn Keogh and Jessica Lin [Eamonn 2008]. In this project, SAX is used for dimensionality reduction and project discretization of time series. SAX requires less storage space and is equally as good as Discrete Fourier Transform (DFT) and Discrete Wavelet Transform (DWT). SAX provides solutions to many of the data mining tasks including motif discovery. SAX provides good performance as compared to other types of tools and it represents the state-of-the-art in time series data sets. SAX is used in many applications: to symbolize the street data, to create discrete data from continuous data and to perform anomaly detection in network traffic. [Eamonn 2008] 5

12 1.5 Time Series Motifs In time series data mining models, the main task is to find the approximately repeated subsequences in a longer time series, called motifs. The two main limitations in finding the time series motifs include the poor scalability of the motif discovery algorithm being used and the inability to discover the motifs in the presence of noise. The random projection algorithm used in the project can find the time series motifs with very high probability even in the presence of noise or don t care symbols. Some of the algorithms that make use of motifs are as follows [Chiu 2003] Motif discovery is important for mining association rules in time series. These are commonly referred to as primitive shapes and frequent patterns. Many time series algorithms work by developing typical prototypes of each class. These are usually considered as motifs. Several time series detection algorithms consist of modeling normal behavior using a set of typical shapes and thereby discovering future patterns different with the typical shapes. Motifs are also utilized in robotics, where a method is introduced to provide an independent means to simplify from a set of qualitatively dissimilar experiences, these experiences are known as motifs. Motifs are of much significance in medical data mining for characterizing a physiotherapy patient s recovery based on the discovery of motifs. [Chiu 2003] In the above areas, discovery of motifs play an important role in finding the patterns in the given time series data. 6

13 2. NARRATIVE The main objective of the project is to find approximately repeated subsequences in a longer time series. The data sources are the various water levels recorded across the Texas coast, such as information from the TCOON database. The data extracted from the database is stored in plain text format on a local machine and then is used as input to SAX based random projection algorithm which produces motifs. 2.1 TCOON Data There are many TCOON stations located along the coast of the Gulf of Mexico. The water level data from the Gulf of Mexico is collected by all the DNR stations located across the coast. Some of these are for socio economic purposes, while others are for research purposes. Each DNR station has a station datum (STND), usually an arbitrary zero used internally to measure all other elevations including water level, Mean Higher High Water (MHHW), Mean High Water (MHW), Mean Tide Level, Mean Sea Level (MSL) and benchmarks. A benchmark is usually a brass survey disk attached permanently to a stainless steel rod driven 50 ft into the ground [DNR 2008]. All water elevations along the coast of Gulf of Mexico are measured at each station relative to the station datum. The arbitrary number is designed to allow all water level observations to be positive numbers. Each and every station has its own unique station datum due to the physical conditions at that station. Benchmarks maintain the station s zero point over time [TCOON 2008]. Figure 2.1 shows information about tidal datums of some of the stations published. In this project, primary water levels are only considered. 7

14 Figure 2.1 Tidal Datums [DNR 2008] 8

15 The data collected from all DNR stations is stored in the TCOON database. The TCOON database schema has five fields.the database schema for TCOON is as follows: Table 2.1 TCOON Database Schema Field Type Null Key Default Extra id Char(3) No NULL jul int (11) NO PRI NULL ser Char(4) NO PRI NULL smv0 smallint(6) NO smv1 smallint(6) NO smv2 smallint(6) NO smv3 smallint(6) NO smv4 smallint(6) NO smv5 smallint(6) NO smv6 smallint(6) NO smv7 smallint(6) NO smv8 smallint(6) NO smv9 smallint(6) NO src Char(4) NO PRI id - station id jul - a date/time stamp in the format YYYYjjjHH, where YYYY is the year and jjj is the julian day and HH is the hour. 9

16 ser - a series identifier for all the the types (pwl = primary water level, wsd= wind speed). smv0-9 - value fields, one for each six minute interval src - identifies where the data came from (nesdis,nwstag, etc) In the project, data from the TCOON database schema is saved on the local machine in one of the file formats and the SAX algorithm is applied on the stored data. 2.2 Symbolic Aggregate approximation (SAX) SAX is the first symbolic version of time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. Lower bounding means the estimated distance in the reduced space is always less than or equal to the original space distance. These lower bounding functions are known for wavelets, Fourier, SVD, piecewise polynomials, Chebyshev polynomials and clipped data [Lin 2008]. Symbolic approximations are used to represent the time series because they provide the following features Hashing Suffix Trees Markov models The best known symbolic approximation that offers lower bounding is SAX. It provides, Lower bounding of Euclidean distance Lower bounding of the DTW distance Dimensionality Reduction Luminosity Reduction [Lin 2008] The features above make the functions possible for use in most of the time series representations. In order to obtain SAX, first one needs to convert the time series to 10

17 piecewise aggregate approximation (PAA) representation, and then convert the PAA to symbols. SAX takes linear amount of time. Figure 2.2(A) represents the Euclidean distance between two time series as the square root of the sum of the squared differences of each pair of corresponding points. Figure 2.2(B) represents the distance measure defined for the PAA approximation, defined as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression ratio. [Motifs 2002] Figure 2.2(C) illustrates the distance between two symbolic representations of a time series. 11

18 Figure 2.2 Three representations of distance measures [Motifs 2002] In this project the raw data obtained from the TCOON will be converted to PAA representation and these representations are then converted to symbols. SAX was developed within the Matlab environment. The method to be used in implementing the project plays an important role in the success of the project. There have been many high 12

19 level representations proposed for data mining which include fourier transforms, wavelets, eigenwaves and piecewise polynomial models. Many of these suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data and second although distance measures can be defined on the symbolic approaches, they have little correlation with the distance measures defined on the original series. SAX solves the two problems; it allows dimensionality and numerosity reduction. It provides an efficient method to find the most unusual time series subsequence [Eamonn 2008]. The symbolic representation of original time series from SAX is used as input to the random projection algorithm to identify motifs. The motifs are used to improve the water level prediction [Eamonn 2008]. SAX makes use of two types of functions to give the motifs. The first function is timeseries2symbol function. This function takes in a raw time series and converts it into strings. One of the main requirements of this function is that N/n should be an integer. N is the raw time series and n are the number of symbols in the low dimensional approximation of the subsequence [Eamonn 2008]. The input to this function is shown in Table 2.2. Table 2.2 Input timeseries2symbol [Eamonn 2008] Input: Data the raw time series N the length of raw time series n the number of symbols in the low dimensional approximation of the sub sequence alphabet_size the number of discrete symbols (2 <= alphabet_size <= 10) 13

20 For the above input to the timeseries2symbol function, the output will be symbolic data as shown in Table 2.3 Table 2.3 Output timeseries2symbol [Eamonn 2008] Output: Matrix of symbolic data (no-repetition). If consecutive subsequences symbolic_data have the same string, then only the first occurrence is recorded, with a pointer to its location stored in "pointers" Pointers location of the first occurrences of the strings The other type of function is the min_dist, which calculates the minimum distance between two strings, if they are of same length. The input to this function is [Eamonn 2008]: string1: first string string2: second string alphabet_size: is the size used to construct the strings compression_ratio: original_data_leng / symbolic_leng The output of the function will be distance: lower-bounding distance usage: distance = min_dist(str1, str2, alphabet_size, compression_ratio) The distance measure in the above function is not a good measure to compare strings but this works well for classification and clustering of the strings [Eamonn 2008]. One of the main advantages of the data mining method of forecasting is that it provides a water level forecasting system for bays and the Texas coast in general. Currently, harmonic water forecasts for bays are provided through NOAA s Physical 14

21 Oceanographic Real Time Systems (PORTS), harmonic and persistent model forecasts are provided by the DNR website. The other advantages of the data mining method are the ability to model non linear systems, the robustness and the generic modeling capability. [TCOON 2008] The input to the developed model plays an important role in determining the performance of the model. One of the important inputs to the model is the primary water level. The meteorological factors play an important role in water level forecasts across the Gulf of Mexico, rather than the tidal factors affecting the water levels. The implemented system is able to predict the water levels more accurately when compared to other models. 2.3 EMMA Algorithm Many algorithms have been proposed for discovery of motifs. Among them, the EMMA algorithm has widest application range [Chiu 2003]. The pseudo code for the algorithm is introduced in Figure 2.3 given below. The line numbers in the Figure 2.3 are used in the discussion of the algorithm that follows. 15

22 Figure 2.3 EMMA Algorithm [Chiu 2003]. The algorithm begins by sliding a moving window of length n across the time series (line 4). The hash function h() (line 5), normalizes the time series, converts it to the symbolic representation and computes an address: 16

23 [Chiu 2003] (1) Where ord(i) is the ordinal value of i, i.e., ord(a) = 1, ord(b) = 2, and it continues. The hash function computes an integer in the range 1 to wcˆcˆa, and a pointer to the subsequence is placed in the corresponding bucket (line 6)[ Chiu 2003]. At this point we have simply rearranged the data into a hash table with w a addresses, and a total of size O(m). This information can be used as a heuristic for motif search, since if there is a truly over-represented pattern in the time series, we should expect that most, if not all, copies of it hashed to the same location. The address with the most hits the Most Promising Candidate (MPC) is called(line 8).A list of all subsequences that mapped to this address was built(line 9), but it is possible that some subsequences that hashed to different addresses are also within R of the subsequences contained in MPC. We can use the MINDIST function to determine which addresses could possibly contain such subsequences (line 12). All such subsequences are added to the list of subsequences that need to be examined in our small matrix (line 14). At this point the list of similar subsequences into the ADM subroutine is passed (line 17). [Chiu 2003] Next, a simple test is performed. If the number of matches to the current best-so-far motif is greater than the largest unexplored neighborhood (line 18), we are done. We can record the best so far motif as the true best match (line 19), note the number of matching subsequences (line 20), and then abandon the search (line 21). If the test fails, however, we must set the most promising candidate to be the next largest bucket (line 23), initialize the new neighborhood with the contents of the bucket (line 24), and loop back to line 11, 17

24 where the full neighborhood is discovered (lines 13 and 14) and the search continues [Chiu 2003]. For simplicity the pseudo code for the algorithm ignores the following possible optimization: it is possible (in fact, likely), that the neighborhood in one interaction will overlap with the neighborhood in the next, this algorithm along with ADM algorithm is inefficient. For data processing more than 1 year of computing resources of the personal computer it appears insufficient. Random projection algorithm [Chui 2003] of motif discovery solves the problem of data processing for more than 1 year [Chiu 2003]. 2.4 ADM Algorithm The ADM algorithm is used for searching the small neighborhood matrix. The ADM algorithm pre-computes any arbitrary set of distances. The matrices ADM and MIN are used to store the upper bound and lower bound and the distances of any path between these two objects. Each entry of ADM[i, j] is either the exact distance between i and j (i.e. those that are pre-computed), or the lower bound for the distance between i and j. In other words, if P i,j contains the set of all paths from i to j, then ADM[i, j] is the largest lower bound distance between i and j, obtained from all paths in P i,j using the triangle inequality [Chiu 2003]. 18

25 Figure 2.4 ADM Algorithm [Chiu 2003]. A matrix MIN is also used because it s impractical to enumerate all paths in P i,j to get the maximum lower bound between i and j. MIN[i,j] stores the minimum distance of 19

26 any path from i to j (i.e. it s the least upper bound of distance between i and j). ADM and MIN are initialized from line 4 to line 8. [Chiu 2003] From line 10 to line 22, we construct the matrix ADM and MIN. For each stage k, 1 k n, ADM[i, j] is the greatest lower bound of any path from i to j that does not pass through an object numbered greater than k. Similarly, MIN[i, j] is the smallest upper bound of the distance between i and j. Note that we further optimized the algorithm by storing or computing only half of the matrices, due to distance symmetry [Chiu 2003]. On line 24, we allocate and initialize an array, count, which stores the number of items within R (i.e. number of matching subsequences) for each motif center. From line 26 to line 35, we scan the matrix ADM and compute the actual distance between i and j if ADM[i,j] is a lower bound that is smaller than R (because the true distance might be greater than R). Again, we omit the optimization of distance symmetry for simplicity. At each step, we keep track of the number of items within R (line 32) [Chiu 2003]. ADM algorithm is a scheme to answer best-match queries from a file containing a collection of objects. A best-match query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. The number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances can be reduced. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness and performance of the ADM algorithm relative to previous algorithms [Lin 2002]. Finally, the ADM algorithm returns the best-matching motif, the motif that has the most items within R, with a count of number of matching subsequences. 20

27 2.5 Probabilistic Discovery of Time Series Motifs SAX is probabilistic in nature, empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or don t care symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time [Chiu 2003]. Noise plays an important role when attempting to discover motifs. Even small amounts of noise can affect the distance measures, even the most commonly used distance measurement techniques like Euclidean distance. In order to avoid the noise, the definition of time series motifs is generalized to allow for don t-care subsections and a novel time and space-efficient algorithm can be used to discover motifs [Chui 2003] Euclidean Distance Measure Euclidean distance is the normal distance between two points that can be proven by repeated application of the Pythagorean theorem. Euclidean distance examines the root of square differences between coordinates of a pair of objects. When this formula is used as distance, the Euclidean space becomes a metric space [Euclidean 2008]. The Euclidean distance between two points P= (p 1, p 2,.,p n ) and Q= (q 1,q 2,.,q n ) in an Euclidean n-space can be defined as, Figure 2.4 illustrates the Euclidean distance measure between two points P and Q (1) 21

28 Time Series P Time Series Q D (P, Q) Figure 2.5 Euclidean Distance Measure [Chiu 2003] One dimensional Euclidean distance for points P= (p x ) and Q= (q x ) is computed as, (2) 22

29 Two dimensional distance measure for points, P= (p x, p y ) and Q= (q x, q y ) is expressed as, Three dimensional distance measure for two 3D points, P= (px, py, pz) and Q= (qx, qy, qz) is computed as, (3) For two N-D points, P= (p 1,p 2,..,p n ) and Q= (q 1, q 2,.,q n ) the N-dimensional distance is computed as, (4) [Euclidean 2008] Lower Bounding Distance Measure Lower bounding means the estimated distance in the reduced space is less than or equal to the original space distance. Q S D LB (Q,S ) Figure 2.6 Lower Bounding Distance Measure [Lin 2002] 23

30 Now the lower bounding distance measure of the two time series D LB can be determined as, M 2 D LB (Q,S ) and illustrated in Figure 2.6 = ( sr sr )( qv sv i 1 i i 1 i i ) The lower bounding distance calculated is less than or equal to the Euclidean distance measure. Definitions of some of the data types that are used in the discovery of time series motifs Time Series Time series indicates a series of data collected in sequence. It is useful for making decision to get an effective way to explore the information hidden in the given data. A time series T= c 1,c 2 c m can be defined as an ordered set of m- real valued variables. Usually time series can be very long containing billions or trillions of observations. These time series contain subsections, which are called subsequences. A time series is a collection of observations made sequentially in time [Lin 2002]. 24

31 Figure 2.7 Time Series [Lin 2002] In Figure 2.7 blue curve represents the time series. Time series can be classified into many types as illustrated in Figure 2.8, Time Series Representation Data Non Data Sorted Coefficient Piecewise Polynomia Singula Symboli Valu Decompositio Tree Wavelet Rando Mapping Spectra Piecewis Aggregat Approximatio Piecewise Linear Approximation Interpolatio Regressio Adaptiv Piecewis Constan Approximaio Natural Languag String Orthonormal Haa Daubechie dbn n > Bi-Orthonormal Coiflet Symlet Discret Fourie Transfor Discret Cosin Transfor 25

32 UUCUCUCD DFT DWT SVD APCA PAA PLA U U C U C U D D SYM Figure 2.8 Time Series Representation [Motifs 2002] In the project, PAA is used to calculate the distance, defined as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression ratio Figure 2.9 Comparison between Time Series representations[motifs 2002]. f e d c b a DFT PLA Haar APCA Subsequence In a given time series T of length m, a subsequence C in the time series T is a sampling of length n<=m of contiguous position from T. In other words we can say that a subsequence C = tp,., t p+n-1 for 1<=p<= m-n+1. As all the subsequences of a time series can be motifs, it is necessary to extract all of them and this can be done through sliding window. [Chui 2003] 26

33 2.5.5 Sliding Window In a given time series T of length m, and subsequence length of n we can build a matrix S of all possible subsequences by sliding a window of size n across a time series T and by placing subsequence C p in the p th row of S. The size of this matrix S is (mn+1)/n. [Chui 2003] In order to determine whether the subsequences are equal, a match needs to be done Match If a positive real number range R is given and in a time series T consists of a subsequence C starting at position P and another subsequence starts at position Q. If the distance measure D(C, M) <=R, then M is said to be a matching subsequence of C. This is illustrated in Figure 2.10 Figure 2.10 Match [Chui 2003] Trivial Match In a given time series T, containing a subsequence C, starting at position p and a matching subsequence M beginning at q. M is a trivial match to C if either p = q or a subsequence M doesn t exist starting at q such that D(C,M) > R. Trivial match is illustrated in Figure 2.11 [Chui 2003] 27

34 Figure 2.11 Trivial Match [Chui 2003]. 2.6 Time Series Projection For time series projection in this project we consider the K-Motif (n, R) case. If we have a time series T of 1,000 data points that has two occurrences of a motif 16 at times T 1 and T 58. If the occurrence of motif at T 58 is corrupted by noise from positions 8 to 12, Figure 2.12 Time Series [Chui 2003]. Figure 2.12 illustrates the process where subsequences are extracted by a sliding window, converting them into symbolic form, and are then placed into matrix S. Here 28

13 A randomly chosen mask [Chiu 2003] Figure 2.13 illustrates this, a mask {1, 2} is chosen randomly and the values are used to project matrix S into buckets.

35 each row of the matrix S points back to the original subsequence. When the matrix S is constructed random projection is started. Now two columns of S matrix are randomly selected as a mask and the k=985 words in the S matrix are hashed into the buckets based on their values in the 1 st and 2 nd columns. Figure 2.13 A randomly chosen mask [Chiu 2003] Figure 2.13 illustrates this, a mask {1, 2} is chosen randomly and the values are used to project matrix S into buckets. Again collisions are recorded by incrementing the appropriate location in the collision matrix and buckets cannot be reused from one iteration to another, they can also be used for appropriate place in the matrix. Figure 2.14 Another iteration with another randomly chosen mask [Chiu 2003] Figure 2.14 illustrates the result of another iteration with {2, 4} as the chosen mask. After repeating the process several times collision matrix is examined. If all the entries are uniform then it indicates there are no more motifs to be found in the given dataset. 29

36 2.7 Brute Force Algorithm Brute-force search is simple to implement, and will always find a solution if it exists. However, its cost is proportional to the number of candidate solutions, which, in many practical problems, tends to grow very quickly as the size of the problem increases. Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size. The method is also used when the simplicity of implementation is more important than speed. This is the case, for example, in critical applications where any errors in the algorithm would have very serious consequences; or when using a computer to prove a mathematical theorem. Brute-force search is also useful as "baseline" method when benchmarking other algorithms or metaheuristics. Indeed, brute-force search can be viewed as the simplest metaheuristic. Brute force search should not be confused with backtracking, where large sets of solutions can be discarded without being explicitly enumerated. 2.8 Finding Planted Motifs In order to recover two planted motifs with two different occurrences lets consider a small dataset. Figure 2.15 shows the two motifs which are different by all means and are noisy along their entire length. Figure 2.15 Two Planted Motifs [Chui 2003] 30

37 These two motifs that have four subsequences are planted into a small dataset of length 1128 as illustrated in Figure The algorithm is ran with n= 128, w=16, a=4 for 100 iterations. Figure 2.16 Dataset with planted motifs [Chiu 2003] Figure 2.17 gives the planted motifs where the dark smudges represent the planted motifs Figure 2.17 Contour Plot with Planted Motifs [Chiu 2003] 31

38 The discovered subsequences at these positions are shown in Figure These may not be similar as our planted motifs but they are similar to each other. Figure 2.18 Motifs Discovered [Chui 2003] 32

39 3. DEVELOPED RESEARCH The data source for the project is the TCOON data base. Data from the TCOON database was stored in a file on a local machine and the data was then passed to SAX. The obtained motifs from SAX were used to improve the predictions from the regression. The concept of using data mining is to take past data from the TCOON data base and pass these values through the SAX to yield the motifs. If time series data is taken as an input, it takes a large amount of time to extract motifs from the information, so SAX is used to reduce the computation time. The proposed process will not require any database to store the data extracted from the TCOON data sets as the data will be stored in a file and one can implement the method on a local machine and a database will not be necessary. 3.1 Code Design The data mining method of prediction is tested within the MATLAB environment. The computers used for the study are Pentium IV PC s with CPU s of 3.19 GHz. 3.2 Obtaining SAX To obtain SAX, first the time series data is converted to the piecewise aggregate approximation (PAA) and then the PAA is converted to symbols. On applying further transformation a discrete representation can be obtained. It is desirable to have a discretization technique that produces symbols with equiprobability. As normalized subsequences have highly Gaussian distribution, breakpoints can be easily determined. This is illustrated in Figure 3.1, where subsequences of length 128 from 8 different time series are extracted and a normal probability plot of the data is plotted 33

40 Figure 3.1 A Normal Probability Plot [Motifs 2002] Breakpoints The breakpoints are sorted list of numbers B = β 1, β a-1 such that the area under a N(0,1) Gaussian curve from β i to β i+1 = 1/a.The breakpoints may be determined by looking them in the statistical table.[ Chui 2003] In Figure 3.2, the blue curve represents the time series data, a fixed size window is taken and the distance between the fixed size window and the time series is calculated. C C Figure 3.2 Time Series Representation [SAX 2008]. 34

41 Now a Gaussian curve is taken and is divided into three parts. The break points are taken along the curve and the PAA coefficients that are below the smallest breakpoint are mapped to the symbol a and all coefficients greater or equal to the smallest breakpoint and less than the second smallest breakpoint are mapped to the symbol b points and the symbols are thus obtained. This is shown in Figure 3.3, where the red lines are the break points, and the curve above the break points are marked as c and the curves below the break points are marked as a. The output of the time series in the figure is baabccbc. - b a a b c c b c Figure 3.3 Symbolic Strings [SAX 2008] 3.3 Random Projection Algorithm The random projection algorithm is used to extract the motifs from time series data. Random projection algorithm makes use of SAX. The projection algorithm is designed to attack the planted (w, d)-motif problem. Each string is planted with exactly one approximate occurrence of an unknown motif y of length w,an occurrence with d substitutions. The planted motif problem uses a huge search space for motif discovery, in 35

42 order to reduce the huge search space, projection algorithm is used. Random projection is used to guess at least some of the occurrences of the unknown planted motif. [Chui 2003] The main factors that made the success of random projection include the choices of the projection size k, the number of iterations i, and the threshold value s of the time series data. The projection size k has to be chosen in such a way that k < d-w. In order to make use of the projection algorithm for motif discovery the dimensionality of the subsequences should be significantly reduced and high correlation is needed [Chui 2003] Functionality In the random projection motif discovery algorithm the length of a motif subsequence cannot be more than length of a time series. In this algorithm time series data is converted to SAX. The functionality of the Random projection algorithm for motif discovery is as follows: Initially parameters are set for loading of data. Identify the filename which contains the time series data are united. Choose a start date on which we search for motive, for example '01/01/2000'. Choose an end date on which we search for motive, for example '12/01/2000'. Interval - an interval of measurements, for example 1 hour. Then data are loaded by function [X Y] = data_load (filename, start_date, end_date, interval); Parameters of SAX algorithm are defined: a - Alphabet size {a, b, c...}; n - Length of motif subsequence; w - Length of word. 36

43 Parameters of the random projection algorithm are defined: t - projection size; num_iteration - number of trials; s - threshold for largest value in collision table; r - range. The path to a folder where results will be saved is defined: output path, for example 'motifs\two-day-long\2000\'. This folder is pre-created in a project folder. SAX algorithm is carried out and an initial time-series will be transformed to the symbolical form. It used functions timeseries2symbol(y, w, a). Then random projection algorithm which gives a set of motif candidates is carried out. Again address to an original time series is carried out and motif candidates are checked on quantity of matches, the candidate motif with biggest number of matches is the best one. Initially subsequences are extracted using a sliding window, converting them into symbolic form, and placing them into S_hat matrix in which each row index pointing back to the original location of the subsequence. Once the matrix is constructed, random projection is started. m columns of S_hat matrix are randomly selected as a mask and the collision matrix is initialized as zero sparse matrix. Now during random projection, each substring of size w in the sequence is mapped to a string of size m, called the projection, by reading the symbols through the mask. 37

44 After projection, if some cells have values that are significantly larger than the average in the collision matrix, we treat them as motif candidates. We then calculate the Euclidean distance between the original time series of these motif candidates. The search of motifs for a certain time span, such as a week long or two weeks are carried out and the results are stored in a folder. Array with the predicted points of length of 48 hours are discovered. Motives for each year ( ) are extracted and these motives also array lengths of 48 hours. 3.4 Brute Force Algorithm The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order. [Xi 2007] This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code for the brute force algorithm is as follows. void BF(char *x, int m, char *y, int n) { int i, j; for(j = 0; j <= n - m; ++j) { for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} } } [Xi 2007] 38

45 3.5 Basic System Functionality The TCOON data on the local machine is sent to SAX which identifies the motifs and gives the best motif by analyzing the previous data it received. These motifs are used to improve the predictions from the vector support regression model. The motifs thus obtained are compared with the results of brute force algorithm. SAX reduces the computation time and produces the precise results due to its nonlinearity. Figure 3.4 shows the data mining model used in the project. When input is given to SAX, it gives output as symbolic strings such as aaabbbaabb. These strings are used to find the motifs utilized to improve the predictions. Past Data Random Projection Algorithm Motifs SAX comparison Past Data Brute Force Algorithm Motifs Figure 3.4 System Architecture 39

46 4. TESTING AND EVALUATION Testing plays an important role for any project. Every system must be tested before it can be declared functional. Testing is the act of subjecting the system to experimental data in order to determine how well the system works. In this project the following data are retrieved from TCOON: primary water levels and mean high water level. The raw primary water level data is stored in a file which contains the data from year 2000 to year The system is tested by passing raw data in the file to the random projection motif discovery algorithm and motifs for each of the years are extracted respectively. To test it, results from the random projection algorithm (RPA) are compared to algorithm of brute force on a small range of data. The results are also tested by comparing that with already known data. For example, the data from the year will be passed to the system and the output predicted data from the data mining model is compared with the already known data of The input to the brute force algorithm is the primary water level; the output from the brute force attack algorithm is compared with the output from RPA to check whether the found motives are accurate. Several test cases are given as input to the SAX and the output is compared with brute force algorithm output. The test cases include, finding the motifs for one week, one month, one year and for the whole 8 years. The time varied for the discovery of motifs. The scalability of the various algorithms is tested and the results are illustrated in Figure 4.1. The results confirm that in the length of the time series brute force algorithm is quadratic and time series projection is linear. 40

Figure 4.1 Scalability of Algorithms [Chiu 2003] The inputs to the system are the standard values taken from the TCOON data base from the past 8 years.

47 Figure 4.1 Scalability of Algorithms [Chiu 2003] The inputs to the system are the standard values taken from the TCOON data base from the past 8 years. The data are stored in the plain text format on the local machine. These values are then passed through the SAX subsystem and the output values will be tabulated and a motif graph is depicted. 4.1 Problems Faced While Developing the Project The input data was tested with the EMMA algorithm and ADM. This algorithm is inefficient. When processing data more than one year in the personal computer it appears insufficient. Preprocessing of data for small amount of missing data, such as one hour is performed by calculating the average of the values available, but for large amounts of missing data, the values are skipped. Selection of kernel functions for regression algorithm is important, as we don t consider time of the beginning of a motif subsequence. If large amounts of data are given as input, the time taken for the discovery of motifs is very long. Initially the identified motifs were tested with linear regression but the output seemed to be a straight line. 41

48 4.2 Results The search of two week long motifs is executed with an interval of 1 hour. The algorithm parameters are as follows Alphabet size, a = 4 Length of motif subsequence, n = 168 Length of word, w = 14 Projection size, t = 9 Number of trials, iter = 50 Threshold for lagest value in collision table, s = 30 Range, R = 0.30 The search of motives of length 2 days is also executed and the algorithm parameters are Alphabet size, a = 4 Length of motif subsequence, n = 48 Length of word, w = 6 Projection size, t = 4 Number of trials, iter = 50 Threshold for largest value in collision table, s = 30 Range, R = 0.10 Figure 4.2 gives the motif matches for the year 2000 for two day lengths. The colored curves in the figure denote the motif matches for every month of the year X axis represents the time period and the Y axis represents the water level measurements. 42

49 Y axis X axis Figure 4.2 Motif Matches for two day lengths in the year Figure 4.3 gives the best motif among the curves for the year From the motif matches obtained for each month, the best motif for the year is extracted. The extracted motif matches for every two days of each month are considered and the best motif from among all the found motives are depicted based on the most reoccurring patterns from all the found motifs. X axis represents the time period and the Y axis represents the water level measurements 43

50 Y axis X axis Figure 4.3 Best Motif for two day lengths in the year Figure 4.4 gives motif matches for the year 2000 for two week lengths. In Figure 4.4 the colored curves denote the motif matches for every two weeks of the year X axis represents the time period and the Y axis represents the water level measurements. The X axis values are numbered till as 168 as two week length motif extend till

51 Y axis X axis Figure 4.4 Motif Matches for week lengths in the year Figure 4.5 gives best motif of the year 2000 for two week length. From the motif matches obtained for two week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 45

52 Y axis X axis Figure 4.5 Best Motif for week lengths in the year Figure 4.6 gives the motif matches for the year 2001 for two day lengths. The colored curves in the figure denote the motif matches for every month of the year X axis represents the time period and the Y axis represents the water level measurements. 46

53 Y axis X axis Figure 4.6 Motif Matches for the year 2001 two day lengths. Figure 4.7 gives the best motif among the curves for the year From the motif matches obtained for each month, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 47

54 Y axis X axis Figure 4.7 Best motif among the curves for the year 2001 two day lengths. Figure 4.8 gives the motif matches for the year 2001 for week length. From the motif matches obtained for two week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 48

55 Y axis X axis Figure 4.8 Motif matches for the year 2001 for week lengths. Figure 4.9 gives the best motif among the curves for the year 2001 for week lengths. From the motif matches obtained for week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 49

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,