ABSTRACT. Data mining is a method of gathering all useful information from large databases. Data

Size: px
Start display at page:

Download "ABSTRACT. Data mining is a method of gathering all useful information from large databases. Data"

Transcription

1 ABSTRACT The developed system is able to produce motifs by making use of data mining methods. Data mining is a method of gathering all useful information from large databases. Data mining methods greatly support efforts to predict future outcomes. The developed system identifies motifs in a given time series dataset and these patterns are employed to improve the prediction of future outcomes. The method used in the developed system is a data mining method called Symbolic Aggregate approximation (SAX). Random Projection algorithm is used to discover the unknown time series motifs. These motifs are tested for accuracy by comparing them with the brute force algorithm. This method requires only one parameter to identify the time series discords, unlike other methods that require a large number of parameters. ii

2 TABLE OF CONTENTS Abstract...ii Table of Contents iii List of Figures...vi List of Tables.vii 1. Background and Rationale Data Mining Need for Data Mining....2 Scalability..2 High Dimensionality...2 Heterogeneous and Complex Data.3 Data Ownership and Distribution..3 Nontraditional Analysis.3 Artificial Neural Network Method.. 4 Symbolic Aggregate approximation.5 Time Series Motifs Narrative...7.TCOON Data Symbolic Aggregate approximation EMMA Algorithm...15 ADM Algorithm..18 Probabilistic Discovery of Time Series Motifs Euclidean Distance Measure Lower Bounding Distance Measure..23 iii

3 2.5.3 Time Series Subsequence Sliding Window Match Trivial Match Time Series Projection Brute Force Algorithm Finding Planted Motifs Developed Research Code Design Obtaining SAX Breakpoints Random Projection Algorithm Functionality Brute force Algorithm Basic System Functionality Testing and Evaluation...40 Problems Faced while Developing the Project.41 Results Expected Results and Future Work Acknowledgements...55 Appendix.56 Bibliography and References...57 iv

4 LIST OF FIGURES Figure 1.3. TCOON Station Figure 2.1. Tidal Datum s Figure 2.2 Three Representations of distance measures 12 Figure 2.3 EMMA Algorithm 16 Figure 2.4 ADM algorithm Figure 2.5 Euclidean Distance Measure 22 Figure 2.6 Lower Bounding Distance Measure.23 Figure 2.7 Time Series..25 Figure 2.8 Time Series Representation..26 Figure 2.9 Comparison between Time Series Representation Figure 2.10 Match..27 Figure 2.11 Trivial Match..28 Figure 2.12 Time Series...28 Figure 2.13 A Randomly Chosen Mask.29 Figure 2.14 Another iteration with another randomly chosen mask..29 Figure 2.15 Two Planted Motifs 30 Figure 2.16 Data Set with Planted Motifs..31 Figure 2.17 Contour plot with Planted Motifs...31 Figure 2.18 Motifs Discovered..32 Figure 3.1 A Normal Probability Plot 34 Figure 3.2 Time Series Representation..34 Figure 3.3 Symbolic Strings..35 v

5 Figure 3.4 System Architecture Figure 4.1 Scalability of Algorithms.41 Figure 4.2 Motif Matches for two day long..43 Figure 4.3 Best Motif for two day long Figure 4.4 Motif Matches for week long..45 Figure 4.5 Best Motif for week long..46 Figure 4.6 Motif Matches for the year 2001 two day long...47 Figure 4.7 Best motif among the curves for the year 2001 two day long...48 Figure 4.8 Motif matches for the year 2001 for week long Figure 4.9 Best motif among the curves for the year 2001 for week long...50 Figure 4.10 Motif matches for the year 2008 for two day long...51 Figure 4.11 Best motif for the year 2008 for week long...52 Figure 4.12 Best motif for the year 2008 for week long for brute force algorithm...53 vi

6 LIST OF TABLES Table 2.1. TCOON Data Schema 9 Table 2.2 Input timeseries2symbol...13 Table 2.3 Output timeseries2symbol...14 Table 3.4 Pseudo code for the EMMA Algorithm. 22 vii

7 1. BACKGROUND AND RATIONALE The discovery of time series motifs is of much importance to improve the water level predictions. These predictions thereby are useful to the shipping industry, people living in the coastal areas, and even for emergency evacuation in case of a hurricane. There are different algorithms available for the discovery of time series motifs. This project makes use of the random projection algorithm to extract the motives in the given primary water levels. Data Mining Data mining is the process of arranging data from large amounts of data and selecting the appropriate information. Data mining is commonly known as the science of automatically discovering useful information from large data sets [Wikipedia 2008]. Traditionally, most of the information is extracted from stored data. But due to the increase in the amount of data being stored, the size of data sets have begun to grow rapidly and become more complex which emphasize the need for more sophisticated tools. Current technologies have made the data collection and organization much easier. Data mining is an integral part of knowledge discovery in databases (KDD). Knowledge discovery is the process of converting the unprocessed data into a readable format. Knowledge discovery provides precise information that can be easily understood by a user. Data mining uses real data and makes information clearly readable. In some approaches like the neural network method, information is not as clear. Some of the data mining methods are only based on prediction rather than 1

8 knowledge discovery in data sets. The collected data can be stored in various formats and it can be distributed across many sites. [Tan 2006] 1.2 Need for Data Mining As new data sets have grown in size and complexity, they posed difficulties to traditional data analysis techniques in terms of analysis and storage which led to the development of the data mining techniques. Some of the challenges that led to the development of data mining are shown in the next sections Scalability Due to the increase in the amount of data being collected, data sets are growing larger in size leading to gigabytes, terabytes and even petabytes of data. In order to handle these types of data sets, data mining algorithms must be scalable. Special search strategies are employed by various data mining algorithms to handle exponential search problems. Sampling is another technique by which scalability can be improved. [Tan 2006] High Dimensionality Currently, data sets come with hundreds or even thousands of attributes instead of the few commonly used before. Traditional data analysis techniques work for low dimensionality data, but they cannot work for high dimensionality. For example, if the temperature measurements of the water are taken at regular intervals, the number of dimensions increases as the number of measurements over a specific period of time increase. This type of the data analysis may not be handled by the traditional data analysis techniques. [Tan 2006] 2

9 1.2.3 Heterogeneous and Complex Data Traditional data analysis methods make use of data sets that contain attributes of the same type. Analysis may be difficult if attributes reside on different systems. Data mining techniques can handle these heterogeneous types of attributes and they take into consideration the relationships of the data, like temporal and spatial autocorrelation and parent-child relationships between semistructured text elements and XML documents. [Tan 2006] Data Ownership and Distribution In some situations the data needed for analysis does not reside in one location or it may not belong to one particular organization; it may be distributed among multiple entities. With this kind of data there arises some challenges like how to reduce the amount of time needed to perform the computation and how to solve the security issues in case of a distributed computation. Data Mining techniques solve the problem of data ownership and distribution by reducing the amount of time taken for computation. [Tan 2006] Nontraditional Analysis The traditional approach to data analysis is based on the hypothesize-and-test paradigm wherein, based on the hypothesis, an experiment is designed to collect the data and the collected data is analyzed. This type of process is very time consuming and difficult. Data mining techniques make use of nontraditional analysis and often represent opportunistic samples of the data. [Tan 2006] 3

10 1.3 Texas Coastal Ocean Observation Network The Texas Coastal Ocean Observation Network (TCOON) is a state-of-the-art waterlevel measurement system along the Texas Coast which has been operating since TCOON is operated by the Conrad Blucher Institute for Surveying and Science (CBI) at Texas A & M University- Corpus Christi. The measuring stations are located along the Gulf Coast of Texas. TCOON provides measurements of precise water levels, wind, temperature and barometric pressure. It follows the NOAA/NOS standards and maintains a real-time, online database. The TCOON data can be used for predicting the tidal datum s, littoral boundaries, oil-spill response, navigation and even for storm predicting and preparation. [TCOON 2008] Figure 1.3 is a TCOON station used to obtain environmental measurements. Sensors are used to measure the various environmental parameters and these sensors are controlled by a data collection computer. Solar panels provide the power needs to the TCOON stations. Most of the TCOON stations make use of Next Generation Water Level Measurement System (NGWLMS). This system has a computer at its heart that controls the sensors and stores the data collected on-site temporarily and transmits the data to the stations. TCOON measures the environmental parameters like water level measurements at every six-minute intervals. The collected data is then stored in a database and this data is used for forecasting water levels, wind speeds and barometric pressures. [TCOON 2008] 4

11 Figure 1.3 TCOON station [TCOON 2008] 1.4 Symbolic Aggregate approximation (SAX) SAX is the primary symbolic representation for time series data and it allows dimensionality reduction and indexing. SAX provides a lower-bound distance measure. SAX was developed in 2002 by Eamonn Keogh and Jessica Lin [Eamonn 2008]. In this project, SAX is used for dimensionality reduction and project discretization of time series. SAX requires less storage space and is equally as good as Discrete Fourier Transform (DFT) and Discrete Wavelet Transform (DWT). SAX provides solutions to many of the data mining tasks including motif discovery. SAX provides good performance as compared to other types of tools and it represents the state-of-the-art in time series data sets. SAX is used in many applications: to symbolize the street data, to create discrete data from continuous data and to perform anomaly detection in network traffic. [Eamonn 2008] 5

12 1.5 Time Series Motifs In time series data mining models, the main task is to find the approximately repeated subsequences in a longer time series, called motifs. The two main limitations in finding the time series motifs include the poor scalability of the motif discovery algorithm being used and the inability to discover the motifs in the presence of noise. The random projection algorithm used in the project can find the time series motifs with very high probability even in the presence of noise or don t care symbols. Some of the algorithms that make use of motifs are as follows [Chiu 2003] Motif discovery is important for mining association rules in time series. These are commonly referred to as primitive shapes and frequent patterns. Many time series algorithms work by developing typical prototypes of each class. These are usually considered as motifs. Several time series detection algorithms consist of modeling normal behavior using a set of typical shapes and thereby discovering future patterns different with the typical shapes. Motifs are also utilized in robotics, where a method is introduced to provide an independent means to simplify from a set of qualitatively dissimilar experiences, these experiences are known as motifs. Motifs are of much significance in medical data mining for characterizing a physiotherapy patient s recovery based on the discovery of motifs. [Chiu 2003] In the above areas, discovery of motifs play an important role in finding the patterns in the given time series data. 6

13 2. NARRATIVE The main objective of the project is to find approximately repeated subsequences in a longer time series. The data sources are the various water levels recorded across the Texas coast, such as information from the TCOON database. The data extracted from the database is stored in plain text format on a local machine and then is used as input to SAX based random projection algorithm which produces motifs. 2.1 TCOON Data There are many TCOON stations located along the coast of the Gulf of Mexico. The water level data from the Gulf of Mexico is collected by all the DNR stations located across the coast. Some of these are for socio economic purposes, while others are for research purposes. Each DNR station has a station datum (STND), usually an arbitrary zero used internally to measure all other elevations including water level, Mean Higher High Water (MHHW), Mean High Water (MHW), Mean Tide Level, Mean Sea Level (MSL) and benchmarks. A benchmark is usually a brass survey disk attached permanently to a stainless steel rod driven 50 ft into the ground [DNR 2008]. All water elevations along the coast of Gulf of Mexico are measured at each station relative to the station datum. The arbitrary number is designed to allow all water level observations to be positive numbers. Each and every station has its own unique station datum due to the physical conditions at that station. Benchmarks maintain the station s zero point over time [TCOON 2008]. Figure 2.1 shows information about tidal datums of some of the stations published. In this project, primary water levels are only considered. 7

14 Figure 2.1 Tidal Datums [DNR 2008] 8

15 The data collected from all DNR stations is stored in the TCOON database. The TCOON database schema has five fields.the database schema for TCOON is as follows: Table 2.1 TCOON Database Schema Field Type Null Key Default Extra id Char(3) No NULL jul int (11) NO PRI NULL ser Char(4) NO PRI NULL smv0 smallint(6) NO smv1 smallint(6) NO smv2 smallint(6) NO smv3 smallint(6) NO smv4 smallint(6) NO smv5 smallint(6) NO smv6 smallint(6) NO smv7 smallint(6) NO smv8 smallint(6) NO smv9 smallint(6) NO src Char(4) NO PRI id - station id jul - a date/time stamp in the format YYYYjjjHH, where YYYY is the year and jjj is the julian day and HH is the hour. 9

16 ser - a series identifier for all the the types (pwl = primary water level, wsd= wind speed). smv0-9 - value fields, one for each six minute interval src - identifies where the data came from (nesdis,nwstag, etc) In the project, data from the TCOON database schema is saved on the local machine in one of the file formats and the SAX algorithm is applied on the stored data. 2.2 Symbolic Aggregate approximation (SAX) SAX is the first symbolic version of time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. Lower bounding means the estimated distance in the reduced space is always less than or equal to the original space distance. These lower bounding functions are known for wavelets, Fourier, SVD, piecewise polynomials, Chebyshev polynomials and clipped data [Lin 2008]. Symbolic approximations are used to represent the time series because they provide the following features Hashing Suffix Trees Markov models The best known symbolic approximation that offers lower bounding is SAX. It provides, Lower bounding of Euclidean distance Lower bounding of the DTW distance Dimensionality Reduction Luminosity Reduction [Lin 2008] The features above make the functions possible for use in most of the time series representations. In order to obtain SAX, first one needs to convert the time series to 10

17 piecewise aggregate approximation (PAA) representation, and then convert the PAA to symbols. SAX takes linear amount of time. Figure 2.2(A) represents the Euclidean distance between two time series as the square root of the sum of the squared differences of each pair of corresponding points. Figure 2.2(B) represents the distance measure defined for the PAA approximation, defined as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression ratio. [Motifs 2002] Figure 2.2(C) illustrates the distance between two symbolic representations of a time series. 11

18 Figure 2.2 Three representations of distance measures [Motifs 2002] In this project the raw data obtained from the TCOON will be converted to PAA representation and these representations are then converted to symbols. SAX was developed within the Matlab environment. The method to be used in implementing the project plays an important role in the success of the project. There have been many high 12

19 level representations proposed for data mining which include fourier transforms, wavelets, eigenwaves and piecewise polynomial models. Many of these suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data and second although distance measures can be defined on the symbolic approaches, they have little correlation with the distance measures defined on the original series. SAX solves the two problems; it allows dimensionality and numerosity reduction. It provides an efficient method to find the most unusual time series subsequence [Eamonn 2008]. The symbolic representation of original time series from SAX is used as input to the random projection algorithm to identify motifs. The motifs are used to improve the water level prediction [Eamonn 2008]. SAX makes use of two types of functions to give the motifs. The first function is timeseries2symbol function. This function takes in a raw time series and converts it into strings. One of the main requirements of this function is that N/n should be an integer. N is the raw time series and n are the number of symbols in the low dimensional approximation of the subsequence [Eamonn 2008]. The input to this function is shown in Table 2.2. Table 2.2 Input timeseries2symbol [Eamonn 2008] Input: Data the raw time series N the length of raw time series n the number of symbols in the low dimensional approximation of the sub sequence alphabet_size the number of discrete symbols (2 <= alphabet_size <= 10) 13

20 For the above input to the timeseries2symbol function, the output will be symbolic data as shown in Table 2.3 Table 2.3 Output timeseries2symbol [Eamonn 2008] Output: Matrix of symbolic data (no-repetition). If consecutive subsequences symbolic_data have the same string, then only the first occurrence is recorded, with a pointer to its location stored in "pointers" Pointers location of the first occurrences of the strings The other type of function is the min_dist, which calculates the minimum distance between two strings, if they are of same length. The input to this function is [Eamonn 2008]: string1: first string string2: second string alphabet_size: is the size used to construct the strings compression_ratio: original_data_leng / symbolic_leng The output of the function will be distance: lower-bounding distance usage: distance = min_dist(str1, str2, alphabet_size, compression_ratio) The distance measure in the above function is not a good measure to compare strings but this works well for classification and clustering of the strings [Eamonn 2008]. One of the main advantages of the data mining method of forecasting is that it provides a water level forecasting system for bays and the Texas coast in general. Currently, harmonic water forecasts for bays are provided through NOAA s Physical 14

21 Oceanographic Real Time Systems (PORTS), harmonic and persistent model forecasts are provided by the DNR website. The other advantages of the data mining method are the ability to model non linear systems, the robustness and the generic modeling capability. [TCOON 2008] The input to the developed model plays an important role in determining the performance of the model. One of the important inputs to the model is the primary water level. The meteorological factors play an important role in water level forecasts across the Gulf of Mexico, rather than the tidal factors affecting the water levels. The implemented system is able to predict the water levels more accurately when compared to other models. 2.3 EMMA Algorithm Many algorithms have been proposed for discovery of motifs. Among them, the EMMA algorithm has widest application range [Chiu 2003]. The pseudo code for the algorithm is introduced in Figure 2.3 given below. The line numbers in the Figure 2.3 are used in the discussion of the algorithm that follows. 15

22 Figure 2.3 EMMA Algorithm [Chiu 2003]. The algorithm begins by sliding a moving window of length n across the time series (line 4). The hash function h() (line 5), normalizes the time series, converts it to the symbolic representation and computes an address: 16

23 [Chiu 2003] (1) Where ord(i) is the ordinal value of i, i.e., ord(a) = 1, ord(b) = 2, and it continues. The hash function computes an integer in the range 1 to wcˆcˆa, and a pointer to the subsequence is placed in the corresponding bucket (line 6)[ Chiu 2003]. At this point we have simply rearranged the data into a hash table with w a addresses, and a total of size O(m). This information can be used as a heuristic for motif search, since if there is a truly over-represented pattern in the time series, we should expect that most, if not all, copies of it hashed to the same location. The address with the most hits the Most Promising Candidate (MPC) is called(line 8).A list of all subsequences that mapped to this address was built(line 9), but it is possible that some subsequences that hashed to different addresses are also within R of the subsequences contained in MPC. We can use the MINDIST function to determine which addresses could possibly contain such subsequences (line 12). All such subsequences are added to the list of subsequences that need to be examined in our small matrix (line 14). At this point the list of similar subsequences into the ADM subroutine is passed (line 17). [Chiu 2003] Next, a simple test is performed. If the number of matches to the current best-so-far motif is greater than the largest unexplored neighborhood (line 18), we are done. We can record the best so far motif as the true best match (line 19), note the number of matching subsequences (line 20), and then abandon the search (line 21). If the test fails, however, we must set the most promising candidate to be the next largest bucket (line 23), initialize the new neighborhood with the contents of the bucket (line 24), and loop back to line 11, 17

24 where the full neighborhood is discovered (lines 13 and 14) and the search continues [Chiu 2003]. For simplicity the pseudo code for the algorithm ignores the following possible optimization: it is possible (in fact, likely), that the neighborhood in one interaction will overlap with the neighborhood in the next, this algorithm along with ADM algorithm is inefficient. For data processing more than 1 year of computing resources of the personal computer it appears insufficient. Random projection algorithm [Chui 2003] of motif discovery solves the problem of data processing for more than 1 year [Chiu 2003]. 2.4 ADM Algorithm The ADM algorithm is used for searching the small neighborhood matrix. The ADM algorithm pre-computes any arbitrary set of distances. The matrices ADM and MIN are used to store the upper bound and lower bound and the distances of any path between these two objects. Each entry of ADM[i, j] is either the exact distance between i and j (i.e. those that are pre-computed), or the lower bound for the distance between i and j. In other words, if P i,j contains the set of all paths from i to j, then ADM[i, j] is the largest lower bound distance between i and j, obtained from all paths in P i,j using the triangle inequality [Chiu 2003]. 18

25 Figure 2.4 ADM Algorithm [Chiu 2003]. A matrix MIN is also used because it s impractical to enumerate all paths in P i,j to get the maximum lower bound between i and j. MIN[i,j] stores the minimum distance of 19

26 any path from i to j (i.e. it s the least upper bound of distance between i and j). ADM and MIN are initialized from line 4 to line 8. [Chiu 2003] From line 10 to line 22, we construct the matrix ADM and MIN. For each stage k, 1 k n, ADM[i, j] is the greatest lower bound of any path from i to j that does not pass through an object numbered greater than k. Similarly, MIN[i, j] is the smallest upper bound of the distance between i and j. Note that we further optimized the algorithm by storing or computing only half of the matrices, due to distance symmetry [Chiu 2003]. On line 24, we allocate and initialize an array, count, which stores the number of items within R (i.e. number of matching subsequences) for each motif center. From line 26 to line 35, we scan the matrix ADM and compute the actual distance between i and j if ADM[i,j] is a lower bound that is smaller than R (because the true distance might be greater than R). Again, we omit the optimization of distance symmetry for simplicity. At each step, we keep track of the number of items within R (line 32) [Chiu 2003]. ADM algorithm is a scheme to answer best-match queries from a file containing a collection of objects. A best-match query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. The number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances can be reduced. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness and performance of the ADM algorithm relative to previous algorithms [Lin 2002]. Finally, the ADM algorithm returns the best-matching motif, the motif that has the most items within R, with a count of number of matching subsequences. 20

27 2.5 Probabilistic Discovery of Time Series Motifs SAX is probabilistic in nature, empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or don t care symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time [Chiu 2003]. Noise plays an important role when attempting to discover motifs. Even small amounts of noise can affect the distance measures, even the most commonly used distance measurement techniques like Euclidean distance. In order to avoid the noise, the definition of time series motifs is generalized to allow for don t-care subsections and a novel time and space-efficient algorithm can be used to discover motifs [Chui 2003] Euclidean Distance Measure Euclidean distance is the normal distance between two points that can be proven by repeated application of the Pythagorean theorem. Euclidean distance examines the root of square differences between coordinates of a pair of objects. When this formula is used as distance, the Euclidean space becomes a metric space [Euclidean 2008]. The Euclidean distance between two points P= (p 1, p 2,.,p n ) and Q= (q 1,q 2,.,q n ) in an Euclidean n-space can be defined as, Figure 2.4 illustrates the Euclidean distance measure between two points P and Q (1) 21

28 Time Series P Time Series Q D (P, Q) Figure 2.5 Euclidean Distance Measure [Chiu 2003] One dimensional Euclidean distance for points P= (p x ) and Q= (q x ) is computed as, (2) 22

29 Two dimensional distance measure for points, P= (p x, p y ) and Q= (q x, q y ) is expressed as, Three dimensional distance measure for two 3D points, P= (px, py, pz) and Q= (qx, qy, qz) is computed as, (3) For two N-D points, P= (p 1,p 2,..,p n ) and Q= (q 1, q 2,.,q n ) the N-dimensional distance is computed as, (4) [Euclidean 2008] Lower Bounding Distance Measure Lower bounding means the estimated distance in the reduced space is less than or equal to the original space distance. Q S D LB (Q,S ) Figure 2.6 Lower Bounding Distance Measure [Lin 2002] 23

30 Now the lower bounding distance measure of the two time series D LB can be determined as, M 2 D LB (Q,S ) and illustrated in Figure 2.6 = ( sr sr )( qv sv i 1 i i 1 i i ) The lower bounding distance calculated is less than or equal to the Euclidean distance measure. Definitions of some of the data types that are used in the discovery of time series motifs Time Series Time series indicates a series of data collected in sequence. It is useful for making decision to get an effective way to explore the information hidden in the given data. A time series T= c 1,c 2 c m can be defined as an ordered set of m- real valued variables. Usually time series can be very long containing billions or trillions of observations. These time series contain subsections, which are called subsequences. A time series is a collection of observations made sequentially in time [Lin 2002]. 24

31 Figure 2.7 Time Series [Lin 2002] In Figure 2.7 blue curve represents the time series. Time series can be classified into many types as illustrated in Figure 2.8, Time Series Representation Data Non Data Sorted Coefficient Piecewise Polynomia Singula Symboli Valu Decompositio Tree Wavelet Rando Mapping Spectra Piecewis Aggregat Approximatio Piecewise Linear Approximation Interpolatio Regressio Adaptiv Piecewis Constan Approximaio Natural Languag String Orthonormal Haa Daubechie dbn n > Bi-Orthonormal Coiflet Symlet Discret Fourie Transfor Discret Cosin Transfor 25

32 UUCUCUCD DFT DWT SVD APCA PAA PLA U U C U C U D D SYM Figure 2.8 Time Series Representation [Motifs 2002] In the project, PAA is used to calculate the distance, defined as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression ratio Figure 2.9 Comparison between Time Series representations[motifs 2002]. f e d c b a DFT PLA Haar APCA Subsequence In a given time series T of length m, a subsequence C in the time series T is a sampling of length n<=m of contiguous position from T. In other words we can say that a subsequence C = tp,., t p+n-1 for 1<=p<= m-n+1. As all the subsequences of a time series can be motifs, it is necessary to extract all of them and this can be done through sliding window. [Chui 2003] 26

33 2.5.5 Sliding Window In a given time series T of length m, and subsequence length of n we can build a matrix S of all possible subsequences by sliding a window of size n across a time series T and by placing subsequence C p in the p th row of S. The size of this matrix S is (mn+1)/n. [Chui 2003] In order to determine whether the subsequences are equal, a match needs to be done Match If a positive real number range R is given and in a time series T consists of a subsequence C starting at position P and another subsequence starts at position Q. If the distance measure D(C, M) <=R, then M is said to be a matching subsequence of C. This is illustrated in Figure 2.10 Figure 2.10 Match [Chui 2003] Trivial Match In a given time series T, containing a subsequence C, starting at position p and a matching subsequence M beginning at q. M is a trivial match to C if either p = q or a subsequence M doesn t exist starting at q such that D(C,M) > R. Trivial match is illustrated in Figure 2.11 [Chui 2003] 27

34 Figure 2.11 Trivial Match [Chui 2003]. 2.6 Time Series Projection For time series projection in this project we consider the K-Motif (n, R) case. If we have a time series T of 1,000 data points that has two occurrences of a motif 16 at times T 1 and T 58. If the occurrence of motif at T 58 is corrupted by noise from positions 8 to 12, Figure 2.12 Time Series [Chui 2003]. Figure 2.12 illustrates the process where subsequences are extracted by a sliding window, converting them into symbolic form, and are then placed into matrix S. Here 28

35 each row of the matrix S points back to the original subsequence. When the matrix S is constructed random projection is started. Now two columns of S matrix are randomly selected as a mask and the k=985 words in the S matrix are hashed into the buckets based on their values in the 1 st and 2 nd columns. Figure 2.13 A randomly chosen mask [Chiu 2003] Figure 2.13 illustrates this, a mask {1, 2} is chosen randomly and the values are used to project matrix S into buckets. Again collisions are recorded by incrementing the appropriate location in the collision matrix and buckets cannot be reused from one iteration to another, they can also be used for appropriate place in the matrix. Figure 2.14 Another iteration with another randomly chosen mask [Chiu 2003] Figure 2.14 illustrates the result of another iteration with {2, 4} as the chosen mask. After repeating the process several times collision matrix is examined. If all the entries are uniform then it indicates there are no more motifs to be found in the given dataset. 29

36 2.7 Brute Force Algorithm Brute-force search is simple to implement, and will always find a solution if it exists. However, its cost is proportional to the number of candidate solutions, which, in many practical problems, tends to grow very quickly as the size of the problem increases. Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size. The method is also used when the simplicity of implementation is more important than speed. This is the case, for example, in critical applications where any errors in the algorithm would have very serious consequences; or when using a computer to prove a mathematical theorem. Brute-force search is also useful as "baseline" method when benchmarking other algorithms or metaheuristics. Indeed, brute-force search can be viewed as the simplest metaheuristic. Brute force search should not be confused with backtracking, where large sets of solutions can be discarded without being explicitly enumerated. 2.8 Finding Planted Motifs In order to recover two planted motifs with two different occurrences lets consider a small dataset. Figure 2.15 shows the two motifs which are different by all means and are noisy along their entire length. Figure 2.15 Two Planted Motifs [Chui 2003] 30

37 These two motifs that have four subsequences are planted into a small dataset of length 1128 as illustrated in Figure The algorithm is ran with n= 128, w=16, a=4 for 100 iterations. Figure 2.16 Dataset with planted motifs [Chiu 2003] Figure 2.17 gives the planted motifs where the dark smudges represent the planted motifs Figure 2.17 Contour Plot with Planted Motifs [Chiu 2003] 31

38 The discovered subsequences at these positions are shown in Figure These may not be similar as our planted motifs but they are similar to each other. Figure 2.18 Motifs Discovered [Chui 2003] 32

39 3. DEVELOPED RESEARCH The data source for the project is the TCOON data base. Data from the TCOON database was stored in a file on a local machine and the data was then passed to SAX. The obtained motifs from SAX were used to improve the predictions from the regression. The concept of using data mining is to take past data from the TCOON data base and pass these values through the SAX to yield the motifs. If time series data is taken as an input, it takes a large amount of time to extract motifs from the information, so SAX is used to reduce the computation time. The proposed process will not require any database to store the data extracted from the TCOON data sets as the data will be stored in a file and one can implement the method on a local machine and a database will not be necessary. 3.1 Code Design The data mining method of prediction is tested within the MATLAB environment. The computers used for the study are Pentium IV PC s with CPU s of 3.19 GHz. 3.2 Obtaining SAX To obtain SAX, first the time series data is converted to the piecewise aggregate approximation (PAA) and then the PAA is converted to symbols. On applying further transformation a discrete representation can be obtained. It is desirable to have a discretization technique that produces symbols with equiprobability. As normalized subsequences have highly Gaussian distribution, breakpoints can be easily determined. This is illustrated in Figure 3.1, where subsequences of length 128 from 8 different time series are extracted and a normal probability plot of the data is plotted 33

40 Figure 3.1 A Normal Probability Plot [Motifs 2002] Breakpoints The breakpoints are sorted list of numbers B = β 1, β a-1 such that the area under a N(0,1) Gaussian curve from β i to β i+1 = 1/a.The breakpoints may be determined by looking them in the statistical table.[ Chui 2003] In Figure 3.2, the blue curve represents the time series data, a fixed size window is taken and the distance between the fixed size window and the time series is calculated. C C Figure 3.2 Time Series Representation [SAX 2008]. 34

41 Now a Gaussian curve is taken and is divided into three parts. The break points are taken along the curve and the PAA coefficients that are below the smallest breakpoint are mapped to the symbol a and all coefficients greater or equal to the smallest breakpoint and less than the second smallest breakpoint are mapped to the symbol b points and the symbols are thus obtained. This is shown in Figure 3.3, where the red lines are the break points, and the curve above the break points are marked as c and the curves below the break points are marked as a. The output of the time series in the figure is baabccbc. - b a a b c c b c Figure 3.3 Symbolic Strings [SAX 2008] 3.3 Random Projection Algorithm The random projection algorithm is used to extract the motifs from time series data. Random projection algorithm makes use of SAX. The projection algorithm is designed to attack the planted (w, d)-motif problem. Each string is planted with exactly one approximate occurrence of an unknown motif y of length w,an occurrence with d substitutions. The planted motif problem uses a huge search space for motif discovery, in 35

42 order to reduce the huge search space, projection algorithm is used. Random projection is used to guess at least some of the occurrences of the unknown planted motif. [Chui 2003] The main factors that made the success of random projection include the choices of the projection size k, the number of iterations i, and the threshold value s of the time series data. The projection size k has to be chosen in such a way that k < d-w. In order to make use of the projection algorithm for motif discovery the dimensionality of the subsequences should be significantly reduced and high correlation is needed [Chui 2003] Functionality In the random projection motif discovery algorithm the length of a motif subsequence cannot be more than length of a time series. In this algorithm time series data is converted to SAX. The functionality of the Random projection algorithm for motif discovery is as follows: Initially parameters are set for loading of data. Identify the filename which contains the time series data are united. Choose a start date on which we search for motive, for example '01/01/2000'. Choose an end date on which we search for motive, for example '12/01/2000'. Interval - an interval of measurements, for example 1 hour. Then data are loaded by function [X Y] = data_load (filename, start_date, end_date, interval); Parameters of SAX algorithm are defined: a - Alphabet size {a, b, c...}; n - Length of motif subsequence; w - Length of word. 36

43 Parameters of the random projection algorithm are defined: t - projection size; num_iteration - number of trials; s - threshold for largest value in collision table; r - range. The path to a folder where results will be saved is defined: output path, for example 'motifs\two-day-long\2000\'. This folder is pre-created in a project folder. SAX algorithm is carried out and an initial time-series will be transformed to the symbolical form. It used functions timeseries2symbol(y, w, a). Then random projection algorithm which gives a set of motif candidates is carried out. Again address to an original time series is carried out and motif candidates are checked on quantity of matches, the candidate motif with biggest number of matches is the best one. Initially subsequences are extracted using a sliding window, converting them into symbolic form, and placing them into S_hat matrix in which each row index pointing back to the original location of the subsequence. Once the matrix is constructed, random projection is started. m columns of S_hat matrix are randomly selected as a mask and the collision matrix is initialized as zero sparse matrix. Now during random projection, each substring of size w in the sequence is mapped to a string of size m, called the projection, by reading the symbols through the mask. 37

44 After projection, if some cells have values that are significantly larger than the average in the collision matrix, we treat them as motif candidates. We then calculate the Euclidean distance between the original time series of these motif candidates. The search of motifs for a certain time span, such as a week long or two weeks are carried out and the results are stored in a folder. Array with the predicted points of length of 48 hours are discovered. Motives for each year ( ) are extracted and these motives also array lengths of 48 hours. 3.4 Brute Force Algorithm The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order. [Xi 2007] This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code for the brute force algorithm is as follows. void BF(char *x, int m, char *y, int n) { int i, j; for(j = 0; j <= n - m; ++j) { for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} } } [Xi 2007] 38

45 3.5 Basic System Functionality The TCOON data on the local machine is sent to SAX which identifies the motifs and gives the best motif by analyzing the previous data it received. These motifs are used to improve the predictions from the vector support regression model. The motifs thus obtained are compared with the results of brute force algorithm. SAX reduces the computation time and produces the precise results due to its nonlinearity. Figure 3.4 shows the data mining model used in the project. When input is given to SAX, it gives output as symbolic strings such as aaabbbaabb. These strings are used to find the motifs utilized to improve the predictions. Past Data Random Projection Algorithm Motifs SAX comparison Past Data Brute Force Algorithm Motifs Figure 3.4 System Architecture 39

46 4. TESTING AND EVALUATION Testing plays an important role for any project. Every system must be tested before it can be declared functional. Testing is the act of subjecting the system to experimental data in order to determine how well the system works. In this project the following data are retrieved from TCOON: primary water levels and mean high water level. The raw primary water level data is stored in a file which contains the data from year 2000 to year The system is tested by passing raw data in the file to the random projection motif discovery algorithm and motifs for each of the years are extracted respectively. To test it, results from the random projection algorithm (RPA) are compared to algorithm of brute force on a small range of data. The results are also tested by comparing that with already known data. For example, the data from the year will be passed to the system and the output predicted data from the data mining model is compared with the already known data of The input to the brute force algorithm is the primary water level; the output from the brute force attack algorithm is compared with the output from RPA to check whether the found motives are accurate. Several test cases are given as input to the SAX and the output is compared with brute force algorithm output. The test cases include, finding the motifs for one week, one month, one year and for the whole 8 years. The time varied for the discovery of motifs. The scalability of the various algorithms is tested and the results are illustrated in Figure 4.1. The results confirm that in the length of the time series brute force algorithm is quadratic and time series projection is linear. 40

47 Figure 4.1 Scalability of Algorithms [Chiu 2003] The inputs to the system are the standard values taken from the TCOON data base from the past 8 years. The data are stored in the plain text format on the local machine. These values are then passed through the SAX subsystem and the output values will be tabulated and a motif graph is depicted. 4.1 Problems Faced While Developing the Project The input data was tested with the EMMA algorithm and ADM. This algorithm is inefficient. When processing data more than one year in the personal computer it appears insufficient. Preprocessing of data for small amount of missing data, such as one hour is performed by calculating the average of the values available, but for large amounts of missing data, the values are skipped. Selection of kernel functions for regression algorithm is important, as we don t consider time of the beginning of a motif subsequence. If large amounts of data are given as input, the time taken for the discovery of motifs is very long. Initially the identified motifs were tested with linear regression but the output seemed to be a straight line. 41

48 4.2 Results The search of two week long motifs is executed with an interval of 1 hour. The algorithm parameters are as follows Alphabet size, a = 4 Length of motif subsequence, n = 168 Length of word, w = 14 Projection size, t = 9 Number of trials, iter = 50 Threshold for lagest value in collision table, s = 30 Range, R = 0.30 The search of motives of length 2 days is also executed and the algorithm parameters are Alphabet size, a = 4 Length of motif subsequence, n = 48 Length of word, w = 6 Projection size, t = 4 Number of trials, iter = 50 Threshold for largest value in collision table, s = 30 Range, R = 0.10 Figure 4.2 gives the motif matches for the year 2000 for two day lengths. The colored curves in the figure denote the motif matches for every month of the year X axis represents the time period and the Y axis represents the water level measurements. 42

49 Y axis X axis Figure 4.2 Motif Matches for two day lengths in the year Figure 4.3 gives the best motif among the curves for the year From the motif matches obtained for each month, the best motif for the year is extracted. The extracted motif matches for every two days of each month are considered and the best motif from among all the found motives are depicted based on the most reoccurring patterns from all the found motifs. X axis represents the time period and the Y axis represents the water level measurements 43

50 Y axis X axis Figure 4.3 Best Motif for two day lengths in the year Figure 4.4 gives motif matches for the year 2000 for two week lengths. In Figure 4.4 the colored curves denote the motif matches for every two weeks of the year X axis represents the time period and the Y axis represents the water level measurements. The X axis values are numbered till as 168 as two week length motif extend till

51 Y axis X axis Figure 4.4 Motif Matches for week lengths in the year Figure 4.5 gives best motif of the year 2000 for two week length. From the motif matches obtained for two week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 45

52 Y axis X axis Figure 4.5 Best Motif for week lengths in the year Figure 4.6 gives the motif matches for the year 2001 for two day lengths. The colored curves in the figure denote the motif matches for every month of the year X axis represents the time period and the Y axis represents the water level measurements. 46

53 Y axis X axis Figure 4.6 Motif Matches for the year 2001 two day lengths. Figure 4.7 gives the best motif among the curves for the year From the motif matches obtained for each month, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 47

54 Y axis X axis Figure 4.7 Best motif among the curves for the year 2001 two day lengths. Figure 4.8 gives the motif matches for the year 2001 for week length. From the motif matches obtained for two week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 48

55 Y axis X axis Figure 4.8 Motif matches for the year 2001 for week lengths. Figure 4.9 gives the best motif among the curves for the year 2001 for week lengths. From the motif matches obtained for week lengths, the best motif for the year is extracted. X axis represents the time period and the Y axis represents the water level measurements 49

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

Time Series Analysis DM 2 / A.A

Time Series Analysis DM 2 / A.A DM 2 / A.A. 2010-2011 Time Series Analysis Several slides are borrowed from: Han and Kamber, Data Mining: Concepts and Techniques Mining time-series data Lei Chen, Similarity Search Over Time-Series Data

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery

Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery David Minnen, Charles Isbell, Irfan Essa, and Thad Starner Georgia Institute of Technology College

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Multiresolution Motif Discovery in Time Series

Multiresolution Motif Discovery in Time Series Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

High Dimensional Data Mining in Time Series by Reducing Dimensionality and Numerosity

High Dimensional Data Mining in Time Series by Reducing Dimensionality and Numerosity High Dimensional Data Mining in Time Series by Reducing Dimensionality and Numerosity S. U. Kadam 1, Prof. D. M. Thakore 1 M.E.(Computer Engineering) BVU college of engineering, Pune, Maharashtra, India

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Computational biology course IST 2015/2016

Computational biology course IST 2015/2016 Computational biology course IST 2015/2016 Introduc)on to Algorithms! Algorithms: problem- solving methods suitable for implementation as a computer program! Data structures: objects created to organize

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Chapter 3: Data Mining:

Chapter 3: Data Mining: Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

GEMINI GEneric Multimedia INdexIng

GEMINI GEneric Multimedia INdexIng GEMINI GEneric Multimedia INdexIng GEneric Multimedia INdexIng distance measure Sub-pattern Match quick and dirty test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

Symbolic Representation and Clustering of Bio-Medical Time-Series Data Using Non-Parametric Segmentation and Cluster Ensemble

Symbolic Representation and Clustering of Bio-Medical Time-Series Data Using Non-Parametric Segmentation and Cluster Ensemble Symbolic Representation and Clustering of Bio-Medical Time-Series Data Using Non-Parametric Segmentation and Cluster Ensemble Hyokyeong Lee and Rahul Singh 1 Department of Computer Science, San Francisco

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

4. Image Retrieval using Transformed Image Content

4. Image Retrieval using Transformed Image Content 4. Image Retrieval using Transformed Image Content The desire of better and faster retrieval techniques has always fuelled to the research in content based image retrieval (CBIR). A class of unitary matrices

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

The Curse of Dimensionality. Panagiotis Parchas Advanced Data Management Spring 2012 CSE HKUST

The Curse of Dimensionality. Panagiotis Parchas Advanced Data Management Spring 2012 CSE HKUST The Curse of Dimensionality Panagiotis Parchas Advanced Data Management Spring 2012 CSE HKUST Multiple Dimensions As we discussed in the lectures, many times it is convenient to transform a signal(time

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 24

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 24 Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras Lecture - 24 So in today s class, we will look at quadrilateral elements; and we will

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks Series Prediction as a Problem of Missing Values: Application to ESTSP7 and NN3 Competition Benchmarks Antti Sorjamaa and Amaury Lendasse Abstract In this paper, time series prediction is considered as

More information

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability 7 Fractions GRADE 7 FRACTIONS continue to develop proficiency by using fractions in mental strategies and in selecting and justifying use; develop proficiency in adding and subtracting simple fractions;

More information

CS490D: Introduction to Data Mining Prof. Chris Clifton

CS490D: Introduction to Data Mining Prof. Chris Clifton CS490D: Introduction to Data Mining Prof. Chris Clifton April 5, 2004 Mining of Time Series Data Time-series database Mining Time-Series and Sequence Data Consists of sequences of values or events changing

More information

Pattern Recognition ( , RIT) Exercise 1 Solution

Pattern Recognition ( , RIT) Exercise 1 Solution Pattern Recognition (4005-759, 20092 RIT) Exercise 1 Solution Instructor: Prof. Richard Zanibbi The following exercises are to help you review for the upcoming midterm examination on Thursday of Week 5

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET

CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET 69 CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET 3.1 WAVELET Wavelet as a subject is highly interdisciplinary and it draws in crucial ways on ideas from the outside world. The working of wavelet in

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

CS 112 Introduction to Programming

CS 112 Introduction to Programming Running Time CS 112 Introduction to Programming As soon as an Analytic Engine exists, it will necessarily guide the future course of the science. Whenever any result is sought by its aid, the question

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi 1. Introduction The choice of a particular transform in a given application depends on the amount of

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Central Valley School District Math Curriculum Map Grade 8. August - September

Central Valley School District Math Curriculum Map Grade 8. August - September August - September Decimals Add, subtract, multiply and/or divide decimals without a calculator (straight computation or word problems) Convert between fractions and decimals ( terminating or repeating

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

Diffusion Wavelets for Natural Image Analysis

Diffusion Wavelets for Natural Image Analysis Diffusion Wavelets for Natural Image Analysis Tyrus Berry December 16, 2011 Contents 1 Project Description 2 2 Introduction to Diffusion Wavelets 2 2.1 Diffusion Multiresolution............................

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

MRT based Fixed Block size Transform Coding

MRT based Fixed Block size Transform Coding 3 MRT based Fixed Block size Transform Coding Contents 3.1 Transform Coding..64 3.1.1 Transform Selection...65 3.1.2 Sub-image size selection... 66 3.1.3 Bit Allocation.....67 3.2 Transform coding using

More information

31.6 Powers of an element

31.6 Powers of an element 31.6 Powers of an element Just as we often consider the multiples of a given element, modulo, we consider the sequence of powers of, modulo, where :,,,,. modulo Indexing from 0, the 0th value in this sequence

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Comparison of Digital Image Watermarking Algorithms. Xu Zhou Colorado School of Mines December 1, 2014

Comparison of Digital Image Watermarking Algorithms. Xu Zhou Colorado School of Mines December 1, 2014 Comparison of Digital Image Watermarking Algorithms Xu Zhou Colorado School of Mines December 1, 2014 Outlier Introduction Background on digital image watermarking Comparison of several algorithms Experimental

More information