Constructing Horizontal layout and Clustering Horizontal layout by applying Fuzzy Concepts for Data mining Reasoning

Size: px

Start display at page:

Download "Constructing Horizontal layout and Clustering Horizontal layout by applying Fuzzy Concepts for Data mining Reasoning"

Roberta McDowell
5 years ago
Views:

1 International Journal Of Engineering And Computer Science ISSN: Volume 4 Issue 1 January 2015, Page No Constructing Horizontal layout and Clustering Horizontal layout by applying Fuzzy Concepts for Data mining Reasoning Kalluri N V Satya Naresh, Divya Vani.Y divyasudha99@gmail.com Shri Vishnu Engineering College for Women Bhimavaram, Andhra Pradesh, India Abstract: Clustering is one of the significant tasks in data mining which is benevolent for bounteous users by affording analysis and decision making. This paper inaugurates agile and dexterous way to conceive horizontal layout and forthright usage of horizontal layout in data mining algorithms like clustering. Predominantly educing a data set in data mining project for analysis is a time conceiving, striving task so horizontal layouts are created and stored in database which averts the burden of performing data preprocessing in data mining projects.the vertical layouts created by vertical aggregations in SQL are impotent for data mining algorithms so horizontal aggregations are used to create horizontal layouts. It is surpass to create horizontal layout instead of creating vertical layout as vertical layout only creates one column per aggregated group by using normal SQL (Structured Query Language) aggregations and horizontal layouts returns many values per aggregated group or row so they are useful for data mining algorithms. Through CASE and SPJ methods horizontal aggregations are evaluated for creating horizontal layouts dexterously and agilely. This paper induces how horizontal layout can be created easily with CASE method than by using SPJ method. To prepare a data set for clustering takes more time and effort so the created horizontal layout is obliged for clustering directly without wastage of time and effort. As in data uncertainty is the key feature so by using soft computing concepts like Fuzzy Set, clustering of horizontal layout is done, hence clustered data is serendipitous for users for analysis and decision making and the whole process is elucidated with examples and experimental results. Keywords: Horizontal Aggregation, Horizontal layout, Vertical layout, Vertical Aggregation, Data mining algorithms, Clustering, Fuzzy Concepts. 1. Introduction: Horizontal layouts are dreadfully of assistance in data mining algorithms, so this paper utterly perambulates about effortless creation and clustering of horizontal layout by superintendence imprecise data. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10028

2 Generally erecting a data set for data mining projects is a most time conceiving process. The vertical layouts spawned by normal SQL aggregation functions (vertical aggregations) are discordant for using in data mining tasks or projects. Vertical layout spawned by vertical aggregations dwelled of more no of rows which are not I/O (Input or Output) efficient and are impotent for using in data mining tasks or projects. So to disentangle the problem of erecting data sets horizontal aggregations are adopted to create horizontal layout easily. Horizontal layouts are augment I/O efficient than vertical layout for using in data mining algorithms like classification, regression analysis, PDA, clustering. Horizontal layout can avoid the burden of creating data sets by performing data preprocessing phase and data set creation phase with complex SQL queries. Vertical layouts have some limitations to use for data mining algorithms which are erected by using normal SQL functions as they return only one column per aggregated group or row, so Horizontal layout is created by using functions called horizontal aggregations which create many columns or values per aggregated group or row instead of one value per row. They are many advantages with horizontal aggregations which are helpful for generating SQL code automatically and these are evaluated by using SPJ and CASE methods in this paper. In this paper it is clearly proved with example that it is easy and time efficient to create horizontal layout by using CASE method than using SPJ method. Without performing any data mining pre-processing tasks in-anticipation created horizontal layout is used unswervingly for clustering saving time and effort. Clustering of horizontal layout is performed by using Fuzzy Concepts handling impreciseness and vagueness of data. The mechanism where information is gleaned, asserted in a summary form and recycled for demographic analysis is known as data aggregation. Intension to get ample information about itemized groups from data based on peculiar variables such as gender, name, age, address, profession, phone number or income is called as general aggregation. Utmost data mining algorithms crave horizontal layout data set as input because horizontal layout return values per aggregated row instead of one value per aggregated row. A latest class of aggregate functions is contemplated to return a table or data set having horizontal layout aggregating expressions of numeric and transposing the results. Functions which belong to this type of class are horizontal aggregations. Horizontal aggregations epitomize the dilatation form of traditional SQL aggregations, which return a group of values or columns in a horizontal layout per aggregated row or group instead of a single column or value per aggregated row. Many vital operators and functions are needed to compute aggregations in SQL. Sum is the ultimate prevalently used aggregation of a column and assorted other aggregation operators return the row count, maximum, average and minimum over the groups of rows. For accomplishing aggregations all the extant operators have cramp to be used in data mining intendments to create large data sets. For OLTP (online transaction process) database schemas need to be profoundly normalized. But conventionally data mining, machine learning or statistical algorithms carve aggregated data to be in synopsized form. Data mining algorithms use suitable input as cross tabular (horizontal) pattern so for this intendment essential endeavor is required to compute aggregation. En masse creating a data set for data mining projects is a most time conceiving process. Horizontal layouts are I/O and time efficient for using in data mining algorithms like classification, regression analysis, PDA, clustering which can avoid the burden of creating data sets by performing data preprocessing Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10029

3 phase and data set creation phase with complex SQL queries. Vertical layouts have some cramp to use for data mining algorithms which are created by using normal SQL functions as they return only one column per aggregated group or row, so Horizontal layout is created by using functions called horizontal aggregations which create many columns or values per aggregated group or row instead of one value per row. They are many advantages with horizontal aggregations like procreate SQL code automatically and evaluated by using SPJ and CASE methods. An advanced class function is Horizontal aggregation to return attributes or columns that are aggregated in a horizontal layout. Most algorithms require datasets with horizontal layout as input. It is tenacious task to superintend data sets without rampart of DBMS. Intramural a Relational database it is worthier to try with different subsets of dimensions and data points are easier, faster and flexible than working outside with another alternative tool. Much like project, join, select, horizontal aggregation are performed by using operator and it is better to implement inside query processor. In everyday and advanced applications intersperse of soft computing and tools are invigorated by soft computing. In real applications data uncertainty is the clamorous feature and as hard computing cannot handle vague and uncertain data soft computing is used. Zadeh inaugurated the notion of graded membership by perceiving the concept of Fuzzy set in order to apprehend impreciseness in data, and theorize the characteristic function of sets. The most autonomous learning problem clustering is dealing with discovering a structure in a collection of unlabeled data. To cluster inexact and imprecise data Fuzzy based clustering algorithms are used. In clustering if the minimum no of elements in a cluster is fixed than it is K-Means algorithm and if no of clusters are fixed than it is fuzzy-c Means algorithm. As horizontal layout can be used precisely for data mining algorithms or projects we are using well-nigh for clustering because it is one of most important task in data mining. Clustering of Horizontal layout can be performed through Fuzzy C-Means algorithm. 2. Literature Review: Database is formulating data to model pertinent aspects of verisimilitude in a way to support processes requiring information. Data Base Management System (DBMS) are specially developed software applications that interact with applications, users and database to capture data and analyze data. DBMS is special software designed to allow define, create, update, query and administrate database. Some known DBMS are MYSQL, PostgreSQL, MariaDB, SQLLite, Oracle, Microsoft SQL Server, DBase, SAP HANA, FoxPro, Libre office Base, IBM DB2, and File Marker Pro. To select data from database SELECT statement is used. Projection is selecting of the columns of table that one wishes to appear in the answer or table or data set. SQL join is used to built data set or table based on the common field between tables from two or more tables to combine rows of tables. Left outer join returns the matched tuples or rows from the right table and all the tuples or rows from left table. Aggregation function groups multiple rows values to form a single value based on certain condition. The most commonly used aggregation functions are average (), maximum (), mode (), median (), count (), minimum (), sum (). These normal SQL aggregation functions are also called as vertical aggregation functions useful to create vertical layout. Group by clause performs gathering of all the rows that contains Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10030

4 data in the opted columns and allows aggregation functions to operate on one or more columns. Data mining is the process of extracting knowledge from data. Data present in various data sources is collected and stored in data warehouses than data mining functionalities are performed on preprocessed data giving results of user understandable form. All tasks Data cleaning, transforming, reducing, regression analysis, association rule generation, Classification, clustering, outlier analysis comes under data mining tasks. This paper deals with clustering among different functionalities of data mining. Data Clustering is the technique of partitioning a dataset into distinct clusters depending upon the property of same identity of elements. The Elements which are having identical features are kept in a single cluster, whereas not so identical elements are kept in different clusters. In 1965 Zadeh determined the sign of fuzzy set and deliberated fuzzy set. Membership function is accredited with fuzzy set and considerate to tackle with imprecise data. A fuzzy set is defined as A S, where S is a set in an universe, is defined by its membership function denoted by such that : X [0,1], that is every A A y A is associated with a real number ( y ), called the membership value of x, which satisfies 0< ( y A ) <1. To cluster data by super visioning impreciseness by using Fuzzy set concept, clustering is performed for the created Horizontal layouts and the clustered data is serendipitous for users to analysis and decision making purposes. 2.2 Need For Creating Horizontal Layout: Horizontal layout predominantly untangles the burden of data mining projects as educing of data sets in data A preparation phase takes lot of time and effort. The horizontal layouts can be precisely used as input data sets by data mining algorithms like classification, regression analysis, clustering and PDA without again preparing data sets from data tables. Prevalent SQL aggregation functions like min, avg, sum, and max can be used to create vertical layout. Vertical layouts elicited by using accustom SQL aggregation functions but cannot be opted as I/O efficient for data mining algorithms because they can generate only one column per aggregated group and legion rows. Therefore a horizontal layout is imperative having many columns per aggregated group i.e returning many values per row. By excogitating functions like horizontal aggregations educing horizontal layout can be comply. Data mining tools can perforce generate SQL code. To assay horizontal aggregations methods like CASE and SPJ can be afford. 2.3 Advantages of creating horizontal layouts using horizontal aggregations and clustering them: (.) In data mining tools SQL code can be generated as horizontal aggregation constructs a template and automates to reproduce, optimize and test SQL queries for correctness. (.) SQL queries generated axiomatically are more efficient than queries generated by end user. (.) The data set created by horizontal aggregations can be created unswervingly in the database. (.) The Horizontal layouts created can be straightly given as input for data mining algorithms like classification, regression analysis, clustering and PDA. (.) The clustered data created by clustering horizontal layout is more serendipitous by users for analysis and decision making. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10031

5 2.4 Definitions: T is a database table with primary key P, C 1, C 2,.,C i as discrete columns, N as one numeric column and it is symbolized as T(P, C 1, C 2.C i, N). In OLAP terms it is interpreted as T is the fact table having P as primary key, i dimensions, N as measure column where M is the size of the table, C 1, C 2.C i are foreign keys in fact table and primary keys in lookup tables. T is the input table, by executing SQL queries tables T V, T H are created where Table T V is the vertical layout table, T H is the horizontal layout. Conversion of vertical layout to horizontal layout is the goal of horizontal aggregations. Let us consider the following table T as example having P as primary key, C 1, C 2 as discrete columns and N as numeric column. Table 2.2 Vertical layout After giving the above SQL Query with SQL aggregation function like sum, above table 2.2 is the output for query which is called a vertical layout. As this vertical layout is having only one aggregated column and both C 1, C 2 acting as primary key it is not useful for giving as input to data mining algorithms, So horizontal tabular layout is required. The following table 2.3 is horizontal layout having two aggregated columns and one primary key which is helpful for giving as input in data mining tasks or algorithms. Database Table Table 2.1 A Horizontal layout Table Methodology Consider the query Select C 1, C 2, Sum (N) from T group by C 1, C 2 order by C 1, C Horizontal Aggregations: Horizontal aggregations are abetting in times where the user wants to get output in horizontal form or craves amalgamating vertical layout with aggregations confide in on grouping columns. As vertical layout are not that abundantly commodious for data mining Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10032

6 algorithms horizontal layout are created by using horizontal aggregations. Horizontal aggregations revamp the vertical layout to horizontal layout by transmogrifying the aggregation column N to list of transposing columns Y 1.Y K. Consider an SQL Query that takes X 1..X m as subset from C 1..C p1. The syntax for conceiving vertical layout is as follows. Select X 1.X m, sum (N) from T group by X 1.X m. The above query will outturn a vertical layout data set possessing m+1 columns where the m columns X 1 X m act as primary and Sum (N) is the only one aggregated column. To metamorphose the Vertical layout to horizontal layout, horizontal aggregation functions are used. The syntax for erecting of Horizontal layout is as follows: SELECT X 1,.,X j, Ha(N BY Y 1,.,Y k ) FROM F GROUP BY X 1,.,X j. Consider a palpable example of stores database procuring stores information in Table transaction. Table transaction is possessing strid, deptid, date, month, year, day, rate, qty, totalsales, itemqty, costamt as columns. Suppose if we appetite to find out total sales for each storied by each day of the week. The normal SQL statement for the above query is Select strid, day, sum (totalsales) from transaction group by strid, day order by strid, day. This gives a vertical layout like below The indispensable desideratum of horizontal aggregations is to transmogrify aggregated column N by a list of columns Y 1 Y k where the Y 1.. Y k are subset of columns X 1 X m and k<m. So to inaugurate SQL code by horizontal aggregations there are four input parameters T, X 1.X m, N, Y 1.Y k Where T is the Input table, X 1.X m are the grouping columns, N is the aggregated column and Y 1.Y k are transposing columns. The frame of reference for horizontal aggregation is similar to the frame of reference for vertical aggregation. The horizontal aggregation function is connate by Ha(N BY Y 1,.,Y k ) where Ha is the standard SQL aggregation function, N is the aggregation column and Y 1..Y k are the transposing columns. Annexing of standard SQL aggregation or vertical aggregation function is rendered by using By clause which transmutes the aggregation column N to list of transposing columns Y 1 Y k which avails in conceiving a horizontal layout instead of vertical layout creation. Fig Vertical layout created by using vertical aggregations This vertical layout is not useful for data mining tasks as it has only one aggregated column and both strid, day of week act as primary key returning many records. So by using horizontal aggregations Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10033

7 horizontal layout is created having many aggregated columns and only strid as primary key. The SQL syntax with horizontal aggregations is as follows: Select strid, sum (total_sales BY day_of_week) from transaction group by strid. Architecture Fig 3.2 System Module 1(Selection Process) Fig Horizontal layout created by using Horizontal aggregations 3.2 Creation and Clustering Horizontal Layout This paper percolates creation of horizontal layout with CASE, SPJ methods and clusters the resulted Horizontal layout by using Fuzzy C-Means algorithm. An Example with results is also explained for understanding. Horizontal layouts can be created by CASE, SPJ and Pivot methods but PIVOT and CASE method give the same result with almost same time complexity but CASE method is having better time complexity than SPJ method. So we are only using CASE and SPJ methods in our process, both gives same result with different time complexities. Creation and clustering horizontal layouts is done in three modules. This is the proposed System architecture: We need to select the table from database and select the columns that we want to group by, aggregate, transpose for which we want to create horizontal layout. Select the group by column X 1..X j Select the aggregate column N Select the transposing column Y 1.Y k Module 2(Creation of Horizontal Layout) In this module horizontal layouts are created by using SPJ and CASE methods SPJ Method: In this caliber we aggregate the column in horizontal way with the help of SPJ (Select, Project, Join) method. The basic idea is to create one table with a vertical aggregation for each result column, and then join all those tables to produce F H. We aggregate from F into d projected tables with d Select-Project-Join-Aggregation queries (selection, projection, join, aggregation). Each table F 1 corresponds to one sub grouping combination and has {X 1 X j } as primary key and an aggregation on A as Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10034

8 the only non key column. It is necessary to introduce an additional table F 0 that will be outer joined with projected tables to get a complete result set. Three Main Steps in SPJ Method to create Horizontal layout: (.) First Table T 0 is created having distinct combination of group by columns X 1,..,X j. (.) For each unique combination of Transposing columns Y 1,,Y k, Tables T 1,.,T d are created. (.) Lastly Table T 0 is left outer joined with each table T 1 to T d. How these tables are created is clearly explained below. Table T 0 defines the number of result rows, and builds the primary key. T 0 is populated so that it contains every existing combination of X 1,..,X j. Table F 0 has X 1,,X j as primary key and it does not have any non key column. INSERT INTO T0 SELECT DISTINCT X 1,..., X j FROM T. We should create tables T 1 to T d. Tables T 1,,., T d contain individual aggregations for each combination of R 1,...,R k. The primary key of table T 1.T d is Y 1,.,Y k and N is aggregated column. INSERT INTO T 1 SELECT X i,.x j, V(N) FROM T/T v WHERE Y 1 = v 11 AND Y k = V k1 GROUP BY X i,.x j. Then each table T 1 aggregates only those rows that correspond to the I th unique combination of Y 1.Y k, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get T H we need d left outer joins with the T 0 and d tables so that all individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there is no qualifying rows. Such approach should be considered on a per CASE basis. INSERT INTO T H SELECT T 0.X 1, T 0.X 2,..., T 0.X j, T 1.N, T 2.N,...,T d.n FROM T 0 LEFT OUTER JOIN T 1 ON T 0.X 1 = T 1.X 1 and... and T 0.X j =T 1.Xj LEFT OUTER JOIN F 2 ON T 0.X 1 = T 2.X 1 and... and T 0.X j = T 2.Xj..LEFT OUTER JOIN Fd ON T 0.X1 = T d.x1 and... and T 0.X j = T d.x j. Real Time Example for SPJ method: Consider a database having stores information and Transaction is a table in the database having StoreId, DepId, Date, Month, Year, Day, ItemId, Rate, Qty, Amt as columns. Suppose if want find total sales amount for each storied by each day of week. The following queries should be computed to construct horizontal layout by using SPJ method Query1: INSERT INTO F 0 SELECT DISTINCT storeid FROM Transaction. Query2: INSERT INTO F 1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Mon GROUP BY storeid; Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10035

9 .Query3: INSERT INTO F 2 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Tue GROUP BY storeid;.query4: INSERT INTO F 3 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Wed GROUP BY storeid;.query5: INSERT INTO F 4 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Thu GROUP BY strid; Query6: INSERT INTO F 5 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Fri GROUP BY storeid;. Query7: INSERT INTO F 6 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Sat GROUP BY storeid;.query8: INSERT INTO F 7 SELECT storeid, sum (amt) AS totalsalesamt FROM F Transaction WHERE Day= Sun GROUP BY storeid; Query9: INSERT INTO F H SELECT F 0.storied, F 1.totalsalesamt AS Mon-amt, F 2.totalsalesamt AS Tue-amt, F 3.totalsalesamt AS Wed-amt, F 4.totalsalesamt AS Thu-amt, F 5.totalsalesamt AS fri-amt, F 6.totalsalesamt AS Sat-amt, F 7.totalsalesamt AS Sun-amt FROM F 0 LEFT OUTER JOIN F 1 on F 0.storeid=F 1.storeid LEFT OUTER JOIN F 2 on F 0.storrid=F 2.storeid LEFT OUTER JOIN F 3 on F 0.storeid=F 3.storeid LEFT OUTER JOIN F 4 on F 0.storeid=F 4.storeid LEFT OUTER JOIN F 5 on F 0.storeid=F 5.storeid LEFT OUTER JOIN F 6 on F 0.storeid=F 6.storeid LEFT OUTER JOIN F 7 on F 0.storeid=F 7.storeid. By evaluating above queries we will get the horizontal layout that we want but it takes lot of effort as more sub queries should be written and more join operations should be performed. Consider the same above query, to create vertical layout for this just one query is enough i.e select storied, day, sum (amt) from Transaction group by storied, day. But to create horizontal layout we are writing 9 queries, so to reduce the effort and time complexity CASE method can be used to create horizontal layout easily with less effort CASE Method: In this module we aggregate the column horizontally through CASE Method. The CASE statement returns a value selected from a set of values based on Boolean expressions. From a relational database theory point of view this is equivalent to doing a simple projection/aggregation query where each non key value is given by a function that returns a number based on some conjunction of conditions. In a similar manner to SPJ, the method directly aggregates from F. Horizontal aggregation queries can be evaluated by directly aggregating from F and transposing rows at the same time to produce F H. First, we need to get the unique combinations of R 1,.,R k that define the matching Boolean expression for result columns. The SQL code to compute horizontal aggregations directly from F is as follows: V () is a standard (vertical) SQL aggregation that has a CASE statement as argument. Horizontal aggregations need to set the result to null when there are no qualifying rows for the specific horizontal group to be consistent with the SPJ method and also with the extended relational model. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10036

10 SQL Syntax for CASE method is given below, in the syntax T is the original table and T H is the horizontal layout: SELECT DISTINCT Y 1,,Y k FROM T. INSERT INTO TH SELECT X 1,..., X j, V(CASE WHEN Y 1 = v 11 and... and Y k =v k1 THEN N ELSE null END),V(CASE WHEN Y 1 = v 1d and... and Y k = v kd THEN N ELSE null END) FROM F GROUP BY X 1, X 2...,Y j. Example: Suppose in a store database if we want find out total items sold in each department of each store by each day of week. The following query is evaluated to create horizontal layout by using CASE method. select StoreId, DepId, sum( CASE when Day='Fri' then Qty else null end),sum( CASE when Day='Mon' then Qty else null end),sum( CASE when Day='Sat' then Qty else null end),sum( CASE when Day='Thr' then Qty else null end),sum( CASE when Day='Tue' then Qty else null end),sum( CASE when Day='Wed' then Qty else null end) from Trans1 Group By StoreId, DepId. previously created data set can be directly taken as input for clustering instead of again creating data set. The Horizontal layout clustered can be useful for analysis and decision making. As fuzzy C-means algorithm can handle vagueness of data, so to cluster Horizontal layouts fuzzy C-Means algorithm is used FUZZY C-MEANS ALGORITHM: As experienced in real life situations, the clustering of datasets by hard c-means leads to a partition of the dataset. But, this is unwanted in many cases and so the applicability of hard c-means has been limited. However, the concept of fuzzy sets, so that an element can belong to any number of clusters with different membership values. The objective function is n c m' 2 m(, ) ( ik ) ( ik ) k1 i1 J U v d m being a real number such that 1 m' and is called the fuzzifier. the k th pattern to v i. ik [0, 1] is the membership of Algorithm: Module 3(Clustering) The main objective in this paper is to create a data set easily so that it can be useful directly in data mining tasks or projects avoiding data preprocessing phase. The horizontal layout can be useful for any data mining algorithm so we are using it directly for clustering. The previously created horizontal layout is taken as input for clustering. Suppose if there is a stores data base, if we want to find the stores that are having same total sales amount for each day of week or if we want to cluster the stores based on total sales for each day of week than STEP 1: Fix c ( 2 c n ) and select a value m Initialize the partition matrix For r = 0, 1, 2,. Do STEP 2: Calculate the c centers using the formula v ij n k 1 n m'. x k 1 ik m' ik STEP 3: Update the partition matrix for the ( r ) U kj ( v r ), i 1, 2,... c i th r step Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10037

11 to ( r 1) U = ( ( r 1) ik ), where Query1: Taking I i c n d k ( r) { 2 ; ik 0} INSERT INTO F 0 SELECT DISTINCT storeid Query2: FROM Transaction. ( r1) ik STEP 4: If c j1 d d (r) ik ( r) jk 2/( m' 1) 1, if I, I ' 0, where i I k {1,2,... c} ( r 1) ( r) U U L k STOP Else go to STEP 2. Here C denotes number of clusters, V denotes cluster centers, X denotes data point, d denotes distance between cluster centre and data point and U is the partition matrix where each element of matrix represents the membership value of a data point X belonging to Cluster C. 4. Results: By taking one real time example construction of Horizontal layout by using SPJ method and CASE method is provided. After creating Horizontal layout, it is taken as input data set for clustering and clustering is done using fuzzy C-means algorithm. Example: Consider a database having stores information. Transaction is a table in the database having StoreId, DepId, Date, Month, Year, Day, ItemId, Rate, Qty, Amt as columns. Suppose if we want to find total sales amount for each storied by each day of week. SPJ Method: The following queries should be computed to construct horizontal layout by using SPJ method. k INSERT INTO F 1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Mon GROUP BY storied..query3: INSERT INTO F 2 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Tue GROUP BY storied..query4: INSERT INTO F1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Wed GROUP BY storied..query5: INSERT INTO F1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Thu GROUP BY strid. Query6: INSERT INTO F1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Fri GROUP BY storied. Query7: INSERT INTO F1 SELECT storeid, sum (amt) AS totalsalesamt FROM Transaction WHERE Day= Sat GROUP BY storied..query8: INSERT INTO F1 SELECT storeid, sum (amt) AS totalsalesamt FROM F Transaction WHERE Day= Sun GROUP BY storied. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10038

Query9: INSERT INTO F H SELECT F 0.storied, F 1.totalsalesamt AS Mon-amt, F 2.totalsalesamt AS Tue-amt, F 3.totalsalesamt AS Wed-amt, F 4.totalsalesamt AS Thu-amt, F 5.totalsalesamt AS fri-amt, F 6.

storeid LEFT OUTER JOIN F 4 on F 0.storeid=F 4.storeid LEFT OUTER JOIN F 5 on F 0.storeid=F 5.storeid LEFT OUTER JOIN F 6 on F 0.storeid=F 6.storeid LEFT OUTER JOIN F 7 on F 0.storeid=F 7.storeid. By evaluating above queries we will get the horizontal layout that we want but it takes lot of effort as more sub queries should be written and more join operations should be performed.

12 Query9: INSERT INTO F H SELECT F 0.storied, F 1.totalsalesamt AS Mon-amt, F 2.totalsalesamt AS Tue-amt, F 3.totalsalesamt AS Wed-amt, F 4.totalsalesamt AS Thu-amt, F 5.totalsalesamt AS fri-amt, F 6.totalsalesamt AS Sat-amt, F 7.totalsalesamt AS Sun-amt FROM F 0 LEFT OUTER JOIN F 1 on F 0.storeid=F 1.storeid LEFT OUTER JOIN F 2 on F 0.storrid=F 2.storeid LEFT OUTER JOIN F 3 on F 0.storeid=F 3.storeid LEFT OUTER JOIN F 4 on F 0.storeid=F 4.storeid LEFT OUTER JOIN F 5 on F 0.storeid=F 5.storeid LEFT OUTER JOIN F 6 on F 0.storeid=F 6.storeid LEFT OUTER JOIN F 7 on F 0.storeid=F 7.storeid. By evaluating above queries we will get the horizontal layout that we want but it takes lot of effort as more sub queries should be written and more join operations should be performed. Consider in the above query to create vertical layout just one query is enough i.e select storied, day, sum(amt) from Transaction group by storied, day. But to create horizontal layout we are writing 9 queries, so to reduce the effort CASE method can be used to create horizontal layout easily with less effort. then Qty else null end) from Trans1 Group By StoreId, DepId.The results are as follows: First we need to select the Transaction table from database containing stores information for which we want to create horizontal layout. The input frame is as follows. By pressing the select table button we can select the Transaction table and by pressing display button the selected table is displayed as follows. After this by pressing generate button the SQL CODE GENERATION frame will be displayed. CASE Method: Suppose if we want find out total items sold in each department of each store by each day of week. The following query is evaluated to create horizontal layout by using CASE method. select StoreId, DepId, sum( CASE when Day='Fri' then Qty else null end),sum( CASE when Day='Mon' then Qty else null end),sum( CASE when Day='Sat' then Qty else null end),sum( CASE when Day='Thr' then Qty else null end),sum( CASE when Day='Tue' then Qty else null end),sum( CASE when Day='Wed' In this frame if we press view Columns button all the columns of the selected table will be displayed. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10039

The clustering results are as follows: By selecting the columns that we want to group by, aggregate, transpose, aggregation

This is the input frame where we need to select the data set that we want to cluster by using the browse button.

algorithm. Here we are selecting the previously created horizontal layout as input for clustering.

Suppose from the stores data base if we want to find the stores that are having same total sales amount for each day of week or

13 The clustering results are as follows: By selecting the columns that we want to group by, aggregate, transpose, aggregation function and method name and by clicking Generate button we get Horizontal layout as output. This is the input frame where we need to select the data set that we want to cluster by using the browse button. The above horizontal layout output is taken as input for clustering and clustering is performed by using fuzzy C-means algorithm. Here we are selecting the previously created horizontal layout as input for clustering. The data storeids are clustered by using Fuzzy C-Means algorithm. Suppose from the stores data base if we want to find the stores that are having same total sales amount for each day of week or if want cluster the stores based on total sales for each day of week than previously created data set can be directly taken as input for clustering instead of again creating data set. Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10040

14 4. Conclusion: (.)Preparing data set for data mining projects takes more effort and time but horizontal layout data set can be easily created using horizontal aggregation functions. (.)It is easy to create Horizontal Layout using CASE than SPJ method as SPJ method consists computing more sub queries where as in CASE method a single query is enough to compute. (.)Time Complexity of CASE method (O(NlogN+dknlogn+dN)) is better than time complexity of SPJ method (O(Nlog(N))+dknlogn+dN ) where N is the size of the input table F, n is the size of output table Horizontal layout, d is the distinct combination of transposing columns and k is the number of transposing columns. (.)Fuzzy C-Means algorithm can give better clustering results than K-Means and Hard C-Means algorithms as it handles vagueness of data. 5. Future Work: (.)Other data mining algorithms like classification, regression analysis, Decision Making can also be implemented by taking Horizontal layout as input. (.)Horizontal layout can be clustered by using other soft computing clustering algorithms to handle impreciseness in data. (.)Missing values in data is not handled, so rough set concept can be used to handle missing data. (.)To reduce the execution time of clustering algorithm it can be parallelized using OPEN_MP. REFERENCES [1] Carlos Ordonez and Zhibo Chen.: Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL [2] G. Bhargava, P. Goel, and B.R. Iyer, Hypergraph Based Reorderings of Outer Join Queries with Complex Predicates, Proc. ACM SIGMOD Int l Conf. Management of Data (SIGMOD 95), pp , [3] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman,.NET Database Programmability and Extensibility in Microsoft SQL Server, Proc. ACM SIGMOD Int l Conf. Management of Data (SIGMOD 08), pp , [4] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman, Non- Stop SQL/MX Primitives for Knowledge Discovery, Proc. ACM SIGKDD Fifth Int l Conf. Knowledge Discovery and Data Mining (KDD 99), pp , [5] E.F. Codd, Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Systems, vol. 4, no. 4, pp , [6] C. Cunningham, G. Graefe, and C.A. Galindo- Legaria, PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS, Proc. 13th Int l Conf. Very Large Data Bases (VLDB 04), pp , [7] C. Galindo-Legaria and A. Rosenthal, Outer Join Simplification and Reordering for Query Optimization, ACM Trans. Database Systems, vol. 22, no. 1, pp , [8] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book, first ed. Prentice Hall, [9] G. Graefe, U. Fayyad, and S. Chaudhuri, On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases, Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD 98), pp , [10] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross- Tab and Sub- Total, Proc. Int l Conf. Data Eng., pp , [11] J. Han and M. Kamber, Data Mining: Concepts and Techniques, first ed. Morgan Kaufmann, [12] G. Luo, J.F. Naughton, C.J. Ellmann, and M. Watzke, Locking Protocols for Materialized Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10041

15 Aggregate Join Views, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 6, pp , June [13] C. Ordonez, Horizontal Aggregations for Building Tabular Data Sets, Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge Discovery (DMKD 04), pp , [14] C. Ordonez, Vertical and Horizontal Percentage Aggregations, Proc. ACM SIGMOD Int l Conf. Management of Data (SIGMOD 04), pp , [15] C. Ordonez, Integrating K-Means Clustering with a Relational DBMS Using SQL, IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp , Feb [16] C. Ordonez, Statistical Model Computation with UDFs, IEEE Trans. Knowledge and Data Eng., vol. 22, no. 12, pp , Dec [17] C. Ordonez, Data Set Preprocessing and Transformation in a Database System, Intelligent Data Analysis, vol. 15, no. 4, pp , [18] C. Ordonez and S. Pitchaimalai, Bayesian Classifiers Programmed in SQL, IEEE Trans. Knowledge and Data Eng., vol. 22, no. 1, pp , Jan [19] S. Sarawagi, S. Thomas, and R. Agrawal, Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications, Proc. ACM SIGMOD Int l Conf. Management of Data (SIGMOD 98), pp , [20] H. Wang, C. Zaniolo, and C.R. Luo, ATLAS: A Small But Complete SQL Extension for Data Mining and Data Streams, Proc. 29th Int l Conf. Very Large Data Bases (VLDB 03), pp , [21] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, and S. Subramanian, Spreadsheets in RDBMS for OLAP, Proc. ACM SIGMOD Int l Conf. Management of Data (SIGMOD 03), pp , [22] Zadeh, L. A.: Fuzzy sets, Information and Control, 8, (1965), pp [23]Sugeno, S.: Fuzzy measures and fuzzy integrals, in Fuzzy Automata and Decision Process, edited by M.Gupta, G.N. Sardis and B.R. Gaines (North Holland, Amsterdam, New York), (1977), pp [24]Attanasov, K. T.: Intuitionistic Fuzzy Sets, Fuzzy Sets and Systems, 20, (1986), pp Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No Page 10042

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of