Data Mining Primitives, Languages, and System Data Mining Primitives Task-relevant data The kinds of knowledge to be mined: Background Knowledge

Data Mining Primitives, Languages, and System Data Mining Primitives Task-relevant data The kinds of knowledge to be mined: Background Knowledge Interestingness measures Presentation and visualization of discovered Patters: Task-relevant data: This is the data base portion to be investigated. Rather than mining the entire database, you can specify the portion of the database that is relevant for analysis/investigation. E.g.: Transaction involving customer purchases is Canda need to be retrieved The kinds of knowledge to be mined: This specifies the data mining functions to be performed, such as Characterization, discrimination. Association. Classification. Clustering. Background Knowledge: Users can specify the background knowledge or knowledge about the domain to be mined. Background knowledge is useful for guiding the knowledge discovery process and for evaluating the patterns found. There are several kinds of background knowledge such as: Concept hierarchies: help in mining data at multiple levels of abstraction. Beliefs regarding relationships in data: This helps to evaluate the discovered patterns according to their degree of unexpectedness or expectedness. Interestingness measures: Interestingness measures are used o to separate uninteresting patterns from knowledge. To guide the mining process. To evaluate the discovered patters.

Different kinds of knowledge have different interestingness measures. Example: Association rule mining has interestingness measures, such as Support : the % of task-relevant relevant data tuples for which the rulepattern appears. Confidence: an estimate of the strength of the implication of the rule. The rules whose support and confidence values are below user-specified threshold are considered uninteresting. Presentation and visualization of discovered Patters: This refers to the form in which discovered patterns to be displayed A Data Mining Query Language The Language adopts an SQL- like syntax, so that it can easily be integrated with the relational query language SQL. The syntax of DMQL is defined is an extended BNF grammar where [ ] represents 0 or one occurrence, { } represents 0 or more occurrences, and words in sans serif font represent keywords. Syntax for Task Relevant Data Specification:.DMQL provides clauses for the specification of such information Syntax: Use database <database-name> / use data ware house <data warehouse-name> In relevance to <attribute-or-dimension list> From <relation(s) / cube (s) > [where <condition>] [order by <order-list>] [group by <grouping-list>] [having <condition> ] //Condition by which groups of data are considered relevant. Examples: Use database All Electronics-database In relevance to I.name, I.price, C.income, C.age From customer C, iter I, purchase P, items-sold S Where I.iterm-ID = S.item-ID and S.trans-ID = P.trans-ID and P.cust-ID = C.cust-ID and C.country = canada Group by P.data.

Syntax for specifying the kind of knowledge to be mined: Specifying the kind of the knowledge to be mined determines the data mining function to be performed. Characterization <Mine-Knowledge-Specification>: : = mine characteristics [as <pattern-name> ] analyze <measure(s)> specifies that characteristic descriptions are to be mined. Examples : mine characteristics as customerpurchasing analyze count % Discrimination: <Mine-Knowledge-Specification> : : = mine comparison [as <pattern-names>] for <target-class> where <target-condition> { versus <contrast-class-i> where <contrast-condition_i > } analyze <measure(s)> specifies that discriminant descriptions are to be mined Syntax for Association: <mine-knowledge-specification> : : = mine association [as <pattern-name>] [matching <meta pattern> ] Specify the mining pattern of association Syntax for Classification: <mine-knowledge-specification> : : = mine classification [as <pattern name> ] analyze <classifying-attribute-or-dimension> Specifies that patterns for data classification are to be mined Syntax for Concept Hierarchy Specification Concept hierarchy allow the mining of knowledge at multiple levels of abstractions.. Use hierarchy <hirarcy_name> for <attribute_or_dimention> Schema hierarchy: This can be defined as street <city < province-or-state < country

Set-grouping hierarchy: organizes values for a given attribute into groups of constants or range values. Define hierarchy age_hierarchy for age on customer as Level1: {young, middle aged, senior}< level0: Ì all Leve2: {20 - - - 39}< Level1: Ìyoung Leve2: {40 - - - 59} < Level1: Ì Middle aged Level2: {60 - - - 89}< Level1: Ì senior Syntax for Interestingness Measure Specification: Interestingness measure and thresholds can be specified by the user with the statement With[<interest measre>] threshold =<threshold. E.g.: with support threshold = 5% with confidence threshold = 70% Syntax for pattern presentation and visualization DMQL display statement for visualization of patterns is: display as <result form> Where, the <result form> could be any of the knowledge presentation /visualization forms, such as table, pie chart, etc To view the patterns at different levels of abstractions <multilevel-manipulation> : = rollup on <attribute or-dimension> Drilldown on <attribute or-dimension> add <attribute or-dimension> drop < attribute or-dimension> Designing Graphical User Interfaces Based On a Data Mining Query Language In experienced users may find data mining query language awkward to use and the syntax difficult to remember Instead, users may prefer to communicate with DM Systems through a GUI A Data mining GUI may consists of the following functional components Data collection and data mining query composition. : This component allows the user to specify task relevant data sets and to compose Data Mining queries.

Presentation of discovered patterns: This component allows the display of the discovered patterns in various forms, including tables, graphs, charts curves and other visualization techniques Hierarchy specification and manipulation: It allows for concept hierarchy specification, either by hand by user or automatically. It also allows for modification of concept hierarchies by the user or automatically. Manipulation of DMPS: It allows for dynamic adjustment of DM thresholds. It also allows for selection, display and modification of concept hierarchies Interactive multilevel mining : It allows roll-up or drilldown operations on discovered Patterns. Other miscellaneous in formation : It includes on-line help manuals, indexed search, debugging etc., Architectures of Data Mining Systems Database and data warehouse systems have becomes the mainstream information systems Comprehensive information processing and data analysis infrastructures have been systematically constructed surrounding database systems and data warehouses. Data Mining using following coupling schemes No coupling Loose coupling Semi-tight coupling Tight coupling No Coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It may fetch data from a particular source (such as a file system), Process data using some DM algorithms and then store the mining results in another files. It is A Simple in implementation. Disadvantages: No coupling DM system may spend a substantial amount of time finding, collecting, cleaning and transforming data where as in DB/DW systems, data tend to be well organized, indexed, cleaned, integrated, so that the finding task relevant, highly quality data becomes easy.

There are many tested, scalable algorithms and data structures implemented in DB/DW systems. Without any coupling of such systems, a DM system will need to used other tools, (making it difficult to integrable such systems into an info processing environment _ Hence, NoCoupling represents a poor design. Loose coupling: DM system uses some facilities of a DB/DW systems It fetches data from a depositing managed by these systems, performs data mining, and then stores the results either in a file or in a designated place in DB/DW. Advantages: It is better than no coupling, since it fetches any portion of data stored in DB/DW using query processing, indexing, and other system facilities. It incurs flexibility, efficiency provided by DW/DB systems. Disadvantages: It is difficult to achieve high scalability and good performance with large data sets, since, Loose coupling DM systems are Memory based and It does not explore data structures and query optimization methods provided by DB/DW systems Semi tight coupling: Besides facilities of DB/DW systems availed by loose coupling, it also uses efficient implementation of few DM primitives such Sorting, indexing, aggregation, histogram analysis, multiway join, and pre-computation of some essential stastical measures such as sum, count etc. Also, some frequently used intermediate mining results can be precomputed and stored in DB/DW system. Tight coupling: Here, DM system is smoothly integrated into the DB/DW system. DM subsystem is treated as one functional component of an information system Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes and query processing methods of a DB/DW systems Advantages: This approach is highly desirable architecture. It facilitates efficient implementation of data mining functions. It provides high system performance. It provides integrated information processing environment. Introduction to Data Generalization

Data generalization is a process which abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels. Methods for the efficient and flexible generalization of large data sets can be categorized according to two approaches: (1) the data cube approach, (2) the attribute-oriented induction approach. Attribute-oriented induction The general idea of attribute-oriented induction is to first collect the task-relevant data using a relational database query and then perform generalization based on the examination of the number of distinct values of each attribute in the relevant set of data. The generalization is performed by either attribute removal or attribute generalization (also known as concept hierarchy ascension). Aggregation is performed by merging identical, generalized tuples, and accumulating their respective counts. This reduces the size of the generalized data set. The resulting generalized relation can be mapped into different forms for presentation to the user, such as charts or rules. The essential operation of attribute-oriented induction is data generalization, which can be performed in one of two ways on the initial working relation: (1) attribute removal, or (2) attribute generalization. 1. Attribute removal is based on the following rule: If there is a large set of distinct values for an attribute of the initial working relation, but either (1) there is no eneralization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or (2) its higher level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation. If, as in case-1, there is a large set of distinct values for an attribute but there is no generalization operator for it, the attribute should be removed because it cannot be generalized. In case-2, where the higher-level concepts of the attribute are expressed in terms of other attributes. 2. Attribute generalization is based on the following rule: If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute.

This rule is based on the following reasoning. Use of a generalization operator to generalize an attribute value within a tuple in the working relation will make the rule cover more of the original data tuples, thus generalizing the concept it represents. This corresponds to the generalization rule known as climbing generalization trees in learning-from-examples. The process of control of how high an attribute should be generalized is called attribute generalization control. If the attribute is generalized too high", it may lead to overgeneralization, and the resulting rules may not be very informative. On the other hand, if the attribute is not generalized to a sufficiently high level", then under-generalization may result, where the rules obtained may not be informative either. Thus, a balance should be attained in attribute-oriented generalization. There are many possible ways to control a generalization process. Two common approaches are described below. The Firs technique, called attribute generalization threshold control, either sets one generalization threshold for all of the attributes, or sets one threshold for each attribute. If the number of distinct values in an attribute is greater than the attribute threshold, further attribute removal or attribute generalization should be performed. The second technique, called generalized relation threshold control, sets a threshold for the generalized relation. If the number of (distinct) tuples in the generalized relat ion is greater than the threshold, further generalization should be performed. Otherwise, no further generalization should be performed. For example, if a user feels that the generalized relation is too small, she can increase the threshold, which implies drilling down. Otherwise, to further generalize a relation, she can reduce the threshold, which implies rolling up. Efficient implementation of attribute-oriented induction Algorithm : Attribute-oriented induction. Mining generalized characteristics in a relational database based on a user's data mining request. Input. (i) A relational database DB, (ii) a data mining q uery, DMQuery, (iii) a_list, a list of attributes (iv) Gen(ai), a set of concept hierarchies or generalization operators on

attributes ai, and (iv) a_gen_thres h(ai) attribute generalization thresholds for each attributes ai. Output. P, a Prime_generalized_relation. Method.The method is outlined as follows 1. W get_task_relevant_data(dmquery, DB) 2. prepare_for_generalization( W); This is performed by: (1) Scanning the initial working relation W once and collecting the distinct values for each attribute ai (2) Computing the minimum desired level Li for each attribute ai based on its given or default attribute threshold and (3) determining the mapping-pairs (v, v0) for each attribute ai in W, where v is a distinct value of ai in W, and v0 is its corresponding generalized value at level Li. 3. P generalization(w); This is done by replacing each value v in W while accumulating count and computing any other aggregate values. This step can be efficiently implemented in two variations: (1) For each generalized tuple, insert the tuple into a sorted prime relation P by a binary search: if the tuple is already in P, simply increase its count and other aggregate values accordingly; otherwise, insert it into P. (2) Since in most cases the number of distinct values at the prime relation level is small, the prime relation can be coded as an m-dimensional array where m is the number of attributes in P, and each dimension contains the corresponding generalized attribute values. Each array element holds the corresponding count and other aggregation values, if any. The insertion of a generalized tuple is performed by measure aggregation in the corresponding array element. Data cube implementation of attribute-oriented induction The data cube implementation of attribute-oriented induction can be performed in two ways. 1. Construct a data cube on-the-fly for the given data mining query: This is desirable if either the task-relevant data set is too specific to match any predefined data cube, or it is

not very large. Since such a data cube is computed only after the query is submitted, the major motivation for constructing such a data cube is to facilitate efficient drill-down analysis. 2. Use a predefined data cube: An alternative Method use a predefined data cube query is posed to the system, and use this predefined cube for subsequent data mining. This is desirable if the granularity of the task-relevant data can match that of the predefined data cube and the set of task-relevant data is quite large Since such a data cube is precomputed, it facilitates attribute relevance analysis, attribute-oriented induction, dicing and slicing, roll-up, and drill-down. The cost one must pay is the cost of cube computation and the nontrivial storage overhead. Methods of attribute relevance analysis The general idea behind attribute relevance analysis is to compute some measure which is used to quantify the relevance of an attribute with respect to a given class. Such measures include the information gain, Giniindex, uncertainty, and correlation coefficients. Let S be a set of training object (or tuple) where the class label of each tuple is known. Suppose that there are m classes. Let S contain si objects of class Ci, for i = 1,,m. An arbitrary object belongs to class Ci with probability si/s, where s is the total number of objects in set S. The expected information needed to classify given tuple is If an attribute A with values {a1, a2....av} is used to partition S into the subsets {S1, S2,... Sv }, where Sj contains those objects in S that have value aj of A. Let Sj contain sij objects of class Ci. The expected information based on this partitioning by A is known as the entropy of A. It is the weighted average: The information gained by branching on A is defined by:

The attribute which maximizes Gain(A) is selected. Attribute relevance analysis for class description is performed as follows. 1. Data Collection: Collect data for both the target class and the contrasting class by query processing. Notice that for class comparison, both the target class and the contrasting class are provided by the user in the data mining query. For class characterization, the target class is the class to be characterized, whereas the contrasting class is the set of comparable data which are not in the target class. 2. Preliminary Relevance analysis using conservative AOI: Attribute-oriented induction (AOI) can be used to perform some preliminary relevance analysis on the data by removing or generalizing attributes having a large number of distinct values(such as name and phone#). Such attributes are unlikely to be meaningful for concept description. To be conservative, the AOI should employ attribute generalization thresholds that are set reasonably large.( so as to allow more attributes to be considered in further relevance analysis by selected measure performed in step-3). The relation obtained by such an attribute removal and attribute generalization process is called the candidate relation of the mining task. 3. Remove irrelevant or weakly relevant attributes using the selected measure : The selected relevance measure used is used evaluate(or rank) each attribute in the candidate relation. For example, the information gain measure described above may be used. The attributes are then sorted (i.e., ranked) according to their computed relevance measure value. Attribute that are not relevant or weakly relevant are then removed based on the set threshold. The resulting relation is called Initial Target/Contrast class Relation. Mining Class Comparisons The class comparison can be done on the below factor Data collection Dimension relevance analysis Synchronous generalization Presentation of the derived comparison Working

Use Big_U nivers ity_d B mine comparison as grad_vs _undergrad_s tudents in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa for graduate_s tudents where status in graduate versus undergraduate_s tudents where status in undergraduate analyzecount% froms tudent 1. Data collection target and contrasting classes 2. Attribute relevance analysis remove attributes name, gender, major, phone# 3. Synchronous generalization controlled by user-specified dimension thresholds prime target and contrasting class(es) relations/cuboids 4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels of abstractions of resulting description 5. Presentation as generalized relations, crosstabs, bar charts, pie charts, or rules contrasting measures to reflect comparison between target and contrasting classes e.g. count%