Data Warehousing & Data Mining

VTU Question Paper Solution Unit-1 1. What is data mining? Explain Data Mining and Knowledge Discovery? [June/July 2014][10marks] Simply stated, data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material (Figure 1.3). Thus, such a misnomer that carries both data and mining became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Dept of CSE, SJBIT Page 1

Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. 2. What is operational data store (ods)? Explain with neat diagram. [June/July 2014] [10marks] The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as online analytical processing (OLAP) systems. The major distinguishing features between OLTP and OLAP are summarized as follows: Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making. Dept of CSE, SJBIT Page 2

Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject oriented database design. View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. 3. What is ETL? Explain the steps in ETL. [DEC 2013] [10marks] The data analysis task involves data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration. Schema integration and object matching can be tricky. How can equivalent real-world entities from multiple data sources be matched up? This is referred to as the entity identification problem. For example, how can the data analyst or the computer be sure that customer id in one database and cust number in another refer to the same attribute? Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values. Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be H and S, and 1 and 2 in another). Hence, this step also relates to data cleaning, as described earlier. Redundancy is another important issue. An attribute (such as annual revenue, for instance) may be redundant if it can be derived from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For numerical attributes, we can evaluate the correlation between two attributes 4. What are the guide lines for implementing the data warehouse. [DEC 2013] [10marks] Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouse systems are valuable tools in today s competitive, fast-evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing weapon - a way to retain customers by learning more about their needs. Dept of CSE, SJBIT Page 3

Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization s operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. According to William H. Inmon, a leading architect in the construction of data warehouse systems, A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision making process. This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems Dept of CSE, SJBIT Page 4

Unit-2 1. Explain four types of attributes by giving appropriate example? [June/July 2014] [10 marks] 2. Distinguish between OLTP and OLAP. [June/July 2014] ][10 marks] To facilitate efficient data accessing, most data warehouse systems support index structures and materialized views (using cuboids). General methods to select cuboids for materialization were discussed in the previous section. In this section, we examine how to index OLAP data by bitmap indexing and join indexing. The bitmap indexing method is popular in OLAP products because it allows quick searching in data cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the domain of the attribute. If the domain of a given attribute consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0. Dept of CSE, SJBIT Page 5

3. Explain the operation of data cube with suitable examples. [Dec 2013] [10 marks] Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods, and query processing techniques. In this section, we present an overview of methods for the efficient implementation of data warehouse systems. Efficient Computation of Data Cubes At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions. In SQL terms, these aggregations are referred to as group -by s. Each group-by can be represented by a cuboid, where the set of group-by s forms a lattice of cuboids defining a data cube. In this section, we explore issues relating to the efficient computation of data cubes. The Compute Cube Operator and the Curse of Dimensionality One approach to cube computation extends SQL so as to include a compute cube operator. The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation. This can require excessive storage space, especially for large numbers of dimensions. We start with an intuitive look at what is involved in the efficient computation of data cubes. Example: A data cube is a lattice of cuboids. Suppose that you would like to create a data cube for AllElectronics sales that contains the following: city, item, year, and sales in dollars. You would like to be able to analyze the data, with queries such as the following: Compute the sum of sales, grouping by city and item. Compute the sum of sales, grouping by city. Compute the sum of sales, grouping by item. 4. Write short note on [Dec 2013] [10 marks] i) ROLAP ii) MOLAP iii) DATACUBE iv) FASMI i) OLAP servers may use relational OLAP (ROLAP), or multidimensional OLAP (MOLAP), or hybrid OLAP (HOLAP). A ROLAP server uses an extended relational DBMS that maps OLAP operations on multidimensional data to standard relational operations. ii) A MOLAP server maps multidimensional data views directly to array structures. A HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for historical data while maintaining frequently accessed data in a separate MOLAP store. iii) A multidimensional data model is typically used for the design of corporate data warehouses and departmental data marts. Such a model can adopt a star schema, snowflake schema, or fact constellation schema. The core of the multidimensional model is the data cube Dept of CSE, SJBIT Page 6

Unit-3 1. Explain motivating challenges that motivation Data Mining? [June/July 2014] [10marks] The motivation challenges that motivated Data mining are -: Scalability High Dimentionality Heterogeneous and complex data Data ownership and distribution Non-traditional Analysis 2. Discuss the Tasks of data minig with suitable examples. [June/July 2014] [10marks] Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths. The data mining primitives specify the following. The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For example, interestingness measures for association rules include support and confidence. Rules whose support and confidence values are below user-specified thresholds are considered uninteresting. The expected representation for visualizing the discovered patterns: This refers to the Form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes. A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with data mining systems. Having a data mining query language provides a foundation on which user-friendly graphical interfaces can be built. Dept of CSE, SJBIT Page 7

3. Explain shortly any five data preprocessing approaches. [Dec 2013] [10marks] Imagine that you are a manager at AllElectronics and have been charged with analyzing the company s data with respect to the sales at your branch. You immediately set out to perform this task. You carefully inspect the company s database and data warehouse, identifying and selecting the attributes or dimensions to be included in your analysis, such as item, price, and units sold. Alas! You notice that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values that deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).welcome to the real world! Incomplete, noisy, and inconsistent data are commonplace properties of large real world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because it was not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, the recording of the history or modifications to the data may have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may need to be inferred. There are many possible reasons for noisy data (having incorrect attribute values). The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. There may be technology limitations, such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data may also result from in consistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data cleaning. Data cleaning routines work to clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines. Getting back to your task at AllElectronics, suppose that you would Dept of CSE, SJBIT Page 8

Unit-4 1. Explain Descriptive tasks in detail? [June/July 2014] [10marks] Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems, which may help potential users distinguish between such systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows: 2. Develop the Apriori Algorithm for generating frequent itemset. [Dec 13] [10marks] Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are usedtoexplore (k+1)-itemsets. First, the setof frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2- itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. Dept of CSE, SJBIT Page 9

To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space. We will first describe this property, and then show an example illustrating its use. Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threshold, min sup, then I is not frequent; that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I [A) cannot occur more frequently than I. Therefore, I [A is not frequent either; that is, P(I [A) < min sup. This property belongs to a special category of properties called antimonotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called antimonotone because the property is monotonic in the context of failing a test. How is the Apriori property used in the algorithm? To understand this, let us look at how k1 is used to find Lk for k _ 2. A two-step process is followed, consisting of join and prune actions. 1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk1. The notation li[j] refers to the jth item in li (e.g., l1[k2] refers to the second to the last item in l1). By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. For the (k1)-itemset, li, this means that the items are sorted such that li[1] < li[2] < : : : < li[k1]. The join, Lk1 on Lk1, is performed, where members of Lk1 are joinable if their first (k2) items are in common. That is, members l1 and l2 of Lk1 are joined if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k2] = l2[k2]) ^(l1[k1] < l2[k1]). The condition l1[k1] < l2[k1] simply ensures that no duplicates are generated. The resulting itemset formed by joining l1 and l2 is l1[1], l1[2], : : :, l1[k2], l1[k1], l2[k1]. 2. The prune step:ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included inck.ascan of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge. 3. What is association analysis? [Dec 13] [10marks] What is keyword-based association analysis? Such analysis collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. Like most of the analyses in text databases, association analysis first preprocesses the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. In a document database, each document can be viewed as a transaction, while a set of keywords in the document can be considered as a set of items in the transaction. That is, the database is in the format fdocument id;a set of keywordsg: Dept of CSE, SJBIT Page 10

The problem of keyword association mining in document databases is thereby mapped to item association mining in transaction databases, where many interesting methods have been developed. Notice that a set of frequently occurring consecutive or closely located keywords may form a term or a phrase. The association mining process can help detect compound associations, that is, domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W. Bush], or noncompound associations, such as [dollars, shares, exchange, total, commission, stake, securities].mining based on these associations is referred to as term-level association mining (as opposed to mining on individual words). Term recognition and termlevel association mining enjoy two advantages in text analysis: (1) terms and phrases are automatically tagged so there is no need for human effort in tagging documents; and (2) the number of meaningless results is greatly reduced, as is the execution time of the mining algorithms. With such term and phrase recognition, term-level mining can be evoked to find associations among a set of detected terms and keywords. Some users may like to find associations between pairs of keywords or terms from a given set of keywords or phrases, whereas others may wish to find the maximal set of terms occurring together. Therefore, based on user mining requirements, standard association mining or max-pattern mining algorithms may be evoked. Dept of CSE, SJBIT Page 11

Unit-5 1. Explain a.continious b.discrete c.asymmetric Attributes with example? [June/July 2014] [10marks] Discrete attribute has a finite or countably infinite set of values. Such attributes can be categorical or numeric. Discrete attributes are often represented by integer valyes.e.g. Zip code, counts etc. Continuous attribute is one whose values are real numbers. Continious attributes are typically represented as floating point variables. E.g. tempreture, height etc. For asymmetric attributes only presence- a non zero attribute value- is regarded as important. E.g. consider a data set where each object is a student & each attribute records whether or not a student took a particular course at university. For a specific student an attribute has a value of 1 if the student took the course and a value 0 otherwise. Because student took only a small fraction of all available courses, most of the value in such a data set would be 0.therefore it is more meaningful and more efficient to focus on non 0 values 2. Explain hunts algorithm and illustrate is working? [June/July 2014] [10marks] Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Dept of CSE, SJBIT Page 12

3. What is rule based classifier? Explain how a rule based classifier works. [Dec 13][7marks] Using IF-THEN Rules for Classification Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion. An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes. The IF -part (or left-hand side) of a rule is known as the rule antecedent or precondition. The THEN -part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age = youth, and student = yes) that are logically ANDed. The rule s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer). R1 can also be written as R1: (age = youth) ^ (student = yes))(buys computer = yes). If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple. A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set,d, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and jdj be the number of tuples in D. We can define the coverage and accuracy of R as coverage(r) = ncovers jdj accuracy(r) = ncorrect ncovers That is, a rule s coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute values hold true for the rule s antecedent). For a rule s accuracy, we look at the tuples that it covers and see what percentage of them the rule can correctly classify. 4. Write the algorithm for k-nearest neighbour classification. [Dec 13][3 marks] Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes. Dept of CSE, SJBIT Page 13

Unit-6 1. Explain a. Origins of data mining b. Data Mining tasks in brief? [June/July 2014] [10marks] We have observed various types of databases and information repositories on which data mining can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to make predictions. In some cases, users may have no idea regarding what kinds of patterns in their data may be interesting, and hence may like to search for several different kinds of patterns in parallel. Thus it is important to have a data mining system that can mine multiple kinds of patterns to accommodate different user expectations or applications. Furthermore, data mining systems should be able to discover patterns at various granularity (i.e., different levels of abstraction). Data mining systems should also allow users to specify hints to guide or focus the search for interesting patterns. Because some patterns may not hold for all of the data in the database, a measure of certainty or trustworthiness is usually associated with each discovered pattern. Data mining functionalities, and the kinds of patterns they can discover, are described below. Concept/Class Description: Characterization and Discrimination Data can be associated with classes or concepts. For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigspenders and budgetspenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived via (1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms, or (2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or (3) both data characterization and discrimination. Data characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are typically collected by a database query. For example, to study the characteristics of software products whose sales increased by 10% in the last year, the data related to such products can be collected by executing an SQL query. Dept of CSE, SJBIT Page 14

2. What is bayes theorm? Show how is it used for classification.[june/july 2014] [10marks] 3. Discuss the methods for estimating predictive accuracy of classification method [Dec 13][7 marks] How can we use the above measures to obtain a reliable estimate of classifier accuracy (or predictor accuracy in terms of error)? Holdout, random subsampling, crossvalidation, and the Dept of CSE, SJBIT Page 15

bootstrap are common techniques for assessing accuracy based on accuracy increases the overall computation time, yet is useful for model selection. Holdout Method and Random Subsampling The holdout method is what we have alluded to so far in our discussions about accuracy. In this method, the given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining onethird is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set. The estimate is pessimistic because only a portion of the initial data is used to derive the model. Random subsampling is a variation of the holdout method in which the holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. (For prediction, we can take the average of the predictor error rates.) Cross-validation In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or folds, D1, D2, : : :, Dk, each of approximately equal size. Training and testing is performed k times. In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model. That is, in the first iteration, subsets D2, : : :, Dk collectively serve as the training set in order to obtain a first model, which is tested on D1; the second iteration is trained on subsets D1, D3, : : :, Dk and tested on D2; and so on. Unlike the holdout and random subsampling methods above, here, each sample is used the same number of times for training and once for testing. For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data. For prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number of initial tuples. 4. What are the two approaches for extending the binary classifiers to extend to handle multi class problems. [Dec 13][7 marks] Supervised learning (classification) a. Supervision: The training data (observations, measurements, etc.) are accompanied by Dept of CSE, SJBIT Page 16

labels indicating the class of the observations b. New data is classified based on the training set n supervised learning (clustering) c. The class labels of training data is unknown d. Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data. Dept of CSE, SJBIT Page 17

Unit-7 1. Explain Predective tasks in detail by example? [June/July 2014] [10marks] 2. List and explain four distance measures to compute the distance between a pair of points and find out the distance between two objects represented by attribute. [June/July 2014] [10marks] Imagine that you are given a set of data objects for analysis where, unlike in classification, the class label of each object is not known. This is quite common in large databases, because assigning class labels to a large number of objects can be a very costly process. Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. Scalability:Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Dept of CSE, SJBIT Page 18

Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control. Ability to deal with noisy data:most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data. That is, given a set of data objects, such an algorithm may return dramatically different clusterings depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed. 2. Explain the cluster analysis methods briefly. [Dec 13][10 marks] Classification: Predicts categorical class labels Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: Models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications Credit approval Target marketing Dept of CSE, SJBIT Page 19

Medical diagnosis Treatment effectiveness analysis Several methods for association-based classification ARCS: Quantitative association mining and clustering of association rules (Lent et al 97) It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al 98) It mines high support and high confidence rules in the form of cond_set => y, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al 99) Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine Eps based on minimum support and growth rate 3. What are the features of cluster analysis? [Dec 13][10marks] The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery. Dept of CSE, SJBIT Page 20

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are far away from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a standalone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. Dept of CSE, SJBIT Page 21

Unit-8 1. Explain a. Anamoly Detection.Give an Example? b. Data set. Give an example? [June/July 2014] [10marks] a). Anamoly Detection Anamoly detection is the task of identifying observations whose charachteristics are significantly different from the rest of the data. Such observations are known as anamoly or outliers. The goal of anamoly detection algorithm is to discover the real anomalies and avoid falsy labeling normal objects as anomalous. E.g. Credit Card Fraud detection. b). Data set Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential Data 2. Explain the web content mining. [June/July 2014] [10marks] Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permeates much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automate discovery organization, and search an indexing tools of the internet and the world wide web such as Lycos Alta Vista, Web Crawler ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. Dept of CSE, SJBIT Page 22

The agent based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information. Web content mining is differentiated from two different points of view:[1] Information Retrieval View and Database View. R. Kosala et al.[2] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database. There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. If a term t occurs n(d, t) in document D, the t-th coordinate of D is n(d, t). When the length of the words in a document goes to D maxt n(d, t). This representation does not realize the importance of words in a document. To resolve this, tfidf (Term Frequency Times Inverse Document Frequency) is introduced. 3. Briefly describe the text mining. [Dec 13] [10marks] Most previous studies of data mining have focused on structured data, such as relational, transactional, and data warehouse data. However, in reality, a substantial portion of the available information is stored in text databases (or document databases), which consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, e-mail messages, and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic form, such as electronic publications, various kinds of electronic documents, e-mail, and the World Wide Web (which can also be viewed as a huge, interconnected, dynamic text database). Nowadays most of the information in government, industry, business, and other institutions are stored electronically, in the form of text databases. Data stored in most text databases are semistructured data in that they are neither completely unstructured nor completely structured. For example, a document may contain a few structured fields, such as title, authors, publication date, category, and so on, but also contain some largely unstructured text components, such as abstract and contents. There have been a great deal of studies on the modeling and implementation of semistructured data in recent database research. Moreover, information retrieval techniques, such as text indexing methods, have been developed to handle unstructured documents. Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data. Typically, only a small fraction of the many available documents will be relevant to a given individual Dept of CSE, SJBIT Page 23