INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program : B. Tech Semester : VI Branch : IT Section : A,B Academic Year : 2018 2019 Course Faculty : Ms. K. LaxmiNarayanamma, Assistant Professor, Dept. of IT OBJECTIVES I II To help students to consider in depth the terminology and nomenclature used in the syllabus. To focus on the meaning of new words / terminology/nomenclature 1 P a g e

DEFINITIONS AND TERMINOLOGYQUESTION BANK UNIT - I 1 Define Database. A database is a collection of information that is organized so that it can be easily accessed, managed and updated. Data is organized into rows, columns and tables, and it is indexed to make it easier to find relevant information. 2 What is data warehouse? A data warehousing is a technique for collecting and managing data from varied sources to provide meaningful business insights. It is a blend of technologies and components which allows the strategic use of data. 3 What is data store? A data store is a repository for storing, managing and distributing data sets on an enterprise level. It is a broad term that incorporates all types of data that is produced, stored and used by an organization. 4 What is data integration? Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure. Data integration allows different data types (such as data sets, documents and tables) to be merged by users, organizations and applications, for use as personal or business processes and/or functions. 4 What is data mart? A data mart is a repository of data that is designed to serve a particular community of knowledge workers. Because data marts catalog specific data, they often require less space than enterprise data warehouses, making them easier to search and cheaper to run. 5 What is Enterprise Data warehouse? In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. 6 Define Meta Data. Metadata is data that describes other data. Meta is a prefix that in most information technology usages means "an underlying definition or description." Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. 7 What is operational Data? An operational data store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. An ODS can be used for integrating disparate data from multiple sources so that business operations, CLO1 AIT006.01 CLO2 AIT006.02 CLO2 AIT006.02 CLO2 AIT006.02 CLO3 AIT006.03 CLO2 AIT006.02 2 P a g e

analysis and reporting can be carried out while business operations are occurring. 8 Define OLTP (online transaction processing). OLTP (online transaction processing) is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTPsystems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales. 9 Define data cube. An OLAP cube is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications. An OLAP cube is a method of storing data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data (measures) are categorized by dimensions. 10 Define OLAP (online analytical processing). 12 List OLAP operations 1.Roll-up(Drill-up) 2.Drill-down 3.Slice and Dice 4.Pivot 13 Define Roll-up operation on data cube? 14 Define Drill-down operation on data cube? 15 Define Slice operation on data cube? 16 Define Dice operation on data cube? OLAP (online analytical processing) is a computing method that enables users to easily and selectively extract and query data in order to analyze it from different points of view. OLAP is the technology behind many Business Intelligence (BI) applications. OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing, complex analytical calculations, and predictive what if scenario (budget, forecast) planning. The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction. Drill-down is the reverse of roll-up. That means lower level summary to higher level summary. Drill-down can be performed either by 1. Stepping down a concept hierarchy for a dimension 2.By introducing a new dimension. The Slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Reduces the dimensionality of the cubes. The Dice operation defines a sub-cube by performing a selection on two or more dimensions. CLO2 AIT006.02 3 P a g e

17 Define Pivot operation on data cube? Pivot is also known as rotate. It Rotates the data axis to view the data from different perspectives. 18 Distinguish OLTP and OLAP An OLTP system is customer-oriented and is used for transaction and query with respect to users and processing by clerks, clients, and information technology professionals. An system orientation? OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. 19 Distinguishing OLTP and Data contents: An OLTP system manages current data that, typically, are OLAP with respect to Data too detailed to be easily used for decision making. contents? An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier for use in informed decision making. 20 List Distinguishing features of 1.Users and system orientation OLTP and OLAP. 2.Data contents 3.Database design 4.View 5.Access patterns 21 What is a multidimensional The multidimensional data model is an integral part of On-Line Analytical data model? Processing, or OLAP.And because OLAP is also analytic, the queries are complex. The multidimensional data model is designed to solve complex queries in real time. 22 Define Star Schema? In data warehousing and business intelligence (BI), a star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions. A fact is an event that is counted or measured, such as a sale or login. A dimension contains reference information about the fact, such as date, product, or customer. A star schema is diagramed by surrounding each fact with its associated dimensions. The resulting diagram resembles a star. 23 Define Snowflake schema. In computing, a snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. UNIT II CLO1 AIT006.01 CLO3 AIT006.03 CLO3 AIT006.03 CLO2 AIT006.02 4 P a g e

1 What is data mining? Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis. Data mining tools allow enterprises to predict future trends. 2 Need of Data Mining? In present world, huge amount of data is available in Information Industry. 3 List the steps included in data mining. Until it converts to useful knowledge there is no use of this huge data. 1.Data Cleaning 2.Data Integration 3.Data Selection 4.Data Transformation 5.Data Mining 6.Pattern Evaluation 7.Knowledge Presentation 4 What is data cleaning? Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct. There are many ways to pursue data cleansing in various software and data storage architectures; most of them center on the careful review of data sets and the protocols associated with any particular data storage technology. 5 What is Data Integration? Data Integration is the process of combining the data from multiple data sources. 6 Define data Selection? Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. 7 What is Data Transformation? Data transformation is the process of converting data or information from one format to another, usually from the format of a source system into the required format of a new destination system. The usual process involves converting documents, but data conversions sometimes involve the conversion of a program from one computer language to another to enable the program to run on a different platform. The usual reason for this data migration is the adoption of a new system that's totally different from the previous one. 8 List the major components of data Mining system architecture 1.Data bases or Data Warehouse Server 2.Data Mining Engine 3.Pattern Evaluation CLO5 AIT006.05 CLO5 AIT006.05 CLO5 AIT006.05 CLO6 AIT006.06 CLO6 AIT006.06 Understand CLO5 AIT006.05 CLO5 AIT006.05 5 P a g e

4.Knowledge Base 5.Graphical User Interface 9 What is data mining Engine? Data mining is a very important process where potentially useful and previously unknown information is extracted from large volumes of data. There are a number of components involved in the data mining process. These components constitute the architecture of a data mining system. 10 Define Knowledge Base with respect to data mining. 11 List the applications of Data Mining? 12 List the Data Mining Functionalities? 13 What are the predictive tasks of data mining? Knowledge Base consists of data that is very important in the process of data mining. Knowledge Base provides input to the data mining engine which guides data mining engine in the process of pattern search. 1.Market Analysis 2.Fraud Detection 3.Customer Retention 4.Production Control 5.Science Exploration 1.Concept / Class description 2.Association (correlation and causality) 3.Classification and Prediction 4.Cluster analysis 5.Outlier analysis 6.Trend and evolution analysis 1.Classification 2.Prediction 3.Time Series analysis 14 What is Clustering principle? Clustering is based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. 15 List the different measures to evaluate the pattern / rules. 16 List the Data Mining system classifications. 1.Objective Measures based on statistics and structures of patterns, e.g., support, confidence, etc. 2.Subjective Measures based on user s belief in the data, e.g., unexpectedness, novelty, action ability, etc. 1.Based on different views of Data Mining system 2.Kinds of databases to be mined 3.Kinds of knowledge to be discovered 4.Kinds of techniques utilized 5.Kinds of applications adapted CLO6 AIT006.06 CLO6 AIT006.06 CLO5 AIT006.05 CLO5 AIT006.05 6 P a g e

17 List the Data Mining task primitives 1.Set of task-relevant data to be mined 2.Kind of knowledge to be mined 3.Background knowledge to be used in the discovery process 4.Mining methodology and user interaction 5.Performance and scalability 18 Define Data reduction. Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form. The basic concept is the reduction of multitudinous amounts of data down to the meaningful parts. 19 State the need for data reduction in data mining? 20 List the Data Reduction Strategies. A database or date warehouse may store terabytes of data. So it may take very long to perform data analysis and mining on such huge amounts of data. Data Reduction Strategies: 1.Data Cube Aggregation 2.Dimensionality Reduction 3.Data Compression 4.Numerosity Reduction 5.Discretisation and concept hierarchy generation. UNIT III 1 What is a market basket? A market basket is a collection of items purchased by a customer in a single transaction, which is a well-defined business activity. CLO5 AIT006.05 2 What is association rule mining? 3 What is the need for association rule mining? 4 What are the measures of rule interestingness? Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. In a given transaction with multiple items, it tries to find the rules that govern how or why items are often bought together. Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules 7 P a g e

5 List the applications of association rule mining? Applications of Association Rule:- 1.Market Basket data analysis. 2.Catalog design. 3.Cross marketing. 6 State Apriori property? The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data. 7 Define Monotonic functions? A monotonic function is a function which is either entirely nonincreasing or nondecreasing. Afunction is monotonic if its first derivative (which need not be continuous) does not change sign. 8 What is Multilevel association rule? Multilevel association rule: Multilevel association rules can be defined as applying association rules over different levels of data abstraction. 9 What is Multi dimensional Multi dimensional association rule can be defined as the statement which association rule? contains only two (or) more predicates/dimensions. 10 Define Categorical Attributes. In statistics, a categorical variable is a variable that can take on one of a limited and usually fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. 12 Define Quantitative Attributes. 13 State constraint based association mining. 14 List the constraints used in mining. Quantitative Attribute (QA) is a special attribute that is used to compare two values, i.e., it is used to compare a user-defined value against an upper limit and a lower limit. For example, the result for a test is inferred by comparing the user-defined value against an upper and a lower limit Constraint-based association rule mining aims to develop a systematic method by which the user can find important association among items in a database of transactions. To elaborate, many retailers, such as supermarkets, carry a large number of items. the kinds of constraints used in the mining are 1.Knowledge type constraint 2.Data constraints 3.Dimension/level constraints 4.Rule constraints 5. Interestingness constraints. CLO9 AIT006.09 CLO9 AIT006.09 CLO9 AIT006.09 CLO10 AIT006.10 CLO10 AIT006.10 CLO10 AIT006.10 CLO10 AIT006.10 8 P a g e

15 What is Closed Frequent Item set? 16 Why Is Frequent Pattern Growth Fast? It is a frequent item set that is both closed and its support is greater than or equal to minimum support. An item set is closed in a data set if there exists no superset that has the same support count as this original item set. 1.No candidate generation 2.No candidate test 3.Use compact data structure 4.Eliminate repeated database scan 5.Basic operation is counting and FPtree building. CLO9 AIT006.09 17 What is Support? Support is an indication of how frequently the item set appears in the dataset. 18 What is Confidence? Confidence is an indication of how often the rule has been found to be true. 19 What is support and minimum support? 20 What is pruning in data mining? The minimum support and minimum confidence are set by the users, and are parameters of the Apriori algorithm for association rule generation. These parameters are used to exclude rules in the result that have a support or a confidence lower than the minimum support and minimum confidence respectively. Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of over fitting. 21 What is frequent Item set? Frequent Itemset an itemset whose support is greater than or equal to minimum support and threshold. 1 Define classification in data mining? UNIT - IV Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. 2 What is prediction? Prediction in data mining is to identify data points purely on the description of another related data value. It is not necessarily related to future events but CLO9 AIT006.09 CLO9 AIT006.09 CLO12 AIT006.12 CLO12 AIT006.12 9 P a g e

the used variables are unknown. Prediction is used to know the unknown or missing values. The prediction in data mining is known as Numeric Prediction. 3 List the steps in classification? 1.Model construction: describing a set of predetermined classes 2.Model usage: for classifying future or unknown objects 4 List the common machine learning algorithms. 1.Linear Regression 2.Logistic Regression 3.Decision Tree 4.SVM 5.Naive Bayes 6.KNN 7.K-Means 8.Random Forest. 5 Define a decision Tree? A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. 6 What Tree pruning in data mining? Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. 7 State the use of Decision tree? A decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making). This page deals with decision trees in data mining. 8 What are attribute selection measures? 9 What is Probabilistic learning 10 Define Probabilistic prediction? 1.Information Gain 2.Gain Ratio 3.Gini Index Probabilistic learning calculates explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems. Probabilistic prediction predicts multiple hypotheses, weighted by their probabilities. CLO13 AIT006.13 CLO13 AIT006.13 CLO13 AIT006.13 CLO12 AIT006.12 10 P a g e

11 Define Bayesian classification? Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. 12 Define lazy learning? Lazy learning is a learning method in which generalization of the training data is delayed until a query is made to the system 13 Define eager learning? In eager learning, where the system tries to generalize the training data before receiving queries. 14 State the disadvantage of lazy learning? 15 State the reason why the nearest neighbor is a lazy algorithm? 16 Define regression analysis in data mining? 17 Give the methods for comparing classification and prediction. The disadvantages with lazy learning include the large space requirement to store the entire training dataset. Particularly noisy training data increases the case base unnecessarily, because no abstraction is made during the training phase. K-NN is a lazy learner because it doesn t learn a discriminative function from the training data but memorizes the training dataset instead. Regression is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset. Regression is used across multiple industries for business and marketing planning, financial forecasting, environmental modeling and analysis of trends the criteria for comparing the methods of Classification and Prediction 1.Accuracy 2. Speed 3. Robustness 4. Scalability 5. Interpretability 18 Define data science? Datascience is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining. 19 Define accuracy of a classifier? Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. CLO12 AIT006.12 CLO13 AIT006.13 CLO13 AIT006.13 11 P a g e

20 What is Naïve Bayes algorithm? 1 What is a cluster in data mining? Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes' probability theorem. It is primarily used for text classification which involves high dimensional training data sets. UNIT - V Clustering is the process of making a group of abstract objects into classes of similar objects. Cluster of data objects can be treated as one group. While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups. 2 Define Cluster analysis. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. 3 What is supervised learning? Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.... In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). 4 What is machine learning? Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. 5 Give examples of clustering. 1.Biology 2.Information retrieval 3.Land use 4.Marketing 5.City-planning 6.Climate CLO16 AIT006.16 CLO16 AIT006.16 CLO15 AIT006.15 CLO17 AIT006.17 CLO16 AIT006.16 12 P a g e

6 Give the Considerations for Cluster Analysis 7 Give Major Clustering Approaches. 1.Partitioning criteria 2.Separation of clusters 3.Similarity measure 4.Clustering space 1.Partitioning approach 2.Hierarchical approach 3.Density-based approach 4.Grid-based approach. 8 What is good clustering? A good clustering method will produce high quality clusters with high intraclass similarity low inter-class similarity. 9 State the weakness of K- Means algorithm. 1.Applicable only when mean is defined 2.Need to specify k, the number of clusters, in advance, 3.Unable to handle noisy data outliers 4.Not suitable to discover clusters with non-convex shapes 10 Define time series database? A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a date time or a date time range). 11 What is the Type of data variables used in clustering analysis? 12 What is the Categorization of Major Clustering Methods 13 List the steps in K-means clustering algorithm. 1.Interval-scaled variables 2.Binary variables 3.Nominal, ordinal, and ratio variables 4.Variables of mixed types. 1.Partitioning algorithms 2.Hierarchy algorithms 3.Density-based 4.Grid-based 5.Model-based. 1.Initialize the center of the clusters 2.Attribute the closest cluster to each data point CLO15 AIT006.15 CLO17 AIT006.17 CLO19 AIT006.19 CLO19 AIT006.19 CLO17 AIT006.17 CLO17 AIT006.17 CLO19 AIT006.19 CLO19 AIT006.19 13 P a g e

14 List the classification of clustering methods. 15 State the key difference between classification and clustering. 16 Define EM (Expectation- 3.Set the position of each cluster to the mean of all data points belonging to that cluster. 1.Partitioning Method 2.Hierarchical Method 3. Density-based Method 4.Grid-Based Method 5.Model-Based Method 6.Constraint-based Method. Classification is taking data and putting it into pre-defined categories and in Clustering the set of categories, that you want to group the data into, is not known beforehand. A commonly used algorithm for model-based clustering is the Expectation- Maximization algorithm or EM algorithm, EM clustering is an iterative Maximization) algorithm. algorithm. 17 Define outlier. In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. CLO19 AIT006.19 CLO19 AIT006.19 CLO19 AIT006.19 CLO20 AIT006.20 Signature of the Faculty Signature of the HOD 14 P a g e