Pattern Management. Irene Ntoutsi, Yannis Theodoridis
|
|
- Henry Knight
- 5 years ago
- Views:
Transcription
1 Pattern Management Irene Ntoutsi, Yannis Theodoridis Information Systems Lab Department of Informatics University of Piraeus Hellas Technical Report Series UNIPI-ISL-TR March 2007
2 Pattern Management Irene Ntoutsi and Yannis Theodoridis Department of Informatics, University of Piraeus, Greece Abstract The need for pattern management has become compulsory nowadays due to the spreading of the data mining technology as a result of the information flood. This need has been recognized by both academic and industrial parts resulting in a number of approaches towards efficient pattern management. In this chapter we review the work performed so far in the area of pattern management and outline open research directions regarding complete pattern management. 1 1 Introduction Nowadays a huge quantity of raw data is collected from different application domains (business, science, telecommunication, health care systems etc.). Also, the World Wide Web overwhelms us with information. According to a recent survey [19] the world produces between 1 and 2 exabytes of unique information per year, which is roughly 250 megabytes for every man, woman, and child on earth. Due to their quantity and complexity, it is impossible for humans to thoroughly investigate these data collections directly. Knowledge Discovery in Databases (KDD) and Data Mining provides a solution to this problem by generating compact and rich in semantics representations of raw data, called patterns [22]. Patterns are compact representations of raw data since they provide a high level description of raw data characteristics. Patterns are also rich in semantics since they reveal new knowledge hidden in the huge amount of raw data. Clusters, decision trees, association rules and frequent itemsets are some of the most well known patterns produced by Data Mining applications. Due to the spreading of the Data Mining technology nowadays, as a result of the data flood, even the amount of patterns extracted from heterogeneous data sources is huge and, quite often, non manageable by humans. Thus, there is a need for efficient pattern management including issues like modeling, storage, This research is partially supported by the Greek Ministry of Education and the European Union under a grant of the Heracletos EPEAEK II Programme ( ). 1 Part of this work has been presented at the tutorial Mining the Volatile Web [23] 1
3 retrieval and manipulation of patterns. However, the majority of work in the data mining area mainly focus on efficient mining, putting aside the pattern management problem which only lately has began to gain more attention by both scientific and industrial parts. In the next sections we will review the work performed so far in the area of pattern management pointing out their advantages and disadvantages. 2 Current pattern management approaches The need for pattern management has been recognized by both scientific and industrial parts and as a result several pattern management approaches have been proposed [10] in the literature. Among them, scientific community efforts try to deal with all the aspects of the pattern management problem including both representation and manipulation issues. Examples of this category are the inductive databases approach [15, 1], the 3 Worlds model [17] and the PANDA project approach [22, 8, 24]. On the other side, industrial approaches mainly focus on pattern representation and storage aiming at easy interchange of patterns between different vendors applications. Examples of this category include both specifications and standards like the Predictive Markup Language Model (PMML) [13], the Common Warehouse Metamodel [2], the SQL/MM Part 6 for Data Mining [4] and the Java Data Mining API (JDM API) [5] as well as extensions of existing commercial DBMS systems like Oracle 10g Data Mining [7], Microsoft SQL Server 2005 Analysis Manager [6] and IBM DB2 Intelligent Miner [3]. In the next sections we will review the approaches proposed so far taking into account the following aspects [10]: the chosen architecture with respect to the raw data integrated, if patterns and data are stored together independent, if patterns are stored independently of raw data the adopted pattern model, i.e. what pattern characteristics are supported by the model the capabilities of the proposed pattern languages, i.e. what operations are supported by the language pattern operations that refer exclusively to patterns cross over operations that relate patterns to raw data the support for temporal management of patterns. Since raw data from which patterns are extracted are usually dynamic, patterns are also dynamic, i.e. they are associated with some temporal notion. Considering the temporal notion of patterns results in a variety of applications, like monitoring, synchronization with respect to the underlying raw data space etc. 2
4 3 Scientific approaches Scientific approaches try to provide an overall solution to the pattern management problem, providing both representation/ storage and retrieval/ manipulation capabilities. In the next sub sections we will present in more detail three of these approaches, namely inductive databases, 3 Worlds model and PANDA model. 3.1 Inductive databases The inductive databases framework, introduced by Imielinski and Manilla in 1996 [15], is inspired by the idea that the data mining process should be supported by the database technology. Thus why this approach relies on an integrated architecture where both data and patterns are stored together in the same repository. Within the inductive databases framework the KDD process is considered as a kind of extended query processing in which users can query both data and patterns to gain insight about the data. To this aim a so called inductive query language is used. An inductive query language is an extension of a database language that allows one to: select, manipulate and query data as in standard queries select, manipulate and query patterns execute cross over queries over patterns, i.e. queries that associate patterns to raw data Due to the importance of query languages within the inductive databases framework, a lot of such languages have been proposed which comprise SQL extensions. Among them, the most well known, are DMQL, MINE RULE and MINE SQL. DMQL Han et al [14] proposed DMQL for the generation of patterns from relational data. Several kinds of rules, like generalized relations (a relation obtained by generalizing from a large set of low level data), characteristic rules (an assertion which characterizes a concept), discriminant rules (an assertion which discriminates concepts described by different classes), classification rules (a set of rules associated with a specific class problem) and association rules (which describes associations relationships between the data) are supported by DMQL. The general syntax of a query in DMQL is depicted in Figure 1. An example of generating association rules through DMQL is depicted in Figure 2. In this example, student is the raw data relation from which patterns will be mined, gpa, birth place, address are the relevant attributes to be used, association rules is the type of patterns to be mined, major = cs and birth place = Canada are the conditions or constraints over the structure 3
5 Figure 1: The general syntax of DMQL (depicted from [14]) components of the generated rules and support threshold=0.05, confidence threshold=0.7 are the conditions or constraints over the measures components of the generated rules. Figure 2: Example of a DMQL query (depicted from [14]) Within DMQL the resulting rules can be stored into relational tables. The limitation of the DMQL, however, is that only pattern extraction operations are supported, whereas no support is provided towards further retrieval and manipulation of the extracted patterns. Furthermore, only rule patterns are considered. MINE-RULE Meo et al [20] proposed MINE RULE, an extension of SQL, for discovering association rules from relational data. A new operator, named MINE RULE has been introduced here. An example of its usage is depicted in Figure 3. The result of the above query is a new table called SimpleAssociations with schema BODY, HEAD, SUPPORT, CON- FIDENCE (defined by the SELECT clause) where each tuple corresponds to a discovered rule. These rules come from the raw data relation Purchase (as specified by the FROM clause), where data are grouped by the attribute transaction (as specified in the GROUP BY clause). Regarding the mining parameters, namely support and confidence, they are specified in the EXTRACTING RULES WITH clause. As already demonstrated by the example, within MINE RULE the resulting rules are stored into relational tables. The limitation of the MINE RULE, however, is that only association rules discovery is supported;in particular, several 4
6 Figure 3: Example of a MINE RULE query (depicted from [20]) variants of simple association rules like association rules from filtered groups, association rules with clustering and association rules with mining conditions. Furthermore, further manipulation of the retrieved rules is no supported. MINE-SQL Imielinski and Virmani [16] proposed MINE SQL for the generation and querying of association rules. MSQL provides operators for generation, post processing and crossover queries over patterns. A rules generation example is depicted in Figure 4. Here a set of rules is generated from the relation: Transactions with specific confidence and support criteria and the results are stored into the relation: MarketAss- Rules. Figure 4: Example of a MSQL queries rules generation (depicted from [16]) A pattern post processing example is depicted in Figure 5; from all generated rules of the previous query (relation: MarketAssRules ) only those with specific structure, i.e. body=yes, are requested. Figure 5: Example of a MSQL queries rules post processing (depicted from [16]) Finally, a cross over query example is depicted in Figure 6, implemented through the operator VIOLATES ALL which returns all raw data tuples of the relation: Transactions that do not participate in any of the rules specified by the GetRules operator. 5
7 Figure 6: Example of a MSQL queries cross over query (depicted from [16]) Except for the VIOLATES ALL operator, further pattern data consistency checking operators are provided like VIOLATES ANY, SATISFIES ALL and SATISFIES ANY. As already noted, within MSQL the resulting rules are stored into relational tables and expect for their generation, post processing and cross over queries over rules are also supported. The limitation of the MSQL, however, is that only association rules manipulation is supported. Overview of inductive databases Inductive databases are based on an integrated architecture where both data and patterns are stored in the same repository and treated in a similar manner; in fact data mining process is considered as an extended querying process. Within inductive databases framework, only specific types of patterns are supported, usually association rules and text mining patterns. Inductive databases treat patterns as permanent objects that can be stored (together with data into relational tables) and further manipulated through several languages, like DMQL, MINE RULE and MSQL. Among them MSQL seems more complete by means that it supports generate, manipulate and cross over queries over patterns. There is no temporal information regarding patterns like for example their creation time and/or their validity period. The only attempt towards this aim is Mine SQL that provides some cross over operations like SATISFIES ALL/ANY, VIOLATES ALL/ANY; actually these operators perform some kind of pattern synchronization or validity checking against a raw dataset rather than capturing pattern changes. However, they provide some kind of temporal notion of patterns Worlds model The 3 Worlds model, introduced in 2000 by Johnson et al [17], provides a unified framework for pattern management, that supports combination of different mining, analysis and aggregation KDD tasks. It relies to three key notions: regions, dimensions and hierarchies. Different pattern types serve to split a given data set into a collection of subsets of tuples, 6
8 which are called regions. For example, the cube operator merely splits up a data set into regions corresponding to all possible group by expressions. In general, the regions created vary on their spatial shapes and on whether they overlap. A set of related regions is called a dimension. As an example, consider the set of itemsets with support exceeding a given threshold or a set of itemsets containing a specific item p. Dimensions might come with interesting structure that relates their regions in the form of a hierarchy. In case of frequent itemsets, for example, the lattice of all possible subsets of the set of itemsets is the associated hierarchy, whereas in case of a decision tree, the associated hierarchy is the tree. The 3 Worlds model is based on a separated architecture consisting of three distinct worlds: Intensional world (I World) In the I World, each region is represented in its intensional form, i.e. as a description of its members through constraints. For example, a cluster of products based on their price can be described as: 10 price 20. Regions produced by most mining/analysis operations are convex and can be modeled as a set of linear inequalities. Non convex regions can be modeled as a union of multiple convex regions. Extensional world (E World) In the E World, each region is represented in its extensional form, i.e. by an explicit enumeration of the tuples belonging to each region. Recalling, the previous mentioned example, the extensional representation contains all source data items with price between 10 and 20. Data world (D World) The D World consists of raw data, e.g. in the form of relations, from which regions and dimensions can be created as a result of mining. It can be viewed as consisting of a relational database. In a few words, we can state that the I World is the world of patterns, the D World is the world of raw data from which patterns have been extracted through some mining process, whereas the E World is the intermediate world that brings the gap between the two other words by determining which data produce which patterns. Manipulating the worlds The manipulation of structures in each world can be performed through an algebra of choice. For the manipulation of regions and dimensions in the I World authors propose the dimension algebra which is comprised of operators similar to the relational algebra as well as very different ones that add significant value to data mining and analysis. Examples of such operators are: Selection operator The selection operator invokes various spatial predicated such as overlap, containment, disjointness and non containment. As an example, consider 7
9 the query find all regions of a given decision tree that overlap with a given constant region: age 25. Purge operator The purge operator removes inconsistent regions that might result after applying some operator on consistent regions. For example, although regions A 2 and A 3 are consistent, their conjunction (A 2) (A 3) is inconsistent. Cartesian product Cartesian product creates new regions which are obtained by taking the pairwise intersection of regions in the two dimension instances. Union In contrast to the traditional approaches, here union does not require union compatibility between its operands and thus a fully heterogeneous union is permitted. Minus Just like the union operator, minus does not require union compatibility between its operands. The result of minus contains exactly those regions in D 1 for which no equivalent region exists in D 2. For the manipulation of extensional dimensions in the E World, authors state that the relational algebra could be utilized extended with aggregation and with some modifications. Relational algebra (extended with aggregation) is also the natural algebra of choice for the D World. Connecting the worlds An important part of this work is the bridge operators between the three worlds which are depicted in Figure 7. The bridging operators that facilitate moving in and out of the worlds are the following: Populate The populate operator amounts to populating the regions in an intensional dimension D with tuples from a given data set E, producing an extensional dimension E. Intuitively, for each region in D, thereisacor- responding extensional region in E which contains just those tuples in E that satisfy the region description. For example, populating a collection of frequent sets with a transaction database would give an extensional dimension with each region containing the set of supporting transactions. So, the populate operator relates the I world with the D world letting us to move to the E world. Mine Given a parameter p, the mine operation maps a relation E to an intensional dimension D. As an example of mine operator, consider the decision 8
10 Figure 7: The different worlds in the 3 Worlds model and the bridges (depicted from [17]) tree extraction. The mine operator specifies the type of the model, but its exact definition and computation depends entirely on the specific mining task invoked. In practice, invocation of the mine operator would result in the running of a relevant (fast) mining algorithm. So, the mine operator bridges the D and I worlds. Lookup The lookup operator is evolved whenever one might want to recall or lookup which intentional dimension give raise to an extensional dimension. This operator is utilized if someone wants to find out the intensional descriptions of regions in E, as opposed to enumerations of their member tuples. So, the mine operator bridges the I and E worlds. Refresh Refresh is an incremental version of the populate operator. The refresh and lookup operators helps in maintaining extensional (resp., intensional) dimensions in sync with manipulations being done on the intensional (resp., extensional) dimensions. Overview of 3 Worlds model The 3 Worlds model provides a generic framework towards unified mining supporting both patterns, underlying raw data and their interconnections management. It relies on an independent architecture, distinguishing thus the pattern management from raw data management, preserving, however, mappings between patterns and the underlying raw data. The 3 Worlds model is not restricted to a specific type of patterns, 9
11 it rather considers patterns as constraints and thus is it is more generic than inductive databases. Algebras for the manipulation of each world, as well as their relationships have been proposed. The temporal aspects of patterns are not considered within this model, although synchronization and consistency issues are considered through relevant query operators. The 3 Worlds model is very appealing in theory, in practice however, there is no work towards its continuation or some prototype implementation. This of course, does not inverse the fact that the 3 Worlds model is an important attempt towards unified data mining. Also, to the best of our knowledge, 3 Worlds model comprise the first attempt towards discrete pattern management which also considers linkages to the underlying raw data set from which patterns have been extracted. 3.3 PANDA framework PANDA provides a unified framework for the representation of heterogeneous patterns, relying on a separated architecture. Its goal is the design of a Pattern Base Management System (PBMS) for patterns in correspondence to the Data Base Management Systems (DBMS) for raw data. The PBMS is a system for handling (storing/processing/retrieving) patterns defined over raw data in order to efficiently support pattern matching and to exploit pattern related operations generating intensional information. The set of patterns managed by a PBMS is called pattern base. The PANDA logical model is composed of three basic building blocks: Pattern type A pattern type pt is a quintuple pt =(n, ss, ds, ms, et), where n is the name of the pattern type; ss is the structure schema, ds is the source schema, ms is the measure schema and et is an expression template, written in a given language, which refers to type names appearing in the source and structure schemes. The name component declares the name of the pattern type, e.g. association rules, decision tree, etc. The structure schema component defines the pattern space by describing the structure of the patterns instances of the pattern type. The source schema component defines the related source space by describing the dataset from which patterns are constructed. The measure schema component describes the measures which quantify the quality of the source data representation achieved by the pattern. The expression template component describes the relationship between the source space and the pattern space, thus carrying the semantics of the pattern. An example of defining the pattern type Association Rule is depicted below: n : AssociationRule ss : RECORD(head : SET(STRING), body : SET(STRING)) ds : BAG(transaction : SET(STRING)) ms : RECORD(confidence : REAL, support : REAL) et : head body transaction 10
12 Pattern A pattern is an instantiation of the pattern type. More formally, a pattern p of type pt is a quintuple p =(pid,s,d,m,e), where pid is the unique pattern identifier, s is the structure component, d isthesourcecompo- nent, m is the measure component and e is the expression component. The structure component positions the pattern within the pattern space defined by its pattern type, the source component identifies the specific dataset the pattern relates to, the measure component estimates the quality of the raw data representation achieved by the pattern, whereas finally the expression component relates the pattern to the source dataset. For example,incaseofanassociationrule,itsstructureiscomprisedofthe head and the body components; its measure consists of the support and the confidence components; its source is the dataset from which the rule has been extracted; and finally, the expression component of the rule is an enumeration of the dataset tuples that support it. An example of defining a pattern belonging to the pattern type Association Rule is depicted below: pid : 512 s :(head = { Boots }, body = { Socks, Hat }) d : SELECT SETOF(article) FROM sales GROUP BY transactionid m :(confidence = 0.75, support = 0.55) e : { Boots, Socks, Hat } transaction Class A class is a set of semantically related patterns. As an example, consider the class SalesRules containing the set of patterns (including 512 ) generated from sales using e.g. the Apriori algorithm. To increase the modeling expressiveness of the model authors have also proposed interesting relationships between patterns: Specialization Specialization allows new entities to be cheaply derived from existing ones, providing thus the model with extensibility and reusability. For example, the pattern type Cluster could be specialized to the type Cluster of Integer. Composition Composition allows the creation of a part of hierarchy of patterns, where the structure schema of a given pattern it is defined in terms of other patterns. As an example, consider the pattern type Clustering which refers to a set of patterns of type Cluster. In this example, there is a composition relationship between the type Clustering and the type Cluster. 11
13 Refinement Refinement allows supporting the modeling of patterns obtained by mining other existing patterns. As an example, consider a pattern type Cluster of Rules that defines a cluster over association rules instead of raw data. In this example, the Cluster of Rules type refines the Association Rule pattern type. Within the PANDA framework a Pattern Manipulation Language (PML) and a Pattern Query Language (PQL) have been proposed. Except for pattern queries, cross over queries that relate patterns with the raw data from which they have been extracted have also been proposed. The PBMS architecture of the PANDA approach is depicted in Figure 8. Figure 8: The PBMS architecture of PANDA(depicted from [22]) 12
14 The PANDA model [22] has been extended in [12] in order to address the need for temporal information management associated with patterns. More specifically, the notions of temporal validity, semantic validity and safety of patterns have been added to the definition of a pattern. The temporal validity is defined by the user and reflects the intuition of the user regarding the period that the patterns will be valid. The semantic validity is more strict by means that it depends on whether the underlying raw data space has been changed. If a pattern is both temporarily and semantically valid then it is considered as safe. The previously proposed PML and PQL [22] languages have been extended in [12] in order to be able to cope with temporal features during pattern manipulation and querying. As a proof of concept the PSYCHO prototype [11] has been implemented. Recognizing the importance of dissimilarity operators for querying, indexing, mining etc. a generic and flexible framework [9] has been proposed by PANDA people capable of assessing dissimilarity between both simple patterns and complex ones (i.e. patterns defined over other patterns instead of raw data). The dissimilarity scores between complex patterns (e.g. clusterings) is conceptually evaluated in a bottom up fashion by aggregating the dissimilarities between the matched component patterns (e.g. clusters). Multiply coupling types and aggregation logics are supported. Furthermore, efficiency issues are considered by avoiding accesses to the raw data from which patterns have been extracted. Overview of PANDA model The PANDA model provides a both generic and extensible framework for full pattern management, generic by means that it treats different pattern types in a uniform way and extensible by means that new pattern types are easily accommodated into the model. It relies on an independent architecture where patterns are treated as first class citizens, but also considers mappings to the underlying raw data space from which patterns have been extracted. Languages for querying and manipulating patterns as well as for their association to raw data have been proposed. The temporal aspects of patterns are taken into account in the core of the model and through the query languages. Overview of theoretical approaches Summarizing the theoretical approaches, we can state that they try to fully facilitate the management of data mining results, considering issues like storage, representation, querying etc. Regarding the adopted architecture, the inductive databases rely on an integrated architecture, whereas 3 Worlds and PANDA model rely on a separate architecture distinguishing management of patterns from the management of raw data, although the intermediate mappings are preserved and exploited. Although, the second approach seems more appealing since it is specialized to patterns and their characteristics, the first approach seems more applicable by means that it exploits the infrastructure of already existing DBMS. Regarding the modeling of patterns, the inductive databases concentrate mainly on association rules, whereas 3 Worlds and PANDA approaches con- 13
15 sider a generic model capable of handling different types of patterns. The first approach seems too restricted and un efficient considering the large diversity of patterns, whereas the second approach seems more efficient since it can be easily extended. Ideally, patterns should be treated in a unified way so as new types can easily be managed with the existing infrastructure. Furthermore, this generality should not ignore the special characteristics of each pattern type, in the contrary, these characteristics should be exploited for efficiency reasons. Regarding the querying of patterns, all approaches recognize the need for pattern querying and retrieval as well as the need for cross over queries involving both patterns and raw data. Several issues arise here that need to be further investigated, like efficient querying through index structures, for example, or sophisticated pattern synchronization with respect to raw data. Finally, the notion of temporariness of patterns has been indirectly recognized by all approaches by means that synchronization and refresh operators have been provided. Only within PANDA, however, temporal aspects of patterns have been considered in an integrated way in both modeling and querying of patterns. 4 Industrial approaches Scientific approaches try to provide an overall solution to the pattern management problem, providing both representation/ storage and retrieval/ manipulation capabilities. In the next sub sections we will present in more detail four of these approaches, namely PMML, SQL/MM, CWM and JDM API. The most popular efforts for modeling patterns is the Predictive Model Markup Language [13], that uses XML to represent data mining models. Another approach, based on SQL, is the SQL/MM standard [4]; here, the supported mining models are represented by structured types that constitute first-class SQL types made accessible through the SQL:1999 base syntax. Another framework for metadata representation is proposed by the Common Warehouse Model [2]: the main purpose of CWM is to enable easy interchange of warehouse and business intelligence metadata between various heterogeneous repositories, and not the effective manipulation of these metadata. The Java Data Mining API [5] addresses the need for procedural support of all the existing and evolving data mining standards. The JDMAPI specification supports the building of data mining models, as well as the creation, storage, access and maintenance of data and metadata that represent data mining results. From an architectural point of view, all the above solutions can be divided into two categories. In the first one, an additional layer is built on top of a DBMS that manipulates patterns, following the traditional object-relational approach. In the second approach, extension is achieved by allowing the integration of new system components (such as data types, access methods, storage structures, and low-level query processing techniques) into the DBMS kernel. Overall, the listed approaches seem inadequate to represent and handle different classes of patterns in a flexible, effective, and coherent way: in fact, a 14
16 given list of predefined pattern types is considered only, and no general and extensible approach to pattern modeling is proposed. 4.1 Specification and standards Predictive Model Markup Language (PMML) PMML [13] is an XML-based language that provides a quick and easy way for companies to define data mining and statistical models using a vendorindependent method share models between PMML compliant applications allows users to develop models within one vendor s application, and use other vendors applications to visualize, analyze, evaluate or otherwise use these models. PMML has been developed by the Data Mining Group (DMG), an independent, vendor led group for DM standards. Among the DMG members are IBM, Microsoft, Oracle, SPSS and SAP. PMML format is supported by major commercial products like Oracle, MSSQL and DB2. PMML supports only predefined types. The types supported so far are: Association Rules, Decision Trees, Center/Distribution Based Clustering, (General) Regression, Neural Networks, Naive Bayes, Rulesets, Sequences, Text and Vector Machines. Through PMML, quality measures associated with patterns can also be represented. Furthermore, is stored the relation between patterns and the subset of the source dataset represented by the pattern. Regarding the temporal notion of patterns,only the model date/time creation are stored. An example of PMML representation for association rules is depicted in Figure 9. The raw data set used for the generation of the model is: {t 1 : Cracker, Coke, Water}, t 2 : {Cracker, Water}, t 3 : {Cracker, Water}, t 4 : {Cracker, Coke, Water}. The data flow in PMML is depicted in Figure 10. Below we present each block of the flow in more detail: - DataDictionary The data dictionary block describes the raw input data hosted in external sources. More specifically, it defines how the model interprets these data according to their type (categorical) and value range. Transformation The transformation block converts the original values to internal values using e.g. normalization of numbers to [0... 1], discretization of continuous fields (derived fields) etc. MiningSchema The mining schema block defines which fields a user has to provide in order to apply a specific model. It contains model specific data like field usage type (e.g. predicted, ignored, input). 15
17 Figure 9: Representing association rules through PMML (depicted from [13]) Model The model block defines specific model parameters (e.g. tree, neural networks, regression). Output The output block depends on the specific kind of model. It is described by leaf nodes in case of a decision tree or by output neurons in case of a neural network. Result The result block is computed from the output of the model (e.g. in case of a neural network, the numeric output value should be de-normalized to original domain values). PMML supports model composition through sequences of models and model selection. Two or more models combined into a sequence so that the results of one model are used as input into another model form a sequence of models. As an example, consider using regression elements within the nodes of a decision tree. Through model selection the result of a model can be used to select which model should be applied next. PMML also supports built in and user defined functions like fine-grained transformations (e.g. sum, product, trim). An important feature of PMML is model verification which gives providers and consumers of PMML models a mechanism to ensure that a model deployed 16
18 Figure 10: The data flow in PMML (depicted from [13]) to a new environment generates results consistent with environment where the model was developed. Recall, that due to differences in operating systems, data precision and algorithm implementation, the model s performance can be affected. To deal with this problem, a dataset of model inputs and known results that can be used to verify accurate results is generated, regardless of the environment. From the above mentioned, is clear that PMML emphases on pattern representation issues, rather than management, providing a framework for pattern exchange between PMML compliant applications Common Warehouse Metamodel (CWM) CWM [2] is a specification that enables easy interchange of metadata between data warehousing tools and metadata repositories in distributed heterogeneous environments. CWM is based on three key industry standards: Unified Modeling Language (UML), which is a modeling standard. All classes in CWM are expressed in UML. Meta Object Facility (MOF), which a metamodeling and metadata repository standard XMI XML Metadata Interchange, which is a metadata interchange standard CWM consists of a number of sub-metamodels representing common warehouse metadata in the areas of Data Resources, Data Analysis (OLAP, Data Mining, ) and Warehouse Management. The CWM Data Mining sub-metamodel represents three conceptual areas: 17
19 Model description The model description is a generic representation of a data mining model. It consists of the MiningModel, which is a representation of the mining model itself, the MiningSettings that drives the construction of the model, the ApplicationInputSpecification that specifies the set of input attributes for the model and the MiningModelResult that represents the result set produced by the testing or application of a generated model. Settings The settings area elaborates further on the MiningSettings and their usage relationships to the attributes of the input specification (e.g. ClusteringSettings, AssociationRulesSettings etc). Attributes Two subclasses of Mining Attribute are defined: NumericAttribute and CategoricalAttribute. The CWM specification has been designed to analyze large amounts of data and the data mining process is just a small part of it. Only predefined pattern types are supported like Clustering, Association Rules, Supervised, Approximation, Attribute Importance and Classification. CWM provides information concerning quality measures associated with patterns and supports links between a pattern and the subset of the source dataset represented by the pattern. To conclude with CWM, we can state that it is not dedicated to patterns, rather it is a more generic specification for the interchange of metadata. Furthermore, within CWM representation issues are mainly considered rather than general management issues SQL/MM Part 6, Data Mining SQL/MM [4] has been developed by the International Standards Organization (ISO) and it comprises a specification for supporting data management of common data types (text, spatial info, images, data mining results) relevant in multimedia and other knowledge intensive applications. It consists of the following parts: Part 1: Framework Part 2: Full-text Part 3: Spatial Part 5: Still Image Part 6: Data Mining SQL/MM defines first-class SQL types that can be accessed through SQL:1999 base syntax. In case of data mining models, every model has a corresponding SQL structured user-defined type. A set of predefined types completes the full definition of each model. These types are: 18
20 DM-*Settings It stores various parameters of the data mining model (e.g. maximum number of clusters, depth of a decision tree etc.) The * symbol stands for Class in case of a classification model, Rule in case of a rule model, Clustering in case of a clustering model, Regression in case of a regression model. DM-*TestResult It holds the results of the testing during the training phase of the data mining models. DM-*Result It is created by running a data mining model against real data. DM-*Task It stores metadata that describe the process and control of testings/ runnings. SQL/MM supports only the predefined types Clustering, Association Rules, Supervised, Classification and Regression. Furthermore, it stores the link between a pattern and the subset of the source dataset represented by the pattern. For each pattern type, a set of quality measures is provided. No temporal information is taken into account (except for the model time creation). Pattern manipulation and querying are possible through SQL. An example of SQL/MM - DM is depicted in Figure 11. Concluding with SQL/MM, we can state that although only a part of it is dedicated to patterns, it provides both representation and management capabilities through SQL statements. However, since its goals concern management of general multimedia data, rather than patterns, it does not deepen on efficient pattern management Java Data Mining API Java Data Mining API [5] addresses the need for an independent of the underlying data mining system Java API that will support the creation, storage, access and maintenance of data and metadata. It has been developed by Sun and is supported by a majority of companies like IBM, Oracle, SPSS, Blue Martini Software etc. JDMAPI provides a standardized access to data mining patterns represented in various formats, including CWM, SQL/MM DM and PMML. It supports four conceptual areas: settings, models, transformations and results. It consists of three logical components that are implemented either as one executable or in a distributed environment: Application Programming Interface (API) API is the end-user-visible component of a JDMAPI implementation that allows access to services provided by the DME 19
21 Figure 11: SQL/MM - DM example regarding association rules (depicted from [4]) Data Mining Engine (DME) DME provides the infrastructure that offers a set of data mining services to its API clients. Metadata Repository (MR) MR sstores the data mining objects. It can be based on the CWM framework, specifically using the CWM Data Mining metamodel, or implemented using a vendor representation. MR might exist in a file-based environment or in a relational database. JDMAPI supports only the predefined types: Clustering, Association Rules, Classification, Approximation and Attribute Importance. It also supports associating patterns with quality measures. Both pattern representation and management issues are considered within JDMAPI. An example of using JDMAPI for building a clustering model is depicted in Figure 12. Overview of Specifications and Standards Summarizing the specifications and standards, we can state that several efforts have been carried out in this direction, focusing mainly on pattern representation issues rather than full pattern management issues. This is due to the fact that specifications and standards emphasize on the efficient and effective interchange of patterns between different vendors applications. Actually, they act as an agreement between data mining producers and consumers. Further pattern management issues like 20
22 Figure 12: JDMAPI example - building a clustering model (depicted from [5]) querying, indexing etc. are not considered at all or are payed small attention. The most popular effort in this direction is PMML which is dedicated to data mining patterns and statistical models and is supported by major vendors in the field of data mining. 4.2 DBMS extensions Oracle Data Mining (ODM) ODM [7] provides data mining functionality embedded in Oracle Database 10g to mine data, extract hidden patterns and insights and build advanced business intelligence applications. All the data mining functionality is embedded in the Oracle DB (Figure 13). ODM supports a variety of data mining tasks including Attribute importance, Classification and Regression, Clustering, Associations, Feature extraction, Text mining and Sequence matching and alignment. ODM functionality is available to end users through ODM Java API, ODM DBMS PL/SQL API and the Data Mining Client (GUI). To build a model, the user defines the input data and the mining function settings. Models are built and stored in the DMS. ODM supports the persistence of mining results as independent, named entities in the DMS. A mining results object contains the operation start time and end time, the name of the model 21
23 Figure 13: Oracle Data Mining embedded into Oracle DB (depicted from [7]) used, input data location, and output data location (if any) for the data mining operation. The user can query model settings and details as well as test the model. ODM works towards supporting PMML import /export. Currently, only Naive Bayes and Association Rules import /export into PMML format are supported Microsoft SQL Server 2005 MSSQL [6] provides an integrated environment for creating and working with data mining models. Currently, the supported models are Decision Trees, Clustering, Association Rules, Itemsets, Naive Bayes, Sequence Clustering, Time Series, and Neural Networks. It also supports PMML 2.1 format. The major contributions of MSSQL regarding data mining are the following: OLE DB for Data Mining interface, which provides the functionality to mine knowledge from relational data sources or multi-relational repositories. the Data Mining Extensions (DMX) to SQL language, which provides a familiar query interface for data mining purposes. More specifically, model creation and training become familiar CREATE and INSERT INTO statements, whereas prediction and content discovery become familiar SELECT statements. Also, the storage and manipulation of patterns is performed in an SQL like style and the results are stored into relational tables. The data mining process in MSSQL is depicted in Figure 14 22
24 Figure 14: Data Mining process in MSSQL (depicted from [6]) IBM DB2 Intelligent Miner IBMs DB2 Intelligent Miner [3] is a suite of tools for supporting Knowledge Management. The supported data mining algorithms are Association Rules, Clustering and Classification. Intelligent Miner produces PMML data mining models which are stored as Binary Large Objects (BLOBs). The end user might interact with the system through an SQL API, thus data mining results can be integrated within business applications using external programming languages. Furthermore, Intelligent Miner supports patterns-data synchronization through a scoring mechanism that can be started up by some triggers monitoring raw data changes. Overview of DBMS extensions Summarizing the DBMS extensions to pattern management, we can state that industrial approaches have recognized the emerging need for pattern management and try to incorporate it into existing DBMSs. So far, issues like pattern storage, simple querying and naive synchronization with respect to raw data are considered. These issues are usually addressed by exploiting the infrastructure of DBMSs, e.g. in case of patterns retrieval SQL like languages are utilized. Actually, it seems that these approaches treat patterns just as another type of (complex) data and rely on some integrated architecture by storing patterns and raw data in the same repository. We should also note that all approaches are PMML compatible, so as to facilitate the interchange of data mining models between different vendor applications. 23
25 5 Conclusions - Open issues - Research directions In this chapter we have overview the work performed so far in the area of pattern management by both academic and industrial parts. Academic approaches try to provide an overall and generic solution to the pattern management problem, starting sometimes from scratch, whereas industrial approaches try to incorporate pattern management into existing DBMSs. Pattern management includes several issues like representation/ storage, querying/ retrieval, indexing, visualization etc. So far all approaches deal with the pattern representation issue either by adopting discrete representation for each pattern type, like PMML, or by adopting a common model for all pattern types, like PANDA or 3-Worlds models. Extensibility for novel mining models is a desired feature for a pattern base management system. Ideally, a general model should be adopted that would also preserve and utilize pattern type specific characteristics. Regarding the pattern storage issue, there are approaches that adopt the integrated architecture where patterns and raw data are stored in the same repository, like inductive databases and industrial approaches, and others that adopt a discrete architecture like PANDA and 3-Worlds model. For the storage of patterns existing data storage approaches are utilized like relational databases, object-relational databases, XML and data files. Although this is a straightforward solution of vendors of commercial DMBSs, it could be revisited since the data mining results are far from the traditional data [18]. A lot of work remains to be done in the area of pattern querying, especially in that of efficient querying through sophisticated index structures. So far, only naive querying capabilities are supported, like retrieval of patterns or simple cross over queries involving both patterns and raw data. Interesting extensions are sophisticated cross-over queries as well as similarity queries and queries involving patterns of different pattern types [21]. References [1] The CINQ (Consortium on Discovering Knowledge with Inductive Queries) project. (valid as of July 2006). [2] Common Warehouse Metamodel (CWM). (valid as of July 2006). [3] IBM DB2 Intelligent Miner ibm.com/software/data/iminer, (valid as of July 2006). [4] ISO SQL/MM Part 6. Documents/FCD/fcd-datamining pdf, (valid as of July 2006). [5] Java Data Mining API. (valid as of July 2006). 24
26 [6] Microsoft SQL Server (valid as of July 2006). [7] Oracle10g Data Mining Concepts. (valid as of July 2006). [8] The PANDA Project. (valid as of July 2006). [9] I. Bartolini, P. Ciaccia, I. Ntoutsi, M. Patella, and Y. Theodoridis. A unified and flexible framework for comparing simple and complex patterns. In PKDD, pages Springer, [10] B. Catania and A. Maddalena. Pattern Management: Practice and Challenges, pages Processing and Managing Complex Data for Decision Support. Idea Group Publishing, [11] B. Catania, A. Maddalena, and M. Mazza. Psycho: A prototype system for pattern management. In VLDB, pages , [12] B. Catania, A. Maffalena, A. Mazza, E. Bertino, and S. Rizzi. A framework for data mining pattern management. In ECML/PKDD, [13] DMG. Predictive Model Markup Language (PMML), (valid as of July 2006). [14] J. Han, Y. Fu, K. Koperski, W. Wang, and O. Zaiane. DMQL: A data mining query language for relational databases. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, [15] T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of the ACM, 39(11):58 64, [16] T. Imielinski and A. Virmani. MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery, 3: , [17] T. Johnson, L. Lashmanan, and T. Ravmond. The 3W Model and Algebra for Unified Data Mining. In VLDB, [18] E. Kotsifakos, I. Ntoutsi, and Y. Theodoridis. Database support for data mining patterns. In Panhellenic Conference on Informatics, [19] P. Lyman and H. R. Varian. How Much Information?, 2000 (valid as of December 2006). [20] R. Meo, G. Psaila, and S. Ceri. A New SQL like Operator for Mining Association Rules. In VLDB, pages , [21] I. Ntoutsi and Y. Theodoridis. Measuring and evaluating dissimilarity in data and pattern spaces. In Proceedings of VLDB 05 Phd Workshop,
Measuring and Evaluating Dissimilarity in Data and Pattern Spaces
Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece
More informationFlexible Pattern Management within PSYCHO
Flexible Pattern Management within PSYCHO Barbara Catania and Anna Maddalena Dipartimento di Informatica e Scienze dell Informazione Università degli Studi di Genova (Italy) {catania,maddalena}@disi.unige.it
More informationUML-Based Conceptual Modeling of Pattern-Bases
UML-Based Conceptual Modeling of Pattern-Bases Stefano Rizzi DEIS - University of Bologna Viale Risorgimento, 2 40136 Bologna - Italy srizzi@deis.unibo.it Abstract. The concept of pattern, meant as an
More informationPanel: Pattern management challenges
Panel: Pattern management challenges Panos Vassiliadis Univ. of Ioannina, Dept. of Computer Science, 45110, Ioannina, Hellas E-mail: pvassil@cs.uoi.gr 1 Introduction The increasing opportunity of quickly
More informationA Framework for Data Mining Pattern Management
A Framework for Data Mining Pattern Management Barbara Catania 1, Anna Maddalena 1, Maurizio Mazza 1, Elisa Bertino 2, and Stefano Rizzi 3 1 University of Genova (Italy) {catania,maddalena,mazza}@disi.unige.it
More informationKDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos
KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done
More informationTowards a Language for Pattern Manipulation and Querying
Towards a Language for Pattern Manipulation and Querying Elisa Bertino 1, Barbara Catania 2, and Anna Maddalena 2 1 Dipartimento di Scienze dell Informazione Università degli Studi di Milano (Italy) 2
More informationTowards a Logical Model for Patterns
Towards a Logical Model for Patterns Stefano Rizzi 1, Elisa Bertino 2, Barbara Catania 3, Matteo Golfarelli 1, Maria Halkidi 4, Manolis Terrovitis 5, Panos Vassiliadis 6, Michalis Vazirgiannis 4, and Euripides
More informationOracle9i Data Mining. Data Sheet August 2002
Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,
More informationPatMan: A Visual Database System to Manipulate Path Patterns and Data in Hierarchical Catalogs
PatMan: A Visual Database System to Manipulate Path Patterns and Data in Hierarchical Catalogs Aggeliki Koukoutsaki, Theodore Dalamagas, Timos Sellis, Panagiotis Bouros Knowledge and Database Systems Lab
More informationMETADATA INTERCHANGE IN SERVICE BASED ARCHITECTURE
UDC:681.324 Review paper METADATA INTERCHANGE IN SERVICE BASED ARCHITECTURE Alma Butkovi Tomac Nagravision Kudelski group, Cheseaux / Lausanne alma.butkovictomac@nagra.com Dražen Tomac Cambridge Technology
More informationMaterialized Data Mining Views *
Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61
More information1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda
Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More information1. Inroduction to Data Mininig
1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the
More informationModelling Structures in Data Mining Techniques
Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor
More informationQuestion Bank. 4) It is the source of information later delivered to data marts.
Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
More informationTIM 50 - Business Information Systems
TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz May 20, 2014 Announcements DB 2 Due Tuesday Next Week The Database Approach to Data Management Database: Collection of related files containing
More informationFast Discovery of Sequential Patterns Using Materialized Data Mining Views
Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo
More informationarxiv: v1 [cs.db] 10 May 2007
Decision tree modeling with relational views Fadila Bentayeb and Jérôme Darmont arxiv:0705.1455v1 [cs.db] 10 May 2007 ERIC Université Lumière Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
More informationTIM 50 - Business Information Systems
TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz Nov 10, 2016 Class Announcements n Database Assignment 2 posted n Due 11/22 The Database Approach to Data Management The Final Database Design
More informationDATA MINING AND WAREHOUSING
DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making
More informationChapter 4 Data Mining A Short Introduction
Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview
More informationHermes - A Framework for Location-Based Data Management *
Hermes - A Framework for Location-Based Data Management * Nikos Pelekis, Yannis Theodoridis, Spyros Vosinakis, and Themis Panayiotopoulos Dept of Informatics, University of Piraeus, Greece {npelekis, ytheod,
More information3.4 Data-Centric workflow
3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load
More informationBitmap index-based decision trees
Bitmap index-based decision trees Cécile Favre and Fadila Bentayeb ERIC - Université Lumière Lyon 2, Bâtiment L, 5 avenue Pierre Mendès-France 69676 BRON Cedex FRANCE {cfavre, bentayeb}@eric.univ-lyon2.fr
More informationA Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective
A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India
More informationData Mining Concepts
Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential
More informationChapter 28. Outline. Definitions of Data Mining. Data Mining Concepts
Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA
More informationTransforming UML Collaborating Statecharts for Verification and Simulation
Transforming UML Collaborating Statecharts for Verification and Simulation Patrick O. Bobbie, Yiming Ji, and Lusheng Liang School of Computing and Software Engineering Southern Polytechnic State University
More informationEfficient integration of data mining techniques in DBMSs
Efficient integration of data mining techniques in DBMSs Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex, FRANCE {bentayeb jdarmont
More informationUNIT I. Introduction
UNIT I Introduction Objective To know the need for database system. To study about various data models. To understand the architecture of database system. To introduce Relational database system. Introduction
More informationComputation Independent Model (CIM): Platform Independent Model (PIM): Platform Specific Model (PSM): Implementation Specific Model (ISM):
viii Preface The software industry has evolved to tackle new approaches aligned with the Internet, object-orientation, distributed components and new platforms. However, the majority of the large information
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationData Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005
Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate
More informationRelational Model: History
Relational Model: History Objectives of Relational Model: 1. Promote high degree of data independence 2. Eliminate redundancy, consistency, etc. problems 3. Enable proliferation of non-procedural DML s
More informationCoral: A Metamodel Kernel for Transformation Engines
Coral: A Metamodel Kernel for Transformation Engines Marcus Alanen and Ivan Porres TUCS Turku Centre for Computer Science Department of Computer Science, Åbo Akademi University Lemminkäisenkatu 14, FIN-20520
More informationKnowledge Discovery in Databases. Databases. date name surname street city account no. payment balance
Databases date name surname street city account no. payment balance 980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00 980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00 980106 Jan Novak Dlouha
More informationGeneralized Document Data Model for Integrating Autonomous Applications
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Generalized Document Data Model for Integrating Autonomous Applications Zsolt Hernáth, Zoltán Vincellér Abstract
More informationPattern representation and management techniques The PBMS concept
UNIVERSITY OF PIRAEUS DEPARTMENT OF INFORMATICS Pattern representation and management techniques The PBMS concept PhD Thesis EVANGELOS E. KOTSIFAKOS MSc, Information Systems, AUEB (2003) BSc, Informatics,
More informationPractical Database Design Methodology and Use of UML Diagrams Design & Analysis of Database Systems
Practical Database Design Methodology and Use of UML Diagrams 406.426 Design & Analysis of Database Systems Jonghun Park jonghun@snu.ac.kr Dept. of Industrial Engineering Seoul National University chapter
More informationHorizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator
Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator R.Saravanan 1, J.Sivapriya 2, M.Shahidha 3 1 Assisstant Professor, Department of IT,SMVEC, Puducherry, India 2,3 UG student, Department
More informationData Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini
Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 669-674 Research India Publications http://www.ripublication.com/aeee.htm Data Warehousing Ritham Vashisht,
More informationNOTES ON OBJECT-ORIENTED MODELING AND DESIGN
NOTES ON OBJECT-ORIENTED MODELING AND DESIGN Stephen W. Clyde Brigham Young University Provo, UT 86402 Abstract: A review of the Object Modeling Technique (OMT) is presented. OMT is an object-oriented
More informationChapter 11 Object and Object- Relational Databases
Chapter 11 Object and Object- Relational Databases Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 11 Outline Overview of Object Database Concepts Object-Relational
More informationSemantics, Metadata and Identifying Master Data
Semantics, Metadata and Identifying Master Data A DataFlux White Paper Prepared by: David Loshin, President, Knowledge Integrity, Inc. Once you have determined that your organization can achieve the benefits
More informationAn UML-XML-RDB Model Mapping Solution for Facilitating Information Standardization and Sharing in Construction Industry
An UML-XML-RDB Model Mapping Solution for Facilitating Information Standardization and Sharing in Construction Industry I-Chen Wu 1 and Shang-Hsien Hsieh 2 Department of Civil Engineering, National Taiwan
More informationOMG Specifications for Enterprise Interoperability
OMG Specifications for Enterprise Interoperability Brian Elvesæter* Arne-Jørgen Berre* *SINTEF ICT, P. O. Box 124 Blindern, N-0314 Oslo, Norway brian.elvesater@sintef.no arne.j.berre@sintef.no ABSTRACT:
More informationCS425 Fall 2016 Boris Glavic Chapter 1: Introduction
CS425 Fall 2016 Boris Glavic Chapter 1: Introduction Modified from: Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Textbook: Chapter 1 1.2 Database Management System (DBMS)
More informationCS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 4 Enhanced Entity-Relationship (EER) Modeling Slide 1-2 Chapter Outline EER stands for Enhanced ER or Extended ER EER Model Concepts Includes all modeling concepts of basic ER Additional concepts:
More informationR07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.
www..com www..com Set No.1 1. a) What is data mining? Briefly explain the Knowledge discovery process. b) Explain the three-tier data warehouse architecture. 2. a) With an example, describe any two schema
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper
THE NEED Knowing where data came from, how it moves through systems, and how it changes, is the most critical and most difficult task in any data management project. If that process known as tracing data
More informationModeling and Language Support for the Management of Pattern-Bases
Modeling and Language Support for the Management of Pattern-Bases Manolis Terrovitis Panos Vassiliadis Spiros Skiadopoulos Elisa Bertino Barbara Catania Anna Maddalena Abstract In our days knowledge extraction
More informationA Survey of the State of the Art in Data Mining and Integration Query Languages
A Survey of the State of the Art in Data Mining and Integration Query Languages (2011 International Conference on Network-Based Information Systems) Sabri Pllana, Ivan Janciak, Peter Brezany, and Alexander
More informationHierarchies in a multidimensional model: From conceptual modeling to logical representation
Data & Knowledge Engineering 59 (2006) 348 377 www.elsevier.com/locate/datak Hierarchies in a multidimensional model: From conceptual modeling to logical representation E. Malinowski *, E. Zimányi Department
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationDATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS
DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS 1. NAME SOME SPECIFIC APPLICATION ORIENTED DATABASES. Spatial databases, Time-series databases, Text databases and multimedia databases.
More informationKnowledge Discovery. URL - Spring 2018 CS - MIA 1/22
Knowledge Discovery Javier Béjar cbea URL - Spring 2018 CS - MIA 1/22 Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics
More informationModelling in Enterprise Architecture. MSc Business Information Systems
Modelling in Enterprise Architecture MSc Business Information Systems Models and Modelling Modelling Describing and Representing all relevant aspects of a domain in a defined language. Result of modelling
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Chapter 12 Outline Overview of Object Database Concepts Object-Relational Features Object Database Extensions to SQL ODMG Object Model and the Object Definition Language ODL Object Database Conceptual
More informationE-R Model. Hi! Here in this lecture we are going to discuss about the E-R Model.
E-R Model Hi! Here in this lecture we are going to discuss about the E-R Model. What is Entity-Relationship Model? The entity-relationship model is useful because, as we will soon see, it facilitates communication
More informationDSS based on Data Warehouse
DSS based on Data Warehouse C_13 / 19.01.2017 Decision support system is a complex system engineering. At the same time, research DW composition, DW structure and DSS Architecture based on DW, puts forward
More informationCHAPTER-23 MINING COMPLEX TYPES OF DATA
CHAPTER-23 MINING COMPLEX TYPES OF DATA 23.1 Introduction 23.2 Multidimensional Analysis and Descriptive Mining of Complex Data Objects 23.3 Generalization of Structured Data 23.4 Aggregation and Approximation
More informationData Mining and Warehousing
Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.
More informationDatabase Technology Introduction. Heiko Paulheim
Database Technology Introduction Outline The Need for Databases Data Models Relational Databases Database Design Storage Manager Query Processing Transaction Manager Introduction to the Relational Model
More informationThe Unified Modelling Language. Example Diagrams. Notation vs. Methodology. UML and Meta Modelling
UML and Meta ling Topics: UML as an example visual notation The UML meta model and the concept of meta modelling Driven Architecture and model engineering The AndroMDA open source project Applying cognitive
More informationData Access Paths for Frequent Itemsets Discovery
Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 1-1
Slide 1-1 Chapter 1 Introduction: Databases and Database Users Outline Types of Databases and Database Applications Basic Definitions Typical DBMS Functionality Example of a Database (UNIVERSITY) Main
More informationChapter 8: Enhanced ER Model
Chapter 8: Enhanced ER Model Subclasses, Superclasses, and Inheritance Specialization and Generalization Constraints and Characteristics of Specialization and Generalization Hierarchies Modeling of UNION
More informationAssociating Terms with Text Categories
Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationWeb Services Annotation and Reasoning
Web Services Annotation and Reasoning, W3C Workshop on Frameworks for Semantics in Web Services Web Services Annotation and Reasoning Peter Graubmann, Evelyn Pfeuffer, Mikhail Roshchin Siemens AG, Corporate
More informationEssay Question: Explain 4 different means by which constrains are represented in the Conceptual Data Model (CDM).
Question 1 Essay Question: Explain 4 different means by which constrains are represented in the Conceptual Data Model (CDM). By specifying participation conditions By specifying the degree of relationship
More informationAn Archiving System for Managing Evolution in the Data Web
An Archiving System for Managing Evolution in the Web Marios Meimaris *, George Papastefanatos and Christos Pateritsas * Institute for the Management of Information Systems, Research Center Athena, Greece
More informationLesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)
From this diagram, you can see that the aggregated mining model preserves the overall range and trends in values while minimizing the fluctuations in the individual data series. Conclusion You have learned
More informationLecturer 2: Spatial Concepts and Data Models
Lecturer 2: Spatial Concepts and Data Models 2.1 Introduction 2.2 Models of Spatial Information 2.3 Three-Step Database Design 2.4 Extending ER with Spatial Concepts 2.5 Summary Learning Objectives Learning
More information1.1 Jadex - Engineering Goal-Oriented Agents
1.1 Jadex - Engineering Goal-Oriented Agents In previous sections of the book agents have been considered as software artifacts that differ from objects mainly in their capability to autonomously execute
More information1 Executive Overview The Benefits and Objectives of BPDM
1 Executive Overview The Benefits and Objectives of BPDM This is an excerpt from the Final Submission BPDM document posted to OMG members on November 13 th 2006. The full version of the specification will
More informationDatabases and Database Systems
Page 1 of 6 Databases and Database Systems 9.1 INTRODUCTION: A database can be summarily described as a repository for data. This makes clear that building databases is really a continuation of a human
More informationA Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)
International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/
More informationData warehouses Decision support The multidimensional model OLAP queries
Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing
More informationDiscovery of Actionable Patterns in Databases: The Action Hierarchy Approach
Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach Gediminas Adomavicius Computer Science Department Alexander Tuzhilin Leonard N. Stern School of Business Workinq Paper Series
More informationChapter 1: Introduction. Chapter 1: Introduction
Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases
More informationGenerating Cross level Rules: An automated approach
Generating Cross level Rules: An automated approach Ashok 1, Sonika Dhingra 1 1HOD, Dept of Software Engg.,Bhiwani Institute of Technology, Bhiwani, India 1M.Tech Student, Dept of Software Engg.,Bhiwani
More informationDynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering
Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of
More informationEnhanced Entity-Relationship (EER) Modeling
CHAPTER 4 Enhanced Entity-Relationship (EER) Modeling Copyright 2017 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 Chapter Outline EER stands for Enhanced ER or Extended ER EER Model Concepts Includes
More informationChapter 1. Types of Databases and Database Applications. Basic Definitions. Introduction to Databases
Chapter 1 Introduction to Databases Types of Databases and Database Applications Numeric and Textual Databases Multimedia Databases Geographic Information Systems (GIS) Data Warehouses Real-time and Active
More informationChapter 2 Overview of the Design Methodology
Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed
More informationExecutive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design)
Electronic Health Records for Clinical Research Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design) Project acronym: EHR4CR Project full title: Electronic
More informationSpemmet - A Tool for Modeling Software Processes with SPEM
Spemmet - A Tool for Modeling Software Processes with SPEM Tuomas Mäkilä tuomas.makila@it.utu.fi Antero Järvi antero.jarvi@it.utu.fi Abstract: The software development process has many unique attributes
More informationSage 300 ERP Intelligence Reporting Connector Advanced Customized Report Writing
Sage 300 ERP Intelligence Reporting Connector Advanced Customized Report Writing Sage Intelligence Connector Welcome Notice This document and the Sage software may be used only in accordance with the accompanying
More informationAn Approach to Intensional Query Answering at Multiple Abstraction Levels Using Data Mining Approaches
An Approach to Intensional Query Answering at Multiple Abstraction Levels Using Data Mining Approaches Suk-Chung Yoon E. K. Park Dept. of Computer Science Dept. of Software Architecture Widener University
More informationUnified Modeling Language (UML)
Appendix H Unified Modeling Language (UML) Preview The Unified Modeling Language (UML) is an object-oriented modeling language sponsored by the Object Management Group (OMG) and published as a standard
More informationModeling and Language Support for the Management of Pattern-Bases
Modeling and Language Support for the Management of Pattern-Bases Manolis Terrovitis a,, Panos Vassiliadis b, Spiros Skiadopoulos c, Elisa Bertino d, Barbara Catania e, Anna Maddalena e, and Stefano Rizzi
More informationValue Added Association Rules
Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency
More information