Mining XML Functional Dependencies through Formal Concept Analysis

Size: px

Start display at page:

Download "Mining XML Functional Dependencies through Formal Concept Analysis"

John Richard
5 years ago
Views:

1 Mining XML Functional Dependencies through Formal Concept Analysis Viorica Varga May 6, 2010

2 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

3 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

4 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

5 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

6 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

7 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML keys Detecting XML data redundancy Conclusions and Future Work

8 XML Design XML data design: choose an appropriate XML schema, which usually come in the form of DTD (Document Type Definition) or XML Scheme. Functional dependencies (FDs) are a key factor in XML design. The objective of normalization is to eliminate redundancies from an XML document, eliminate or reduce potential update anomalies. Arenas, M., Libkin, L.: A normal form for XML documents. TODS 29(1), (2004) Yu, C., Jagadish, H. V.: XML schema refinement through redundancy detection and normalization. VLDB J. 17(2): (2008)

9 Schema definition Definition (Schema) A schema is defined as a set S = (E, T, r), where: E is a finite set of element labels; T is a finite set of element types, and each e E is associated with a τ T, written as (e : τ), τ has the next form: τ ::= str int float SetOf τ Rcd [e 1 : τ 1,..., e n : τ n ]; r E is the label of the root element, whose associated element type can not be SetOf τ. Types str, int and float are the system defined simple types and Rcd indicate complex scheme elements. Keyword SetOf is used to indicate set schema elements Attributes and elements are treated in the same way, with a symbol before attributes.

10 Figure: CustOrder XML tree

11 Customer s Orders Example Scheme 0 CustOrder : Rcd 1 Customers : SetOf Rcd 2 CustomerID : s t r 3 CompanyName : s t r 4 A d d r e s s : s t r 5 C i t y : s t r 6 P o s talcode : s t r 7 Country : s t r 8 Phone : s t r 9 Orders : SetOf Rcd 10 OrderID : i n t 11 CustomerID : s t r 12 OrderDate : s t r 13 OrderDetails : SetOf Rcd 14 OrderID : i n t 15 ProductID : i n t 16 U n i t P r i c e : f l o a t 17 Q u a n t i t y : f l o a t 18 ProductName : s t r 19 CategoryID : i n t

12 A schema element e k can be identified through a path expression, path(e k ) = /e 1 /e 2 /.../e k, where e 1 = r, and e i is associated with type τ i ::= Rcd [..., e i+1 : τ i+1,...] for all i [1, k 1]. A path is repeatable, if e k is a set element. We adopt XPath steps. (self) and.. (parent) Definition (Data tree) An XML database is defined to be a rooted labeled tree T = N, P, V, n r, where: N is a set of labeled data nodes, each n N has a label e and a node key that uniquely identifies it in T ; n r N is the root node; P is a set of parent-child edges, there is exactly one p = (n, n) in P for each n N (except n r ), where n N, n n, n is called the parent node, n is called the child node; V is a set of value assignments, there is exactly one v = (n, s) in V for each leaf node n N, where s is a value of simple type.

13 Descendant, repeatable element definition We assign a node key, referred to to each data node in the data tree in a pre-order traversal. A data element n k is a descendant of another data element n 1 if there exists a series of data elements n i, such that (n i, n i+1 ) P for all i [1, k 1]. Data element n k can be addressed using a path expression, path(n k ) = /e 1 /... /e k, where e i is the label of n i for each i [1, k], n 1 = n r, and (n i, n i+1 ) P for all i [1, k 1]. A data element n k is called repeatable if e k corresponds to a set element in the schema. Element n k is called a direct descendant of element n a, if n k is a descendant of n a, path(n k ) =... /e a /e 1 /... /e k 1 /e k, and e i is not a set element for any i [1, k 1].

14 Warehouse Example

15 Warehouse Example Scheme 0 warehouse : Rcd 1 s t a t e : SetOf Rcd 2 name : s t r 3 s t o r e : SetOf Rcd 4 contact : Rcd 5 name : s t r 6 a d d r e s s : s t r 7 book : SetOf Rcd 8 ISBN : s t r 9 author : SetOf s t r 10 t i t l e : s t r 11 p r i c e : s t r

16 Element-value equality Definition(Element-value equality) Two data elements n 1 of T 1 = N 1, P 1, V 1, n r1 and n 2 of T 2 = N 2, P 2, V 2, n r2 are element-value equal (written as n 1 = ev n 2 ) if and only if: n 1 and n 2 both exist and have the same label; There exists a set M, such that for every pair (n 1, n 2 ) M, n 1 = ev n 2, where n 1, n 2 are children elements of n 1, n 2, respectively. Every child element of n 1 or n 2 appears in exactly one pair in M. (n 1, s) V 1 if and only if (n 2, s) V 2,where s is a simple value. Example Data elements node 30 and 50 are element value equal if and only if the subtrees rooted at those two elements are identical when the order among sibling elements is ignored.

17 Path-value equality Definition(Path-value equality) Two data element paths p 1 on T 1 = N 1, P 1, V 1, n r1 and p 2 on T 2 = N 2, P 2, V 2, n r2 are path-value equal (written as T 1.p 1 = pv T 2.p 2 ) if and only if there is a set M of matching pairs where For each pair m = (n 1, n 2 ) in M, n 1 N 1, n 2 N 2, path(n 1 ) = p 1, path(n 2 ) = p 2, and n 1 = ev n 2 ; All data elements with path p 1 in T 1 and path p 2 in T 2 participate in M, and each such data element participates in only one such pair. Value equality between two paths is complicated by the fact that a single path can match multiple data elements in the data tree. This definition consider two paths value equal if each node which is pointed to by one path must have a corresponding node that is pointed to by the other path, where the two nodes are element value equal.

18 Generalized tree tuple Definition A generalized tree tuple of data tree T = N, P, V, n r, with regard to a particular data element n p (called pivot node), is a tree t T n p = N t, P t, V t, n r, where: N t N is the set of nodes, n p N t ; P t P is the set of parent-child edges; V t V is the set of value assignments; n r is the same root node in both t T n p and T ; n N t if and only if: n is a descendant or ancestor of np in T, or n is a non-repeatable direct descendant of an ancestor of np in T ; (n 1, n 2 ) P t if and only if n 1 N t, n 2 N t, (n 1, n 2 ) P; (n, s) V t if and only if n N t, (n, s) V.

19 Tuple class A generalized tree tuple is a data tree projected from the original data tree. It has an extra parameter called a pivot node. In contrast with tree tuple defined in Arenas and Libkin s article, which separate sibling nodes with the same path at all hierarchy levels, the generalized tree tuple separate sibling nodes with the same path above the pivot node Based on the pivot node, generalized tree tuples can be categorized into tuple classes: Definition(Tuple class) A tuple class Cp T of the data tree T is the set of all generalized tree tuples tn T, where path(n) = p. Path p is called the pivot path.

20 Figure: Example tree tuple

21 Figure: Example tree tuple

22 XML Functional Dependency Definition(XML FD) An XML FD is a triple C p, LHS, RHS, written as LHS RHS w.r.t. C p, where C p denotes a tuple class, LHS is a set of paths (P li, i = [1, n]) relative to p, and RHS is a single path (P r ) relative to p. An XML FD holds on a data tree T (or T satisfies an XML FD) if and only if for any two generalized tree tuples t 1, t 2 C p - i [1, n], t 1.P li = or t 2.P li =, or - If i [1, n], t 1.P li = pv t 2.P li, then t 1.P r, t 2.P r, t 1.P r = pv t 2.P r. A null value,, results from a path that matches no node in the tuple, and = pv is the path-value equality defined previous.

23 XML Functional Dependency Example Example (XML FD) In our running example whenever two products agree on ProductID values, they have the same ProductName. This can be formulated as follows:./productid./productname w.r.t. C OrderDetails Another example is:./productid./categoryid w.r.t C OrderDetails Example (XML FD) In warehause tree:./isbn./title w.r.t C book../contact/name,./isbn./price w.r.t C book./isbn./author w.r.t C book./author,./title./isbn w.r.t C book

24 Trivial XML FD Definition: (Trivial XML FD) An XML FD C p, LHS, RHS is trivial if: 1. RHS LHS, or 2. For any generalized tree tuple in C p, there is at least one path in LHS that matches no data element. The 2. point can arrise, because of the existence of Choice elements. Example If Contact is a Choice element instead of Rcd, i.e. it can have either name or address as its childs, but not both, then the XML FD: C store,./contact/name,./contact/address,./@key is trivial, because no C store tuple will have both LHS node.

25 XML key Definition (XML key) An XML Key of a data tree T is a pair C p, LHS, where T satisfies the XML FD C p, LHS,./@key. Example We have the XML FD: C Orders,./OrderID,./@key, which implies that C Orders,./OrderID is an XML key. Example C State,./name C Store,./contact/name,./contact/address are XML keys.

26 Structurally redundant XML FDs Theorem Let FD = C p, LHS, RHS, if none of the paths in LHS and RHS specifies a data element that is descendent of the pivot node in the tuple, then FD holds on a data tree T if and only if FD = C p, LHS, RHS holds on T, where C p is the lowest-repeatable-ancestor tuple class of C p paths in LHS and RHS are equivalent to paths in LHS and RHS (i.e. they correspond to the same absolute paths). Example../ISBN../title w.r.t C author is structurally redundant with./isbn./title w.r.t C book

27 Interesting XML FD Tuple classes with repeatable pivot paths are called essential tuple classes. Definition(Interesting XML FD) An XML FD C p, LHS, RHS is interesting if it satisfies the following conditions: RHS / LHS; C p is an essential tuple class; RHS matches to descendent(s) of the pivot node. An interesting XML FD is a non-trivial XML FD with an essential tuple class and is not structurally redundant to any other XML FD.

28 XML data redundancy Definition(XML data redundancy) A data tree T contains a redundancy if and only if T satisfies an interesting XML FD C p, LHS, RHS, but does not satisfy the XML Key C p, LHS. Intuitively: if C p, LHS is not a key for T, then there exists two distinct tuples in C p that share the same LHS. T satisfies C p, LHS, RHS, so RHS of these two tuples must be value equal so: data is redundantly stored

29 GTT-XNF Definition(GTT-XNF) An XML schema S is in GTT-XNF given the set of all satisfied interesting XML FDs if and only if for each such XML FD ( C p, LHS, RHS ), C p, LHS is an XML key. Intuitively: GTT-XNF disallows any satisfied interesting XML FD that indicates data redundancies. Rule 1 (Reflexivity) LHS P 1 w.r.t. C p is satisfied if P 1 LHS. Rule 2 (Augmentation) LHS P 1 w.r.t. C p {LHS, P 2 } P 1 w.r.t. C p. Rule 3 (Transitivity) LHS P 1 w.r.t. C p... LHS P n w.r.t. C p {P 1,..., P n } P w.r.t. C p LHS P w.r.t. C p.

30 XML Data Flat Representation Figure: One relation for the whole XML data

31 XML Data Hierarhical Representation

32 Intra-relation FDs/Keys, Inter-relation FDs/Keys Example (XML FD) In warehause tree:./isbn./title w.r.t C book intra-relation FD../contact/name,./ISBN./price w.r.t C book inter-relation FD./ISBN./author w.r.t C book inter-relation FD./author,./title./ISBN w.r.t C book inter-relation FD Yu, C., Jagadish, H. V.: XML schema refinement through redundancy detection and normalization. VLDB J. 17(2): (2008) presents algorithm based on partitioning to detect intra-relation FDs, another very complicated for inter-relation FDs.

33 Eliminating redundancy-indicating FDs if C p, LHS is not a key for T T satisfies C p, LHS, RHS, so RHS is redundantly stored to eliminate such FD, the schema element corresponding to RHS is moved into a new schema location, such that those data elements are no longer redundantly stored. Let Σ be the set of redundancy-indicating FDs. Example./ISBN./title w.r.t C book {../../name,../contact/name,./isbn }./price w.r.t C book Assumption: {../name,./contact/name} is a key for C store.

34 Local/global XML FD Definition (Local/global XML FD) An XML FD C p, LHS, RHS is local if there exists LHS LHS such that C p, LHS is an XML key, where C p is an ancestor tuple class of C p (i.e. p is a prefix of p). Otherwise, the FD is global. Example./ISBN./title w.r.t C book is global, because no subset of its LHS is a key for any tuple class above C book means: 2 books, regardless whether they are under the same store or state, if they have the same ISBN, then they will have the same title Example {../../name,../contact/name,./isbn }./price w.r.t C book is local, because {../../name,../contact/name} is a key for C store. means: state name and store name uniquely identifies each store, any 2 books, if they have the same ISBN, they will have the same price, as long as they are under the same store.

35 Eliminate global FD Procedure 1 Let F = {P 1,..., P n } P r w.r.t. C p be a redundancy indicated global FD on Schema S root ; {e i i [1, n]} and {τ i i [1, n]} be the sets of schema element labels and types, respectively, associated with each P i ; e r and τ r be the schema element label and type,respectively, associated with P r ; τ parent be the schema element type of the parent element of P r ; τ root =Rcd[e 1 : τ 1,..., e m : τ m] be the element type of the root element. Eliminating redundancy: Create a new schema element with label e new and type τ new =SetOfRcd[e 1 : τ 1,..., e n : τ n, e r : τ r ] Set τ root =Rcd[e 1 : τ 1,..., e m : τ m, e new : τ new ] Remove (e r : τ r ) from τ parent

36 Eliminate global FD Example./ISBN./title w.r.t C book Scheme after eliminating global FD: 0 warehouse : Rcd 1 s t a t e : SetOf Rcd 2 name : s t r 3 s t o r e : SetOf Rcd 4 contact : Rcd 5 name : s t r 6 a d d r e s s : s t r 7 book : SetOf Rcd 8 ISBN : s t r 9 author : SetOf s t r 10 p r i c e : s t r 11 new book : SetofRcd 12 ISBN : s t r 13 t i t l e : s t r

37 Adjusting FD remove F from Σ. the semantics of F is captured by: {P 1,..., P n } P r w.r.t. C new, it is not redundancy, does not need to be added to Σ remove all FDs from Σ that are affected by the move of P r. Example {./author,./title}./isbn w.r.t C book is removed, because it is not valid. It is safe to do, because ISBN is no longer redundant.

38 Eliminate local FD Procedure 2 Let F = {P 1,..., P k 1, P k,..., P n } P r w.r.t. C p be a redundancy indicated local FD on Schema S root ; {P 1,..., P k 1 } is the key for C p; C p is is an ancestor tuple class of C p and there is no other subset L of {P i i [1, n]} such that L is a key for C p C p is an ancestor of C p and a descendant of C p (i.e. C p is the lowest tuple class that can be identified) {e i i [k, n]} and {τ i i [k, n]} be the sets of schema element labels and types, respectively, associated with each P i ; e r and τ r be the schema element label and type,respectively, associated with P r ; τ parent be the schema element type of the parent element of P r ; τ p =Rcd[e 1 : τ 1,..., e m : τ m] be the element type of the schema element corresponding to the pivot path of C p.

39 Eliminate redundancy Create a new schema element with label e new and type τ new =SetOfRcd[e k : τ k,..., e n : τ n, e r : τ r ] Set τ p =Rcd[e 1 : τ 1,..., e m : τ m, e new : τ new ] Remove (e r : τ r ) from τ parent Explanation to eliminate a local FD like {../../name,../contact/name,./isbn }./price w.r.t C book create a new schema element containing the subset of its LHS (ISBN) that are not part of the key for ancestor tuple class (C store ) and RHS element (price) put this new element under the schema element corresponding to the pivot path of the ancestor tuple class (/warehouse/state/store). RHS element is removed from its original position

40 Eliminate redundancy cont. by creating the new schema element under the non-root ancestor, fewer elements needs to be copied under the new scheme after the the modification of the scheme remove any FD that is affected by the move of P r Scheme after eliminating local FD: 0 warehouse : Rcd 1 s t a t e : SetOf Rcd 2 name : s t r 3 s t o r e : SetOf Rcd 4 contact : Rcd 5 name : s t r 6 a d d r e s s : s t r 7 book : SetOf Rcd 8 ISBN : s t r 9 author : SetOf s t r 10 t i t l e : s t r 11 new book : SetOfRcd 12 ISBN : s t r 13 p r i c e : s t r

41 Special case for Procedure 2 if the entire LHS of the FD is a key for some ancestor tuple class Example In DBLP scheme year of an article is determined by the identity (@key) of the issue containing the article instead of creating a new scheme element containing a single element year we move year after issue

42 Normalization algorithm

43 Normalization algorithm explanation FDs are grouped according to their LHS Example./ISBN./title w.r.t C book./isbn./author w.r.t C book if they are dealt separately two new scheme element are created FDs are processed according to the number of paths in their LHS to reduce the storage cost. Example {./title,.author}./isbn w.r.t C book If this FD is processed first, then the elements title and author will remain under book, not ISBN

44 Normalization algorithm explanation cont. FDs are processed according to the hierarchy depth of their tuple class in a bottom-up fashion (lowest first) this is because during the process of FDs for a lower hierarchy tuple class, redundancy-indicating FDs for a higher hierarchy tuple class may be created algorithm terminates because each application of Procedure 1 and 2 either removes one redundancy-indicating FD or converts one redundancy-indicating FD into another one with a tuple class at a higher hierarchy

45 GTT-XNF Scheme of Warehouse xml data Eliminating first:./isbn./title w.r.t C book./isbn./author w.r.t C book this FD is no longer redundancy indicating {../../name,../contact/name,./isbn }./price w.r.t C book 0 warehouse : Rcd 1 s t a t e : SetOf Rcd 2 name : s t r 3 s t o r e : SetOf Rcd 4 contact : Rcd 5 name : s t r 6 a d d r e s s : s t r 7 book : SetOf Rcd 8 ISBN : s t r 9 p r i c e : s t r 10 new book : SetOfRcd 11 ISBN : s t r 12 t i t l e : s t r 13 author : SetOf s t r

46 Introduction to FCA From a philosophical point of view a concept is a unit of thoughts consisting of two parts: the extension, which are objects; the intension consisting of all attributes valid for the objects of the context; Formal Concept Analysis (FCA) introduced by Wille gives a mathematical formalization of the concept notion. A detailed mathematic foundation of FCA can be found in: Ganter, B., Wille, R.: Formal Concept Analysis. Mathematical Foundations. Springer, Berlin-Heidelberg-New York. (1999) Formal Concept Analysis is applied in many different realms like psychology, sociology, computer science, biology, medicine and linguistics. FCA is a useful tool to explore the conceptual knowledge contained in a database by analyzing the formal conceptual structure of the data.

47 Introduction to FCA FCA studies how objects can be hierarchically grouped together according to their common attributes. In FCA the data is represented by a cross table, called formal context. A formal context is a triple (G, M, I ). G is a finite set of objects M is finite set of attributes The relation I G M is a binary relation between objects and attributes. Each couple (g, m) I denotes the fact that the object g G is related to the item m M.

48 Introduction to FCA For a set A G of objects we define A := {m M gim for all g A} the set of all attributes common to the objects in A. Dually, for a set B M of attributes we define B := {g G gim for all m B} the set of all objects which have all attributes in B. A formal concept of the context K := (G, M, I ) is a pair (A, B) where A G, B M, A = B, and B = A. We call A the extent and B the intent of the concept (A, B). The set of all concepts of the context (G, M, I ) is denoted by B(G, M, I ).

49 Example formal context The following cross table describes for some hotels the attributes they have. In this case the objects are: Oasis, Royal, Amelia, California, Grand, Samira; and the attributes are: Internet, Sauna, Jacuzzi, ATM, Babysitting. ({California, Grand}) := {Sauna, Jacuzzi}. Internet Sauna Jacuzzi ATM Babysitting Oasis X X X X Royal X X X Amelia X X California X X Grand X X Samira X X Table: Formal context of the Hotel facilities example

50 FCA tool to detect XML FDs we elaborate an FCA based tool that identify functional dependencies in XML documents. to achieve this, as a first step, we have to construct the Formal Context of functional dependencies for XML data. we have to identify the objects and attributes of this context in case of XML data. tuple-based XML FD notion proposed in the above section suggests a natural technique for XFD discovery XML data can be converted into a fully unnested relation, a single relational table, and apply existing FD discovery algorithms directly. given an XML document, which contains at the beginning the schema of the data, we create generalized tree tuples from it.

51 Construct Formal Context of XML FDs each tree tuple in a tuple class has the same structure, so it has the same number of elements. we use the flat representation which converts the generalized tree tuples into a flat table each row in the table corresponds to a tree tuple in the XML tree in the flat table we insert non-leaf and leaf level elements (or attributes) from the tree for non-leaf level nodes the associated keys are used as values we include non-leaf level nodes with associated key values, to detect XML keys

52 Flat table for tuple class C Orders Example Let us construct the flat table for tuple class C Orders. There are two non-leaf nodes: Orders, appears as Orders@key OrderDetails, appears as OrderDetails@key.

53 Formal Context for class C Orders Context s Attributes: PathEnd/ElementName for non-leaf level nodes: the name of the attribute is constructed as: and its value will be the associated key value for non-leaf level nodes: the element names of the leaves. Context s Objects: the objects are considered to be the tree tuple pairs, actually the tuple pairs of the flat table. The key values associated to non-leaf elements and leaf element s values are used in these tuple pairs. Context s Properties: the mapping between objects and attributes is defined by a binary relation, this incidence relation of the context shows which attributes of this tuple pairs have the same value.

Beginning of the Formal Context of functional dependencies for tuple class C Orders the analyzed XML document may have a large number of tree tuples.

54 Beginning of the Formal Context of functional dependencies for tuple class C Orders the analyzed XML document may have a large number of tree tuples. we filter the tuple pairs and we leave out those pairs in which there are no common attributes, by an operation called clarifying the context, which does not alter the conceptual hierarchy.

55 Concept Lattice of functional dependencies Formal Context for tuple class C Orders we run the Concept Explorer (ConExp) engine to generate the concepts and create the concept lattice.

56 Processing the Output of FCA a concept lattice consists of the set of concepts of a formal context and the subconcept-superconcept relation between the concepts; every circle represents a formal concept; each concept is a tuple of a set of objects and a set of common attributes, but only the attributes are listed; an edge connects two concepts if one implies the other directly; each link connecting two concepts represents the transitive subconcept-superconcept relation between them; the top concept has all formal objects in its extension; the bottom concept has all formal attributes in its intension.

57 The relationship between FDs in databases and implications in FCA a FD X Y holds in a relation r over R iff the implication X Y holds in the context (G, R, I ) where G = {(t 1, t 2 ) t 1, t 2 r, t 1 t 2 } and A R, (t 1, t 2 )IA t 1 [A] = t 2 [A]. objects of the context are couples of tuples and each object intent is the agree set of this couple the implications in this lattice corresponds to functional dependencies in XML. Example C Orders,./OrderID,./CustomerID C Orders,./Orders@key,./CustomerID C Orders,./OrderDetail/OrderID,./CustomerID

58 Reading the Concept Lattice in the lattice we list only the attributes, these are relevant for our analysis; let there be a concept, labeled by A, B and a second concept labeled by C, where A, B and C are FCA attributes; let concept labeled by A, B be the subconcept of concept labeled by C; tuple pairs of concept labeled by A, B have the same values for attributes A, B, but for attribute C too. tuple pairs of concept labeled by C do not have the same values for attribute A, nor for B, but have the same value for attribute C. tuple pairs of every subconcept of concept labeled by A, B have the same values for attributes A, B. the labeling of the lattice is simplified by putting each attribute only once, at the highest level.

59 Reading the Concept Lattice we analyze attributes A and B: if we have only A B, then A would be a subconcept of B; if only B A holds then B should be a subconcept of A; we have A B and B A, that s why they come side by side in the lattice. So attributes from a concept imply each other. Example We have the next XML FDs: C Orders,./OrderID,./OrderDetails/OrderID C Orders,./OrderID,./Orders@key C Orders,./Orders@key,./OrderID C Orders,./Orders@key,./OrderDetails/OrderID C Orders,./OrderDetails/OrderID,./Orders@key C Orders,./OrderDetails/OrderID,./OrderID

60 The functional dependencies found by software FCAMineXFD Figure: Functional dependencies in tuple class C Orders

61 The concept lattice for the whole XML document

62 Data Analysis we can see the hierarchy of the analyzed data: the node labeled by Customers/Country is on a higher level, than node labeled by Customers/City; the Customer s node with every attribute is a subconcept of node labeled Customers/City; in our XML data, every customer has different name, address, phone number, so these attributes appear in one concept node and imply each other; the Orders node in XML is child of Customers, in the lattice, the node labeled with the key of Orders node, is subconcept of Customers node, so the hierarchy is visible; these are 1:n relationships, from Country to City, from City to Customers, from Customers to Orders. information about products is on the other side of the lattice; Products are in n:m relationship with Customers, linked by OrderDetail node in this case.

63 FDs for the whole XML document

64 Finding XML keys FDs with RHS values can be used to detect the keys in XML. In tuple class C Orders we have XML FD: C Orders,./OrderID,./@key, which implies that COrders,./OrderID is an XML key. C Orders,./OrderDetails/OrderID,./@key, so COrders,./OrderDetails/OrderID is an XML key too. In tuple class C Customers software found XML FD: C Customers,./CustomerID,./@key, which implies that C Customers,./CustomerID is an XML key. other detected XML keys are: CCustomers,./Orders/CustomerID ; CCustomers,./CompanyName ; C Customers,./Address ; C Customers,./Phone.

65 Detecting XML data redundancy having the set of functional dependencies for XML data in a tuple class, we can detect interesting functional dependencies. in essential tuple class C Orders an interesting FD: C Orders,./OrderDetails/ProductID,./OrderDetails/ProductName but C Orders,./OrderDetails/ProductID is not an XML key So it is a data redundancy. the same reason applies for XML FD C Orders,./OrderDetails/ProductName,./OrderDetails/ProductID. the other XML FD s have as LHS a key for tuple class C Orders.

66 Conclusions This paper introduces an approach for mining functional dependencies in XML documents based on FCA. Based on the flat representation of XML, we constructed the concept lattice. We analyzed the resulted concepts, which allowed us to discover a number of interesting dependencies. Our framework offers an graphical visualization for dependency exploration.

67 Future Work given the set of dependencies discovered by our tool: propose a normalization algorithm for converting any XML schema into a correct one

Detecting XML Functional Dependencies through Formal Concept Analysis

Detecting XML Functional Dependencies through Formal Concept Analysis Katalin Tunde Janosi-Rancz 1, Viorica Varga 2, and Timea Nagy 2 1 Hungarian University of Transylvania, Tirgu Mures, Romania tsuto@ms.sapientia.ro