Interrogation System Architecture of Heterogeneous Data for Decision Making

Interrogation System Architecture of Heterogeneous Data for Decision Making Cécile Nicolle, Youssef Amghar, Jean-Marie Pinon Laboratoire d'ingénierie des Systèmes d'information INSA de Lyon Abstract Decision support system has, for several years, been a very active domain of research. The deciders of today are faced with an ever increasing flow of data, which is becoming more and more diverse. It is therefore difficult, in a company, for the decision maker to find the data which is necessary for him to make decision. It is in this context that is placed our study. We propose, in this article, an access system to heterogeneous databases. This system will permit deciders to formulate request without having to know the structure of their data and allow them to have an overview of their information system. This system is centered on a repository which constitute its knowledge base. A thesaurus is also used to allow decider to formulate his requests using natural language. Finally, the different links that exist between the databases are used to find data the most pertinent and help to eliminate the risk of noise in the answers. Introduction A common problem facing many organizations today concerns disparate information sources and repositories, including databases, object stores, knowledge bases, file systems, electronic mail systems, etc. Decision makers often need information from multiple sources. Among the solutions to resolve this problem, the approaches built around mediator and wrapper techniques are widely used (Hammer et al, 1995 ; Wiederhold, 1992 ; Zhou et al, 1996). It is in this framework, the project presented in this paper is developed. More precisely, the work presented in this paper deals with problem of heterogeneous data sources used within family allowance administration (CNEDI 1 ). Data manipulated in such organization are, in one hand juridical and legal documents structured according to SGML standard and in another hand alphanumeric (atomic and complex) data for describing individuals subject to family allowances. Legal documents contains attribution conditions for family allowances (Chabbat, 1997). These data necessitates different systems for storing, handling and interrogation. In addition, the most important point is that documents and alphanumeric data are interrelated through attribution rules materialized by production rules (more than 15 000 rules). A rule determine the manner an individual attribute (for instance, allowance amount) can be valued. Consequently, this administration and particularly their decision makers, needs new means to access these data and interpret them. The existing applications, developed independently, do not take into account the links between legal texts and individual data. Beyond classical treatments on these information's, it is often indispensable for decision makers to correlate data and documents. For that, they may formulate a query that concerns not only a unique base but necessitates full semantic over administration data. Another interesting point in this work is to offer a support to the development of new applications which require to know the information cartography within used architectures, in order to reuse existing data and avoid the useless creation or modification of new schema of databases. To summarize, CNEDI de Lyon uses three databases concerning: Legal texts structured with respect to SGML standard. These texts are applied to the framework of family allowance. They contain laws, decrees, etc. Alphanumeric data about individuals subject to family allowance. These data are described by simple properties such as name, family situation, resources, child's number, income tax, etc. The database schema is a relational schema. Rules of allowance attribution which are expressed by production rules following the pattern "if condition then action". Condition corresponds to individual resources and/or family position, and action corresponds to allowance computation. These rules are manually extracted from legal texts by law 1 This work is supported by CNEDI de Lyon: Centre National d'etudes et Développements Informatiques that manages family allowances' attributions.

specialists. Among CNEDI s applications, let us note that there is a real-time application based on an expert system to compute individuals allowances. Allowance computation use the appropriate rules set for the individual under consideration. Related works The research for information amongst different databases has been the subject of very active research for many years (Ahmed et al, 1991 ; Chrisment et al, 1997 ; Flesca et al, 1998 ; Fourel, 1998 ; Gupta, 1989 ; Hull & Zhou, 1996). In the 90s, federated databases (Litwin et al, 1990 ; Sheth & Larson, 1990 ; Su et al, 1996 ; Thomas et al, 1990 ; Wiederhold, 1992) have been proposed for integrating existing databases while limiting the complexity of this integration and conserving some of the autonomy of the existing databases. One of the main drawback of federated databases is that there is no real coordination of the flow of information, which can lead to information dispersion. During the last few years, systems based on mediators (Wiederhold, 1992) have been studied such as Information Manifold (from AT&T) (Levy, 1996) and TSIMMIS (from Stanford University) (Chawathe et al, 1994 ; Garcia-Molina et al, 1997), which are two examples of mediator implementation. We can define the basic mediator architecture as two level architecture : mediators and wrappers levels. A wrapper is a translator which, having being defined for a particular database, translates the information from the mediator to a language which is understandable to the database for which it is attached, and vice-versa. On the other hand, a mediator is in charge of querying one or many databases and so is linked to one or more wrappers (or mediators). When different databases are queried, many problems have to be taken into account: The dispersion of the types of data, The difference between the interrogation languages of different systems, The heterogeneity of the data schemas. When a user send a request towards a central access system, the system should decide towards which databases the request has to be sent, and should be also capable to modify the request when necessary before sending it to a database. The request should arrive at the managing system of the database in a comprehensible language. The role of mediator is to "point" request towards the right database. It is the wrapper that translates the request into a language which is understandable by the managing system of the database in question. The wrapper then receives the answer from the system and send it to the mediator in a language which is known by the mediator. This way a mediator can dialog with many wrappers (which can as well be used by several mediators). A wrapper, however, is dedicated to one type of database. Proposition To allow decision makers to access and retrieve relevant data from legal texts and individual data, we propose a mediator as the main component of the interrogation system architecture. The mediation level is based on a repository that constitutes the knowledge base. The repository model summarizes document models and individual models across production rules (allowance attribution rules). The interrogation issued from end-users concerns queries such as: (1) Which are the texts applying to a given individual? (2) which are the relevant properties in allowance computation? The building of repository model (knowledge database) is motivated by the two main functionality's described above: i) search of correlated data and ii) knowledge of cartography elements. First functionality requires the elaboration of a knowledge database that memorize relationships between data of different databases. Relationships between legal texts and production rules are easy to understand because rules are a translation of legislator s legal texts made by law specialists. These relationships describe links between texts and rules. Indeed, although it is important to decision makers to know what production rules are extracted from decrees, it is also important to retrieve decrees from a certain number of rules. Queries such as which law treats home allowance attribution will be easily treated. Relationships between legal texts and individual data allow specialist to understand the allowance nature of each individual. The decrees set applying to such individual and the reason of attribution modification can be known. Let us note that the "legal text base" stores different versions of laws in order to justify allowance history. Semantic of

information extracted is visualized with a labeled graph of dependencies to restore inter-data semantics. Second functionality necessitates to elaborate a model for memorizing all entities corresponding to information system elements. The different components of this model are entities and their relationships used in each database, model type (relational, document-oriented model, SGML), applications used data and their distribution among administration sites. System architecture The architecture of the interrogation system (Figure1) is composed of three levels: i) the interface level whose goal is to present results, ii) the mediation level which is the hart of interrogation system and whose components are briefly described below and iii) the source level containing existing data. 1. The Parser module which allows the analysis of user interrogation type. Obviously, interrogation system does not filter queries concerning a unique database. In this case, it plays the role of a multiplexor. A query on individual data does not necessitates system specificity's. 2. The extraction module which needs meta-data issued from repository in order to know which databases must be accessed, language to use, physical name of data, etc. This module uses rule base. Intermediate results are stored in log files. 3. The integration module that permits the extraction module results. The integration consists on response elaboration from heterogeneous data. 4. The Result formatting that allows to built an homogeneous response to the initial query. Answers are combined to form a unique response and are coded according to XML standard warranting a coherent exchange with any application. DECISION MAKER WORKSTATION USER INTERFACE PARSER MODULE MEDIATION LEVEL KNOWLEDGE DATABASE RESULT FORMATTING INTEGRATION MODULE EXTRACTION MODULE Juridical Texts Production rules Individual data HETEROGENEOUS SOURCES Figure 1: Logical interrogation system Architecture.

Knowledge base We propose a model of knowledge base to identify necessary information for querying different databases. Th diagram below illustrates this knowledge base model. Rule "storagesin" Database Stored in Data has > Model Type <composed of Entity Composed of Element Schema <composed of Composed of > Association Comes from Textual use Multimedia Non Textual Structured Non Structured Applies to > Image Sound Vidéo Alphanumerical Figure 2 : knowledge base's model, in UML The role of knowledge base is to give a general view on searched data. It particularly allows to identify: The location of data, The links within a database (particularly with "Association" Entity), The links between databases, The links between rules and texts of which they are issued. Example We use both a knowledge base and a thesaurus. The knowledge base allows to navigate through knowledge of various managed databases whereas thesaurus permits particularly to analyse a request formulated in natural language. A decision maker could, for example, formulate a request like : "allowers concerned by law number 94-629". The decision maker who likes to know the impact of a possible modification of law number 94-629, formulates his request as : The figure below illustrates different steps of research. "impact of modification of law n 94-629?" The access system, thanks to thesaurus, can determine that it needs to search all which is in connection with the law n 94-629 ( step 0 in figure). First, we search various existing links between legal texts of law type and the others bases (step 1). We find, thanks to knowledge base, that a link exists between legal texts' base and rules' base. Hence, the link between rules base and allowers database is identified. We so know not only links between bases but also fields which allow to connect these different bases. Once this is determined, we can elaborate the succession plan and the decomposition filter to elaborate various sub-queries for bases (here, we have to question three bases : legal texts base, rules' base and allowers base). We so search for the concerned law (step 2), which gives us the following information (step 3) : the text

about the APE (Allocation Parentale d'education). Furthermore, we search in rules base all rules (step 4) which have : As predicate, the APE associated to another one (that's to say another predicates), which will give us different benefits impacted by the modification of APE management (for example, collection of APE invalidate payment of Family Complement), As validated term (that's to say the right part of the rule) the APE (for example, if an allower collect the APJE 2, he can't collect the APE). We get in return (step 5) all benefits which are implicated (directly, like APE, or indirectly, like APJE) by the modification of the law. Afterwards, we search all allowers collecting APE or another benefits resulting of the search on the rules base (step 6). The answers (step 7) are next sent to access system which is going to form the final answer before it gives it to the user (step 8). "Impact of modification of the law n 94-629? " Answer given to user 8 Access System 2 3 4 5 6 7 0 1 Legal Texts' Rules' Allowers' Thesaurus Knowledge Figure 3 : Different steps of a research. When answers are, two treatments are possible : 1. Either we search texts associated to various benefits found in order to give directly to the user the impact of law modification on the other texts, and then we search benefits implicated, etc., until having no more modifications. But it can be long and unnecessary to decision maker ; 2. Or we allow to user, once the final answer is given, to realize a search merely by clicking on one of concerned benefits, and by making the following search : "impact of text modification treating of this benefit". We choose the second option. We thus supply only the answer concerning initial request, but we give also the possibility to decision makers to easily do a thorough search. 2 APJE : Allocation Pour Jeunes Enfants

Conclusion The work developed in this project can be used for enhancing powerful information retrieval systems since all data relied to a process domain of an enterprise can be used during interrogation step. Repository model can be generalized to other data types. Indeed, multimedia data can be included, for instance voice databases allowing recording of expert annotations. However, let us note that repository maintenance problems, particularly after database updates, have not been considered but constitute our future work. It's important (and even essential) to have all useful data for a good decision making. Nowadays, the consultation of data is not only numerous but also heterogeneous. In this paper, we have proposed an access system which permits to question all databases of a firm from an easily access system, and thanks to a request wrote in natural language. The access system is based on mediators and wrappers to realize the connection with databases. Moreover, we use XML language to handle requests and answers internally in access system, as well as question various databases. References Ahmed, R. et al, (1991). The Pegasus heterogeneous multidatabase system, IEEE Computer, 24:19-27, 1991 Chabbat, B. et al (1996). Structuring the semantics : an experiment with legal documents, In: Proceedings of International Conference Circuits, Systems and Computers (CSC'96), edited by Hellenic Naval Academy, Piraeus - Greece, juillet 1996, pp 297-308. Chabbat B., (1997). Modélisation multiparadigme de textes réglementaires, Thèse en Informatique soutenue le 8 décembre 1997 à l'insa de Lyon, France Chawathe, S. et al (1994). The TSIMMIS Project : Integration of heterogeneous information sources, Proceedings of IPSJ Conference, pp 7-18, Tokyo, Japan, October 1994 Chrisment, C. et al (1997). Extraction et synthèse de connaissances à partir de bases de données hétérogènes, Ingénierie des Systèmes d'information. Volume 5 - n 3/1997, pages 367 à 400 Flesca, S. et al (1998). An architecture for accessing a large number of autonomous, heterogeneous databases, Networking and Information Systems Journal, Volume 1 n 4-5/1998, pages 495 à 518 Fourel, F. (1998). Modélisation, indexation et recherche de documents structurés, Thèse de IMAG, 05 février 1998 Garcia-Molina, H. et al (1997). The TSIMMIS Approach to Mediation : Data Models and Languages, Journal of Intelligent Information Systems, 1997 Gupta, A. (1989). Integration of Information Systems : Bridging Heterogeneous Databases, IEEE Press, 1989 Hammer, J. et al (1995). Hammer J. et al, Information Translation, Mediation, and Mosaic-d Browsing in the TSIMMIS System, In Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose, California, June 1995. Hull, R. & Zhou, G. (1996). A framework for supporting Data Integration Using the materialised and virtual approaches, in Proceedings of the ACM SIGMOD International Conference of the Management of Data, pp 481-492, june 1996

Levy, A. et al (1996). Querying heterogeneous data sources using source description, Proceedings of the 22nd Very Large Data s (VLDB) Conference, pp 251-262, 1996 Litwin, W. et al (1990). Interoperability of multiple autonomous databases, ACM Computing Surveys, 22:267-293, 1990 Sheth, A.P. & Larson J.A. (1990). Federated Database System for Managing Distributed, Heterogeneous, and Autonomous Databases, ACM Computing Surveys, vol. 22, n 3, ACM Press, pp183-236, 1990 Su, S.Y.W. et al (1996). NCL : A Common Language for Achieving Rule-d Interoperability Among Heterogeneous Systems, Journal of Intelligent Information Systems : Integrating Artificial Intelligence and Database Technologies, vol.6, n 2-3, Kluwer Academic Publishers, pp 171-198, 1996 Thomas, G. et al (1990). Heterogeneous distributed database systems for production use, ACM Computing Surveys, 22:237-266, 1990 Wiederhold, G. (1992). Mediators in the architecture of future information systems, The IEEE Computer Magazine, Vol. 25, n 3, March 1992, pp 38 à 49 Zhou, G.et al (1996). Generating Data Integration Mediators that Use Materialization, in Kluwer Academic Publishers, Boston MA, July 1996, ISBN: 0-7923-9726-6.