Assessing data quality in records management systems as implemented in Noark 5

Size: px

Start display at page:

Download "Assessing data quality in records management systems as implemented in Noark 5"

Amberlynn Cain
5 years ago
Views:

1 1 Assessing data quality in records management systems as implemented in Noark 5 Dimitar Ouzounov School of Computing Dublin City University Dublin, Ireland dimitar.ouzounov2@computing.dcu.ie Abstract Good data quality is crucial in records management, but these two areas have never been studied together. The main contribution of this paper lies in bridging the gap that exists. We do so by examining the data quality issues in a Norwegian standard for records management called Noark 5, and developing a software component that objectively measures data quality on the basis of user-defined DQ requirements. There was no Noark 5 system that we could use freely in our study. To address the issue, we designed an architecture for records management and DQ assessment, and developed a Noark 5 compliant system which is in many aspects superior to the existing commercial solutions. We also created an innovative language for specifying data quality requirements in the form of rules. The language allows data quality to be objectively measured in a wide variety of systems regardless of their underlying technology. I. INTRODUCTION Nowadays, electronic information plays a more important role in our society than ever. Before the advent of the World Wide Web information was mostly contained within the organisations that created it. The Web, along with the emergence of web-based service-oriented architectures, allowed electronic data to be easily exchanged between different parties. The quality of the exchanged data has become fundamental to the relationships between citizens, businesses, and governmental institutions. Organisations worldwide lose significant amounts of money as a direct consequence of poor data quality. The Data Warehousing Institute estimates that businesses in the United States alone lose around 600 billion dollars a year. These losses come not only from staff overtime and unnecessary printing and postage but also from the diminishing credibility of the organisation in the eyes of its customers and suppliers [6]. Another striking example of the consequences of poor data quality is the Year 2000 problem which necessitated modifications in various software applications and databases that cost about 1.5 trillion dollars [2]. Although data quality is as important to governmental institutions as it is to businesses, research has mostly focused on business application domains including customer relationship management, supply chain management, and enterprise resource management [12]. One particularly under-researched area is that of records management. Records management systems are used by different types of organisations to track documents and organise them into a records structure, which is stored in a way that allows future users to grasp the meaning of the original documents and the context within which they were created. The content, structure and context of the documents are described by various kinds of metadata which can be assigned manually or automatically by the system. In many countries, records management systems are widely used by the public administration, in which case the systems typically need to be in line with a number of laws and regulations in order to ensure that the actions of the public bodies are properly documented. Norwegian law, for instance, sets the obligation for public bodies to keep records for all cases they handle, and specifies how the recordkeeping function should be organised, what must be stored in an archive and for how long it should be retained. The archive function of the public administration of Norway includes activities such as maintaining an overview of all documents assigned to a particular case, assigning documents to cases, archiving cases that have been handled, and responding to enquiries regarding the case handling status and the content of documents. After a specified period of time, the archive material is set aside and deposited at an archival repository such as the National Archives of Norway. If a data quality error is introduced in a record (e.g. as a result of the data entry process) it becomes harder to fix the error the more time elapses from the creation of the record. The reason is that the data in the record needs to reflect facts about the state of the real world at the time the record is created and not at the time the record is fixed. So, for example, if an incorrect address is entered today and it changes after six months, fixing the record after one year will require the original correct address to be entered and not the current one. Once an erroneous record is transferred to the National Archives of Norway, it becomes impossible to fix the error due to legal reasons. As a consequence, poor data quality in records management systems used by Norwegian public bodies leads to poor data quality in the national archive, which may render the data stored in the archive useless for future generations. Although it is impossible to know how data in the national archive will be used in the future, the responsible parties need to ensure that the data are of high quality with respect to current standards. In order to meet legal requirements, all public bodies in Norway are obliged to use records management systems based on the Noark 5 standard [15]. Every Noark 5 compliant system must include the so called inner core, which provides means for storing and retrieving records from a database and an interface that allows administrators to modify records. A Noark 5 outer core is built on top of the inner core and includes functionality such as user administration, reporting,

2 2 and case handling. Finally, a Noark 5 complete system extends the outer core and includes even more functionality, but it is highly doubtful that this level of compliance has been achieved by any of the existing solutions. There was no open-source Noark 5 system that we could use in our study. We addressed the issue by designing a comprehensive architecture for records management and DQ assessment, and by building a Noark 5 inner core from scratch on top of this architecture. Due to space limitations, we will publish a description of the architecture separately and only provide a brief overview of the system here. The new Noark 5 inner core is based on Enterprise Java Beans (EJB) technology and was designed with modularity, flexibility, and scalability in mind. Some of the features of our system clearly distinguish it from the existing commercial solutions. First, it allows data quality to be objectively measured using DQ requirements, which are written in an innovative domain-specific language that we developed. Second, the system can be easily adapted to specific organisational scenarios. And third, it can be deployed both in a cluster and as a hosted service which can be accessed over the Internet. We will release the Noark 5 system under an open-source license, which is an important contribution to both the records management community and to society in general. The goal of this study is to develop an understanding of how data quality can be objectively measured in records management systems on the basis of user-defined DQ requirements. We begin our discussion in Section II with a brief description of the research methodology we used. Section III contains an overview of data quality, data quality dimensions and metrics, and data quality methodologies. In Section IV we discuss the Noark 5 standard focusing on metadata, describe how cases are typically handled with a Noark 5 system, and identify several data quality dimensions that are applicable to records management systems. Section V presents the stateof-the-art language that we created for specifying data quality requirements. Next, we describe in Section VI the data quality component that we developed for our Noark 5 system, and show the results of validating the component with two different datasets in Section VII. Finally, we present our conclusions in Section VIII. II. RESEARCH METHODOLOGY Research in the field of Information Systems (IS) embodies two fundamental paradigms [10]. The first one, behavioural science, aims to formulate and verify theories that describe human and organisational behaviour in the context of IS. The second paradigm, design science, focuses on solving organisational problems by developing innovative artifacts. Artifacts include, among other things, algorithms, practices, and prototype solutions, which aim to facilitate the design, development and management of information systems. We followed the design science research methodology in our study and observed the research guidelines outlined by Hevner et al [10]. The three most important guidelines are described here. First, design science research should be conducted iteratively, seeking improvement of the artifact during each iteration. Second, research must result in artifacts which attempt to solve specific organisational problems. And third, research must clearly contribute to the area of the produced artifact. All these guidelines are based on the fundamental principle of design science that new knowledge is acquired through the creation and application of useful artifacts [20], [13], [9]. A. Overview III. BACKGROUND A widely accepted definition of data quality, which reflects the fact that quality cannot be uniquely defined for a product such as information, is fitness of use [3]. Such a definition is flexible because it can be used to describe data quality in many different domains. Data quality is regarded as a multidimensional concept [21] and a number of models have been created to describe it. These models identify data quality dimensions using one of three approaches [2]. In the intuitive approach, dimensions are derived based on the intuition and previous experience of the researcher. The theoretical approach makes use of formal models and logical analysis to construct the set of dimensions. In the empirical approach dimensions are identified based on experiments, and surveys and interviews with data consumers. Regardless of the utilised approach, dimensions are either generic and apply to all kinds of data, or specific to a particular domain. The perception of data quality is highly dependent on the perspective of the people who use the data [3]. Data quality, however, can be determined not only by surveying users but also by looking at the data itself, and by examining the process of accessing the data [14]. Based on this, data quality dimensions can be classified into three categories. Subjective dimensions, such as understandability, can only be assessed by users based on background and experience. Objective dimensions are assessed by analysing the information itself. An example of such a dimension is completeness. Process dimensions, such as response time, are assessed by querying the data. Both objective and process dimensions can be measured automatically and may have objective scores, while subjective dimensions do not allow for objective scores. B. Data quality models A data quality model developed using the intuitive approach is the one proposed by Redman [17]. He identified a number of dimensions and divided them into three categories. The first category includes dimensions related to the conceptual schema of data. The second category contains dimensions related to data values. Finally, the third category focuses on dimensions related to the internal representation of data. Following the empirical approach, Wang and Strong [23] developed a data quality model on the basis of what data quality means to data consumers. Surveys were used as the main research tool and a framework was developed, which according to the authors closely represents the aspects of data quality which are most important to data consumers. The framework includes twenty dimensions grouped into four categories intrinsic data quality, contextual data quality, repre-

3 3 sentational data quality, and accessibility data quality. Intrinsic data quality includes dimensions which are fundamental to all types of data. Contextual data quality contains dimensions that must be considered with regard to the specific task at hand. Representational data quality includes dimensions that are related to the format and meaning of data. The last category, accessibility data quality, covers dimensions related to the accessibility and security of data. A similar but more recent data quality model is the one developed by Bovee, Srivastava and Mak [3]. They classified data quality dimensions into four categories, namely integrity, interpretability, relevance and accessibility. The dimensions in the first category are intrinsic in nature, while the dimensions in the other three categories are extrinsic. Wand and Wang [21] developed a theoretical model of data quality by looking at discrepancies between two different views of the real world the one obtained by directly observing real-world phenomena, and the one obtained by inspecting an information system that represents the same phenomena. Four intrinsic data quality dimensions were derived which specify quality of data according to whether data are complete, unambiguous, meaningful, and correct. Incompleteness occurs when a real-world state is not represented in the information system. Ambiguity is observed when two or more real-world states are mapped to the same information system state. An information system state that does not map to a real-world state is meaningless. Finally, incorrectness occurs when a realworld state is mapped to the wrong information system state. The last two problems are a direct result of what the authors call garbling. Garbling is usually caused by errors in the data entry process. C. Metrics The quality of the data in an information system can be determined using various metrics, which are either based on some fundamental principles or created on an ad hoc basis. Metrics are used in both subjective and objective evaluation of data quality. Pipino, Lee and Wang [16] propose three principles for developing DQ metrics simple ratio, MIN and MAX operations, and weighted average. Simple ratio is used for dimensions such as completeness and consistency, which can be evaluated for a given piece of data by testing whether each of the data elements satisfy some condition. Data quality is commonly measured on a scale of 0 to 1 where 0 denotes poor quality and 1 denotes good quality. For this reason, simple ratio is defined as the number of successful outcomes divided by the total number of outcomes, subtracted from one. Next, MIN and MAX operations are useful for dimensions such as believability where multiple DQ values are typically aggregated, e.g. by interviewing users. MIN is conservative because it assigns to the dimension the lowest of the DQ values while MAX is liberal as it assigns the highest of the values. Finally, weighted average is appropriate for multivariate cases but requires good understanding of how important each variable is for the dimension that is being evaluated. A number of metrics based on the three principles outlined here are proposed by Batini and Scanniapeco [2]. D. Methodologies A data quality methodology is a set of models, techniques and guidelines which define a systematic process for measuring and improving the quality of the data in an organisation [2]. Methodologies can be classified according to various criteria. One of them distinguishes between general-purpose and specific-purpose methodologies. General-purpose methodologies include a variety of activities that can be applied in different domains. These methodologies are also known as management methodologies because they follow several key principles of quality management. Special-purpose methodologies address a particular activity or a specific application domain. For example, DQ assessment methodologies focus on measuring and assessing the data quality in an organisation, benchmarking the results against other organisations or a set of best practices, and suggesting suitable improvement steps. We provide here a summary of the key data quality management and assessment methodologies. 1) TDQM The Total Data Quality Management methodology (TDQM) is a comprehensive methodology for data quality management, which is extensively based on the principles of Total Quality Management (TQM) [22]. Analogously to product manufacturing in TQM, information manufacturing in TDQM is the process in which a system works with raw data to create an information product (IP). TDQM includes four phases to define requirements for, measure, analyse, and improve data quality. Executing the phases iteratively is essential to producing highquality IP. The methodology also distinguishes between four different types of stakeholders. Information suppliers gather or create raw data. Information manufacturers are responsible for designing and developing the systems that produce IP. Information consumers use the IP. Finally, IP managers are responsible for managing the whole IP production cycle. 2) TIQM Another methodology for managing data quality is the Total Information Quality Management methodology (TIQM) developed by English [7]. TIQM includes five phases, which are executed iteratively and focus on assessing the quality of both information architecture and data, measuring costs and risks, correcting data, and improving business processes. There is a separate phase that does not have a fixed beginning and end, which aims to make data quality an important part of the organisational culture. The author of the methodology makes an important point about data quality management. He suggests that DQ problems can hardly be solved by implementing a one-time improvement programme because they are caused by broken business processes. Unless the processes are improved, they will keep producing poor-quality data. Moreover, due to the same reason, better data quality cannot be achieved only by using data correction software. 3) HIQM A more recently developed methodology is the Hybrid Information Quality Management methodology (HIQM) developed by Cappiello, Ficiaro and Pernici [4]. HIQM is based on

4 4 TDQM but includes eight phases environment analysis, resource management, quality requirements definition, quality measurement, analysis and monitoring, improvement, strategy correction, and warning management. Several of these phases distinguish HIQM from other methodologies. The environmental analysis phase, for example, focuses on acquiring knowledge of the organisational processes and data and determining the feasibility of introducing a data quality management programme in the organisation. Improvement steps at the strategic level are executed as part of the strategy correction phase. Finally, the warning management phase supports interactive analysis of any errors originating in the improvement phase. 4) CDQ The Comprehensive Data Quality methodology (CDQ) attempts to integrate processes found in other methodologies such as TDQM and TIQM and is applicable to both structured and unstructured data [1]. It includes three major phases which are executed iteratively state reconstruction, assessment, and choice of improvement steps. The state reconstruction phase aims at modelling the organisational units, processes, services, and data sources, along with the relationships between them. The assessment phase focuses on measuring and assessing data quality along different dimensions and on setting new quality targets. Before data quality is assessed, however, relevant DQ issues are identified by interviewing users. The last phase in the methodology helps organisations to select the optimal improvement steps by evaluating different alternatives using cost-benefit analysis. 5) IPMAP The last management methodology we describe is the IPMAP methodology which was developed by Shankaranarayan, Ziad and Wang [18]. It follows the principles of IP manufacturing defined by TDQM and aims to support decision makers in scenarios which are characterised by large volumes of data, data sources distributed across several locations, and multiple stakeholders. The methodology includes three major components. The first component, called IPMAP, provides a set of constructs for modelling the stages of IP manufacturing. The second component in the framework is a set of metadata elements which can be attached to any of the manufacturing stages. Metadata elements include, among other things, the business rules associated with a given stage, the business unit responsible for it, and the timeliness, accuracy and completeness dimensions associated with the data at that stage. The third component of the framework is a set of capabilities that allow the decision maker to estimate the time required to produce an IP and to determine the exact stage in the manufacturing process in which data quality problems originated. 6) AIMQ The AIMQ methodology provides a comprehensive set of techniques for assessing and benchmarking data quality [11]. It includes three components. The first component is a DQ model which defines quality based on what it means to various stakeholders. The second component is a questionnaire for assessing quality in terms of the dimensions that are important to stakeholders. The last component provides two techniques to help organisations interpret the collected measurements. One of them calculates the gap between assessments of different stakeholders, while the other is used for benchmarking data quality against a set of best practices. 7) Francalanci and Pernici The assessment methodology developed by Francalanci and Pernici aims to overcome the limitations of other methodologies, which focus on assessing quality only by looking at the source where data are stored without considering how users perceive quality [8]. Indeed, different users have different requirements and their perceptions of the quality of the same dataset may vary. The methodology allows quality to be assessed from the perspective of different users by assigning them to classes, which group similar users together. Generally, users in the same class access the same system services and use the same types of data. Explicit data quality requirements are specified for each class by associating an evaluation function to each dimension and setting a minimum acceptable value for that dimension. Class-level requirements are inherited by all users in the class but can be modified by each user based on his personal preferences. When a user requests data from a service, the system evaluates the quality of the data, checks whether the data satisfy class-level and user-level requirements, and presents the results to the user. E. Methodologies and Noark 5 A suitable DQ methodology can help public bodies that use Noark 5 systems to produce higher-quality data and thus increase data reusability and provide better services to citizens. However, in order to be applicable to records management, a methodology needs to include a number of important features. First, it must offer techniques for data and business process modelling. Second, it must allow data consumers to specify which DQ dimensions are important and to define DQ requirements in a format that can be parsed by software and used for objective DQ assessment. The methodology must also provide suitable data quality improvement techniques and a means for estimating the cost of improvement. Note that data correction tools which require little or no human involvement cannot be used in the context of Noark 5 systems because records in such systems may be modified only by designated people due to legal reasons. Next, the methodology has to include capabilities for benchmarking data quality, improving business processes that lead to poor DQ, and correcting the strategy of the organisation. Finally, it must offer guidelines for making DQ an important part of the organisational culture. A careful analysis of the reviewed methodologies reveals that none of them support all previously outlined features. The analysis results are summarised in Table I. Although the methodologies are not directly applicable to Noark 5, some of their features can be combined in order to develop a comprehensive data quality management methodology for records management systems. Some of the important features are the data quality requirements definition in TDQM, cost-benefit analysis in CDQ, modelling techniques in IPMAP, and the

5 5 TABLE I: Data quality methodologies Feature / Methodology TDQM TIQM HIQM CDQ IPMAP AIMQ Francalanci & Pernici Modelling techniques User-defined DQ dimensions Executable DQ requirements Objective assessment Benchmarking Business process improvement Improvement costs analysis Strategy correction Organisational culture transformation techniques for DQ measurement and benchmarking in AIMQ. Creating a new methodology is outside the scope of the present study but we lay the groundwork for such an endeavour by identifying several dimensions that are applicable to Noark 5 and which can be objectively measured, and by proposing an approach for measuring data quality along these dimensions on the basis of user-defined DQ requirements. IV. THE NOARK 5 STANDARD Apart from specifying functional requirements, the Noark 5 standard defines metadata that must be supported by compliant systems. Metadata are conceptually represented with an entity relationship model which we translated to EJB entities when we developed our system entities in the model correspond to classes, attributes correspond to class fields, and relationships between entities correspond to fields that reference other classes. Noark 5 systems can handle both electronic and paper documents, which are stored externally and are thus outside the scope of the system. For this reason, our attention is drawn to the data quality of the metadata and not the actual documents. This section provides an overview of the Noark 5 conceptual model, describes how cases are typically handled with Noark 5 systems, and presents several applicable data quality dimensions. A. Metadata The key entities in the Noark 5 conceptual model are depicted in Fig. 1, while a detailed description of their attributes can be found in the standard. The dotted lines in the diagram denote entity relationships which are part of the standard and fully supported by our system but were disregarded in this study for the sake of simplicity. The work with any Noark 5 system begins by creating a fonds, i.e. an archive. A fonds is associated with one or more fonds creators and references a number of series. Each series points to case files which represent cases in the system. Fonds, series, and case files all have associated storage locations, which denote where documents are physically stored (in the case of paper documents). Each series points to a classification system with a number of classes and keywords. A classification system may be based on social security numbers. In that case, a given class will be the social security number of a particular person. Classes are therefore used to organise information e.g. all records for one person will belong to the same class. Keywords on the other hand are used to facilitate searching. A case file includes one or more registry entries each of which records a particular transaction and points to documents related to the case. The link from a registry entry to a document is indirect a registry entry is first linked to a document description that references one or more document objects, which in turn point to either electronic or paper documents. Both case files and registry entries can be assigned to classes and can have associated keywords. B. Case handling Users of every Noark 5 system are assigned a role on the basis of how they use the system. The two most important roles are archive leader and executive officer. An archive leader manages the archive of an organisation. Given a new Noark 5 system, he establishes the archive structure by creating a fonds, possibly several subfonds, a series, etc. He also assigns cases to executive officers for handling, monitors the cases in his organisation or department, and ensures that they are handled both efficiently and effectively. An archive leader has superuser access rights to the system, which allows him, among other things, to modify the access privileges of other users, edit wrongly entered metadata, and move misclassified cases. Executive officers are responsible for handling the cases that have been assigned to them by the archive leader. The Business Process Modelling Notation (BPMN) diagram in Fig. 2 shows the process of case handling. When a citizen submits an application to a public body, a Noark 5 system automatically creates a new case and sends it to the archive leader. At this stage, the case has a status of registered. The archive leader then delegates the case to one of the executive officers. Once the officer opens the case, its status changes to reserved for editing. The officer adds a number of registry entries to the case over some period of time. Each entry includes documents related to the case. Once the case is complete, it gets a status of finished. At this point, the

6 6 Fig. 1: Noark 5 simplified conceptual model Fig. 2: Case handling in Noark 5 officer writes a response that will later be sent to the citizen and forwards both the response and the case to the archive leader for approval. The status of the case changes to pending approval. If the case is not complete, the archive leader sends it back to the officer. Otherwise, he signs the response, either electronically or on paper, sends it to the citizen, and sets the status of the case to archived. In the presence of a DQ measurement component, officers would be able to check the quality of any given case before submitting it for approval. Furthermore, such a component would facilitate the archive leader in deciding whether a case should be archived or sent back to an officer for corrections. C. Dimensions applicable to Noark 5 Many of the data quality dimensions identified in the literature are applicable to Noark 5 but we focus in our analysis only on those that can be objectively measured. They are listed below along with two other dimensions which are specific to Noark 5 and possibly to the domain of records management. Completeness can be defined as the degree to which data are of sufficient depth, breadth, and scope for the task at hand [2]. We consider an object (entity) in Noark 5, to have good data quality in terms of completeness when the important fields in the object (as regarded by users of the data) are not NULL and non-empty if they are strings. Accuracy is defined by Redman [17] as the distance between a data value v and the real-world value v

7 7 which v represents. Batini and Scannapieco [2] call this type of accuracy semantic accuracy and also discuss syntactic accuracy, which unlike semantic accuracy can be measured objectively. In syntactic accuracy we are not interested in comparing v to v but instead in comparing v to the possible values that it can take, i.e. its input domain. An object in Noark 5 has good data quality in terms of syntactic accuracy when each of its fields matches a value in its input domain. Consistency is described by a set of constraints that must hold true for all objects in the system. These constraints can be defined for one or more fields in a single class or in a class and a referenced class. There are two major types of integrity constraints, namely functional dependency and inclusion dependency. Functional dependency constraints specify which fields depend on what values of other fields in the same or in another object. Inclusion dependency constraints indicate that an object must refer to other objects, or that it includes objects that are also included in an object that it references. Correctness describes the situation in which every realworld state maps to the correct system state [21]. A case in Noark 5, for example, can be assigned to the wrong class, which produces incorrectness that cannot be automatically detected. However, other errors of the same type can be detected. For example, it is possible to detect that a case is assigned to the wrong series when the creation date of the case is not between the series start and end dates. Disposal and processing delay are two other dimensions that we believe are important to the domain of records management. A case in a Noark 5 system has low data quality along the disposal dimension when the age of the case is over some specified period of time and the case has not been removed from the system. Note that case disposal is required due to legal requirements. A case has low data quality along the processing delay dimension when no new registry entries have been added to it for some period of time but it has not been archived. V. LANGUAGE FOR SPECIFYING DQ REQUIREMENTS Noark 5 does not consider data quality but we hope that this will change in the future versions of the standard, which not only needs to specify what constitutes good data quality but also to define different levels of DQ compliance. Translating DQ requirements to code that measures quality is difficult and has to be done on an ad hoc basis unless a domain-specific language is used to describe the requirements. However, to the best of our knowledge, no such language exists, and DQ requirements are typically defined in a non-portable way even in commercial data quality software systems. We developed a portable domain-specific language which allows DQ requirements to be specified in the form of DQ rules for the six dimensions described in Section IV. The language can be used to measure data quality in a wide variety of systems and does not impose our definitions of DQ dimensions to its users but instead provides constructs that allow users to specify what data quality aspects are included in a dimension. We wrote the language grammar in EBNF and used ANTLR to construct a language parser and interpreter. Due to space limitations, the grammar and the actual data quality rules that we developed for Noark 5 will be published separately. A description of the language and a representative set of the rules will be provided here instead. The language we developed works on objects and can therefore be used to measure data quality in systems that use objectoriented databases, but also in systems that use relational databases, although in this case an object-relational mapping is required to translate the relational schema to a set of interrelated objects. In fact, our system works with a relational database and uses Hibernate to implement the object-relational mapping. Focusing on the object-oriented view of the data, each DQ rule in our language describes what good data quality is for a given class in terms of its fields. Evaluating the data quality of every case in our Noark 5 system is done in three steps. First, a set of DQ rules is defined for the classes that taken together form a case. Second, a language interpreter measures the DQ of all instances of these classes based on the previously defined rules. Finally, the measurements are combined into a single data quality value for each case. Structure of DQ rules Every DQ rule in our language includes three parts header, body, and trailer. The header specifies a dimension that the rule contributes to and a class whose instances will be inspected by the rule. The body includes one or more DQ tests which are executed for each instance of the class and return a boolean value as a result. Note that TRUE indicates good DQ, while FALSE denotes poor DQ. A DQ test is made up of one or two logical predicates. Finally, the trailer marks the end of a rule with the string. Predicates Predicates use one of several supported operators to compare fields or a field with a literal, and return a boolean value. Supported operators include the relational operators <, <=, >, >=, == and!=, as well as seven other operators which will be illustrated later in the text using Noark-specific examples. The types of fields that can be used in a predicate are those present in Noark 5, i.e. integer, string, date, reference to an object, or a set of references to objects. Measuring DQ The language interpreter evaluates each DQ rule for a given dimension for all instances of the class specified in the header and returns as a result the number of successful and unsuccessful DQ tests for each instance. This result, along with the dimension that it applies to, is stored in a database using an approach similar to that proposed by Storey and Wang [19]. Storing DQ measurements in this way allows us to calculate data quality along a given dimension for a single object or for an arbitrary set of objects. Consequently, we can easily calculate data quality for objects in Noark 5 which stand on their own such as the instances of StorageLocation,

8 8 Author, and FondsCreator. More importantly, we can calculate the data quality of cases, which consist of a number of objects of the classes CaseFile, RegistryEntry, DocumentDescription, and DocumentObject. Example DQ rule Listing 1 shows an example rule which will be applied to all instances of the Type1 class. Note that the rule specifies requirements not only for the Type1 instances but also for the Type2 objects that each Type1 instance refers to. The name of the reference is rtype2. A reference is essentially a pointer from one object to another. DIM "Example" FOR Type1 t1, rtype2 t2 t1.intf >= 1; WHEN t1.strf MATCHES "ab" THEN t2.intf > 0; Listing 1: Example DQ rule The example rule contributes to a dimension called Example and includes two DQ tests. The first test contains a single predicate and evaluates to TRUE when the integer field intf in the instance of Type1 is greater than or equal to 1. The second DQ test uses a WHEN-THEN construct with two predicates. This test returns TRUE either when the string field strf in the Type1 instance does not match ab, or when it matches ab and the integer field intf in the object pointed to by the field rtype2 is greater than 0. Noark 5 DQ rules We defined eighteen rules for Noark 5 with more than seventy DQ tests. In the following we provide a small subset of these rules in order to better illustrate the capabilities of the language we developed. The rules in Listing 2 contribute to the completeness dimension. The first one works on CaseFile objects and returns TRUE when the object s file type is not NULL or an empty string. The second rule works on RegistryEntry objects and evaluates to TRUE when the object s created date is not NULL. DIM "Completeness" FOR CaseFile f EXISTS f.filetype; DIM "Completeness" FOR RegistryEntry e EXISTS e.createddate; Listing 2: Completeness rules The rule in Listing 3 applies to syntactic accuracy and works on CaseFile objects. It returns TRUE when the object s status matches either of the strings open and closed. Note that any regular expression can be placed on the right-hand side of the MATCHES operator. DIM "Syntactic Accuracy" FOR CaseFile f f.casestatus MATCHES "(open closed)"; Listing 3: Syntactic accuracy rule The rule in Listing 4 is part of the consistency dimension and works on CaseFile objects and on a Series object that each of them references. The first test evaluates to TRUE when the CaseFile object contains a non-empty set of references to RegistryEntry objects. The second test returns TRUE when the reference to StorageLocation in the CaseFile object is contained in the set of references to StorageLocation in the series that the case belongs to. DIM "Consistency" FOR CaseFile f, rseries s INCLUDES f.refregistryentries; f.reflocation CONTAINED IN s.reflocations; Listing 4: Consistency rule The rule in Listing 5 contributes to the correctness dimension and works on CaseFile objects and on a Series object that each of them points to. The rule evaluates to TRUE when the case creation date is between the start and end dates of the corresponding series. DIM "Correctness" FOR CaseFile f, refseries s f.createddate BETWEEN s.startdate,s.enddate; Listing 5: Correctness rule The rule in Listing 6 is part of the disposal dimension and works on CaseFile objects. It evaluates to TRUE when the case age is less than 1825 days (five years). The creation date of the case is denoted by the createddate field in the CaseFile object. DIM "Disposal" FOR CaseFile f f.createddate AGE< 1825; Listing 6: Disposal rule Finally, the rule in Listing 7 applies to the processing delay dimension and works on CaseFile objects and on the set of referenced RegistryEntry objects. The rule evaluates to TRUE for cases that are currently being processed when the age of the newest registry entry in the case is less than five days. DIM "Processing Delay" FOR CaseFile f, rentry r WHEN f.casestatus MATCHES "open" THEN r.createddate AGE_LATEST< 5; Listing 7: Processing delay rule VI. DATA QUALITY COMPONENT The data quality component that we developed for our Noark 5 system is made up of two separate units a backend which evaluates the data quality in the system, and a frontend which triggers data quality analysis and visualises the results as reported by the backend. We implemented the backend as an EJB module that can be easily plugged into our Noark 5 system. The frontend is a web application based on the ZK

9 9 Fig. 3: Visual appearance of the DQ frontend framework. It communicates with the backend either by using remote objects or web services. The backend provides a number of functionalities to its clients. A client can start data quality analysis both synchronously and asynchronously, check whether data quality analysis is in progress, retrieve raw data quality results, i.e. results for each object for a given dimension, or retrieve the results for the cases handled by the system. When a client triggers the data quality analysis process, a language recogniser parses the DQ rules defined in a specific text file in the module. Then an interpreter measures data quality based on the rules and stores the results in the database. Parsing the rules every time analysis is triggered is not an expensive operation and at the same time it allows new rules to be added on the fly without restarting the system. When a client needs to retrieve the data quality results for all cases in the system, the module uses the raw data quality results stored in the database to calculate a data quality value ranging from 0 to 1 for each case by using simple ratio as defined by Pipino et al [16]. This approach is preferred to fetching raw data quality results as that may impose a significant communication overhead. However, the module also provides the necessary mechanisms to allow a client to fetch raw results and calculate a metric based on some principle different than simple ratio if it wishes to do so. Most of the functionality of the module is contained in a class library which can be reused in other systems that are not based on EJB. The frontend (Fig. 3) allows its users to start DQ analysis of the data in the system and to visualise the results once the analysis is performed. It contains three graphical widgets a list of all analysed dimensions, a list of the cases in the system, and a dial chart for displaying data quality measurements on a scale of 0 to 100. Both lists provide sorting capabilities. For example, a user can sort the available cases by their name or author. When a user selects one or more dimensions and one or more cases from the lists, the application calculates the average quality for the selected cases in each of the selected dimensions, calculates the average of these values, and visualises it using the dial chart. VII. RESULTS We validated the data quality component by using it to analyse the quality of two different datasets one with good and one with poor quality. Both datasets were created automatically by a tool that connects to the Noark 5 system, establishes a basic archive structure, and creates 100 cases. Each case includes a random number of RegistryEntry, DocumentDescription, and DocumentObject objects. The tool operates in two modes. In the first mode, it follows the data quality rules we specified for Noark 5 when it creates the cases. In the second mode, it randomly introduces specific types of errors which are described below. In order to lower the quality of the data in terms of completeness, the tool creates objects with NULL or empty fields. Accuracy is decreased by introducing syntactically incorrect values for string fields. Consistency is lowered by randomly breaking the inclusion and functional dependency rules that we specified. For example, the tool deliberately sets the finalised date of randomly selected cases to one that is older than their creation date. It also sets the storage location of randomly chosen cases to a location that is not included in the series that the cases belong to. Correctness is decreased by changing the creation date of random cases to a date which is not between the start and end dates of the series that the cases belong to. The tool reduces data quality along the disposal dimension by setting the creation date of random cases to a date that is more than five years older than the current date. Finally, quality along the processing delay dimension is lowered by setting the status of random cases to being processed and setting the creation date of all registry entries in those cases to a date that is older than five days. The data quality module successfully analysed both data sets and reported data quality of 1.0 along the six dimensions for the data set that follows our rules, and a lower quality for the other set. We verified that the module detects all types of

10 10 errors that we had deliberately introduced in one of the data sets. VIII. CONCLUSION The goal of the present study was to develop an understanding of how data quality can be objectively measured in records management systems on the basis of user-defined requirements. In order to answer this main research question, we identified a number of data quality issues in the Noark 5 standard, experimented with different measurement techniques for the identified issues, and developed a software component for DQ assessment. As there was no open-source Noark 5 system that we could use in our study, we designed a comprehensive architecture for records management and DQ assessment and developed a new Noark 5 core based on it. Our system is not only modular and flexible but unlike existing commercial solutions supports data quality analysis and can be deployed both in a clustered environment and as a hosted service. We provided the background necessary for our study by investigating several approaches to identifying data quality dimensions, by looking at fundamental principles for developing metrics, and by outlining a number of DQ management and assessment methodologies. Our analysis of these methodologies showed that they cannot be directly applied in organisations that use Noark 5 systems due to the specifics of the records management domain. After identifying six applicable dimensions that can be objectively measured, we developed a domain-specific language for expressing data quality requirements. The language can be used to measure data quality in a variety of systems regardless of the technology they are based on. Moreover, it is applicable not only to Noark 5 but also to other records management standards such as MoReq2, which is used both by public bodies and private organisations throughout the European Union [5]. The approach of specifying data quality requirements with a domain-specific language is highly innovative because to the best of our knowledge even commercial DQ solutions specify such requirements in a non-portable way. This paper makes an important contribution by bridging the domains of records management and data quality, which has not been done before. Our study showed that good data quality is crucial for records management systems, particularly for those based on Noark 5, but that the area is under-researched. DQ is not even part of the Noark 5 standard and although the implications of poor-quality data are not widely understood and not currently observable, they may render data in the national archive of Norway unusable in the future. For this reason, data quality must be considered in the next version of the standard. The language we developed is also an important contribution because it allows, among other things, a uniform set of DQ requirements to be developed for public bodies in Norway and used to objectively measure and benchmark the quality of the data they produce. Our study has two limitations which are related to the set of DQ dimensions that we considered, and the practical utility of the domain-specific language. A study of other dimensions applicable to records management systems will help to better understand the scope and importance of DQ issues in such systems. Furthermore, the language we developed needs to be tested in different scenarios and possibly extended with new operators, constructs, and DQ tests. The ultimate goal of the research of data quality in records management systems should be the development of a comprehensive DQ management methodology for such systems as well as suite of software tools to support the methodology. This paper provides a good start both from a theoretical and practical perspective. REFERENCES [1] C. Batini, F. Cabitza, C. Cappiello, and C. Francalanci. A comprehensive data quality methodology for web and structured data. International Journal of Innovative Computing and Applications, 1(3):205, [2] C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, Sept [3] M. Bovee, R. P. Srivastava, and B. Mak. A conceptual framework and belief-function approach to assessing overall information quality. International Journal of Intelligent Systems, 18(1):51 74, Jan [4] C. Cappiello, P. Ficiaro, and B. Pernici. HIQM: a methodology for information quality monitoring, measurement, and improvement. In Advances in Conceptual Modeling - Theory and Practice, volume 4231, pages Springer Berlin Heidelberg, Berlin, Heidelberg, [5] E. Commission. Model requirements for the management of electronic records - update and extension 2008, [6] W. Eckerson. Data quality and the bottom line. Technical report, [7] L. English. Total information quality management - a complete methodology for IQ management. Information Management Magazine, [8] C. Francalanci and B. Pernici. Data quality assessment from the user s perspective. In Proceedings of the 2004 international workshop on Information quality in informational systems - IQIS 04, page 68, Paris, France, [9] A. Hevner, S. Chatterjee, A. Hevner, and S. Chatterjee. Design science research in information systems. In Design Research in Information Systems, volume 22, pages Springer US, Boston, MA, [10] A. R. Hevner, S. T. March, J. Park, and S. Ram. Design science in information systems research. MIS Quarterly, 28(1):75 105, Mar [11] Y. Lee. AIMQ: a methodology for information quality assessment. Information & Management, 40(2): , Dec [12] S. E. Madnick, R. Y. Wang, Y. W. Lee, and H. Zhu. Overview and framework for data and information quality research. Journal of Data and Information Quality, 1(1):1 22, June [13] S. T. March and G. F. Smith. Design and natural science research on information technology. Decision Support Systems, 15(4): , Dec [14] F. Naumann and C. Rolker. Assessment methods for information quality criteria [15] N. A. S. of Norway. Noark 5 standard for records management, Apr [16] L. L. Pipino, Y. W. Lee, and R. Y. Wang. Data quality assessment. Communications of the ACM, 45(4):211, Apr [17] T. C. Redman. Data Quality for the Information Age. Artech House, Jan [18] G. Shankaranarayan, M. Ziad, and R. Y. Wang. Managing data quality in dynamic decision environments. Journal of Database Management, 14(4):14 32, [19] V. C. Storey and R. Y. Wang. Modeling quality requirements in conceptual database design. pages 64 87, [20] T. Tuunanen, M. A. Rothenberger, S. Chatterjee, and K. Peffers. A design science research methodology for information systems research. Journal of Management Information Systems, 24(3):45 77, [21] Y. Wand and R. Y. Wang. Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11):86 95, Nov [22] R. Y. Wang. A product perspective on total data quality management. Communications of the ACM, 41(2):58 65, Feb [23] R. Y. Wang and D. M. Strong. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4):5 33, Mar

Data Quality Framework

#THETA2017 Data Quality Framework Mozhgan Memari, Bruce Cassidy The University of Auckland This work is licensed under a Creative Commons Attribution 4.0 International License Two Figures from 2016 The