Semantics, Metadata and Identifying Master Data

Semantics, Metadata and Identifying Master Data A DataFlux White Paper Prepared by: David Loshin, President, Knowledge Integrity, Inc.

Once you have determined that your organization can achieve the benefits of integrating data quality and data governance through introducing a master data management (MDM) program, some typical early questions emerge, such as What architectural approaches will we take to deploy our MDM solution? or What are the business approaches for acquiring the appropriate tools and technologies required for MDM success? These are good questions, but they are often asked prematurely. Even before determining how to manage the enterprise master data asset, there are more fundamental questions that need to be asked and comprehensively explored, such as: What data elements constitute our master data? How do we locate and isolate master data objects that exist within the enterprise? How do we assess the variances between the different representations in order to consolidate instances into a single view? Because of the ways that diffused application architectures have evolved within different organizations, it is likely that while there are a relatively small number of core master objects used, there are many different ways that these objects are modeled, represented and stored. For example, any application that must manage contact information for individual customers will rely on a data model that maintains the customer s name. Yet one application will track an individual s full name, while others will break up the name into its first, middle and last parts. And even for those that track the given and family names of a customer will do it differently perform a quick scan of the data sets within your own organization and you are likely to find LAST_NAME attributes with a wide range of field lengths. Figure 1: Isolating master data from different data sets.

The challenges are not limited to determining what master objects are used. Indeed, the core requirement is to find where master objects are used and to chart a strategy for standardizing, harmonizing and consolidating them into a master repository or registry. When the intention is to create an organizational asset that is not just another data silo, it is imperative that your organization provide the means for both the consolidation and integration of master data and facilitate the most effective and appropriate sharing of that master data. What is Master Data? What are the characteristics of master data? So far, the industry has been better at describing master data but less adept at actually defining what master data is. As a description, master data objects are those core business objects that are used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections, and taxonomies. Master data objects are those things that we care about the things that are logged in our transaction systems, measured and reported on in our reporting systems, and analyzed in our analytical systems. Common examples of master data include: Customers Suppliers Parts Products Locations Contact mechanisms For example, consider the following transaction: David Loshin purchased seat 15B on flight 238 from BWI to SFO on July 20, 2006. Some of the master data elements in this example and their types are shown in Table 1. Master Data Object Customer Product Value Flight 238 Location Location David Loshin Seat 15B BWI SFO Table 1: Master data elements for a typical airline reservation. Aside from the above description, master data objects share certain characteristics: The real-world objects modeled within the environment as master data objects tend to be referenced in multiple business areas. For example, the concept of a vendor may exist in the finance application at the same time as in the procurements application.

Master data objects are referenced in both transaction and analytic system records. While the sales system may log and process the transactions initiated by a customer, those same activities may be analyzed for the purposes of segmentation and marketing. Master data objects may be classified within a semantic hierarchy, with different levels of classification, attribution and specialization applied depending on the application. For example, we may have a master data category of party, which in turn is comprised of individuals or organizations. Those parties may also be classified based on their roles, such as prospect, customer, supplier, vendor, or employee. Master data objects may require specialized application functions to create new instances, as well as manage the updating and removal of instance records. Each application that involves supplier interaction may have a function enabling the creation of a new supplier record. They are likely to have models reflected across multiple applications, possibly embedded in legacy data structure models. While we may see a natural hierarchy across one dimension, the taxonomies that are applied to our data instances may actually cross multiple hierarchies. For example, a party may be an individual, a customer and an employee simultaneously. In turn, the same master data categories and their related taxonomies would be used for transactions, analysis and reporting. For example, the headers in a monthly sales report may be derived from the master data categories (sales by customer by region by time period). Enabling the transactional systems to refer to the same data objects as the subsequent reporting systems ensures that the analysis reports are consistent with the transaction systems. Centralizing Semantic Metadata Master data may be sprinkled across the application environment. The objective of a master data management program is to facilitate the effective management of the set of master data instances as a single centralized master resource. But before we can materialize a single master record for any entity, we must be able to: 1. Discover which data resources may contain entity information 2. Understand which attributes carry identifying information 3. Extract identifying information from the data resource 4. Transform the identifying information into a standardized or canonical form 5. Establish similarity to other standardized records This entails cataloging the data sets, their attributes, formats, data domains, definitions, contexts and semantics, not just as an operational resource, but rather in a way that can be used to automate master data consolidation as well as governing the ongoing application interactions with the master repository. In other words, to be able to manage the master data, one must first be able to manage the master metadata. But as there is a need to resolve multiple variant models into a single view, the interaction with the master metadata must facilitate the resolution of three critical aspects:

Format at the element level Structure at the instance level Semantics across all levels. Figure 2: Preparation for a master data integration process must resolve the differences between the syntax, structure, and semantics of different source data sets. These are effectively three levels of integration that need to dovetail as a prelude to any kind of enterprise-wide integration, and introduces three corresponding challenges for master metadata management: 1. Collecting and analyzing master metadata 2. Resolving similarity in structure 3. Understanding and unifying master data semantics Challenge 1: Consolidating and Analyzing Master Metadata One approach is to analyze and document the metadata associated with all data objects across the enterprise and use that information to guide analysts seeking out master data. Many of the data sets may have documented some of the necessary metadata. For example, relational database systems allow for querying table structure and data element types, and COBOL copybooks reveal some structure and potentially even some alias data. Some of the data may have little or no documented metadata, such as fixed-format or character-separated files.

If the objective is to collect comprehensive and consistent metadata, as well as ensure that the data appropriately correlates to its documented metadata, we can use data profiling as the tool of choice. Because of its ability to apply both statistical and analytical algorithms for characterizing data sets, data profiling can drive the empirical assessment of structure and format metadata while simultaneously exposing of embedded data models and dependencies. Our consolidated metadata repository will eventually enumerate the relevant characteristics associated with each data set in a standardized way, including the data set name, its type (e.g., RDBMS table, VSAM file, CSV file) and the characteristics of each of its columns/attributes (e.g., length, data type or format pattern). At the end of this process, we will have more than simply a comprehensive catalog of all data sets. We will also be able to review the frequency of meta-model characteristics, such as frequently-used names, field sizes, and data types. Capturing these values with a standard representation allows the metadata characteristics themselves to be subjected to the kinds of statistical analysis that data profiling provides. For example, we can assess the dependencies between common attribute names (e.g., CUSTOMER ) and their assigned data types (e.g., VARCHAR[20]) to identify (and potentially standardize against) commonly-used types, sizes and formats. Challenge 2: Resolving Similarity in Structure Despite the expectations that there are many variant forms and structures for your organization s master data, the different underlying models of each master data object are bound to share many commonalities. For example, the structure for practically any residential customer table will contain a name, an address and a telephone number. On the other hand, almost any vendor table will probably also contain a name, an address and a telephone number. A closer look might suggest considering an underlying model concept of a party, used as the basis for both customer and vendor. In turn, the analyst might review any model that contains those same identifying attributes as a structure type that can be derived or is related to a party type. There are two aspects to structure similarity for the purpose of tracking down master data instances. The first is seeking out overlapping structures, in which the core attributes determined to carry identifying information for one data object overlap with a similar set of attributes in another data object. The second is identifying derived structures, in which one object s set of attributes are completely embedded within other data objects. Both cases indicate a structural relationship, and when related attributes carry identifying information, the analyst should review those objects to determine if they indeed represent master objects. Challenge 3: Unifying Semantics The third challenge focuses the qualitative difference between pure syntactic or structural metadata (as we can discover through the profiling process), and the underlying semantic metadata. This involves more than just analyzing structure similarity. It involves understanding what the data means, how that meaning is conveyed, how that meaning connects data sets across the enterprise, and approaches to capturing semantics as an attribute of your metadata framework. As a data set s metadata is collected, the semantic analyst must approach the business client to understand that data object s business meaning. One step in this process involves reviewing the degree of semantic consistency in data element naming is related to overlapping data types, sizes and structures. The next step is to

document the business meanings assumed for each of the data objects, which involves asking questions like: What are the definitions for the data elements? Or for the data sets themselves? Are there authoritative sources for the definitions? Do similar objects have different business meanings? The answers to these questions not only help in determining which data sets truly refer to the same underlying real-world objects, they also contribute to an organizational resource that can be used to standardize a representation for each data object as its definition is approved through the data governance process. Managing that semantic metadata as a central asset enables the metadata repository to grow in value as it consolidates semantics from different enterprise data collections. Identifying and Qualifying Master Data Once the semantic metadata has been collected and centralized, the analyst s task of identifying master data should be simplified. As more metadata representations of similar objects and entities populate the repository, the frequency of specific models will provide a basis for assessing whether the attributes of a represented object qualify the data elements represented by the model as master data. By adding additional characterization data for each data set s metadata profile, we can add more knowledge to the process of determining sources that can feed a master data repository, which will help in the analyst s task. One approach is to characterize the value set associated with each column in each table. At the conceptual level, designating a value set using a simplified classification scheme reduces the level of complexity associated with data variance, and allows for loosening the constraints when comparing multiple metadata instances. For example, we can limit ourselves to six data value classes, such as these: 1. Boolean or Flag There are only two valid values, one representing true and one representing false. 2. Time/Date Stamp A value that represents a point in time. 3. Magnitude A numeric value on a continuous range, such as a quantity or an amount. 4. Code Enumeration A small set of values, either used directly (e.g., using the colors red and blue ) or mapped as a numeric enumeration (e.g., 1 = red, 2 = blue ). 5. Handle A character string with limited duplication across the set may be used as part of an object description (e.g., name or address_line_1 fields contain handle information). 6. Cross-Reference An identifier that either is uniquely assigned to the record or provides a reference to that identifier in another dataset.

The Fractal Nature of Metadata Profiling At this point, each data attribute can be summarized in terms of a small number of descriptive characteristics: data type, length, data value, class, etc. In turn, each data set can be described as a collection of its component attributes. Because we are looking for similar data sets with similar structures, formats and semantics, our job is to assess each data set s identifying attribution, try to find the collections of data sets that share similar characteristics, and determine if they represent the same objects. Let s summarize: We are using our tools to assess data element structure We are collecting this information into a metadata repository We use our tools to look for data attributes that share similar characteristics We use our tools to seek out attributes with similar names We analyze the data value sets and assign them into value classes We use our tools to detect similarities between representative data metamodels In essence, the techniques and tools we can use for determining the sources of master data objects are the same types of tools we use for consolidating the data into a master repository! Using data profiling, parsing, standardization and matching, we can facilitate the process of identifying which data sets (tables, files, spreadsheets, etc.) represent which master data objects. Standardizing the Representation The analyst now has a collection of master object representations. But as a prelude to developing the consolidation road map, decisions must be made as part of the organization s governance process. To consolidate the variety of diverse master object representations into a single repository, the relevant stakeholders need to agree on a common representation as well as the underlying semantics for that representation. It is critical that a standard representation be defined and agreed to so that the participants expecting to benefit from the data in the master repository can effectively share the data. And because MDM is a solution that integrates tools with policies and procedures for data governance, there should be a process for defining and agreeing to data standards. Summary: Metadata Profiling Drives the Process In effect, we have described a process for analyzing similarity of syntax, structure and semantics as a prelude to identifying enterprise sources of master data. And since the objective in identifying and consolidating master data representations requires empirical analysis and similarity assessment as part of the resolution process, it is comforting to know that the same kinds of tools and techniques that will subsequently be used to facilitate data integration can also isolate and catalog organizational master data.