CHAPTER 3 BUILDING ARCHITECTURAL DATA WAREHOUSE FOR CANCER DISEASE

Size: px

Start display at page:

Download "CHAPTER 3 BUILDING ARCHITECTURAL DATA WAREHOUSE FOR CANCER DISEASE"

Christopher Howard
5 years ago
Views:

1 32 CHAPTER 3 BUILDING ARCHITECTURAL DATA WAREHOUSE FOR CANCER DISEASE 3.1 INTRODUCTION Due to advanced technology, increasing number of hospitals are using electronic medical records to accumulate substantial amount of data of their patients with the associated clinical conditions and treatment details. Electronic medical records are becoming more ubiquitous in day-to-day clinical practice. They capture clinical data, store in personal database as well as mirror it in local and regional database. Data capture, storage, retrieval and display are performed. They also allow display of alerts, warnings; guide a clinician through a clinical practice by way of workflow and On-Line Transaction Processing of intelligent data. The intelligent data processes the On-Line Transaction Processing by creating a new information repository that integrates basic data from various sources, properly arranges data formats and then makes data available for analysis and evaluation aimed at planning and decision-making process. For this analytical work the data warehouse model is an efficient and effective progression technique to handle the intelligent data, also it is uncomplicated to manage the huge amount of electronic data stored for number of years and data is progressed routinely for daily operation of medical transaction Data Warehouse Data Warehouse is a phenomenon that grew from the huge amount of electronic data stored in recent years and for emergency need the data is used to accomplish goals that go beyond the routine tasks linked to daily processing. Detection and prevention system is very essential in a Data warehouse process in the scenario of a big chain of multispecialty or cancer institute which have many branches, also patient admin managers need to quantify and evaluate how each

2 33 branch can update their patient details to the central hospital. The group database stores detailed patient data on the task performed by branches. A data warehouse is the storage of convenient, consistent, complete and consolidated data, which is collected for the purpose of making quick analysis for the end users who take part in Decision Support Systems (DSS).These data is obtained from different operational sources and kept in separate physical store. A data warehouse is not only a relational database that contains historical data derived from transactional data but also it is an environment that includes all the operations and applications to manage the process of gathering data and delivering it to medical users such as extraction, cleansing, transformation and loading (ECTL) solution, an On-Line Analytical Processing (OLAP) engine, client analysis tools. Data warehouses have no standard definition and the people who work on data warehouse subject has defined it in many ways as follows: The basic data warehouse architecture interposes between end-user desktops and production data sources a warehouse that we usually think of as a single, large system maintaining an approximation of an enterprise data model. (O'Donnell 2001) A data warehouse is a copy of transaction data specifically structured for querying and reporting. (Kimball et al. 1998, p.19) William Inmon defined a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision making process. (Lane and Schupmann 2002, pp.42-43). A data warehouse is a collection of data that supports decision-making processes. It provides the following features (Inmon, 2005): Subject-oriented. Integrated and consistent. Shows its evolution over time and it is not volatile.

3 34 Subject-Oriented: Data Warehouses are designed to aid in decision making for a particular subject. For example, medical data from application contains specific sales of medicines to patients. In contrast, medical data for decision support contains a historical record of sales over specific time intervals. If designed well, subject-oriented data provides a stable image of medical process, independent of legacy systems. In other words, it captures the basic nature of the medical environment. Integrated: Data Warehouse consists of different kinds of data which are collected from separate legacy systems and this can create conflicts and inconsistencies among units of measure. Because of this, they have to put it in a consistent format and by this way they become integrated. Nonvolatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable a user to analyze what has occurred. New data is always appended to the database, rather than replaced. The database continually absorbs new data, integrating it with the previous data. Time variant: There is difference between operational data and informational data from the point of time variance. Operational data is valid only at the moment of access-capturing a moment in time. When performance requirements are demanded, historical data is needed. Data warehouse represents data over a long time horizon; historical analysis can be easily performed. (Lane and Schupmann 2002, pp.42-43) Data Warehouses are subject-oriented because they pivot on enterprisespecific concepts, such as orders, patients, medicine/equipments, medical consultant and admin/doctors. On the contrary, operational database pivot on many different enterprise applications. We put emphasis on integration and consistency because data warehouses take advantage of multiple data sources, such as data extracted from production and then stored to enterprise database, or even data from a third party information systems. A data warehouse should provide a unified view of all

35 the data. Generally speaking, we can state that creating a data warehouse system does not require that new information be added; rather, existing information needs rearranging.

4 35 the data. Generally speaking, we can state that creating a data warehouse system does not require that new information be added; rather, existing information needs rearranging. This implicitly means that an information system should be previously available. Operational data usually covers a short period of time, because most transactions involve the latest data. A data warehouse should enable analysis that instead covers a few years. For this reason, data warehouses are regularly updated from operational data and keep on growing. If data were visually represented, it might progress like so: A photograph of operational data would be made at regular intervals. The sequence of photographs is stored to a data warehouse and results will be shown in a movie that reveals the status of an enterprise from its foundation until present. In addition, the data warehouse involves to data accessible and process oriented. Accessible: The primary purpose of a data warehouse is to provide ready accessible information to end-users. Process-Oriented: It is important to view data warehousing as a process for delivery of information. The maintenance of a data warehouse is ongoing and iterative in nature. Figure 3.1 Data Flow diagram of Data Warehouse

5 Need to Build Data Warehouse The concept of Data Warehousing has evolved out of the need for easy access to a structured store of quality data that can be used for decision making. It is globally accepted that information is very powerful asset that can provide significant benefits to any organization and very sensitive; which is useful to the world. Organizations have vast amount of data but have found it increasingly difficult to access it and make use of it. This is because it has many different formats, exists in many different platforms, and resides in different file and database structures developed by different kind of process. Thus organizations have had to write and maintain perhaps hundreds of programs that are used to extract, prepare and consolidate data that is used by many different applications for analysis and reporting. Also, decision makers often want to dig deeper into the data once initial findings are made. This would typically require modification of the extract programs or development of new ones. This process is costly, inefficient, and highly time consuming. Data warehousing offers a better approach. Data warehousing implements the process to access heterogeneous data sources, clean, filter, transform the data, store the data in a structure that is easy to access, and understand and to use. The data is then used for query, reporting and data analysis. As such the access, use, technology, and performance requirements are completely different from those in a transaction-oriented operational environment. The volume of data in data warehousing can be very high, particularly when considering the requirements for huge data analysis. Data analysis programs are often required to scan vast amount of data, which could result in a negative impact over operational applications, which are more performance sensitive. Therefore, there is a requirement to separate two environments to minimize conflicts and degradation of performance in the operational environment Need to build Data Warehouse for Cancer Disease It is well known that the Data Warehouse is very important in the large chain of organization level to maintain the massive data records/information. So it is very necessary to implement a data warehouse system in health care sectors. Medical

37 data bases are scattered across a large area, when brought together and integrated, they can be used to obtain results and to make decisions which ranges from earlier detection of the disease to

6 37 data bases are scattered across a large area, when brought together and integrated, they can be used to obtain results and to make decisions which ranges from earlier detection of the disease to predicting the exact treatment, it results and the curability of the disease. Cancer database is one such instance which spread across wide geographical area where the pieces of much needed information are scattered. Cancer is one of the most leading causes of death worldwide. Every year equal number of men and women are diagnosed with cancer. Early diagnosis of cancer in its beginning stage is very helpful in curing the disease. Cancer may occur due to genetically, biological and environmental factors. Several attributes such as age, gender, marital status, habits, family history of cancer, etc., plays a major role in causing cancer. A data warehouse built using cancer database will provide clear information to a medical analyst. 3.2 BASIC ELEMENTS OF THE DATA WAREHOUSE The data warehousing elements are source system, data staging area, presentation server, dimensional model, data mart, On-Line Transaction Processing (OLAP) has been used to develop the data warehouse. Figure 3.2 The Basic Elements of the Data Warehouse

7 Data Source System A source system is called as legacy system that captures medical data and transactions. It is the largest source of data for analysis systems therefore it is a burden to create queries and administration reports directly from these systems. Data source can be performed by various factors like On-Line Transaction Processing (OLTP) which would be a day to day transaction, historical data and external data source of cancer patients. The data collected which contains Personal details, Habits; Family history of cancer patients, Symptoms, Diagnosis and Treatment details has been provided by Department of Biostatistics and Cancer Registry, Adyar Cancer Institute (WIA), Chennai, India OLTP technique of data transactions On-Line operational systems were to perform transaction and query processing. So, they are also termed as On-Line Transaction Processing systems (OLTP). Sometimes there will be a huge database bottlenecks that causes major problems for retailers and other organizations with highly distributed online environments. Database contains patients as well as medical information which are read from, written to constantly and in near real-time to support the quality and timeliness of each transaction. Features of OLTP a. Users and system orientation: OLTP is customer-oriented which is used for transaction, query processing by admin, patients and medical professionals. b. Data contents: OLTP system manages current data in too detailed format. c. Database design: An OLTP system generally adopts an entity relationship data model and an application-oriented d. View: OLTP system focuses mainly on the current data without referring to historical data or data in different organizations.

8 39 e. Access patterns: Access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency, control and recovery mechanisms Contrasting OLTP and data warehousing environments One major difference between the types of system is that data warehouse, a type of data normalization common in OLTP environments. Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems. Figure 3.3 Contrasting OLTP and Data Warehousing Environments Workload: Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.

9 40 Data modifications: A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data (patient s record) modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each medical transaction. Schema design: Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance and to guarantee data consistency. Typical operations: A typical data warehouse query scans thousands or millions of rows. For example, "Find the total number of patients gets treatment." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the grievance from the patients." Historical data: Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction Data Staging Area A data staging area is an initial storage area where set of processes - that clean, transform, combine, de-duplicate, household, archive - are performed on the data in order to use them in the data warehouse. The data staging area acts as a bridge between the source system and presentation server. Data staging area can be spread over a number of machines and does not need to be based on relational technology. Unlike the presentation service, which will be described, the main restriction of data staging area is that it never provides query and presentation services. In many ETL processes it performs the extract, transforms, loads the cancer patient records to the database or data storage but in our research area the data records of disease factor which may contain some irrelevant information. In the

10 41 study, ECTL is known as Extract, Cleaning, Transform and Load is efficient when compared to ETL ECTL process for data warehouse The traditional ETL Process Extract, Transform and Load: This approach to data warehouse development is the traditional and widely accepted approach. Data is extracted from the data sources (line of medical applications) using a data extraction tool via whatever data connectivity is available. It is then transformed using a series of transformation routines. This transformation process is largely dictated by the data format of the output. Data quality and integrity checking is performed as part of the transformation process, and corrective actions are built into the process. Transformations and integrity checking are performed in the data staging area. Finally, once the data is in the target format, it is then loaded into the data warehouse ready for presentation. The process is often designed from the end backwards, in that the required output is designed first. In so doing, this informs exactly what data is required from the source. The routines designed and developed to implement the process are written specifically for the purpose of achieving the desired output, and only the data required for the output is included in the extraction process. In addition, the output design must incorporate all facts and dimensions required to present both the aggregation levels required by the BI solution and any possible future requirements. Medical rules that define how to aggregations are achieved and the relationships between the various entities in both the source and target are designed, and therefore coded into the routines that implement the ETL process. This approach leads to tight dependencies in the routines at each stage of the process. Data Warehouse maintenance issues include data extract, cleansing, transforming, loading, subsequent loading (refreshing) and data purging. Refers to the methods involved in accessing and manipulating source data and loading it into the target database.

11 42 1. Extract: Some of the data elements in the operational database can be reasonably expected to be useful in the decision-making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. 2. Cleansing: Information quality is the key consideration in determining the value of the information. The developer of the data warehouse makes the data error-free before entering into the warehouse as much as possible. This process is known as data cleansing. It must deal with many types of possible errors. It includes missing data and incorrect data at one source, inconsistent data and conflicting data when two or more sources are involved. 3. Transform: The operational database developed can be based on any set of priorities, which keep changing with requirements. Therefore those who develop data warehouse based on these database are typically faced with inconsistency among their data source. In our research the data inconsistency has been handling in cleansing step to avoid the errors in an operational database. 4. Loading: It often implies physical movement of data from the computer storing the source database to that which will store the source data warehouse database assuming it is different. 5. Data refreshing and data purging: After the initial loading, updates at the source database should be propagated to the data warehouse. This propagation is called data refreshing and data purging is known as Data storage process Presentation Server A presentation server is a physical machine that stores the processed data for the end user s querying and reporting requirements. It is fed from data staging area. If the query able presentation resource for an enterprise s data organizes around an entity-relation model, understandability and performance will be lost.

12 43 Also the tables will be organized as star schema if the presentation server presents and stores data in a dimensional framework Dimensional Model Dimensional model, which is designed to provide higher query performance, resilience to change and to be more understandable, is an alternative model to entity relation model. The dimensional model consists of fact table and dimension tables. A fact table contains measurement of the medical that is preferred to be numeric and additive. There has to be a set of two or more foreign keys that helps to join dimension tables to fact table. A dimension table is complementary to the fact table. Most of them have many textual attributes. It also has primary key which enables to make a relation with the fact table Data Mart Data mart is a logical subset of the complete data warehouse and prepared for a single medical process in an organization. When they come together, an integrated enterprise data warehouse is formed. Data marts must be built from shared dimensions and fact. By this way they can be combined and used together OLAP (On-Line Analytic Processing) OLAP enables querying and presenting text and number data from data warehouses for end users. OLAP technology is based on multidimensional cube of data and OLAP database have multidimensional structure End User Application These applications help end users to prepare queries, make analysis and perform other activities which are targeted to support medical needs such as end user data access tool and ad hoc query tool. End user data access tool works with SQL session and provides to the user a report, a screen of data or another form of analysis. Ad hoc query tool facilitates preparing queries by giving an opportunity to the user to use pre-built query templates.

tools. 3.2.9 Metadata Metadata contains information and definitions about the data, which is stored.

13 Modeling Application Modeling applications enable to transform or make a summary from the data warehouse by forecasting models, behavior scoring models, allocation models and data mining tools Metadata Metadata contains information and definitions about the data, which is stored. The first image most people have of the data warehouse is a large collection of historical, integrated data. While that image is correct in many regards, there is another very important element of the data warehouse that is vital - metadata. Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs operate on. Figure 3.4 Meta Data in Simple Form Figure 3.5 Role of Meta Data and the Community Served by Meta Data

45 3.3 CANCER DATA WAREHOUSE ARCHITECTURE (CDWA) Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, and analyst) to make

14 CANCER DATA WAREHOUSE ARCHITECTURE (CDWA) Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions. A data warehouse (or scale data mart) is a specially prepared repository of data designed to support decision making. The data comes from operational systems and external sources. To create the data warehouse, cancer data are extracted from source systems like questionnaire, cancer institute database, etc cleaned (e.g., to detect and correct errors), transformed (e.g., put into subject groups or summarized), and loaded into a data store (i.e., placed into a data warehouse). It includes tools for extracting data from multiple operational database and external sources; for cleaning, transforming and integrating this data; for loading data into the data warehouse; and for periodically refreshing the warehouse to reflect updates at the sources and to purge data from the warehouse, perhaps onto slower archival storage. In addition to the main warehouse, there may be several departmental data marts. Figure 3.6 Cancer Data Warehousing Architecture using ECTL, OLTP, and OLAP Servers

15 46 Data in the warehouse and data marts are stored and managed by one or more warehouse servers, which present multidimensional views of data to a variety of front end tools: query tools, report writers, analysis tools, and data mining tools. Finally, there is a repository for storing and managing metadata, tools for monitoring and administering the warehousing system. Data warehousing technologies have been successfully deployed in healthcare Cancer Data Warehouse Architectural (Enhanced) The metadata and raw data of a traditional OLTP system is present, as an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data as our cancer data warehouse query is to retrieve from the cancer institute. In the data warehouse enhancement process, have collected the data as questionnaire manner, and get the database from cancer institute. It gets the better source of data records which handle OLTP process in the specialist institutes who provide the information as zone and region wise as database level. A summary in the database is called a materialized view. The user can identify the data records easily in data warehouse model using materialized view. Figure 3.7 Cancer Data Warehouse Architecture (Enhanced)

16 Operational source systems Operational source systems are developed to capture and process original medical transactions in specialist (cancer) institutes. These systems are designed for data entry, not for reporting, but it is from here the data in data warehouse gets populated. Data Warehouse (DW): A data warehouse contains the data that is organized and stored specifically for direct user queries and reports. It differs from an OLTP database in the sense that it is designed primarily for reading but not for writing. Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Our enhanced data warehouse unifies the data scattered throughout an organization into a single centralized data structure with a common format. A fundamental concept of a DW is the distinction between data and information. Data is composed of observable and recordable facts that are often found in operational or transactional systems. A Data warehouse is a repository of integrated information, available for querying and analysis. A Cancer Data Warehouse (CDW) is a DW tailored for the needs of users in a clinical environment, combining data from various medical database and cleanses medical data to form a centralized data repository to answer the informational needs of all clinical users and supports medical decision making. The medical data gathered in the healthcare process, contains data related to patient s care including specific demographics, input/output data recorded for the patient, diagnosis data, treatment, procedures performed and costs associated with the patient s care. Therefore, several challenges and requirements associated with utilizing of DW technique in medical domain are produced. The challenges include: the patient data format, medical transaction analysis, data integration, data quality and ECTL process technique.

17 Patient data format The usage of CDW technology aims to determining the relationships in clinical data, discovering disease trends, evaluating the performance of different treatments protocols used, support measuring, improving patient outcome and provide information to users in areas ranging from research to management. The medical data collected during the regular day-to-day events and stored in various systems that include, statistical information system, medical information system and laboratory information system. However the clinical data is stored in various medical systems during the patient s visiting time. These types of data include: Demographic Information: Information collected once to provide rich data analysis environment. Clinical Information: Information about patient's life habits, which use to enhance the data analysis capabilities. Diagnosis information: Describing the diagnosis process. Treatments Information: Information about treatment process that involves treatment type, treatment procedure, and treatment risk information. Laboratory information of the laboratory test results. Figure 3.8 Important Components of Cancer Medical Processes

18 49 Cancer disease clinical systems contain accumulated substantial amounts of data about patients with the associated clinical conditions and treatment. The hidden relationships and patterns within medical information are used to monitor the impact of specific disease, effect of medical processes and their efficiencies/deficiencies. The medical data contain various types of data such as; text and qualitative format, numeric and quantitative format, sequential or time series data Medical transaction analysis Medical transaction analysis is identifying treatment purpose and determining solutions to disease problems. One of the most important aspects of developing a CDW is to define disease reason. However, the CDW does not achieve its objectives without clearly defining the disease reason. Furthermore, the discussion of the medical transaction analysis phases are significant to study and analyze the existing process from medical perspective as well as to determine project objectives, requirements, constraints and acceptance criteria. The phases are composed of four phases as illustrated in the followings: The requirements are gathered in order to understand the purpose of the CDW problems to identify the suitable data model that will be used. Determining and gathering the requirements must be done in proper way, which state the CDW value and derive the architecture of the CDW. The requirements are further analyzed and investigated to determine the data integration problems. This followed by producing an initial dimensional model that showing facts, measures, dimension keys, and dimension hierarchies. Dimension hierarchies can include parallel hierarchical paths. Validity of the model is assessed to realize medical objectives and to ensure medical goals and needs which are clearly understood, the CDW architecture is designed as per the medical requirements.

19 50 Database is planned to be stored on a multidimensional database, showing all elements of the model and their properties. Detailed dimensional models can further be extended and optimized. The CDW development must meet some functional requirements in order to maintain data integration in CDW. These requirements include: Understanding the medical purpose, requirements and constraints. Determining medical objectives and needs. Determining the medical rules. Determining the suitable model that supports data analysis. Identifying the data sources of the required data, and performing the sizing of the model. Proving a mechanism that responds the queries related to healthcare Data integration Data Integration is a process of combining data from more than one disparate data sources within one or several institutions into a single physical repository. This large volume of data is integrated, rearranged and consolidated to provide a unified view to analyze the data. The data integration becomes significant issues in situations of developing a CDW due to the complexity of the hospital environment such as various care practices, data types and definitions. Additionally, the clinical data is integrated from various medical information systems. These medical systems are in different clinical routines, incompatible structures, and incompleteness of clinical information systems. Handling medical data integration issues and challenges need to provide the following requirements: Developing enhanced integration framework to combine heterogeneous medical data sources to CDW. Providing a mechanism to integrate medical data from various clinical information systems and hence needs to be integrated for consistency and analysis.

20 51 Reducing the dimensions of medical facts describing a current situation of a patient. Minimizing the time required for extracting, transforming and storing the data in the CDW Data quality Data quality is an essential characteristic that determines the reliability of data for analysis, making decisions and planning. The acceptable data quality in the medical field is critical issue to the reliability of medical decision making and research environment. Quality of data is achieved we when require (useful) data that exactly meet the specific needs stored in common format required by CDW without data quality. Needs to lay down a strong mechanism to manage medical data quality. Defining levels of data quality that are appropriate to the organization. Understanding the data quality problems from medical perspective, there are a wide variety of dimensions on which data quality can be affected. Understanding the format of data stored by each source, there are wide varieties of structured and unstructured types of data Cancer Data Warehouse Architecture (with a Staging Area using ECTL) ETL is improved to handle the medical data so that enhanced as ECTL which plays a vital role in DW solutions and responsible for the extract data from heterogeneous data sources, converting extracted data into a common format suitable for analyzing, mining, identifying and data quality problems, cleansed data to eliminate undesired data, and finally loading these data into the DW (Extract- Cleanse-Transform -Load). In medical field ECTL process activities are highly

21 52 sensitive to obtain the quality of data and data integration, poor quality of data will affect the reputation of an organization and causes low quality decision making. Figure 3.9 Cancer Data Warehouse Architecture (with a Staging Area using ECTL) Due to the complication of medical data structure and clinical operations in real-world clinical environment, it is important to develop a powerful ETL tool to integrate, transform, and clean medical data before loading this data into CDW. Furthermore, the ECTL process is quite complex in medical field which requires extract data from several sources, cleaning, transformation activities, and loading facilities. Figure 3.10 Cancer Data ECTL Process Flow Diagram

22 Extraction process Extraction process is responsible for extracting data from various heterogeneous data sources. The ECTL process requires connect on to the source systems, select of the relevant cancer data needed for disease types analytical processing and research within the CDW. The data extract from numerous disparate source systems and each of these data sources has its distinct set of characteristics that need to be managed in order to effectively extract data for the ECTL process. Furthermore the complexity of the extraction process depends on the data characteristics and attributes, amount of source data and processing time. Therefore, the ECTL process needs to effectively integrate technology to extract these data. Handling extraction process and challenges need to provide the following requirements to ensure subject-oriented of the CDW: 1. Analyzing cancer data sources in order to comprehend their structure and contents to understand the disease data that exist in the sources database to identify the relevant data at the sources that needed depending on the purpose of CDW. the selection of these data requires: a. Identifying source systems that contain the required data, identifying the quality and scope of each cancer data source. b. Understanding the format of disease data stored by each source to determine whether all the data available to fulfill the requirements or not, and the required data fields populate properly and consistently. c. Identifying the attributes contain in each data source. 2. Determining the options of extracting the data from the source systems that include update notification, incremental extracts and full extracts to capture only changes in source files. 3. Determining the protocols for data transferring.

23 54 4. Determining encryption standards that need to be set with each of the source systems. 5. Monitoring data transfer failures, errors and making notifications through different methods such as control files, metadata files, notifications, system log writing and file system log writing Cleansing process Data cleansing is one of the most important issues in ECTL process as it ensures the quality of the data in the DW. The data cleansing deals with detecting and removing errors and inconsistencies from cancer disease data in order to improve the quality of data. The data cleansing phase involves three steps which include: data analysis, data refinement and data verification. The objective of data analysis is to identify issues and detect the data issues. Data problems include completeness, validity, accuracy, consistency, conformity and integrity. For each problematic area, the data quality issues and acceptance criteria are identified and then, for each data quality issues, the solutions are developed. Furthermore, the data with quality issues will be refined using some of the data cleansing methods to realize their full benefits. Additionally, the cleaned data then will be assessed against the acceptance criteria again to ensure that the data issues have been resolved after the data cleansing process. Finally, after verification, the data will be moved from staging area to CDW. Therefore, the trend of data cleansing process is to make cleaning and conforming on the extracted data to gain accurate data of high quality. Handling cleansing process and challenges in CDW need to provide the following requirements: Understanding the data quality problems from medical perspective. Cleaning of the extracted data set, according to the required medical rules.

24 55 All the requisite information is available, free from errors, in a usable state. The data collected is relevant to the medical purpose. The ability to link relative records together to ensure the data consistency in format. The data satisfies a set of constraints, and maintains in a consistent fashion to ensure the data values consistent across data sets. All patient basic information records must contain a unique patient identification number for each patient. Auto generated primary keys and cross reference tables Transaction process Transaction process is to transform the extracted data into a common format by applying a set of conditions, rules or functions. The Transaction phase tends to make multiple data manipulations on the incoming data according to medical needs, to ensure that the data loaded into CDW is integrated and accurate. The Transaction process requires joining the data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values and applying advanced validation rules by defining the granularity of fact tables and the dimension tables. In medical field very complex Transaction provide the following requirements to meet the medical needs of the targeted system: a. Understanding the format of data stored by each source to determine whether all the data fulfill the requirements or not. b. Figure out a way of mapping the external data sources and internal data sources fields to the CDW fields. c. Transaction and coding the medical data into the required content format for CDW storage.

25 56 d. Providing amount of manipulation needed for transaction process according to the objective of the CDW such as summarization, integration, and aggregation using different techniques according to requirement specifications. e. Providing suitable data model to allow querying by multiple dimensions Loading process Loading process is the process of loading data from staging area to the CDW. The extracted and transformed data is written into the dimensional structure which is actually accessed by the end users and applications. A major data loading problem is the ability of ECTL process to discriminate between new and existing data at loading time, the new rows that need to be appended and rows that already exist need to be updated. Handling loading process and challenges need to provide the following requirements to ensure that loading process should perform correctly with little resources: a. The ability of ECTL process to provide the desired latency in updating the dataset. b. The ability of ECTL process to discriminate between new and existing data at loading time; the new rows that need to be appended and rows that are already in exist need to be updated. c. Information is up to date or is provided at the time specified (data tagged with a time) Cancer Data Warehouse Elements Evolution The framework is composed of the development environment and user (patients) environment. In the development environment the data warehouse metadata repository and other components, which will be described later, are located in ECTL processes and change processing is conducted. The metadata management tool incorporates the CDW analysis tool which is used by administrator or developer

26 57 to design a data warehouse schema and specify ECTL processes. The metadata management tool maintains the static part of the mapping repository of the metadata repository, where the metadata of the last data warehouse version and mappings are defined by the logics of ECTL processes, which are stored Objectives of dimensional modeling warehouses: There are two major differences between operational database and data End user access: In a data warehousing environment, users (medical admin) write queries directly against the database structure, whereas in an operational environment, users generally access only the database through an application system front end. invisible to the user. In a traditional application system, the structure of the database is Read-only: Data warehouses are effectively read only database - users can retrieve and analyze data but cannot update it. Data stored in the data warehouse is updated via batch extract processes Dimensional modeling The type of analysis that will be done with the data warehouse can determine the type of model and the model s contents. Because query, reporting and multidimensional analysis require summarization and explicit metadata as it is important that the model contain these elements. Also, multidimensional analysis usually entails drilling down and rolling up, so these characteristics need to be in the model as well. A clean and clear data warehouse model is a requirement, else the end user tasks will become too complex, and end users will stop trusting the contents of the data warehouse and the information drawn from it because of highly inconsistent results. Data mining however, usually works best with the lowest level of detail available. Thus, if the data warehouse is used for data mining, a low level

27 58 of detail data should be included in the model. The objective of dimensional modeling is to produce database structures that are easy for end users to understand and write queries against. A secondary objective is to maximize the efficiency of queries. It achieves these objectives primarily by minimizing the number of tables and relationships between them. This reduces the complexity of the database and minimizes the number of joins required in user queries Cancer Data Warehouse Architecture (with a Staging Area and Data Mart) CDW customize the warehouse s architecture for different groups within the organization. It can do this by adding data marts, which are designed for a particular line of transactions. Figure 3.11 Cancer Data Warehouse Architecture (with a Staging Area and Data Mart) An example is where patient entry, treatment process and medicinal/drug availability are separated. It is listed as cancer symptoms, diagnosis recommendation, and decision system. In this example, a disease symptom analysis

28 59 might analyze historical data for cancer disease, diagnosis result, patient history and behavior Data mart Data mart is a logical subset of an enterprise-wide data warehouse. For example, a data warehouse is an analysis process which is constructed incrementally from individual, conformed data marts dealing with separate subject areas such as cancer symptoms. Dimensional Data marts are organized by subject area such as patients, medical consultant and coordinated data category such as patients, diagnosis treatment lab, and medical consultant. These flexible information stores allow data structures to respond medical changes in product line, new patient, responsibilities, mergers, consolidations, and acquisitions. However, there are three different patterns or informal models of data mart development have appeared. The first response call to data mart development was the view that data marts are best characterized as subsets (often somewhat or highly aggregated) of the data warehouse, sited on relatively inexpensive computing platforms that are closer to the user, and are periodically updated from the central data warehouse. In this view, the cancer data warehouse is the parent of the data mart. The second pattern of development denies the data warehouse and sees the data mart as independently derived from the islands of information that predate both data warehouses and data marts. The data mart uses data warehousing techniques of organization and tools. The data mart is structurally a data warehouse. It is just a smaller data warehouse with a specific medical function. Moreover, its relation to the cancer data warehouse turns the first pattern of development on its head. Here multiple data marts are parents to the cancer data warehouse, which evolves them organically. The third pattern of development attempts to synthesize and remove the conflict inherent in the first two. Here data marts are seen as developing in parallel with the cancer data warehouse. Both develop from islands of information, but data marts do not have to wait for the cancer data warehouse to be implemented.

60 It is enough that each data mart is guided by the enterprise data model developed for the data warehouse, and is developed in a manner consistent with this data model.

29 60 It is enough that each data mart is guided by the enterprise data model developed for the data warehouse, and is developed in a manner consistent with this data model. Then the data marts can be finished quickly. a. Top Down Approach A top down progression requires more planning and design work to be completed at the beginning of the Cancer Data Warehouse Development (CDWD) process. This brings the need to involve people from each of the analytic groups, departments, or lines of cancer diagnosis access that will be participating in the data warehouse implementation. Decisions concerning data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins. The top down implementation can also imply more need for an enterprise wide or organization wide data warehouse with a higher degree of cross group like cancer type, department wise treatment process or line of cancer diagnosis access to the data. As shown, with this approach, it is more typical to structure a global data warehouse. If data marts are included in the configuration, they are typically built afterward. And, they are more typically populated from the global data warehouse rather than directly from the operational or external data sources. Figure 3.12 Top down approach - with a Staging Area and Data Mart

30 61 A top down development can result in more consistent data definitions and the enforcement of diagnosis rules across the cancer institute. However, the cost of the initial planning and design can be significant. It is a time-consuming process which can delay in actual implementation, benefits, and easy identification of disease features. For example, it is difficult and time consuming to determine, and get conformity on, the data definitions, process rules among all the different analytic cancer type groups, treatment wise departments, and lines of diagnosis factors participate. Developing a global data model is also a lengthy task. In many hospitals, management is becoming less and less willing to accept these delays. By creating the data warehouse model of cancer disease data and further data mart has been developed based on the cancer data warehouse. The cancer disease types data which progress in the top down approach of post-performed data mart are cancer symptoms, diagnosis recommended and decision system. The ETT is extract, transform, transport cancer data to cancer data warehouse and ETT which performs to create cancer data marts and also performs from the cancer dataset which contains the overall records which will be extracted as prescribed in dimensional features. A dimensional feature which depicts the variable of cancer disease attributes that can be extracted and form data warehouse. So that the data warehouse analyzes the cancer data and computes the information about the disease diagnosis, symptoms oriented result feature which can transform & transport this extracted data to the cancer data warehouse model. Thus the ETT process is well performed extracted cancer data into data mart it is rather than the cancer data which directly operates from the operational or external data sources. b. Bottom Up Approach A bottom up progression involves the planning and designing of data marts without waiting for a more global infrastructure to be put in place. This does not mean that a more global infrastructure will not be developed but it will be built incrementally as initial data mart implementations expand. This approach is more

62 widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation.

31 62 widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation. Along with the positive aspects of the bottom up approach are some considerations. For example, as more data marts are created, data redundancy which contains (e.g.: same patient info obtained in the district hospital and inserted in the central hospital) duplicate; inconsistency between the data marts can occur. To avoid redundancy with careful planning, monitoring, and design guidelines, this can be minimized. Multiple data marts may bring with an increased load on operational systems because more data extract operations are required. Integration of the data marts into a more global environment, if that is the desire; it can be difficult unless some degree of planning has been done. Some rework may also be required as the implementation grows and new issues are uncovered that force a change to the existing areas of the implementation. In contrast to the top down approach, data marts can be built before, or in parallel with, a global data warehouse. And as the figure shows, data marts can be populated either from a global data warehouse or directly from the operational or external data sources. After the data marts have been generated ETT process build the data marts into cancer data warehouse. Figure 3.13 Bottom-Up approach - with a Staging Area and Data Mart

32 63 The bottom up approach has become the choice of many hospitals, especially medical management, because of the faster payback. It enables faster results because data marts have a less complex design than a global data warehouse. 3.4 MULTIDIMENSIONAL CANCER DATA WAREHOUSE The multidimensional data model is based on the key concepts such as cube, dimension and hierarchy. This model allows its users to view and extract results from the cancer data warehouse in several different ways. The OLAP structure comprises cubes, dimensions, measures, hierarchies and levels. A data cube is a multidimensional data storage unit. Habits, risk factors and symptoms are few dimensions of cancer data warehouse. Measures are the facts that are to be analyzed. This cancer data warehouse contains a Fact table named Medical that has fields for Patient_id as primary key, Habit_id, Risk_factor_id, Symptom_id, Blood group_id Cancer_id as foreign keys representing other dimensions and Treatment, Diagnostic status, Districts and Date of Entry as measures. If a medical analyst generally analyzes the quantity of drugs required for a month, the Diagnostic status of patients in a month will be the measure and the cancer cube will be the dimension Multidimensional Cancer Data Cube The model of data analysis underlying the design is based on several layers: data integration, querying, analyzing, visual presentation and interactive exploration. A trivariate setting, this presents all the challenges in the multidimensional approach. Therefore, for easier exposition, we restrict our attention to the trivariate setting and, when required, point out any specific attributes of models with more than three factors. Here, the multidimensional model is quite interested in modeling the spatial distribution of risks for several combinations of two factors.

33 64 The general factor in this setting will always be the patient details, while one of the other major factors will usually be the disease symptoms (from a set of diseases) and disease diagnosis/treatment and the first factor may either be unstructured, such as Sex or Race, or structured in some way such as Time period or Age group, marriage status. Data integration means that an integrated view onto the different kinds of data has to be provided. Health data, e.g. Cancer cases with a spatiotemporal relation, has by itself a complex structure with additional statistical aspects. Every data item, a cancer case, is described by the location of occurrence, the geographic coordinates, and the time the event occurred, the date of diagnosis. In the process of analysis this item has to be combined with patient behavior data. Figure 3.14 Multidimensional Cancer Data Warehouse - with dimension of X, Y, Z Multidimensional analysis has become a popular way to extend the capabilities of query and reporting. That is, rather than submitting multiple queries, data is structured to enable fast and easy access. Dimensions can have individual entities or a hierarchy of entities, such as region, storage and sector. Multidimensional analysis enables users to look at a large number of interdependent factors involved in a medical problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically.

34 Cancer data aggregation architecture Now, our multidimensional data warehouse confronts us with a third inference problem. Aggregation is the central means to summarize and condense the information contained in the various sources. It occurs (1) when integrating data from sources, (2) when building views for the data marts, and 3) in ad-hoc queries. As queries to the sources or to larger views are far more expensive than those to smaller views, we are confronted with a new problem, namely, given a query involving aggregation and a (materialized) view, can this query be computed using (the aggregations contained in) this view. This depends on whether the aggregations contained in the view are still fine-grained enough to compute the aggregations required by the query. A Data Warehouse Conceptual Schema may contain detailed descriptions of the structure of aggregates, but it may not explicitly include aggregation functions. A value in a single cell may represent an aggregated measure computed from more specific data at some lower level of the same dimension. Aggregation involves computing aggregation functions {according to the attribute hierarchy within dimensions or to cross-dimensional formulas} for one or more dimensions. For example, the value 100,000 for the patient report in year 2014 may have been consolidated as the sum of the disaggregated values of the weekly (or day-by-day) patient entry. Another example introducing an aggregation grounded on a different dimension is the rate of an entries {e.g., patient of cancer or without cancer} sum of the rate of all of patient diagnosis. Therefore, the Relational database tables contain patient records (or rows). Each record consists of fields (or columns) that are Age, Gender, Marital Status ; Age at Marriage, Age of 1st Child, Age of Last child, No of children; Occupation History, District, State, Education, Annual Income; Habits, Occupation Hazards, Diet, Fast food addiction, Family history of Cancer, Relationship with cancer patient, Weight Loss, Anemia, Earlier cancer Diagnosed, Symptoms 1, Symptoms 2, Symptoms 3, Symptoms 4, Symptoms 5 and Blood Group is an Influencing factor based on major factor s disease diagnosis. In a normal relational

35 66 database, a number of fields in each record (keys) may uniquely identify each record. In contrast, a multidimensional database contains n-dimensional arrays (sometimes called hypercube or cubes), where each dimension has an associated hierarchy of levels of consolidated data. Figure 3.15: Dimension x (Variables) Figure 3.16: Dimension y (Variables) Figure 3.17: Dimension Z (Variables) We begin the Entity-Relationship Conceptual Data Model which represents the structure of aggregations. Thus, a conceptual schema will be able to describe abstract properties of multidimensional cubes, their interrelationships, and, most notably, their components. In order to support multiple hierarchies, the data model must provide means of defining and structuring these hierarchies, and for arbitrary aggregation along the hierarchies. A conceptual data model where both multidimensional aggregations and multiple hierarchically organized dimensions which can be abstracted and described to provide support for query languages in multidimensional data models. In fact, in the few attempts where a data cube

36 67 introduces the notion of multi dimensions and of dimension levels are x, y, z within these dimensions the Data warehouse conceptual schema could serve as a reference meta-model for deriving the inter-relations among levels and dimensions Multidimensional Star Schema The basic building block used in dimensional modeling is the star schema. A star schema consists of one large central table called the fact table, and a number of smaller tables called dimension tables. The fact table forms the centre of the star, while the dimension tables form the points of the star. A star schema may have any number of dimensions. The fact table contains measurements (e.g. Patient History, Risk Factor, Cancer, Symptoms, Treatment, and Diagnosis) which may be aggregated in various ways. The dimension tables provide the basis for aggregating the measurements in the fact table. The fact table is linked to all the dimension tables by one-to-many relationships The primary key of the fact table is the concatenation of the primary keys of all the dimension tables. In this example, Patient details, History of Patient, Risk_factor, Cancer, Symptoms, Treatment, and Diagnosis. Dimension tables are often highly denormalized tables, and generally consist of embedded hierarchies. Patient, who represents a single dimension in the star schema above, consists of three independent hierarchies, when they are normalized out. The advantage of using star schemas to represent data is that it reduces the number of tables in the database, the number of relationships between them and therefore the number of joins required in user queries. First Principles approach, which is based on analysis of user query requirements. It begins by identifying the relevant facts that need to be aggregated, the dimensional attributes to aggregate by, and forming star schemas based on these. It results in a data warehouse design which is a set of discrete star schemas.

37 68 However there are number of practical problems in this approach: User analysis requirements are highly unpredictable and subject to change over time, which provides an unstable basis for design. It can lead to incorrect designs if the designer does not understand the underlying relationships in the data. It results in loss of information through premature aggregation, which limits the ways in which data can be analyzed. The approach is presented by examples rather than via an explicit design procedure. Figure 3.18 Star Schema representing Cancer Data Warehouse

38 69 Fact Table The Fact table that describes the subject matter is named fact_medical. The table consists of Foreign Keys (Patient_id, Date of Entry, Habit_id, Risk_factor_id, Symptoms_id, Cancer_id, Treatment, Blood group_id, Diagnose status and Districts) that related with dimensional table and Measures (Types of Cancer, Severity of Cancer, and Method of diagnosis). Dimension Tables The six dimension tables that detailed each entity in the Patient table, Habits table, Risk_factor table, Symptoms table, Blood_factor table and Cancer table. Table 3.1 Dimension Table TABLE NAME Dim Patient Dim Habits Dim Symptoms Dim Cancer Dim Risk_factor Dim Blood_factor TABLE DESCRIPTION A table that stores patients information, such as patient name, gender, age, address, phone etc. The data is used to show demographic data for the Cancer disease. A table that stores all the history of Habits of Patient before and after the patient entry. Active smoking, Passive smoking, Alcohol, Hot beverages, Fast-food addiction, etc... A table which stores all symptoms of related cancer types is Oral Cancer symptoms1...5, Lung Cancer symptoms1...5, Breast Cancer symptoms1...5, Stomach Cancer symptoms1...5, Cervix Cancer symptoms1...5, Blood Cancer symptoms A table that stores all the patient cancer status, risk scores, then the Severity of cancer is based on risk score attained of cancer types, method of diagnosis and treatment process such as Chemotherapy, Radiotherapy, etc A table that stores all family history of cancer, weight loss, anemia, Relationship with Cancer patient, etc A table that stores every patients blood group.

39 Multidimensional Data Structure A multidimensional view of the data is important when designing front end tools, database design and query engines for OLAP. The modeling technique named as "Star Schema" is used to represent multidimensional data. The star schema is adopted here mainly because of its clarity, convenience and rapid indexing. In concise, a star schema can be defined as a specific type of database design used to support analytical processing, which includes a specific set of denormalized tables. A star schema contains seven types of tables designed in multidimensional star schema. This modeling consists of a central table called fact table (Medical) and other six tables which directly link to it and it is mentioned as Dimensional table. In general, the fact table contains the keys and measurements. 3.5 OLAP TECHNIQUES CLUSTERING ANALYSIS In the multidimensional data warehouse, data are organized into multiple dimensions. And each dimension contains multiple levels of abstraction defined by concept hierarchies. Using these hierarchies different types of OLAP operations are possible. Online analytical processing (OLAP) systems must cope with huge volumes of patient data and at the same time it must allow for short response time to facilitate interactive usage. They must also be capable to scale, meaning to be easily extensible with the increasing data volumes accumulated. Furthermore, the requirement that the data analyzed should be up-to-date which is becoming more and more important. However, not only these different requirements are needed, but also run counter to the performance needs of the day-to-day transaction process. Most OLAP systems nowadays are kept separated from mission critical systems. This means that they offer compromise between up-to-date, that is, freshness of data, and query response time. The data needed are propagated into the OLAP system on a regular basis, preferably when not slowing down day-to-day medical, for example, during nights or weekends. OLAP users have no alternative but to analyze stale data.

40 71 Since OLAP systems can present the general description of information in data warehouses, OLAP functions are for user-specified summary and comparison. These are the basics of the data mining functionalities, which on a larger scale, include, under the broad categorizations of descriptive and predictive data mining, association classification, prediction, time-series analysis and other data analysis tasks. A concept refers to a collection of data of such patients with cancer disease, Information processing, based on queries, can find useful information. However, answers to such queries reflect the information that is directly stored in the database or at most, computable by aggregate functions. Concept description generates description for characterization and comparison of data. Given the large amount of data stored in database, it is useful to be able to describe the characteristics in the most concise manner that covers most of all the data. Data from a set of database, data warehouses, spreadsheets or other information repositories form the first tier. A cancer data warehouse or multidimensional cancer data warehouse server is then responsible for fetching the relevant data from the database based on the user s mining request. A knowledge base supports the data mining engine that processes the user queries. This is the domain knowledge that will guide the search or evaluate the resulting patterns for knowledge. It can include concept hierarchies which are used to organize attributes and attribute values into different levels of abstraction. Domain knowledge can also include additional constraints and threshold values as well as metadata describing the data from multiple heterogeneous sources. Allowing generalizations of data at each level facilitates user examining the data at different levels of abstraction. Data generalization is a process that abstracts a large-set of task relevant data from a low conceptual level to higher conceptual levels On-Line Analytical Process Data mining methods offer solutions to help manage data, information overloaded and build knowledge for information systems and decision support in treatment/ prevention & detection in health care. Applying data mining techniques enhances the creation of untapped useful knowledge from large medical datasets. The increasing use of these techniques can be observed in healthcare applications

41 72 that support decision making, e.g., in patient and treatment outcomes; in healthcare delivery quality; in the development of clinical guidelines, in the allocation of medical resources; and in the identification of drug therapeutic or adverse effect associations. So the data mining techniques to investigate cancer data have paying attention on feature extraction from diagnostic result to detect and classify the disease analysis factors. Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters. Cluster analysis is a clustering method for gathering observation points into clusters or groups to make (1) each observation point in the group similar, that is, cluster elements are of the same nature or close to certain characteristic which is based on Symptoms of cancer; (2) Observation points in clusters differ; that is, clusters are different from one another. The k-means algorithm is a partitioning clustering adjusts the clusters. A special type of clustering is called segmentation. With segmentation a database is partitioned into disjointed groupings of similar tuples called segments. Segmentation is often viewed as being identical to clustering. In other circles segmentation is viewed as a specific type of clustering applied to a database itself OLAP operations of medical cancer data warehouse OLAP is performed on cancer data warehouse or cancer disease data marts. The primary goal of OLAP is to support ad hoc query needed to support decision support system. The multidimensional view of cancer data is fundamental to OLAP function. OLAP is a practical view, not a data structure or schema. The complex nature of OLAP process requires a multidimensional review of the cancer data. OLAP Operations in Multidimensional Cancer Data Warehouse (MCDW), 1. Roll-up 2. Drill Down 3. Slice and Dice 4. Pivot

42 73 1. Roll up (drill-up): It is performed by climbing up hierarchy of a dimension or by dimension reduction (reduces the cube by one or more dimensions). The roll up operation is based location (roll up on location) and is equivalent to grouping the data by country. Roll-up operations do not remove any events but change the level of granularity of a particular dimension. For example, Rolling up (sometimes to as drilling up) is the reverse. For example, types of disease symptoms {types} = {{symptoms 1}, {symptoms 2}, {symptoms 3}, {symptoms 4}, {symptoms 5}}; is rolled up into symptoms {types} = {{symptoms 1, symptoms 2, symptoms 3, symptoms 4, symptoms 5}} in dimension Z. Example of Dimension Y Gender {types} = {{Male}, {Female}, {others}} is rolled up into gender. Gender {types} = {{Male, Female, others}}. Figure 3.19 Original view of Cancer Data Warehouse Figure 3.20 Roll-Up View

43 74 2. Drill down (roll down) It is the reverse of roll-up Navigates from less detailed data to more detailed data by, o o Stepping down a concept hierarchy for a dimension Introducing additional dimensions Drill down operations does not remove any events but change the level of granularity of a particular dimension. For Example, before drilling down patient Treatment Duration TD {monthly/yearly} = {Jan/2010., Jan 2011, Jan 2012}; after drilling down TD {monthly/yearly} = {Jan/2010., Jan 2011, Feb-2011, March-2011.Nov-2011, Dec-2011, Jan 2012}; Figure 3.19 Original view of Cancer Data warehouse Figure 3.21 Drill- down View 3 Slice and Dice The slice operation performs a selection on one dimension of the given cube, resulting in a sub-cube.

44 75 The slice operation produces a sliced OLAP cube by allowing the analyst to pick specific value for one of the dimensions. For example the Slicing is performed in the Cancer Data warehouse, dimension Y Age, thus the Age dimension is removed the original cube and only Age of Dimension Y, remaining dimension of X Habits like (Smoking, Chewing, Alcohol, Hot beverage), dimension of Z (Weight loss, Anemia, Symptoms 1 5, Family History of Cancer are same. It implies the removing of the Age dimension and only considers the Age factor related to Habits, disease symptoms. The dice operation defines a sub-cube by performing a selection on two or more dimensions. The dice operation produces a sub-cube by performing the system to pick specific values for multiple dimensions. For example, one could dice the Treatment/Duration 2010 & 2011, Gender wise Male, Female and disease factor of Anemia. No dimensions are removed, but only Treatment Duration in 2010 and 2011 as Gender wise data warehouse dicing process is analyzed the Z-dimension disease factor Anemia and/or Weight Loss are considered. Figure 3.19 Original view of Cancer Data Warehouse

76 Figure 3.22 Slice View Figure 3.23 Dice View 4. Pivot Visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. It removes a measure.

45 76 Figure 3.22 Slice View Figure 3.23 Dice View 4. Pivot Visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. It removes a measure. Figure 3.19 Original view of Cancer Data Warehouse Figure 3.24: Pivot view For Example, data for Marital Status, Diet, and Habit of Hot Beverage is obtained from the input cube and querying is allowed only for specific measures. We evaluate Projection by removing a measure from the subcube query.

46 Analytical clustered cancer data warehouse Clustering is a process of separating dataset into subgroups according to the unique feature. Clustering separates the dataset into relevant and non-relevant dataset to Cancer data types. The aim of clustering is to classify objects or data into a number of categories or classes where each class contains identical feature. Clustering is the process of grouping data objects into a set of disjoint classes, called clusters, so that objects within the same class have high similarity to each other, while objects in separate classes are more dissimilar. Clustering is an example of un-supervised classification. Classification refers to a procedure that assigns data objects to a set of predefined classes. Un-supervised means that clustering does not rely on predefined classes and training examples while classifying the data objects. Thus, clustering is distinguished from pattern recognition or the areas of statistics known as discriminate analysis and decision analysis, which seek to find rules for classifying objects from a given set of pre-classified objects. The main benefits of clustering are that the data object is assigned to unknown classes. Here k is a positive integer representing the number of clusters. The pre-processed data is clustered using the k-means clustering algorithm with the value of k equal to S. The class S which represent S = 6 clusters where the cluster contains relevant data to Lung, Stomach, Breast, Oral, Blood, Cervix Cancer types and another contains remaining data that means non relevant data. During the k-means clustering algorithm for numerical datasets, the following generic step process is to be performed: 1. Insert the first k objects into k new clusters. 2. Calculate the initial k means for k clusters. 3. For each object x, a. Calculate the dissimilarity between x and the means of all clusters. b. Insert i into the cluster C whose mean is closest to x. 4. Recalculating the cluster means that the cluster dissimilarity between mean and objects are minimized.

47 78 5. Repeat 3 and 4 until no or few objects change clusters after a full cycle test of all the objects. 6. A weakness is that it often terminates at a local optimum rather than a global optimum. 7. The number of clusters k needs to be specified in advance by the user. 8. Finally, the results depend on the order of the objects in the input dataset as different orderings will produce different results. The k-means method has been shown to be effective in producing good clustering results for many practical applications. However, the k-means algorithm generally converges at local optima. This affects the quality of end clustering results. This is because the k-means algorithm is heavily dependent on the selection of initial centroids, which are actually selected randomly in the beginning of the algorithm. We have developed k-means clustering algorithm with Mean-based Initial centroids. This algorithm removes the limitation of terminating of the k-means algorithm at local optima. It also makes the k-means algorithm to be applied to a wide variety of input data sets. Following is an example of k-mean clustering in which the centroids are taken randomly. Suppose we have several objects and each object has more than 15 attributes or features of dataset are collected. Our goal is to group these objects into k= 6 groups based on the primary features as Age grouping. So we calculate the distance between cluster centroid to each object. Table 3.2 Cancer Patient Different Age Level of Cluster Average S.No. Cancer Types Disease Identified On Age Average 1 Lung Age from more Stomach Age from more, Less from Blood Age from more, Less from Breast Age from more, Less from Oral Age from more, Less from Cervix Age from more. 4.5

48 79 Each cluster that is formed can be viewed as a class of objects, from which rules can be derived. Clustering can also facilitate taxonomy structure, that is, the organization of observations into a hierarchy of classes that group similar events together. The cancer patient data has more number of disease identification patterns, all patterns that occur in the dataset k times. Then the patterns are clustered based on distances computed from comparison of the lists of individual patients that match the patterns. The clusters are formed such that all patients with the same patterns are identified and the clusters also identify the relationships between the parameters shared by all the patients Integration of cancer data warehouse Multi-dimensional data warehouse is normally built to validate assumptions and to discover trends on large amount of patient data using OLAP. MySQL emerged as a low-cost entry in a crowded DBMS market, but MySQL's selling point was never just or even primarily price: It was an ease of use and administration and absolute suitability for important applications, in particular as the database back-end of choice for Web publishing. The technology now rivals that of established closed-source vendors for the broad set of traditional database applications, which we can classify as publishing, operational, and analytical. MySQL is able to compete in these areas because its modular architecture allows you to choose the storage engine that performs best for a spectrum of needs: Cancer DB for transactional systems. CDW for analytical systems including data warehouses and data marts. Memory, formerly known as Heap, for high-performance applications. The Cluster Storage engine, for high availability and scalability. Archive for efficient storage of large data volumes. Federated, providing for local access to remote data tables. Merge, also known as CDW, which collects identical CDW tables for unified access.

49 80 The different engines share common administration and query interfaces, and MySQL even allows us to select engines on a table-by-table basis within a database. MySQL has a single version of SQL and has a smart optimizer that insulates developers and users from the technical details that distract from their focus on delivering the best possible applications. Data Warehouses are database that optimized for data analysis rather than for transaction processing. They are structured using dimensional modeling techniques star schemas to provide rapid response to complex queries. Most of data warehouses store textual and geospatial data in addition to numerical data. By including harmonized, cleansed metadata information that describes the tables, fields, and value sets data warehouses are able to host diverse applications that range from structured reporting and performance dashboards to adhoc query and intensive statistical data mining. Whether we are creating data warehouses or data marts datasets specially designed to respond to the analytical needs of particular users or applications or both, MySQL's CDW engine provides the fast bulk, incremental data loading and indexing, and the responsibility, needed to support large data volumes and diverse, complex queries. MySQL is also, of course, fully capable of supporting real-time data analysis that works directly off operational data stores managed with cancer DB or another of MySQL's engines. Because MySQL provides a standard SQL implementation and application programming interfaces (APIs) usable across the complete set of engines, the user has the flexibility to run analyses off an appropriately structured data warehouse, data mart, or operational data store Implementation of OLAP develops multidimensional From a conceptual standpoint, we contend that OLAP calls for the following four kinds of functionalities: Querying: Ability to pose powerful ad-hoc queries through a simple and declarative interface. Restructuring: Ability to restructure information in a multi-

50 81 dimensional database exploiting the dimensionality of data and bringing out different perspectives of the data. Classification: Ability to classify or group data sets in an appropriate manner for subsequent summarization. Summarization/Consolidation: This is a generalization of the aggregate operators in standard SQL. In general, summarization maps multi-sets of values of a numeric type to a single, consolidated value. Multidimensional On-Line Analytical Processing (MOLAP) does not rely on the relational model but instead materializes the multidimensional views. MOLAP can thus provide better performance with the materialized and optimized multidimensional views. However, MOLAP demands substantial storage for materializing the views and is usually not scalable to large datasets due to the multidimensional. The multidimensional data structure of a multidimensional database is what we call an n- dimensional table. We desire to be able to see values of certain attributes as a function of others, in whichever way suits us, exploiting possibilities of multidimensional rendering. Drawing on the terminology of statistical database, we can classify the attribute set associated with the schema of a table into two kinds: parameters and measures. There is no priority distinction between parameters and measures in that any attribute can play either role. Multidimensional accesses methods which are commonly used in spatial DBMS provide multidimensional clustering in order to efficiently answer multidimensional range queries. In combination with a suitable hierarchy encoding, these methods can be used to significantly speed up OLAP queries. Many attributes in relational MDBMS in general and in data warehouses in particular have an actual domain of a very small set of values. A typical example is the attribute a cancer symptom of the dimension table cancer disease details of

51 82 which has an actual domain of n values. However, a much longer character string is used to store the regions. We call the data type of an attribute to be an enumeration type, if its actual domain consists of a relatively small finite set of values. It makes sense to store pre-computed aggregates for the highest aggregation levels with restrictions in only one dimension. Since not all possible aggregations can be stored in general, MDW of Cancer data allows one to derive many further aggregates efficiently from the raw data. Data cube is intended to identify the proposed multidimensional data and to support common OLAP tasks like pie graph. Even though such tasks are usually possible with standard SQL queries, the queries may become very complex. So that, the OLAPCUBE writer using SQL Table/Excel performs the operations like dimension creation, measurement of cancer disease variables and visualize the data cube according to cancer disease variables. Figure 3.25 Dimension creation of specific field from Cancer Data Warehouse

52 83 Figure 3.26 Measurement of cancer disease variables for Cancer data Figure 3.27 Data Cube dimension of Age Factor of Cancer data

53 84 Figure 3.28 Data Cube dimension of Habit- Chewing factor of Cancer data Figure 3.29 Data Cube dimension of patient Weight Loss of Cancer data

CHAPTER 3 Implementation of Data warehouse in Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining 3.1 Introduction to Data Warehousing A data warehouse is storage of convenient, consistent, complete and consolidated data, which is collected