MEDICAL INFORMATICS & DATABASE MANAGEMENT MODULE 5: BIG DATA MANAGEMENT AND ANALYSIS DR.ORALUCK PATTANAPRATEEP

MEDICAL INFORMATICS & DATABASE MANAGEMENT MODULE 5: BIG DATA MANAGEMENT AND ANALYSIS DR.ORALUCK PATTANAPRATEEP Doctor of Philosophy Program in Clinical Epidemiology Section for Clinical Epidemiology & Biostatistics Faculty of Medicine Ramathibodi Hospital Mahidol University Semester I Academic year 2016 www.ceb-rama.org

RACE 614 Medical Informatics & Database Management Module 5: Big data management and analysis

Contents Objectives... 1 References... 1 I. Big data and data science... 3 Why big data... 4 Data science... 4 II. Data warehouse and visualization... 6 What is a data warehouse... 7 - Basic data warehousing environment... 8 - Data mart and its components... 9 Big data and data lake... 11 Data visualization... 13 - Infographic... 14 III. Machine learning algorithm and big data analytic... 16 What is machine learning algorithms... 16 - Classification model... 18 - Regression model... 19 - Cluster analysis... 20 - Association analysis... 21

Objectives Students should be able to: 1. Understand the big data, data science and data warehouse concept 2. Utilize the data science processes to big data problem 3. Select appropriate data visualizations to clearly communicate analytic insights to audiences 4. Apply appropriate machine learning algorithms to analyze big data References 1. Lantz B. Machine learning with R (2 nd edition). Packt Publishing. 2015. 2. Provost F and Fawcett T. Data science for business. O Reilly Media, Inc. 2013. 3. Reeves LL. A manager s guide to data warehousing. Wiley Publishing, Inc. 2009. 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge management: cases and applications. Information Science Reference. 2009. 5. Han J and Kamber M. Data mining: concepts and techniques (2 nd edition). Morgan Kaufmann Publisher, CA, USA. 2006. 6. Kimball R and Ross Margy. The data warehouse toolkit: the complete guide to dimensional modeling 2 nd ed. Wiley Computer Publishing. 2002. 1

In previous modules, we explored data management of primary source, starting with design record form(s) and manage database in Epidata. However, in the real world, another source of data is secondary, especially in electronic format which recently has grown bigger in size. It is increasingly gathered by high performance and convenient devices, so called big data. In this final module, we will cover three domains exploring big data, which are: 1. Data science and big data: reveals the concept of data value added and introduces data science, which is the new era of data management. 2. Data warehouse and visualization: provide the concept of making data warehouse and demonstrate how to communicate the finding(s). 3. Machine learning algorithm and big data analytic: explore how to mine data with 4 main machine learning algorithms. 2

I. Big data and data science How do we find information and knowledge from data or big data. Figure 1 demonstrates the value added from no-meaning raw data at the base of pyramid to meaningful information, knowledge and wisdom. For example, 2 numbers at raw data level, 115 and 90, has no meaning without any clue. By adding the meaning to number, we found relationship of these 2 numbers. It is fasting blood sugar (FBS) which decreases from 115 to 90. But the next question is whether lower FBS is good or bad. Figure 1: from data to wisdom Wisdom Understand principles Knowledge Understand patterns Applied Control of the diet will improve patient s health Context FBS should be less than 100 by dietary control Information Understand relations Data Meaning FBS decreases from 115 to 90 Raw 115, 90 By adding the context that FBS should be less than 100, so the information we found FBS decreases from 115 to 90 is good. Moreover, tools and techniques called machine learning algorithm may be applied to understand patterns and predict future. From this example, we may find patterns of patients who control their diet well and 3

predict his/her level of FBS. Finally, at the top, we may conclude that controlling one s diet will improve patient s health. Why big data Big data simply means datasets that are too large, too various, and too rapid for traditional data processing systems. In the past, we have processed only small volumes of data with no variety of data type over one night each time to find information and knowledge. With high hardware performance and technology, data recently has been generated in many forms with high volume and can be kept, retrieved, and analyzed more rapidly at one time. Big data, then, can be mainly described by the following 3 characteristics: Volume: amount of data from a few to millions of records, from one to hundreds of tables. Variety: range of data types and sources from structured to unstructured, from text to image. Velocity: speed of data in and out from batch to real time. Data science Once the massive data in flexible forms can be processed in a few minutes, possibility to find information and dig knowledge will happen more. The next questions are who will perform the analysis and which special talent do they need. 4

Figure 2: data science Venn diagram 2.0 Data science Computer science Machine learning Unicorn Maths and statistics Traditional software Traditional research Subject matter expertise Ref: http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html Three essential skills of data scientist (figure 2) are 1.computer science, 2.maths and statistics, and 3.subject matter expertise (Venn diagram 2.0). Since it is quite hard to find one person who is keen in all 3 areas, most data scientists are working as teamwork. Data science process (figure 3) starts with collecting raw data from real world situations e.g. from human behavior, financial issues or medicine utilization. Then formulate hypothesis, process and clean data to get exploratory data analysis which may be in summary statistics or graphs. In case, the data is not enough or cannot answer the question, more data should be collected, processed and cleaned. The next step is building models with machine learning algorithms. The final outcome from data science process is data product or value added data. Within the process, the important thing is communication to the audience to make decisions, such as reports or dashboards. 5

Figure 3: data science process Raw data is collected Data is processed Clean data Exploratory data analysis Models & Algorithms Data product Communicate Make decision Ref: http://www.kdnuggets.com/2016/03/data-science-process.html II. Data warehouse and visualization From data science process, we discuss about collecting, processing, and cleaning data (so called ETL in data warehousing); and also the importance of communication to audience for making decisions. We begin with comparison of information which is kept in 2 forms with different purposes. Information is mainly kept in 2 forms: the operational systems of record and the data warehouse. The operational systems are where the data is put in and almost always deal with one record at a time; while the data warehouse is where the data is integrated from different operational systems and almost never dealt with one row at a time. Table 1 is a comparison of operational systems and data warehouse. 6

Table 1: comparison of operational systems and data warehouse Area of comparison Operational systems Data warehouse Purpose of data Daily business tasks Analysis, planning, decision supporting Function Day-to-day operation, detailed data Long term information, summarized data Design Application oriented, real time Subject oriented, depends on length of cycle for data supplements to warehouse Access Read and write Mostly read Size 100MB to GB 100 GB to TB What is a data warehouse Data warehouse is a central repository of integrated data from one or more disparate transactional data sources; such as relational database, enterprise resource planning. Figure 4 shows basic data warehousing environment, starting with transactional data sources, using a process called ETL: - Extracts data from transactional data sources and normally temporarily keeps in staging tables - Transforms the data in the proper format for the purposes of querying and analysis and - Loads it into the final target which designed and modeled in dimensional format Then at the client side, a user will retrieve data which is already in a data warehouse to create their own dashboard/report for either exploring or analysing purpose. 7

- Basic data warehousing environment Figure 4 explains basic data warehousing environment. From left to right, data in various forms are extracted, transformed and loaded into a data warehouse which consists of many data marts. The ETL process or metadata management deals with ODS (operational data store or a mirror (backup) of transactional database), staging tables which are temporary databases, master tables which mainly keep the master data warehouse dimensions. Figure 4: basic data warehouse environment Transactional data Metadata Dimensional Query/Report/ sources management modeling Visualization Relational DB/HIS ODS Data warehouse Pivot in MS Excel ERP Data mart Staging Other tables Data mart sources Dashboard in BI tools Flat files Master tables Data mart Meta data Skills DB design and Extract, transform, Dimensional Multidimensional administration, SQL load (ETL) modeling queries, Data mining, Predictive analysis 8

Transactional data Metadata Dimensional Query/Report/ sources management modeling Visualization Tools MS SQL server, IBM data manager, IBM framework IBM Cognos MySQL, Oracle Informatica, Oracle manager, Oracle business object, 11g, MS Access, ODI, SAS DI studio warehouse builder, MS analysis etc. DB2, SQL server services, (MS integration services powerbi, MS PowerPivot, QlikSense, Tableau) DB = database, HIS = hospital information system, ERP = enterprise resource planning, ODS = operational data store - Data mart and its components A simple form of data warehouse that is focused on a single functional area is called data mart or cube. A data mart is designed in dimensional format as a fact table which comprises of measures and dimensions. Typically, measures are values that can be aggregated, and dimensions are groups of hierarchies that define the facts. For example, in figure 5, number of visits is a measure; date, clinic, health scheme are elements of dimensions. A dimension may have none, one or more hierarchies. Health scheme has no hierarchy. Clinic has one hierarchy with one level that means clinic can drill up as building. Also date has 2 hierarchies which are calendar year and fiscal year; and each hierarchy is also in several levels i.e., week, month, and year. In addition, dimension date has one attribute which is day (Monday-Sunday). 9

Figure 5: a data mart in star schema Figure 6 explains a data mart as a cube for reporting number of visits in 3 dimensions: date in X axis, clinic in Y axis, and health scheme in Z axis. Each box contains number of visits, e.g., on 1/1/16, 22 NHSO patients visit medicine clinic (3 dimensions: date, clinic and heath scheme). A cube can accommodates data of dimensions that define a business problem. When dimension changes, measure will be summed as box is combined, e.g., 119 patients visit medicine clinic on 1/1/16 (2 dimensions: clinic and date) or when drilling up, 198 patients visit building 1 on 1/1/16. 10

Figure 6: a data mart as a cube Big data and data lake With the growth of data in the last decade, the new term dealing with data management system is data lake. Table 2 compares key differences between data warehouse and data lake. Table 2: comparison of data warehouse and data lake Area of comparison data warehouse data lake Data structure Structured Structured and unstructured Data type Cleansed/aggregated Raw Data volume Large (Terabytes) Extremely large (Petabytes) 11

Area of comparison data warehouse data lake Access methods SQL NoSQL However, data lake can be added to data system with data warehouse to maximize the use of data. In figure 7, Hadoop is added to retrieve data from unstructured data sources. Figure 7: basic data warehouse environment plus data lake architecture Transactional data sources Data system Query/Report/ Visualization Relational DB/HIS Pivot in MS Excel ERP Other sources Staging tables Data mart Dashboard in BI tools Flat files Master tables Data mart Unstructured data file Meta data 12

Data visualization From the last column of transformed data to information and knowledge in figure 4 and 7, data visualization, which is both an art and a science, is one of data science processes (figure 3) to communicate information, knowledge or even data products clearly and efficiently to the audiences. Effective visualization helps users analyze and get an evidence from data. To generate the visualization of the data, we need to understand the data we are trying to visualize, know the audience in what they want to know and then simply use a visual in the best and simplest form to convey the information. There are many tools to do data visualization. From simple tools such as MS Excel to small BI (business intelligence) tools such as MS Power BI, QlikSense, Tableau, etcetera and large BI tools such as as IBM Cognos. Figure 8 and 9 are examples of using pivot tools in MS Excel and dashboard in MS Power BI to present data from a data mart. Figure 8: pivot table and chart in MS Excel 13

Figure 9: a dashboard designed in MS Power BI - Infographic Infographic is a combination of 2 words - information and graphic. It is a kind of data visualization that is composed of three parts which are visual, content, and knowledge. The visual means how to make an attractive and memorable graphic, since the vision is a sense in which a human receives significantly more information than any of the other four (touch, hearing, smell, taste). The content must be a statistically proven fact and be able to transfer the knowledge to audiences. Figure 10 is a sample of infographic from WHO in 2013. 14

Figure 10: a sample of infographic from WHO 15

III. Machine learning algorithm and big data analytic What is machine learning algorithms In previous sections, we have discussed how to manage big data to get information. In this section, we will move next on to how to transform data or information into knowledge with a set of algorithms, called machine learning. Figure 11: machine learning and its combination AI = Artificial Intelligence, KDD = Knowledge Discovery and Data mining As a statistician, we may question what is the difference between statistical modellings and machine learnings. The answer is that statistical modellings are formalization of relationships between variables in the form of mathematical equations, while machine learning are algorithms that can learn from data without relying on rulesbased programming. Machine learning algorithms are generally divided into 2 major categories (descriptive and predictive) with 2 major types of data (continuous and categorical) as shown with the sample techniques in table 3. The objective of descriptive tasks is to derive patterns that summarize the underlying relationships in data. They are often 16

exploratory in nature to validate and explain the results. Predictive tasks aim to predict the value of a particular attribute based on the values of other attributes. Table 3: machine learning techniques/algorithms Descriptive tasks (unsupervised) Predictive tasks (supervised) Continuous Clustering Regression Categorical Association Classification Choosing the best algorithm to use for a specific analytical task can be a challenge. While we can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result. For example, we can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final mining model. In machine learning process, there are 6 steps as shown in figure 12 starting with 1.understanding the business and type of problem; 2.understanding data (data may be from different sources); 3.preparing data (ETL); 4.creating model; 5.evaluating the model; and 6.deploying. 17

Figure 12: CRISP-DM model - Classification model Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next, the prediction set, which contains the same set of attributes except for the prediction outcome, which will be applied to test the classification model. There are many algorithms applied in classification such as k-nn, Naïve Bayes, and decision tree etcetera. The k-nn or k-nearest neighbors algorithm uses information about a prediction s k-nearest neighbors to classify an unknown outcome. Figure 13 is an example of how to diagnose breast cancer with k-nn algorithm. With 2 dimensional attributes, texture and radius, each dot presents malignant (m) or benign (b). To classify x into m or b, k-nn calculates distance and decides outcome for x. 18

Figure 13: an example of k-nn algorithm In the other 2 algorithm, Naïve Bayes or Bayesian method use training data to calculate probability of unknown outcome by using the formula P(A B) = P(B A)P(A) / P(B). Decision tree uses a tree structure to model the relationships among attributes and outcomes. - Regression model While classification algorithm applies for categorical attributes, regression algorithm applies for continuous ones for supervised model. This algorithm is the same as in statistics class which uses independent variables to predict a dependent variable. 19

- Cluster analysis Clustering is an unsupervised model that divides data into clusters. I, classification we have training data for knowing outcomes and predicting outcomes from testing data. For example, figure 14, as a medical staff, we want to organize and facilitate diabetes patients to learn how to control their diet and do exercise by dividing into 3 groups based on patients age and level of sugar in blood. Figure 14: an example of diabetes patients The most common algorithm for cluster analysis is k-means. The k-means first assigns each of n examples to one of k clusters, then, it tries to minimize the differences within each cluster and maximize the differences between clusters. Figure 15 is the result of k-mean cluster analysis where patients are divided into 3 groups based on their similarity of age and level of sugar in blood. 20

Figure 15: an example of diabetes patients - Association analysis Association or market basket analysis is another unsupervised model that finds the relationships among categorical variables in a dataset. Table 4 is an example of 5 prescriptions from one clinic. Table 4: an example of drugs in prescriptions Rx no. Drug items 1 {PPI, NSAIDs, Calcium} 2 {Antidepressant, NSAIDs, Antianxiety, Muscle relaxant} 3 {NSAIDs, Muscle relaxant, PPI} 4 {Antidepressant, Antianxiety, Calcium} 5 {NSAIDs, PPI, Calcium} 21

By looking at only 5 prescriptions dataset, we may guess some patterns: Rx no. 1, 3, and 5 are for orthopedic patients, while Rx no. 2 and 4 are for psychiatric patients. With similar rules, in large transaction databases, association analysis uses statistical measures (support and confidence measures) to locate association of items and groups into the same basket. The most common method is Apriori approach. The Apriori approach is an association rule mining, based on the principle of frequent pattern mining. Performing Apriori analysis involves 2 steps as follow: 1. Generate candidate set: the first step finds items that occur with a frequency that exceeds a specified threshold (defined as support measure) from the data set, that is: Support = Number of observations having A B Total number of observations 2. Derive the association rules: the second step analyses items in the candidate set for mining association rules which indicate conditional probabilities between each pair of item groups. Rules are generated based on pairs whose conditional probability value exceeds a user-defined threshold (called confidence measure), that is: Confidence = Number of observations having A B Number of observations having A 22