Study on Building and Applying Technology of Multi-level Data Warehouse Lihua Song, Ying Zhan, Yebai Li

Similar documents
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

A priority based dynamic bandwidth scheduling in SDN networks 1

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

Research on the value of search engine optimization based on Electronic Commerce WANG Yaping1, a

CISC 7610 Lecture 2b The beginnings of NoSQL

The application of OLAP and Data mining technology in the analysis of. book lending

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

Research on Computer Network Virtual Laboratory based on ASP.NET. JIA Xuebin 1, a

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Research on software development platform based on SSH framework structure

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Data Warehousing 11g Essentials

Oracle Database 11g: Data Warehousing Fundamentals

SQL Query Optimization on Cross Nodes for Distributed System

A Design of Remote Monitoring System based on 3G and Internet Technology

Design of Greenhouse Temperature and Humidity Monitoring System Based on ZIGBEE Technique Ming Xin 1,a, Wei Zhongshan 1,b,*

Research on Design Information Management System for Leather Goods

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

The Application of CAN Bus in Intelligent Substation Automation System Yuehua HUANG 1, a, Ruiyong LIU 2, b, Peipei YANG 3, C, Dongxu XIANG 4,D

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Proceedings of the IE 2014 International Conference AGILE DATA MODELS

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Construction Scheme for Cloud Platform of NSFC Information System

Preliminary Research on Distributed Cluster Monitoring of G/S Model

A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1, d

Building Energy Saving Configuration Software Data Processing System

20466C - Version: 1. Implementing Data Models and Reports with Microsoft SQL Server

Design of Physical Education Management System Guoquan Zhang

What is Gluent? The Gluent Data Platform

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

A Test Sequence Generation Method Based on Dependencies and Slices Jin-peng MO *, Jun-yi LI and Jian-wen HUANG

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER

CHAPTER 3 Implementation of Data warehouse in Data Mining

Research on Centralized Data-Sharing Model Based on Master Data Management

Concurrency Control and Self-optimization based on SQL Server Database System Rongchuan Guo

A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries

QUALITY MONITORING AND

Crop Production Management Information System Design and Implementation

Efficient Load Balancing and Fault tolerance Mechanism for Cloud Environment

High Performance Computing on MapReduce Programming Framework

AN EFFICIENT DEADLOCK DETECTION AND RESOLUTION ALGORITHM FOR GENERALIZED DEADLOCKS. Wei Lu, Chengkai Yu, Weiwei Xing, Xiaoping Che and Yong Yang

Data Warehouse and Data Mining

Study on Gear Chamfering Method based on Vision Measurement

The Development of Mobile Shopping System Based on Android Platform

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

The Data Organization

Construction of SSI Framework Based on MVC Software Design Model Yongchang Rena, Yongzhe Mab

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

DATA MINING TRANSACTION

Chapter 6 VIDEO CASES

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

A Cache Utility Monitor for Multi-core Processor

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

The Research and Design of the Application Domain Building Based on GridGIS

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Research and Design of Data Storage Scheme for Electric Power Big Data

Design on Students Score Management System based on Asp.net Zhe Li1,a, Jiahui Wang2,b, Shuang Wei3,c

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

A Hybrid Genetic Algorithm for the Distributed Permutation Flowshop Scheduling Problem Yan Li 1, a*, Zhigang Chen 2, b

Research on the Knowledge Representation Method of Instance Based on Functional Surface

Efficient Algorithm for Frequent Itemset Generation in Big Data

A Method and System for Thunder Traffic Online Identification

Data Mining in the Application of E-Commerce Website

Distributed KIDS Labs 1

Application of RMAN Backup Technology in the Agricultural Products Wholesale Market System

Research on Two - Way Interactive Communication and Information System Design Analysis Dong Xu1, a

Question Bank. 4) It is the source of information later delivered to data marts.

Construction and Application of Cloud Data Center in University

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

Entities and Attributes. Image 6.2 Image 6.3

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data

6+ years of experience in IT Industry, in analysis, design & development of data warehouses using traditional BI and self-service BI.

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Design on Data Storage Structure for Course Management System Li Ma

Quality Assessment of Power Dispatching Data Based on Improved Cloud Model

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

SAP CERTIFIED APPLICATION ASSOCIATE - SAP HANA 2.0 (SPS01)

A Design Method for Composition and Reuse Oriented Weaponry Model Architecture Meng Zhang1, a, Hong Wang1, Yiping Yao1, 2

Design of student information system based on association algorithm and data mining technology. CaiYan, ChenHua

Available online at Procedia Engineering 29 (2012) 69 73

Exploration of Fault Diagnosis Technology for Air Compressor Based on Internet of Things

After completing this course, participants will be able to:

Query processing on raw files. Vítor Uwe Reus

A Novel Data Mining Platform Design with Dynamic Algorithm Base

Research on adaptive network theft Trojan detection model Ting Wu

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

SAP BW/4HANA the next generation Data Warehouse

An Agricultural Tri-dimensional Pollution Data Management Platform Based on DNDC Model

Oracle 1Z0-515 Exam Questions & Answers

Design of Liquid Level Control System Based on Simulink and PLC

Research on Design and Application of Computer Database Quality Evaluation Model

Full file at

Transcription:

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) Study on Building and Applying Technology of Multi-level Data Warehouse Lihua Song, Ying Zhan, Yebai Li College of Computer Science,North China University of Technology,Beijing, China e-mail: blackzhanying@qq.com Keywords: Multi-level Data Warehouses; Extract; Incremental Update;E- Business Abstract. Based on multi-database management systems application, we use multi-level data warehouses. This paper proposes an incremental method to update the data set of result and summarizes the two transmission strategies. Verify that the data-driven policy through experimental simulation platform environment analysis and draw the feasibility and effectiveness of this method is more suitable platform applications conclusion. Introduction In the background of fast-growing information, data shows the characteristics of massive, distributed and heterogeneous. It makes the centralized data warehouse processing capacity in data analysis is increasingly limited. Because of distributed data warehouses have the features of low cost of maintenance, data integrity, high tolerance against system failures[1], it's more competitive for some special cases which include bank and e-commerce platform. In recent years, Hive is the mainly technology to be used to build distributed data warehouses[2], but Hadoop is not yet available analysis tools. In order to avoid complex development work in the application presentation layer, we use the database technology and combine with the open source analysis and presentation tools(e.g. Mondrian and JPivot). Related research on multilevel data warehouses structure for typical distributed data warehouses form(including global warehouse and local data warehouses) is as follows: Paper [3] presented double channels view updating algorithm to maintain multiple views on line and parallel realize OLAP query for ensuring data consistency and improving query efficiency. Paper [4] proposed a model of distributed data warehouses for sale decision and it also present a solution for data transmission from local data warehouses to global data warehouse, which was based on large-scale clothing enterprises. In this paper, global data warehouse and local data warehouse are referred to as platform data warehouse(pdw) and enterprise data warehouse(edw). PDW is built for e-commerce platform, and EDWs are built for the enterprise users who registered on the platform There are two strategies for transmission between the PDW and EDWs. a). Round-robin scheduling, PDW extracts data from the EDWs; b). Data-driven, after EDWs completing update they transmit data to PDW immediately. Both of them are related to data exchange across database servers. The strategy a) is characterized by PDW based on specific conditions to determine the priority of the EDWs. Paper [5] presented using round-robin scheduling strategy in distributed data warehouses to solve the problem which is in the poor flexibility and real-time, but the global data warehouse had to maintain the additional views for update and the communication frequency was increased. The research also includes the relevant papers [6], [7]. Strategy b) is featured in EDWs extracting data in parallel then push them to PDW, and it has strong continuity, low network communication frequency and high concurrency. The adverse factor is that the strategy is very likely to cause conflict between the update transactions, and paper [8] proposed solution to the conflict from a data storage perspective. SYSTEM STRUCTURE Business analysis system is aimed at building data warehouses to store the decision data, 2015. The authors - Published by Atlantis Press 591

analyzing the historical data and displaying readable results to the users. Based on the above objectives, the system structure as showed in Figure 1. Business analysis system is classified as the presentation layer, application layer, and data storage management layer. Users can send data requests, then web server gets the request and OLAP server uses MDX query statement to Application Layer Present Layer data WEB Server request data set MDX query OLAP Server Data Storage Management Layer data request Multi-Level Data Warehouses ETL Data Source Figure 1. Structure of marketing analysis system DBMS 0 PDW DBMS 1 DBMS 2 DBMS N EDW 11...EDW 1M Enterprise DB 11...DB 1M EDW 21...EDW 2M Enterprise DB 21...DB 2M EDW N1...EDW NM Enterprise DB N1 DB NM Figure 2. Data storage management layer interacts with the data warehouses eventually result data is returned and displayed in the front page step by step. The open source tools can be used in the presentation layer and application layer. So this research focuses on data storage management multi-level data warehouse construction. The platform is based on multiple database environment and multiple distribution databases up to homogenous in multi-database management systems. To achieve platform and enterprise-level different analytical functions, data storage management structure is shown in Figure 2. Tablespaces are allocated for different users whose business models are independent. And the same operating mode determines corresponding to the same table structure in the separate tablespaces. All the databases and data warehouses used Oracle 11g database management systems. The data sources of PDW are the statistic from the registered EDWs. And data sources of the EDWs are taken from the corresponding database. Demand for analysis and decision making for different points of view, both logical data model and data granularity for the two level data warehouses also varies. LOGIC MODEL DESIGN The main purpose of analysis can be described as the follow: 592

Make users master the product marketing of their enterprises and propose the new marketing strategy. Tell users what merchandise sells well by analyzing the whole registered enterprises' marketing. From the above, it can be deduced that enterprise-level analysis focuses on specific commodity marketing and platform-level analysis emphasizes on the categories of product marketing. Thus, the analysis granularities for the PDW and EDWs are different. In order to facilitate the subsequent statistics, we added the statistical values as a column to the fact table which is in the PDW. Based on a star model and snowflake model features and advantages, logic models for sale topic are designed as the Figure 3 and Figure 4. Time Regional Company Sale Fact Table Customer Commodity Categories of Commodity Figure 3. Logic model in a EDW Sales Volum Trading Volum Return Quantity Time Regional Sale Fact Table Categories of Commodity Company Figure 4. Logic model in PDW Centered on sales fact table and creating the dimension tables about time, region, merchandise(or categories of merchandise). Especially, design for sale fact table can be expressed as that: SALE_FACT={A, P(ai) } (1) A={a1,a2,..., an}, P(x) ={count(x), sum(x) } (2) "A" represents the set of attribute values as the columns in the fact table, and the ai stand for a specific attribute value; P(x) is operational set which include COUNT and SUM as the grouping function in database, which is used to calculate the orders for sale, return order and sales amount. Attributes to this design, amount of data transferred between the single EDW and PDW in one day will be limited to N tuples. The integer N depends on the count of category dimension table. DATA PROCESSING Due to use the same database management system and have identical table structure between the multiple database, the data sources are non-heterogeneous and not having dirty data. Based on the data model design, dimension tables and the fact table can be categorized by the update frequency. The tables which almost do not need to be updated: the time dimension tables and regional dimension tables. Periodic update of tables: the commodity dimension tables, customers dimension tables and companies dimension tables. The fact tables, which are updated every day. The trigger can be utilized to the periodic update of tables. And updating strategy for the fact tables should be mainly designed. Each EDW counts its marketing results. Then the EDWs which complete the statistic transmit the data set to PDW. The ETL process for data transmission from enterprise databases to EDWs, or from EDWs to PDW will mainly utilize the operations: Mapping( ), choice(ơ) and join( ).So they can be implemented by encapsulating as a procedure. 593

And there are two transmission strategies summarized in this paper round-robin scheduling and data-driven. A. Round-Robin scheduling PDW extracts data from the EDWs according to some algorithm. The main difficulty of the strategy is selecting the appropriate scheduling algorithm which must ensure that minimizing the time overhead and network traffic between the two level data warehouse, and reducing the maintenance cost. Classic genetic algorithm is superior to the traditional search algorithm to get the approximate optimal solution sets, but it's cumbersome, difficult to achieve and not conducive to maintenance. Paper [7] proposed the task scheduling based on a greedy algorithm, but it had much prerequisite and some of them can't be met in this case. The strategy in paper [5] is according to the system resource consumption adjust their algorithms in real time. But it didn't give the assignment priority rule. Serial extraction from EDWs by company code does not need to consider the complex scheduling algorithm. It's easy to implement and do not need to create the tables to record the communication data and scheduling rules. But it has to modify the procedure when a new company registering on the platform. Considering the flexibility not suitable for practical application, we chose the data-driven as the way to transmit. B. Data-Driven After EDWs completing update they transmit data to PDW immediately. The time overhead T is showed in the following: T=max{(ta1+tb1'),(ta2+tb2'),...,(tan+ tbn')}+ t (3) The tai represents time overhead of one EDW extract data from the corresponding database; And the tbi stands for the transmission time from a EDW to PDW; t is extra time overhead. As the equation, the goal is to minimize extra time overhead. An experiment showed that the extra time was caused by the conflicts between multiple tasks, the segment in database automatic expanding and multiple network connections. Paper [8] proposed that utilize ORACLE's functionality on distribution to solve the conflict problem. Technical documentation [9] also provides solutions to reduce network connections. The specific approach is using list partitioning technology[10] to partition the fact table in PDW according to the value of company code. Next, the paper will verify the method's validity by the simulation experiment EXPERIMENTAL ANALYSIS Efficiency of two-tier data warehouse update was modeled on the actual platform environment simulation test. Servers hardware configuration parameters as showed in Table I. TABLE I. Routing Server Database Server 1 Database Server 2 SERVERS HARDWARE CONFIGURATION Configuration Parameters Intel(R) Xeon(R) 8 CPUs 2.0GHz,6.0G RAM Intel(R) Xeon(R) 4 CPUs 2.0GHz,4.0G RAM Intel(R) Xeon(R) 2 CPUs 1.6GHz,5.0G RAM 594

Figure 5. Time overhead of three cases Establish data warehouses for enterprise users separately in the two database servers; Construct data warehouse for the platform in the routing server. Set time to ensure the system time of three servers consistent (the error is within 1 second). In order to close to the practical application environment, choose the time overhead which was in the nighttime and repeated to measure average as experimental results. According to the established design, compare the time overhead in the three cases: No partition fact table. Partition in a single datafile Partition in multiple datafiles Experimental result is shown at the Figure 5. The amount of EDWs and time consumption(unit: s) is described in the X axis and Y axis respectively. Experiments show that under the same conditions, time consumption of transmission from the enterprise databases to EDWs and then to PDW can be reduced through table partition technology. Further analysis found that, the t is mainly caused by the time difference of job delay in the two DBMSs which were established in the two different database servers. Table partition can reduce conflict between the multiple inserting tasks and then reduce the time difference, which may attribute to the storage features of the database. And experiments show that this method does not cause any data loss and alteration. This paper proposes that limit the number of data by transmitting the statistics from EDW to PDW; utilize the data-driven to ensure high flexibility and consistency of the transmission; the table partition can be used to reduce time consumption in this application environment. Acknowledgement This paper funded by : Project of Construction of the central financial support for Universities and Colleges (Project No.:P X M2014_014212_000097); Beijing Higher Education Young Elite Teacher Project(Project No.: YETP1419 ); The Project of Construction of Innovative Teams and Teacher Career Development for Universities and Colleges Under Beijing Municipality (Project No.: IDHT20130502); Beijing Higher Education Teachers Team Construction Special Training Project---2014 General Abroad Visiting Program for Young Backbone Teachers ( Project No.: 067145301400) References [1] Mehdi Kashfi, Abdolreza Hajmoosaei. Optimal Distributed Data Warehouse System Architecture: Big Data and Cloud Computing, 2014[C]. Sydney: IEEE, 2014:110-115 [2] Li Weiwei, Li Mei, Zhang Yang etc. Research of Classification Analysis for Distributed Data Warehouse[J]. Application Research of Computers, 2013, 30(10):2936-2939,2043 595

[3] Xiong Zhongyang, Huang Hailong, Zhang Yufang etc. Multi-level Data Warehouse Architecture and Update Algorithm of Double Channels View [J]. Computer Science,2002, 29(6):79-81 [4] Ye Zheng. Design of Disributed Data Warehouse for Sales Decision of Large-scale Clothing Enterprise [D]. Zhejiang: Zhejiang University, 2006 [5] Yang Yiping. Research and Design for Data Scheduling Based on Distributed Data Warehouse [D]. Beijing: Beijing University of Posts and Telecommunications, 2006 [6] Wang Shan, Chen Kun. Research on Task Scheduling Method Based on the Greedy Algorithm in ETL [J]. Microelectronics & Computer, 2009, 26(7):130-133 [7] Zhong Qiuxi, Xie Tao, Chen Huowang. Task Matching and Scheduling by Using Genetic Algorithms [J]. Journal of Computer Research and Development, 2000, 37(10):1197-1203 [8] Liu Peiyu. A Study on Sharing Techniques for Multi-users in LAN [J]. Computer Engineering and Applications, 2000(07):114-115,141 [9] Steve Fogel, Tony Morales, Padmaja Potineni, Sheila Moore. Oracle Database Administrator's Guide[EB/OL]. (2008-03)[2015-01-06] http://docs.oracle.com/cd/b28359_01/server.111/b28310/ds_concepts002.htm#admin12085 [10] Muxiu Zhu, Xianjiu Zhang. The Management of Big Data Tables Based on Oracle Partition Technology: Computer Science and Education, 2014[C]. Vancouver:IEEE, 2014:570-572 596