A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping

A Multi-Analyzer Machine Learning Model for Marine Heterogeneous Data Schema Mapping Wang Yan 1, 2 Le Jiajin 3, Zhang Yun 2 1 Glorious Sun School of Business and Management Donghua University 2 College of Information Technology Shanghai Ocean University 3 School of Computer Science and Technology Donghua University Shanghai, China ABSTRACT: In heterogeneous data integration, an effective machine learning model plays an important role in schema mapping. Schema mapping machine learning model and its probability learning improvement are analyzed in this paper firstly, and then the concept of multi -analyzer model with the method of fuzzy comprehensive evaluation is put forward to improve machine learning results efficiency and accuracy. Comprehensive experiments on multi-analyzer model confirm the effectiveness of multi -analyzer model. Keywords: Marine Data, Heterogeneity, Machine Learning, Schema Mapping, Fuzzy Comprehensive Evaluation Received: 30 May 2015, Revised 2 July 2015, Accepted 11 July 2015 2015 DLINE. All Rights Reserved 1. Introduction 1.1 Heterogeneity of Marine Data and Integration Mappings Marine data is typical heterogeneous data with multi-scale, multi-temporal and multi-semantic features. Marine rich data include marine resource, marine geography, marine biology, marine chemistry, and many other fields. Due to the difference between acquisition equipments, information processing platforms and data storage formats, even in the same field heterogeneous data still exist. In schema mapping, marine heterogeneous data s features are described as follows: 1 Large-scale Large-scale marine data come from marine large monitor sensor systems which transmit huge amount of marine data. 2 Distribution Uncertainty Distribution uncertainty includes both physical distribution and logical distribution uncertainty. The physical distribution uncertainty is that the data source is not physically concentrated but distributed in multi regions which connected by network between pluralities of data sources. And logical uncertainty is that different properties may even exist in a same data cube s different data sources. 3 Semantic Heterogeneity 110 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015

Naming rules and data types in each marine data source are more or little different. And there is also data semantic heterogeneity in marine databases. When a solid data model is built, different granularity division, different relations between the different entities, as well as entities semantic description difference between different data sources result in heterogeneous semantic data. 4 Semi-Standardized Although semantic heterogeneity of schema and data exist in marine database, the descriptions of marine information have certain specifications and standards, they have some commonality, for example hydrographic tables contain tide, water temperature, and wave height data, while times, location, and wave field in the tables have significant structural features. These loose constraints are called as semi-standardized. Marine heterogeneous integration is an organic integration of marine data in different formats, source, nature and characteristics of logically or physically. The goal is to conversion, adjust, decompose and merge marine data s different formal features (such as units, format, scale etc. and internal features (such as property, etc. and final form compatible seamless marine data sets [1]. In the process of marine heterogeneous data integrating, how to improve the efficiency of heterogeneous data schema mapping should be considered. Schema mapping relation is the combination of elements mapping between heterogeneous data models in the same field or heterogeneous island. Schema mapping problem is a research hotspot in heterogeneous data integration and transformation. Instantiation process of schema mapping is a process of that users select some data or data sets by operate the data entry, determine the data flow direction, and select target database and mapping table to map. How to implement automatic or semiautomatic schema mapping is the problem for machine mapping. 1.2 Automatic Schema Mapping based on Machine Learning Automatic mapping between heterogeneous data can be achieved by the structure of the learning machine. A large number of researches indicate multi-strategy machine learning usually has higher accuracy than single one. Therefore, while increasing the complexity of the system, a lower error rate can be obtained. Currently, the researches of multi- strategy learning machine focus on learning model combination and single learning output approach. In [2, 3] BayesIDF learning and grammar learning method are combined effectively, and multi- strategy approach is proposed for information extraction. The multi-strategy learning model in [4] refers to the framework of combination the result of multiple learning methods, which is also known as meta-learning [5]. These analyses theoretically prove that a result combination of learning is more accurate than the best results of the individual learner. It is combined rule-based and based on maximum entropy model in [6], proposed a hybrid approach to determine an appropriate time contact between a pair of entities. A new multi-domain online learning framework based on parameters is proposed in [7]. EnsembleMatrix is proposed in [8] which can help users understand the relative merits of various classifiers and analyzers, and allow users to explore and establish a communication interact model by direct visualization. Some other integration methods are presented, such as boosting [9], stacking [10], and bagging [11], these methods repeat single learning training, and apply the result to different parts of one problem, and combine these results to obtain performance improvement. However, because of the huge difference between marine samples and heterogeneous data, it is more difficult to determine and give an accurate judgment from the individual output of the learning. Firstly, the paper describes characteristics of machine learning schema mapping model for marine heterogeneous data, and discusses how to improve the efficiency of automatic schema mapping model by probability learn. Then in order to evaluate output results of the automatic schema mapping more accurately, a concept of multi-learning and multi-analyzer is put forward, and the fuzzy comprehensive evaluation is applied to multi-analyzer model to quantify various factors of learner output during heterogeneous sampling and get more accurate combination learning results. Finally, the multi-analyzer model is verified in multi-source marine observation data conversion experiments. 2. Automatic Schema Mapping Based on Machine Learning Automatic mapping model of machine learning generally contains input interface, learner, analyzer and human interface as shown in Figure 1. The input interface checks data, and filters and selects the sample. Learner is responsible for statistical analysis and feature extraction. By training and learning, input data and sample object model generates an internal data model. Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015 111

After analysis the learning result, the analyzer can adjust the parameters of the learning machine and reconfigures strategies to improve the efficiency of the schema mapping and real-time matching. Man-machine interface refers to the users operate UI and input feedback adjustment information and parameters to form positive feedback. Figure 1. Flow diagram of automatic schema mapping based on machine learning In the training phase, training data - are set by default, learning machine extracts the character parameters of data sets, and then internal schema mapping prediction model is generated. The analyzer analyzes output data of learning machine and feedbacks parameters. In the schema mapping phase, according to the prediction model, the learner analyzes data source and monitor schema mapping results in real time. In this process, users can interfere with the schema mapping results by UI interface. In the self-learning process, selection of learning sample is an important part. Sample sets should be of the maximum global matching attributes. Set sample set β, β attributes {b 1, b 2,, b n }. As global learning samples, γ has the global data set {b f1, b f2,, b fn }. Where n represents the number of global properties, if there is min ({b 1, b 2,, b n },{b f1, b f2,,b fn } sample set β, can be called the best sample set. For marine data system, assuming there are two kinds of data, data source SOD and target data DOD, some schema mapping constraint and schema mapping constraint relationships in the user interface. We can establish schema mapping with all these things. For example, there are sites, latitudes and longitudes of sensor nodes, sensing time, tidal waves in marine source data. After the target schema field and specified schema relationship are defined a training process table in detail can be obtained. If the samples are extracted according to the probability distribution model, probability properties distribution of the large sample could be calculated and the data selection function of input interface could be optimized. Though enough of samples is required, the training process cannot cover have all the instance sets. If there is a distribution property of statistical sample, prospective study can be obtained in advance. When the sample data sets are selected, attribute tags in the sample data can be previously identified, and these tags are recorded in the real environments, which can mark the distribution density of the attribute, as shown in figure 2. Samples can be searched according to tags in high probability distribution interval, and sorted according to the probability distribution. We can get various properties data sets, including highly matched collection sets, moderate matched collection sets and lowly matched ones. Samples in various properties data sets are selected randomly and the training results then are modifies by manual intervention interface during training. According to the probability distribution learning results are transferred into the learners and analyzers, comprehensive analysis modules in the analyzer feedback the judgment information, inquiry learner s pre-judgments for training. The analyzer s field judgments can be mathematically expressed as follows: 112 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015

Figure 2. The sample filter based on the probability distribution p = m (f (x s, p, t, x d = { 1, p < 0.4 ( Default probability 1 threshold 0, p > 0.8 ( Default probability 0 threshold For any instance of group x s, we get schema mapping values by schema mapping function f, and compare these values to the known target instances, 0 or 1 will be obtained by calculate the probability of matching values can compare they to the probability thresholds. Then learner gives the scores of the schema mapping effect. If there are multi-learners, you need to compare and rank multiple learner output. Finally, the analyzer will output record information table, which covers sample properties and the parameters proposals of learning machine, and dynamically adjust the parameters according to current input and response results. After training, analyzer can begin to the map process for new data sources and new data. An ocean observation data mapping process is shown in Figure 3. The data which matching distribution probability is extracted from instance of observed data source, and used for detect and judgment the information unit. And then the correlation matching probability of learning machine is given for analysis module s 0/1 judgment according to predefine threshold value. In the whole process users give the real-time feedback modify by operating UI. 3. Heterogeneous Data Schema Mapping Optimization basing on Multi-analyzer 3.1 Multi-analyzer Concept The analyzer can selects the best learning path from the output of the learners, so it must take consider analysis theme domain subdivision and analysis major subject identification, and object sorting firstly. When we select subject field, universality, completeness and independency must be considered firstly, so that the selected subject field cover other subject fields. Secondly, evaluation granularity affects the difficulty and parallelism of judgment and calculation. Finally, the selection of the appropriate evaluation method must base on the samples feature. We can select the probability or fuzzy evaluation methods. In automatic machine schema mapping model, multi-analyzer will improve the schema mapping accuracy. In Figure 4, multianalyzer will build a score matrix for the matching correlation of the learner s output results. Multi-analyzer use multi-score board strategy to match different data or learning strategies. Multi-analyzer can expand the match and synthesize process, introduce multi-stage matching and integration, and select the best Analysis Results from the output. Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015 113

Figure 3. Machine schema mapping process of ocean observation data Figure 4. Multi-Analyzer automatic schema mapping Structure P n = m 1 ( f ( x s, p,t, x d m 2 ( f ( x s, p,t, x d m 3 ( f ( x s, p,t, x d... m n ( f ( x s, p,t, x d Field analysis can be expressed as a multidimensional function to extend the determination mode. But for multidimensional analysis results, how to convert multi-outputs to single output, and how to select an appropriate function to achieve dimension reduction processing and simplify the analysis are difficult and important. There are huge differences in massive samples and heterogeneous data, so the final output and determination of multi-analyzer be more difficult and confused. 114 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015

So the fuzzy comprehensive evaluation is introduced for multi-analyzer s result to simplify the result by the fuzzy dimension reduction. 3.2 Fuzzy Comprehensive Evaluation Method The basic idea of fuzzy comprehensive evaluation is to consider the various factors associated with objects, and make a reasonable evaluation with the fuzzy linear transformation theory and the maximum membership degree principle. There are m factors related to evaluating objet, it is called as the factors set and denote as below: U = {u 1, u 2,..., u m } There are n comments which are called evaluation set and recorded as below: V = {v 1, v 2,..., v m } Firstly, each factor u i in factors set U is single evaluation factor which determines the membership degree value r ij of the reviews v j and forms a single-actor evaluation set of u i, r i = (r i1, r i2,..., r im, r i μ (V So we can get a fuzzy sets of the reviews set V. Then a fuzzy-value schema mapping function is shown below, f: U F(V, u i f(u i Where f (u i = r ij = (r i1, r i2,..., r in is the comments fuzzy vector on the factors set u i, and r ij is the relationship factor between u j, and v j. Then all the single-factor evaluation factors are integrated together to build a general evaluation of matrix and a fuzzy comprehensive evaluation matrix R from U to V: In other words, fuzzy value function get the relationship from U to V, where matrix., R f is the comprehensive evaluation As the effects of all the factors for evaluation are different, strong or weak, a fuzzy weight factor sets is necessary for matrix U, which is defined below. Where a i is the measure of the impact in the evaluation for the factors u i (i = 1,2,..., m,and A is called importance fuzzy subset on the matrix U, and a i is called importance coefficient for factor u i. Then fuzzy comprehensive evaluation model is given to calculate the fuzzy comprehensive evaluation set. When the factors of importance fuzzy sets A and comprehensive evaluation matrix (fuzzy relations R are known, fuzzy subsets A on evaluation matrix V is got by linear transform from R. Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015 115

Where indicates generalized fuzzy and operation, and indicates generalized fuzzy or operation. And B is called as the set of fuzzy comprehensive evaluation set for V, and the formula is known as the comprehensive evaluation model, and denoted as. Finally, according to the maximum principle of membership degree, the maximum largest membership b j in the set (b l, b 2,, b n is the result of comprehensive evaluation, where review element v j is corresponded to b j in fuzzy comprehensive evaluation set. Compared to single-factor ui, the membership factor of evaluation ui for review factor v i is r ij (j = 1, 2,,n, while the results of operations (a i r ij (denoted by r ij * can reflect all the influence factors and obtain more accurate evaluations. That is the fuzzy comprehensive evaluation method s merit. 3.3 Multi-analyzer with Fuzzy Comprehensive Evaluation In order to evaluate the learner s output, during the process of output quantization, fuzzy comprehensive evaluation method can be applied to learning machine for massive heterogeneous marine data schema mapping and take all dynamic, ambiguous, and real-time factors into account. In the schema mapping system with multi-analyzer, fuzzy comprehensive evaluation method will improve automatic map s accuracy and sensitivity. (1 The evaluation factors set of learning model is: U = {U 1, U 2,, U n },the weight set is A ={A 1, A 2,, A n }, and A i is the weight of evaluation factor U i,. And U i ={U i1,u i2,,u n }, n is the number of specific performance factor basing on design factors. The weight set A i ={a i1,a i2,,a in }, where a ij is the weighing of u ij, where.. The performance factors set of the automatic schema mapping learning machine model is shown in Table 1. Objective Level 1 Level 2 evaluation of learner soutput Correlation model evaluation U 1 Correlation model evaluation U 11 Correlation model evaluation U 12 Performance analysis evaluation U 2 Performance analysis evaluation U 21 Performance analysis evaluation U 22 Adaptability analysis evaluation U 3 Adaptability analysis evaluation U 31 Adaptability analysis evaluation U 32 Table 1. Evaluation factor hierarchical table Weight set is determined by the way of matching degree evaluation according to the expert points-scoring system s evaluation for all kind of performance indexes. The single factor performance evaluation of learning model evaluate the membership index of each factor, and a fuzzy schema 116 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015

mapping function is given: :U V. For each U i, R i can be shown as the fuzzy matrix: Where r jk represents the membership grade of factors u ij to the k level evaluation V k. The value of r jk can be determined by expert points-scoring system. The number of v i level reviews for u ij is s i, so we get : The membership vector B i of the evaluation set V, B i =A i R i = (b i1, b i2,, b im, is the summary result of performance single factor fuzzy evaluation result on the factor U i. After the single factor evaluation, fuzzy comprehensive evaluation can extend the evaluation to all levels learning indexes and constitute fuzzy matrix B by a single factor B i. And after the fuzzy performance matrix operations to R, the membership vector B is given for factors set U to comment V level: B = A R = (b 1,b 2,,b m If the normalization expression is According to the principle of maximum membership, performance reviews V k corresponded to the maximum membership degree b k in B is the performance level element in model design stage. The decision-maker determines the performance index of the model design According to the critical index threshold. It should be noted that the weight of the model needs to be adjusted for the different learning machine models and learning strategies. So for multi learning strategies weight parameters sets may be loaded dynamically. 4. Experiments The factors and weights of fuzzy comprehensive evaluation can be adjusted according to the actual requirements during the marine heterogeneous data schema mapping process. For example, in the process of multi-source marine tidal data, the evaluation factors and the weights range are shown in Table 2. We compare the schema mapping conversion accuracy and efficiency of 20-group multi-source marine data with single-analyzer and multi-analyzer based on fuzzy comprehensive evaluation. There are eight types conversion information, including location Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015 117

Multi strategies learning machine model is the research hotspot of automatic data schema mapping. Because of marine data s heterogeneity and large capacity, the outputs of learner become more difficult to judge and select accurately. In order to evaluate the learner s output more effectively and accurately, this paper presents the concept of multi-analyzer and evaluates multicoordinates, timestamp information, wave heights, tidal time, observation instrument information, data packet length, data backhaul IP addresses, temperature information. Schema mapping error rate is defined as the ratio of errors and the total number of schema mapping after learning process which measures the schema mapping accuracy and model quality. Different machine learning models are built with Matlab to map 100,000 heterogeneity data filed. We can get the following results in the experiment: Objective Level1 Level2 Weights Evaluat-ion of mul Correlation model evaluation Conversion coefficient score of the 15%~20% source marinetidal data s outputs of tide data in time t 1 ~t n learning machin-e output Matching score of same dimensions 10%~15% tidal data (surface data / underwater data conversion output Output distribution probability matching 5%~10% of key data (dynamic location coordinates, time information, wave height information, tide time information High tide, low tide limit spatial 0%~5% matching degree conversion performance Response time(max0min0avg under 5%~10% analysis evaluation typical long term observation Response time(max0min0avg under 5%~10% typical short term observation Response time(max0min0avg under 0%~5% long term observation tidal day Response time(max0min0avg under 5%~10% short time observation tidal day Adaptability analysis evaluation Data fluctuation under long 5%~10% term observation Output fluctuation of same dimensions 5%~10% (surface data / underwater data tidal data Table 2. Tidal data evaluation factors hierarchy partition Y-axis in Figure 5 and Figure 6 is schema mapping error rates. They drop 5% from single analyzer strategy to multi analyzer strategy. And there is 3% average error rate in multi analyzer with fuzzy comprehensive evaluation method. 5 Conclusion 118 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015

Figure 5. Comparative analysis strategy of single learner Figure 6. Comparative analysis of multi-learner strategies output with fuzzy comprehensive evaluation method and quantify the various factors of output under the mass and heterogeneous samples. At last, we compare three schema mapping machine methods in the observing data conversion experiments. Experimental results show the accuracy improvement of multi-analyzer significantly. Acknowledgement This work is supported by the National Natural Science Foundation of China (Grant No. 41376178, and the Shanghai Science and Technology Committee (Project NO. 11510501300. Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015 119

References [1] Huang Dongmei., Zhang Chi., Du Jipeng. (2012. Integration of massive multi-source heterogeneous space-time data in digital sea, Marine Environmental Science, 31 (001, 11-113. [2] Michalski, R. S., Carbonell, J. G., Mitchell,T. M. (1985. Machine Learning: An ArtificialIntelligence Approach. Morgan Kaufmann. [3] Domingos, P. (1996. Unifying Instance-Based and Rule-Based Induction. MachineLearning, 24 (2, 141-168. [4] Freitag, D. (2000. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39 (2, 169-202. [5] Chan, P. K., Stolfo, S. J. (1993. Experiments on multistrategy learning by meta-learning. In : Proceedings of the second international conference on Information and knowledge management, 314-323. [6] Chang, Y. C., Dai, H. J., Wu, J.C.Y. (2013. et al. TEMPTing System: a hybrid method of rule and machine learning for temporal relation extraction in patient discharge summaries, Journal of BiomedicalInformatics. [7] Dredze, M., Kulesza, A., Crammer, K. (2010. Multi-domain learning by confidence-weighted parameter combination, Machine Learning,79 (1-2, 123-149. [8] Talbot, J., Lee, B., Kapoor, A. (2009. et al. EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers, In : Proceedings of the 27 th international conference on Human factors in computing systems. ACM, 1283-1292. [9] Freund, Y.,Schapire, R.E.(1996. Experiments with a New Boosting Algorithm. MACHINE LEARNING-INTERNATIONAL WORKSHOP THENCONFERENCE, 148-156. 10] Wolpert,D.H. (1992. Stacked generalization, Neural Networks, 5 (2, 241-259. [11] Breiman, L. (1996. Bagging Predictors. Machine Learning, 24 (2, 123-140. 120 Journal of Multimedia Processing and Technologies Volume 6 Number 4 December 2015