UNIVERSITI PUTRA MALAYSIA A NEW APPROACH FOR INSTANCE-BASED SCHEMA MATCHING

Similar documents
SEMANTICS ORIENTED APPROACH FOR IMAGE RETRIEVAL IN LOW COMPLEX SCENES WANG HUI HUI

THE COMPARISON OF IMAGE MANIFOLD METHOD AND VOLUME ESTIMATION METHOD IN CONSTRUCTING 3D BRAIN TUMOR IMAGE

ISOGEOMETRIC ANALYSIS OF PLANE STRESS STRUCTURE CHUM ZHI XIAN

PERFOMANCE ANALYSIS OF SEAMLESS VERTICAL HANDOVER IN 4G NETWOKS MOHAMED ABDINUR SAHAL

BLOCK-BASED NEURAL NETWORK MAPPING ON GRAPHICS PROCESSOR UNIT ONG CHIN TONG UNIVERSITI TEKNOLOGI MALAYSIA

RECOGNITION OF PARTIALLY OCCLUDED OBJECTS IN 2D IMAGES ALMUASHI MOHAMMED ALI UNIVERSITI TEKNOLOGI MALAYSIA

ENHANCEMENT OF UML-BASED WEB ENGINEERING FOR METAMODELS: HOMEPAGE DEVELOPMENT CASESTUDY KARZAN WAKIL SAID

HARDWARE AND SOFTWARE CO-SIMULATION PLATFORM FOR CONVOLUTION OR CORRELATION BASED IMAGE PROCESSING ALGORITHMS SAYED OMID AYAT

HARDWARE/SOFTWARE SYSTEM-ON-CHIP CO-VERIFICATION PLATFORM BASED ON LOGIC-BASED ENVIRONMENT FOR APPLICATION PROGRAMMING INTERFACING TEO HONG YAP

IMPROVED IMAGE COMPRESSION SCHEME USING HYBRID OF DISCRETE FOURIER, WAVELETS AND COSINE TRANSFORMATION MOH DALI MOUSTAFA ALSAYYH

HARDWARE-ACCELERATED LOCALIZATION FOR AUTOMATED LICENSE PLATE RECOGNITION SYSTEM CHIN TECK LOONG UNIVERSITI TEKNOLOGI MALAYSIA

OPTIMIZE PERCEPTUALITY OF DIGITAL IMAGE FROM ENCRYPTION BASED ON QUADTREE HUSSEIN A. HUSSEIN

UNIVERSITI PUTRA MALAYSIA EFFECTS OF DATA TRANSFORMATION AND CLASSIFIER SELECTIONS ON URBAN FEATURE DISCRIMINATION USING HYPERSPECTRAL IMAGERY

A LEVY FLIGHT PARTICLE SWARM OPTIMIZER FOR MACHINING PERFORMANCES OPTIMIZATION ANIS FARHAN BINTI KAMARUZAMAN UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA MULTI-LEVEL MOBILE CACHE CONSISTENCY SCHEMES BASED ON APPLICATION REQUIREMENTS DOHA ELSHARIEF MAHMOUD YAGOUB

UNIVERSITI PUTRA MALAYSIA RELIABILITY PERFORMANCE EVALUATION AND INTEGRATION OF ROUTING ALGORITHM IN SHUFFLE EXCHANGE WITH MINUS ONE STAGE

SYSTEMATIC SECURE DESIGN GUIDELINE TO IMPROVE INTEGRITY AND AVAILABILITY OF SYSTEM SECURITY ASHVINI DEVI A/P KRISHNAN

LINK QUALITY AWARE ROUTING ALGORITHM IN MOBILE WIRELESS SENSOR NETWORKS RIBWAR BAKHTYAR IBRAHIM UNIVERSITI TEKNOLOGI MALAYSIA

OPTIMIZED BURST ASSEMBLY ALGORITHM FOR MULTI-RANKED TRAFFIC OVER OPTICAL BURST SWITCHING NETWORK OLA MAALI MOUSTAFA AHMED SAIFELDEEN

ENHANCING TIME-STAMPING TECHNIQUE BY IMPLEMENTING MEDIA ACCESS CONTROL ADDRESS PACU PUTRA SUARLI

FUZZY NEURAL NETWORKS WITH GENETIC ALGORITHM-BASED LEARNING METHOD M. REZA MASHINCHI UNIVERSITI TEKNOLOGI MALAYSIA

INTEGRATION OF CUBIC MOTION AND VEHICLE DYNAMIC FOR YAW TRAJECTORY MOHD FIRDAUS BIN MAT GHANI

SECURE-SPIN WITH HASHING TO SUPPORT MOBILITY AND SECURITY IN WIRELESS SENSOR NETWORK MOHAMMAD HOSSEIN AMRI UNIVERSITI TEKNOLOGI MALAYSIA

ONTOLOGY-BASED SEMANTIC HETEROGENEOUS DATA INTEGRATION FRAMEWORK FOR LEARNING ENVIRONMENT

IMPLEMENTATION OF UNMANNED AERIAL VEHICLE MOVING OBJECT DETECTION ALGORITHM ON INTEL ATOM EMBEDDED SYSTEM

DYNAMIC TIMESLOT ALLOCATION TECHNIQUE FOR WIRELESS SENSOR NETWORK OON ERIXNO

AN IMPROVED PACKET FORWARDING APPROACH FOR SOURCE LOCATION PRIVACY IN WIRELESS SENSORS NETWORK MOHAMMAD ALI NASSIRI ABRISHAMCHI

SUPERVISED MACHINE LEARNING APPROACH FOR DETECTION OF MALICIOUS EXECUTABLES YAHYE ABUKAR AHMED

A NEW STEGANOGRAPHY TECHNIQUE USING MAGIC SQUARE MATRIX AND AFFINE CIPHER WALEED S. HASAN AL-HASAN UNIVERSITI TEKNOLOGI MALAYSIA

LOCALIZING NON-IDEAL IRISES VIA CHAN-VESE MODEL AND VARIATIONAL LEVEL SET OF ACTIVE CONTOURS WITHTOUT RE- INITIALIZATION QADIR KAMAL MOHAMMED ALI

AN INTEGRATED SERVICE ARCHITECTURE FRAMEWORK FOR INFORMATION TECHNOLOGY SERVICE MANAGEMENT AND ENTERPRISE ARCHITECTURE

AUTOMATIC APPLICATION PROGRAMMING INTERFACE FOR MULTI HOP WIRELESS FIDELITY WIRELESS SENSOR NETWORK

RGB COLOR IMAGE WATERMARKING USING DISCRETE WAVELET TRANSFORM DWT TECHNIQUE AND 4-BITS PLAN BY HISTOGRAM STRETCHING KARRAR ABDUL AMEER KADHIM

ENHANCING WEB SERVICE SELECTION USING ENHANCED FILTERING MODEL AJAO, TAJUDEEN ADEYEMI

CLOUD COMPUTING ADOPTION IN BANKING SYSTEM (UTM) IN TERMS OF CUSTOMERS PERSPECTIVES SHAHLA ASADI

COLOUR IMAGE WATERMARKING USING DISCRETE COSINE TRANSFORM AND TWO-LEVEL SINGULAR VALUE DECOMPOSITION BOKAN OMAR ALI

UNIVERSITI PUTRA MALAYSIA

ADAPTIVE LOOK-AHEAD ROUTING FOR LOW LATENCY NETWORK ON-CHIP NADERA NAJIB QAID AL AREQI UNIVERSITI TEKNOLOGI MALAYSIA

SLANTING EDGE METHOD FOR MODULATION TRANSFER FUNCTION COMPUTATION OF X-RAY SYSTEM FARHANK SABER BRAIM UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI SAINS MALAYSIA. CMT322/CMM323 Web Engineering & Technologies [Kejuruteraan & Teknologi Web]

UNIVERSITI PUTRA MALAYSIA RANK-ORDER WEIGHTING OF WEB ATTRIBUTES FOR WEBSITE EVALUATION MEHRI SAEID

STUDY OF FLOATING BODIES IN WAVE BY USING SMOOTHED PARTICLE HYDRODYNAMICS (SPH) HA CHEUN YUEN UNIVERSITI TEKNOLOGI MALAYSIA

This item is protected by original copyright

MODELLING AND REASONING OF LARGE SCALE FUZZY PETRI NET USING INFERENCE PATH AND BIDIRECTIONAL METHODS ZHOU KAIQING

DETECTION OF WORMHOLE ATTACK IN MOBILE AD-HOC NETWORKS MOJTABA GHANAATPISHEH SANAEI


LOGICAL OPERATORS AND ITS APPLICATION IN DETERMINING VULNERABLE WEBSITES CAUSED BY SQL INJECTION AMONG UTM FACULTY WEBSITES NURUL FARIHA BINTI MOKHTER

MICRO-MOBILITY ENHANCEMENT IN MULTICAST MOBILE IPv6 WIRELESS NETWORKS. By YONG CHU EU

PROBLEMS ASSOCIATED WITH EVALUATION OF EXTENSION OF TIME (EOT) CLAIM IN GOVERNMENT PROJECTS

DEVELOPMENT OF SPAKE S MAINTENANCE MODULE FOR MINISTRY OF DEFENCE MALAYSIA SYED ARDI BIN SYED YAHYA KAMAL UNIVERSITI TEKNOLOGI MALAYSIA

Signature :.~... Name of supervisor :.. ~NA.lf... l.?.~mk.. :... 4./qD F. Universiti Teknikal Malaysia Melaka

UNSTEADY AERODYNAMIC WAKE OF HELICOPTER MAIN-ROTOR-HUB ASSEMBLY ISKANDAR SHAH BIN ISHAK UNIVERSITI TEKNOLOGI MALAYSIA

STATISTICAL APPROACH FOR IMAGE RETRIEVAL KHOR SIAK WANG DOCTOR OF PHILOSOPHY UNIVERSITI PUTRA MALAYSIA

BORANG PENGESAHAN STATUS TESIS

ADAPTIVE ONLINE FAULT DETECTION ON NETWORK-ON-CHIP BASED ON PACKET LOGGING MECHANISM LOO LING KIM UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA AMTREE PROTOCOL ENHANCEMENT BY MULTICAST TREE MODIFICATION AND INCORPORATION OF MULTIPLE SOURCES ALI MOHAMMED ALI AL SHARAFI

HIGH SPEED SIX OPERANDS 16-BITS CARRY SAVE ADDER AWATIF BINTI HASHIM

IMAGE SLICING AND STATISTICAL LAYER APPROACHES FOR CONTENT-BASED IMAGE RETRIEVAL JEHAD QUBIEL ODEH AL-NIHOUD

DEVELOPMENT OF A MOBILE ROBOT SPATIAL DATA ACQUISITION SYSTEM OOI WEI HAN MASTER OF SCIENCE UNIVERSITI PUTRA MALAYSIA

ENERGY-EFFICIENT DUAL-SINK ALGORITHMS FOR SINK MOBILITY IN EVENT-DRIVEN WIRELESS SENSOR NETWORKS

UNIVERSITI PUTRA MALAYSIA GRAPHICAL USER INTERFACE LAYOUT LANGUAGE USING COMBINATORS KHAIRUL AZHAR KASMIRAN FSKTM

HERMAN. A thesis submitted in fulfilment of the requirements for the award of the degree of Doctor of Philosophy (Computer Science)

DYNAMIC MOBILE SERVER FOR LIVE CASTING APPLICATIONS MUHAMMAD SAZALI BIN HISHAM UNIVERSITI TEKNOLOGI MALAYSIA

CAMERA CALIBRATION FOR UNMANNED AERIAL VEHICLE MAPPING AHMAD RAZALI BIN YUSOFF

MULTICHANNEL ORTHOGONAL FREQUENCY DIVISION MULTIPLEXING -ROF FOR WIRELESS ACCESS NETWORK MOHD JIMMY BIN ISMAIL

SOLUTION AND INTERPOLATION OF ONE-DIMENSIONAL HEAT EQUATION BY USING CRANK-NICOLSON, CUBIC SPLINE AND CUBIC B-SPLINE WAN KHADIJAH BINTI WAN SULAIMAN


AN ENHANCED CONNECTIVITY AWARE ROUTING PROTOCOL FOR VEHICULAR AD HOC NETWORKS AHMADU MAIDORAWA

UNIVERSITI PUTRA MALAYSIA PERFORMANCE ENHANCEMENT OF AIMD ALGORITHM FOR CONGESTION AVOIDANCE AND CONTROL

VIRTUAL PRIVATE NETWORK: ARCHITECTURE AND IMPLEMENTATIONS

UNIVERSITI PUTRA MALAYSIA ADAPTIVE METHOD TO IMPROVE WEB RECOMMENDATION SYSTEM FOR ANONYMOUS USERS

FINITE ELEMENT INVESTIGATION ON THE STRENGTH OF SEMI-RIGID EXTENDED END PLATE STEEL CONNECTION USING LUSAS SOFTWARE MOHD MAIZIZ BIN FISHOL HAMDI

A SEED GENERATION TECHNIQUE BASED ON ELLIPTIC CURVE FOR PROVIDING SYNCHRONIZATION IN SECUERED IMMERSIVE TELECONFERENCING VAHIDREZA KHOUBIARI

IMPLEMENTATION AND PERFORMANCE ANALYSIS OF IDENTITY- BASED AUTHENTICATION IN WIRELESS SENSOR NETWORKS MIR ALI REZAZADEH BAEE

UNIVERSITI PUTRA MALAYSIA A MATRIX USAGE FOR LOAD BALANCING IN SHORTEST PATH ROUTING NOR MUSLIZA MUSTAFA FSKTM

ENHANCING SRAM PERFORMANCE OF COMMON GATE FINFET BY USING CONTROLLABLE INDEPENDENT DOUBLE GATES CHONG CHUNG KEONG UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA METHODOLOGY OF FUZZY-BASED TUNING FOR SLIDING MODE CONTROLLER

UNIVERSITI PUTRA MALAYSIA WEIGHTED WINDOW FOR TCP FAIR BANDWIDTH ALLOCATION IN WIRELESS LANS

RESOURCE ALLOCATION SCHEME FOR FUTURE USER-CENTRIC WIRELESS NETWORK WAHEEDA JABBAR UNIVERSITI TEKNOLOGI MALAYSIA

DATASET GENERATION AND NETWORK INTRUSION DETECTION BASED ON FLOW-LEVEL INFORMATION AHMED ABDALLA MOHAMEDALI ABDALLA

DEVELOPMENT OF VENDING MACHINE WITH PREPAID PAYMENT METHOD AMAR SAFUAN BIN ALYUSI

ANOMALY DETECTION IN WIRELESS SENSOR NETWORK (WSN) LAU WAI FAN

PRIVACY FRIENDLY DETECTION TECHNIQUE OF SYBIL ATTACK IN VEHICULAR AD HOC NETWORK (VANET) SEYED MOHAMMAD CHERAGHI

UNIVERSITI PUTRA MALAYSIA TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE FOR MINING WEB CONTENT OUTLIERS

DETERMINING THE MULTI-CURRENT SOURCES OF MAGNETOENCEPHALOGRAPHY BY USING FUZZY TOPOGRAPHIC TOPOLOGICAL MAPPING

UNIVERSITI PUTRA MALAYSIA IMPROVED WAVELET DENOISING OF HYPERSPECTRAL REFLECTANCE USING LEVEL-INDEPENDENT WAVELET THRESHOLDING

UNIVERSITI PUTRA MALAYSIA IMPROVED MULTICROSSOVER GENETIC ALGORITHM FOR TWO- DIMENSIONAL RECTANGULAR BIN PACKING PROBLEM MARYAM SARABIAN FS

UNIVERSITI PUTRA MALAYSIA DEVELOPMENT OF CLASS 2 AND CLASS 3 SURGE PROTECTION DEVICES FOR LOW VOLTAGE PROTECTION SYSTEMS

ssk 2023 asas komunikasi dan rangkaian TOPIK 4.0 PENGALAMATAN RANGKAIAN Minggu 11

MINIATURIZED PLANAR CAPACITANCE TOMOGRAPHY SYSTEM USING FAN BEAM PROJECTION TECHNIQUE FOR STAGNANT SAMPLES VISUALIZATION

AMBA AXI BUS TO NETWORK-ON-CHIP BRIDGE NG KENG YOKE UNIVERSITI TEKNOLOGI MALAYSIA

QOS-AWARE HANDOVER SCHEME FOR HIERARCHICAL MOBILE IPv6 USING CONTEXT TRANSFER WITH LINK LAYER TRIGGER

UNIVERSITI SAINS MALAYSIA. CST333 Distributed & Grid Computing [Perkomputeran Teragih & Grid]

A TRUST MODEL FOR BUSINESS TO CUSTOMER CLOUD E-COMMERCE HOSSEIN POURTAHERI

BORANG PENGESAHAN STATUS TESIS

UNIVERSITI SAINS MALAYSIA. Peperiksaan Semester Pertama Sidang Akademik 2003/2004. September/Oktober 2003

A RULE MODELING ENGINE FOR COMPLEX EVENT PROCESSING (A CASE STUDY ON PASSIVE RFID READERS FOR A VIRTUAL SHOPPING MALL)

DESIGN AND IMPLEMENTATION OF A MUSIC BOX USING FPGA TAN KIAN YIAK

AN ENHANCED SIMULATED ANNEALING APPROACH FOR CYLINDRICAL, RECTANGULAR MESH, AND SEMI-DIAGONAL TORUS NETWORK TOPOLOGIES NORAZIAH BINTI ADZHAR

MAGNETIC FLUX LEAKAGE SYSTEM FOR WIRE ROPE INSPECTION USING BLUETOOTH COMMUNICATION MUHAMMAD MAHFUZ BIN SALEHHON UNIVERSITI TEKNOLOGI MALAYSIA

Transcription:

UNIVERSITI PUTRA MALAYSIA A NEW APPROACH FOR INSTANCE-BASED SCHEMA MATCHING OSAMAH ABDUL SATTAR MAHDI FSKTM 2014 5

A NEW APPROACH FOR INSTANCE-BASED SCHEMA MATCHING By OSAMAH ABDUL SATTAR MAHDI Thesis Submitted to the School of Graduate Studies, Universiti Putra Malaysia, in Fulfilment of the Requirements for the Degree of Master of Science May 2014

COPYRIGHT All material contained within the thesis, including without limitation text, logos, icons, photographs and all other artwork, is copyright material of Universiti Putra Malaysia unless otherwise stated. Use may be made of any material contained within the thesis for noncommercial purposes from the copyright holder. Commercial use of material may only be made with the express, prior, written permission of Universiti Putra Malaysia. Copyright Universiti Putra Malaysia

بسم هللا الرمحن الرحيم ك م ا أ ر س ل ن ا ف يك م ر س وال م نك م ي ت ل و ع ل ي ك م آي ات ن ا و ي ز ك يك م و ي ع ل م ك م ال ك ت اب و ال ح ك م ت و ي ع ل م ك م م ا ل م ت ك ون وا ت ع ل م ون DEDICATION سورة البقرة - آيت 151 This thesis is dedicated to my Dearest, precious and First Teachers: My Father and Mother I will always be grateful for your endless love, unlimited support and deep faith in me And My brother and sisters, who are like candles that burn to provide others light, Thanks to Allah for sending these angels to my world. ii

Abstract of thesis presented to the Senate of Universiti Putra Malaysia in fulfilment of the requirement for the degree of Master of Science A NEW APPROACH FOR INSTANCE-BASED SCHEMA MATCHING Chairman: By OSAMAH ABDUL SATTAR MAHDI May 2014 Professor Hamidah Ibrahim, PhD Faculty : Computer Science and Information Technology Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results. This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match. Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google iii

similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works. iv

Abstrak tesis dipersembahkan kepada Senat Universiti Putra Malaysia sebagai memenuhi keperluan mendapatkan Ijazah Sarjana Sains PENDEKATAN BAHARU UNTUK PEMADANAN SKEMA BERASASKAN KETIKAAN Pengerusi : Oleh OSAMAH ABDUL SATTAR MAHDI Mei 2014 Profesor Hamidah Ibrahim, PhD Fakulti : Sains Komputer dan Teknologi Maklumat Pemadanan skema adalah fasa penting dalam integrasi data yang bertujuan untuk mencari koresponden antara atribut skema dengan menggunakan maklumat skema. Walau bagaimanapun, maklumat ini tidak selalunya tersedia atau berguna untuk digunakan kerana ia boleh jadi singkatan. Akibatnya, ketikaan boleh jadi satu pilihan alternatif bagi maklumat skema. Pelbagai pendekatan padanan skema berasaskan ketikaan telah dicadangkan untuk mencapai matlamat dalam penemuan koresponden antara atribut skema, dengan menganggap ketikaan sebagai rentetan termasuklah ketikaan numerik. Ini menghalang penemuan pola biasa atau menjalankan pengiraan statistikal di kalangan ketikaan numerik. Sebagai akibat, ini menyebabkan pemadanan yang tidak dikenalpasti terutamanya bagi atribut dengan ketikaan numerik yang seterusnya mengurangkan kualiti hasil pemadanan. Tesis ini bertujuan mencadangkan satu pendekatan efisien yang dapat mengenal pasti padanan atribut antara skema dengan mengeksploitasi sepenuhnya ketikaan. Pendekatan ini menggunakan konsep pengecaman pola untuk menentukan pemadanan atribut bagi ketikaan numerik dan campuran. Ini diperolehi dengan mencipta ungkapan biasa secara automatik berdasarkan kepada ketikaan. Sementara itu, bagi ketikaan abjad pendekatan ini mengira skor persamaan semantik dengan menggunakan persamaan Google bagi mendapatkan pertalian semantik antara ketikaan. Pendekatan yang dicadangkan mengandungi lima fasa utama, iaitu: (i) menganalisis ketikaan, (ii) mengelaskan atribut skema, (iii) mengekstrak saiz sampel yang optimal, (iv) mengenal pasti persamaan ketikaan, dan (v) mengenal pasti pemadanan. Tiga analisis telah direka bentuk dan dijalankan ke atas dua set data yang berbeza, iaitu: (i) Restoran dan (ii) Census dengan merujuk kepada precision (P), recall (R), dan F- measure (F). Analisis pertama bertujuan untuk mengenal pasti saiz sampel tuple yang optimum untuk digunakan semasa fasa mengekstrak saiz sampel yang optimum. Tujuan mengenal pasti saiz sampel yang optimum adalah untuk mengurangkan bilangan perbandingan antara ketikaan yang mana dapat mengurangkan masa pemprosesan v

operasi pemadanan. Analisis ini menunjukkan bahawa saiz sampel yang optimum adalah 50% daripada saiz jadual yang sebenar bagi kedua-dua data set. Analisis kedua bertujuan untuk menyiasat dan membuktikan bahawa menggabungkan kedua-dua persamaan Google dan ungkapan biasa sebagaimana dalam pendekatan cadangan kami mencapai ketepatan yang lebih tinggi berbanding menggunakan persamaan Google atau ungkapan biasa secara berasingan. Hasil menunjukkan bahawa pendekatan cadangan kami mencapai precision (P), recall (R), and F-measure (F) dalam julat 93% - 99% untuk kedua-dua set data. Sebaliknya, persamaan Google dan ungkapan biasa yang dilaksanakan secara berasingan mencapai precision (P), recall (R), dan F-measure (F) dalam julat 36% - 74%. Manakala analisis ketiga bertujuan untuk membandingkan prestasi pendekatan cadangan kami dengan pendekatan yang sebelum ini. Hasil menunjukkan bahawa pendekatan cadangan kami mengatasi pendekatan sebelumnya walaupun suatu sampel ketikaan digunakan daripada mempertimbangakan keseluruhan ketikaan semasa proses padanan skema berasaskan ketikaan sebagaimana digunakan dalam kajian sebelum ini. vi

ACKNOWLEDGEMENTS In the name of ALLAH, the most merciful and most compassionate. Praise to ALLAH S. W. T. who granted me strength, courage, patience and inspirations to complete this work. In the first place I would like to record my deepest gratitude to my supervisor Professor Dr. Hamidah Ibrahim, for her continuous and caring help, advice, and encouragement during my Ms.c. studies at the Universiti Putra Malaysia. I thank her for being very patient with my questions and my progress, for the countless lessons on writing and presenting technical materials, on doing research in general, and also for the English lessons, even though she is not a language teacher. I gratefully acknowledge Associate Professor Dr. Lilly Suriani Affendey for her advices, supervision, and crucial contribution, which made the backbone of this research and so to this thesis. Where would I be without my family? My parents deserve special mention for their inseparable support and prayers. My Father, Professor Dr. Abddul Sattar, in the first place is the person who put the fundament my learning character, showing me the joy of intellectual pursuit ever since I was child. My Mother is the one who sincerely raised me with her caring and gently love. Aseel, Nawfal, and Zena thanks for being supportive and caring siblings. Osamah Abdul Sattar Mahdi 2014 vii

viii

This thesis was submitted to the Senate of Universiti Putra Malaysia and has been accepted as fulfillment of the requirement for the degree of Master of Science. The members of the Supervisory Committee were as follows: Hamidah Ibrahim, PhD Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Chairman) Lilly Suriani Affendey, PhD Associate Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Member) BUJANG BIN KIM HUAT, PhD Professor and Dean School of Graduate Studies Universiti Putra Malaysia Date: ix

Declaration by graduate student I hereby confirm that: this thesis is my original work; quotations, illustrations and citations have been duly referenced; this thesis has not been submitted previously or concurrently for any other degree at any other institutions; intellectual property from the thesis and copyright of thesis are fully-owned by Universiti Putra Malaysia, as according to the Universiti Putra Malaysia (Research) Rules 2012; written permission must be obtained from supervisor and the office of Deputy Vice- Chancellor (Research and Innovation) before thesis is published (in the form of written, printed or in electronic form) including books, journals, modules, proceedings, popular writings, seminar papers, manuscripts, posters, reports, lecture notes, learning modules or any other materials as stated in the Universiti Putra Malaysia (Research) Rules 2012; there is no plagiarism or data falsification/fabrication in the thesis, and scholarly integrity is upheld as according to the Universiti Putra Malaysia (Graduate Studies) Rules 2003 (Revision 2012-2013) and the Universiti Putra Malaysia (Research) Rules 2012. The thesis has undergone plagiarisim detection software. Signature: Date: Name and Matric No.: x

Declaration by Members of Supervisory Committee This is to confirm that: the research conducted and the writing of this thesis was under our supervision; supervision responsibilities as stated in the Universiti Putra Malaysia (Graduate Studies) Rules 2003 (Revision 2012-2013) are adhered to. Signature: Signature: Name of Name of Chairperson of Member of Supervisory Supervisory Committee: Committee: xi

TABLE OF CONTENTS DEDICATION ABSTRACT ABSTRAK ACKNOWLEDGEMENTS APPROVAL DECLARATION LIST OF TABLES LIST OF FIGURES CHAPTER 1 INTRODUCTION 1 1.1 Overview 1 1.2 Problem Statement 3 1.3 Objective of the Research 3 1.4 Scope of the Research 3 1.5 Organisation of the Thesis 3 xii Page ii iii v vii ix x xiv xv 2 LITERATURE REVIEW 2.1 Introduction 5 2.2 Concept of Schema Matching 5 2.2.1 Schema Level Matching 2.2.1.1 Element Level Matching 2.2.1.2 Structure Level Matching 7 8 8 2.2.2 Instance Level Matching 8 2.2.3 Combination of Multiple Matcher 9 2.3 Instance Based Schema Matching 9 2.3.1 Element Level Approaches 10 2.4 Techniques Applied for Syntactic and Semantic Matching 11 2.5 Previous Approaches of Instance Based Schema Matching 12 2.5.1 Neural Network 12 2.5.2 Machine Learning 16 2.5.3 Information Theoretic Discrepancy 18 2.5.4 Rule Based 21 2.6 Point of Departure 23 2.7 Summary 24 3 METHODOLOGY OF RESEARCH 3.1 Introduction 25 3.2 Methodology of the Research 25 3.3 The Research Framework of the Instance Based Schema Matching 28 3.3.1 The Sub-Components of the Pre-Processing 28 3.3.2 The Sub-Components of the Instance Matching 29

3.4 Performance Measurement 29 3.5 Data Set 31 3.6 Summary 32 4 INSTANCE BASED SCHEMA MATCHING 4.1 Introduction 33 4.2 The Proposed Approach 33 4.2.1 Analysing Instances 34 4.2.2 Classifying Schema Attributes 36 4.2.3 Extracting the Optimal Sample Size 37 4.2.4 Identifying Instance Similarity 4.2.4.1 Regular Expression (regexes) 4.2.4.2 Using Regular Expressions to Describe Data Instances 4.2.4.3 Google Similarity Distance 4.2.4.4 Google Similarity for Alphabetic Data Type 38 38 39 45 46 4.2.5 Identifying the Match 47 4.3 Summary 49 5 EVALUATION 5.2 Introduction 50 5.3 Result 50 5.3.4 Analysis 1 50 5.3.5 Analysis 2 54 5.3.6 Analysis 3 55 5.4 Discussion 58 5.5 Summary 59 6 CONCLUSION AND FUTURE WORKS 6.2 Conclusion of Research 60 6.3 Future Work 60 REFERENCES 62 BIODATA OF STUDENT 67 LIST OF PUBLICATIONS 68 xiii