UNIVERSITI PUTRA MALAYSIA TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE FOR MINING WEB CONTENT OUTLIERS

Similar documents
INTEGRATION OF CUBIC MOTION AND VEHICLE DYNAMIC FOR YAW TRAJECTORY MOHD FIRDAUS BIN MAT GHANI

A SEED GENERATION TECHNIQUE BASED ON ELLIPTIC CURVE FOR PROVIDING SYNCHRONIZATION IN SECUERED IMMERSIVE TELECONFERENCING VAHIDREZA KHOUBIARI

SUPERVISED MACHINE LEARNING APPROACH FOR DETECTION OF MALICIOUS EXECUTABLES YAHYE ABUKAR AHMED

PROBLEMS ASSOCIATED WITH EVALUATION OF EXTENSION OF TIME (EOT) CLAIM IN GOVERNMENT PROJECTS

ENHANCEMENT OF UML-BASED WEB ENGINEERING FOR METAMODELS: HOMEPAGE DEVELOPMENT CASESTUDY KARZAN WAKIL SAID

SECURE-SPIN WITH HASHING TO SUPPORT MOBILITY AND SECURITY IN WIRELESS SENSOR NETWORK MOHAMMAD HOSSEIN AMRI UNIVERSITI TEKNOLOGI MALAYSIA

STUDY OF FLOATING BODIES IN WAVE BY USING SMOOTHED PARTICLE HYDRODYNAMICS (SPH) HA CHEUN YUEN UNIVERSITI TEKNOLOGI MALAYSIA

HARDWARE AND SOFTWARE CO-SIMULATION PLATFORM FOR CONVOLUTION OR CORRELATION BASED IMAGE PROCESSING ALGORITHMS SAYED OMID AYAT

AN IMPROVED PACKET FORWARDING APPROACH FOR SOURCE LOCATION PRIVACY IN WIRELESS SENSORS NETWORK MOHAMMAD ALI NASSIRI ABRISHAMCHI

ISOGEOMETRIC ANALYSIS OF PLANE STRESS STRUCTURE CHUM ZHI XIAN

ADAPTIVE ONLINE FAULT DETECTION ON NETWORK-ON-CHIP BASED ON PACKET LOGGING MECHANISM LOO LING KIM UNIVERSITI TEKNOLOGI MALAYSIA

COLOUR IMAGE WATERMARKING USING DISCRETE COSINE TRANSFORM AND TWO-LEVEL SINGULAR VALUE DECOMPOSITION BOKAN OMAR ALI

ENHANCING TIME-STAMPING TECHNIQUE BY IMPLEMENTING MEDIA ACCESS CONTROL ADDRESS PACU PUTRA SUARLI

MODELLING AND REASONING OF LARGE SCALE FUZZY PETRI NET USING INFERENCE PATH AND BIDIRECTIONAL METHODS ZHOU KAIQING

ENHANCING WEB SERVICE SELECTION USING ENHANCED FILTERING MODEL AJAO, TAJUDEEN ADEYEMI

Pengenalan Sistem Maklumat Dalam Pendidikan

SEMANTICS ORIENTED APPROACH FOR IMAGE RETRIEVAL IN LOW COMPLEX SCENES WANG HUI HUI

UNIVERSITI PUTRA MALAYSIA

THE COMPARISON OF IMAGE MANIFOLD METHOD AND VOLUME ESTIMATION METHOD IN CONSTRUCTING 3D BRAIN TUMOR IMAGE

RECOGNITION OF PARTIALLY OCCLUDED OBJECTS IN 2D IMAGES ALMUASHI MOHAMMED ALI UNIVERSITI TEKNOLOGI MALAYSIA

FUZZY NEURAL NETWORKS WITH GENETIC ALGORITHM-BASED LEARNING METHOD M. REZA MASHINCHI UNIVERSITI TEKNOLOGI MALAYSIA

A NEW STEGANOGRAPHY TECHNIQUE USING MAGIC SQUARE MATRIX AND AFFINE CIPHER WALEED S. HASAN AL-HASAN UNIVERSITI TEKNOLOGI MALAYSIA

STATISTICAL APPROACH FOR IMAGE RETRIEVAL KHOR SIAK WANG DOCTOR OF PHILOSOPHY UNIVERSITI PUTRA MALAYSIA

UNIVERSITI PUTRA MALAYSIA RANK-ORDER WEIGHTING OF WEB ATTRIBUTES FOR WEBSITE EVALUATION MEHRI SAEID

IMPLEMENTATION OF UNMANNED AERIAL VEHICLE MOVING OBJECT DETECTION ALGORITHM ON INTEL ATOM EMBEDDED SYSTEM

ENHANCING SRAM PERFORMANCE OF COMMON GATE FINFET BY USING CONTROLLABLE INDEPENDENT DOUBLE GATES CHONG CHUNG KEONG UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA MULTI-LEVEL MOBILE CACHE CONSISTENCY SCHEMES BASED ON APPLICATION REQUIREMENTS DOHA ELSHARIEF MAHMOUD YAGOUB

HARDWARE/SOFTWARE SYSTEM-ON-CHIP CO-VERIFICATION PLATFORM BASED ON LOGIC-BASED ENVIRONMENT FOR APPLICATION PROGRAMMING INTERFACING TEO HONG YAP

DETECTION OF WORMHOLE ATTACK IN MOBILE AD-HOC NETWORKS MOJTABA GHANAATPISHEH SANAEI

HARDWARE-ACCELERATED LOCALIZATION FOR AUTOMATED LICENSE PLATE RECOGNITION SYSTEM CHIN TECK LOONG UNIVERSITI TEKNOLOGI MALAYSIA

IMPROVED IMAGE COMPRESSION SCHEME USING HYBRID OF DISCRETE FOURIER, WAVELETS AND COSINE TRANSFORMATION MOH DALI MOUSTAFA ALSAYYH

LOCALIZING NON-IDEAL IRISES VIA CHAN-VESE MODEL AND VARIATIONAL LEVEL SET OF ACTIVE CONTOURS WITHTOUT RE- INITIALIZATION QADIR KAMAL MOHAMMED ALI

OPTIMIZE PERCEPTUALITY OF DIGITAL IMAGE FROM ENCRYPTION BASED ON QUADTREE HUSSEIN A. HUSSEIN

UNIVERSITI PUTRA MALAYSIA RELIABILITY PERFORMANCE EVALUATION AND INTEGRATION OF ROUTING ALGORITHM IN SHUFFLE EXCHANGE WITH MINUS ONE STAGE

LOGICAL OPERATORS AND ITS APPLICATION IN DETERMINING VULNERABLE WEBSITES CAUSED BY SQL INJECTION AMONG UTM FACULTY WEBSITES NURUL FARIHA BINTI MOKHTER

PERFOMANCE ANALYSIS OF SEAMLESS VERTICAL HANDOVER IN 4G NETWOKS MOHAMED ABDINUR SAHAL

UNIVERSITI PUTRA MALAYSIA ADAPTIVE METHOD TO IMPROVE WEB RECOMMENDATION SYSTEM FOR ANONYMOUS USERS

DYNAMIC TIMESLOT ALLOCATION TECHNIQUE FOR WIRELESS SENSOR NETWORK OON ERIXNO

AUTOMATIC APPLICATION PROGRAMMING INTERFACE FOR MULTI HOP WIRELESS FIDELITY WIRELESS SENSOR NETWORK

BLOCK-BASED NEURAL NETWORK MAPPING ON GRAPHICS PROCESSOR UNIT ONG CHIN TONG UNIVERSITI TEKNOLOGI MALAYSIA

LINK QUALITY AWARE ROUTING ALGORITHM IN MOBILE WIRELESS SENSOR NETWORKS RIBWAR BAKHTYAR IBRAHIM UNIVERSITI TEKNOLOGI MALAYSIA

Signature :.~... Name of supervisor :.. ~NA.lf... l.?.~mk.. :... 4./qD F. Universiti Teknikal Malaysia Melaka

HIGH SPEED SIX OPERANDS 16-BITS CARRY SAVE ADDER AWATIF BINTI HASHIM

SLANTING EDGE METHOD FOR MODULATION TRANSFER FUNCTION COMPUTATION OF X-RAY SYSTEM FARHANK SABER BRAIM UNIVERSITI TEKNOLOGI MALAYSIA

A LEVY FLIGHT PARTICLE SWARM OPTIMIZER FOR MACHINING PERFORMANCES OPTIMIZATION ANIS FARHAN BINTI KAMARUZAMAN UNIVERSITI TEKNOLOGI MALAYSIA

DYNAMIC MOBILE SERVER FOR LIVE CASTING APPLICATIONS MUHAMMAD SAZALI BIN HISHAM UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA EFFECTS OF DATA TRANSFORMATION AND CLASSIFIER SELECTIONS ON URBAN FEATURE DISCRIMINATION USING HYPERSPECTRAL IMAGERY

SOLUTION AND INTERPOLATION OF ONE-DIMENSIONAL HEAT EQUATION BY USING CRANK-NICOLSON, CUBIC SPLINE AND CUBIC B-SPLINE WAN KHADIJAH BINTI WAN SULAIMAN

DEVELOPMENT OF COMMERCIAL VEHICLE SPEED WARNING SYSTEM NGO CHON CHET

OPTIMIZED BURST ASSEMBLY ALGORITHM FOR MULTI-RANKED TRAFFIC OVER OPTICAL BURST SWITCHING NETWORK OLA MAALI MOUSTAFA AHMED SAIFELDEEN

DATASET GENERATION AND NETWORK INTRUSION DETECTION BASED ON FLOW-LEVEL INFORMATION AHMED ABDALLA MOHAMEDALI ABDALLA

ONTOLOGY-BASED SEMANTIC HETEROGENEOUS DATA INTEGRATION FRAMEWORK FOR LEARNING ENVIRONMENT

This item is protected by original copyright

QOS-AWARE HANDOVER SCHEME FOR HIERARCHICAL MOBILE IPv6 USING CONTEXT TRANSFER WITH LINK LAYER TRIGGER

DEVELOPMENT OF A MOBILE ROBOT SPATIAL DATA ACQUISITION SYSTEM OOI WEI HAN MASTER OF SCIENCE UNIVERSITI PUTRA MALAYSIA

RGB COLOR IMAGE WATERMARKING USING DISCRETE WAVELET TRANSFORM DWT TECHNIQUE AND 4-BITS PLAN BY HISTOGRAM STRETCHING KARRAR ABDUL AMEER KADHIM


AMBA AXI BUS TO NETWORK-ON-CHIP BRIDGE NG KENG YOKE UNIVERSITI TEKNOLOGI MALAYSIA

AN INTEGRATED SERVICE ARCHITECTURE FRAMEWORK FOR INFORMATION TECHNOLOGY SERVICE MANAGEMENT AND ENTERPRISE ARCHITECTURE

TUITION CENTRE MANAGEMENT SYSTEM (TCMS) ZARIFAH BINTI MOHD PAHMI UNIVERSITI TEKNIKAL MALAYSIA MELAKA

UNIVERSITI PUTRA MALAYSIA A MATRIX USAGE FOR LOAD BALANCING IN SHORTEST PATH ROUTING NOR MUSLIZA MUSTAFA FSKTM

IMAGE SLICING AND STATISTICAL LAYER APPROACHES FOR CONTENT-BASED IMAGE RETRIEVAL JEHAD QUBIEL ODEH AL-NIHOUD

IMPLEMENTATION AND PERFORMANCE ANALYSIS OF IDENTITY- BASED AUTHENTICATION IN WIRELESS SENSOR NETWORKS MIR ALI REZAZADEH BAEE

UNIVERSITI PUTRA MALAYSIA METHODOLOGY OF FUZZY-BASED TUNING FOR SLIDING MODE CONTROLLER

ADAPTIVE LOOK-AHEAD ROUTING FOR LOW LATENCY NETWORK ON-CHIP NADERA NAJIB QAID AL AREQI UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA CROSSTALK-FREE SCHEDULING ALGORITHMS FOR ROUTING IN OPTICAL MULTISTAGE INTERCONNECTION NETWORKS

SYSTEMATIC SECURE DESIGN GUIDELINE TO IMPROVE INTEGRITY AND AVAILABILITY OF SYSTEM SECURITY ASHVINI DEVI A/P KRISHNAN

MICRO-MOBILITY ENHANCEMENT IN MULTICAST MOBILE IPv6 WIRELESS NETWORKS. By YONG CHU EU

MULTICHANNEL ORTHOGONAL FREQUENCY DIVISION MULTIPLEXING -ROF FOR WIRELESS ACCESS NETWORK MOHD JIMMY BIN ISMAIL

Pengguna akan diberikan Username dan Password oleh Administrator untuk login sebagai admin/conference Manager bagi conference yang akan diadakan.

UNSTEADY AERODYNAMIC WAKE OF HELICOPTER MAIN-ROTOR-HUB ASSEMBLY ISKANDAR SHAH BIN ISHAK UNIVERSITI TEKNOLOGI MALAYSIA

UNIVERSITI PUTRA MALAYSIA CLASSIFICATION SYSTEM FOR HEART DISEASE USING BAYESIAN CLASSIFIER

FAKULTI TEKNOLOGI & SAINS MAKLUMAT

SMART AQUARJUM (A UTOMATIC FEEDING MACHINE) SY AFINAZ ZURJATI BINTI BAHARUDDIN

UNIVERSITI PUTRA MALAYSIA KEY TRANSFORMATION APPROACH FOR RIJNDAEL SECURITY

PRIVACY FRIENDLY DETECTION TECHNIQUE OF SYBIL ATTACK IN VEHICULAR AD HOC NETWORK (VANET) SEYED MOHAMMAD CHERAGHI

UNIVERSITI PUTRA MALAYSIA PERFORMANCE ENHANCEMENT OF AIMD ALGORITHM FOR CONGESTION AVOIDANCE AND CONTROL

CLOUD COMPUTING ADOPTION IN BANKING SYSTEM (UTM) IN TERMS OF CUSTOMERS PERSPECTIVES SHAHLA ASADI

DEVELOPMENT OF SPAKE S MAINTENANCE MODULE FOR MINISTRY OF DEFENCE MALAYSIA SYED ARDI BIN SYED YAHYA KAMAL UNIVERSITI TEKNOLOGI MALAYSIA

AN ENHANCED SIMULATED ANNEALING APPROACH FOR CYLINDRICAL, RECTANGULAR MESH, AND SEMI-DIAGONAL TORUS NETWORK TOPOLOGIES NORAZIAH BINTI ADZHAR

UNIVERSITI PUTRA MALAYSIA AMTREE PROTOCOL ENHANCEMENT BY MULTICAST TREE MODIFICATION AND INCORPORATION OF MULTIPLE SOURCES ALI MOHAMMED ALI AL SHARAFI

INSTRUCTION: This section consists of FOUR (4) structured questions. Answer ALL questions.

PANDUAN PENGGUNA (SUPPLIER) MAINTAIN CERTIFICATES/SUPPLIER DETAILS SUPPLIER RELATIONSHIP MANAGEMENT SUPPLY CHAIN MANAGEMENT SYSTEM (SCMS)

UNIVERSITI PUTRA MALAYSIA WEIGHTED WINDOW FOR TCP FAIR BANDWIDTH ALLOCATION IN WIRELESS LANS


CAMERA CALIBRATION FOR UNMANNED AERIAL VEHICLE MAPPING AHMAD RAZALI BIN YUSOFF

UNIVERSITI PUTRA MALAYSIA IMPROVED MULTICROSSOVER GENETIC ALGORITHM FOR TWO- DIMENSIONAL RECTANGULAR BIN PACKING PROBLEM MARYAM SARABIAN FS

THESIS PROJECT ARCHIVE SYSTEM (T-PAS) SHAHRUL NAZMI BIN ISMAIL

VIRTUAL PRIVATE NETWORK: ARCHITECTURE AND IMPLEMENTATIONS

A TRUST MODEL FOR BUSINESS TO CUSTOMER CLOUD E-COMMERCE HOSSEIN POURTAHERI

PANDUAN PENGGUNA (SUPPLIER) MAINTAIN CERTIFICATES/SUPPLIER DETAILS SUPPLIER RELATIONSHIP MANAGEMENT SUPPLY CHAIN MANAGEMENT SYSTEM (SCMS)

WEB USAGE MINING USING KIJKTEM E-COMMUNITY APPLICATION IIASIMAH RAZAK

UNIVERSITI PUTRA MALAYSIA

ssk 2023 asas komunikasi dan rangkaian TOPIK 4.0 PENGALAMATAN RANGKAIAN Minggu 11

EEE 428 SISTEM KOMPUTER

DETERMINING THE MULTI-CURRENT SOURCES OF MAGNETOENCEPHALOGRAPHY BY USING FUZZY TOPOGRAPHIC TOPOLOGICAL MAPPING

ANOMALY DETECTION IN WIRELESS SENSOR NETWORK (WSN) LAU WAI FAN

Semasa buku ini ditulis XAMPP mengandungi empat versi:

DEVELOPMENT OF VENDING MACHINE WITH PREPAID PAYMENT METHOD AMAR SAFUAN BIN ALYUSI

INSTRUCTION: This section consists of FOUR (4) structured questions. Answer ALL questions.

Transcription:

UNIVERSITI PUTRA MALAYSIA TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE FOR MINING WEB CONTENT OUTLIERS WAN RUSILA BINTI WAN ZULKIFELI FSKTM 2013 8

TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE FOR MINING WEB CONTENT OUTLIERS By WAN RUSILA BINTI WAN ZULKIFELI Thesis Submitted to the School of Graduate Studies, Universiti Putra Malaysia, in Fulfillment of the Requirements for the Degree of Master of Science December 2013

COPYRIGHT All material contained within the thesis, including without limitation text, logos, icons, photographs and all other artwork, is copyright material of Universiti Putra Malaysia unless otherwise stated. Use may be made of any material contained within the thesis for noncommercial purposes from the copyright holder. Commercial use of material may only be made with the express, prior, written permission of Universiti Putra Malaysia. Copyright Universiti Putra Malaysia

Alhamdulillah.. I lovingly dedicated this thesis to my.. Husband.. Who supported me each step of the way, Thank you for your love, encouragement and sacrifices. Son.. Who gave me strength and motivation to finish my study. Parents ~ Mama, Papa, Mak, Abah.. Who believe in me, Thank you for all the moral support, guidance and patience. Sister.. Who offered time, thought and lough, Thank you for the hand you always lend me. This is a thesis that I.. Truly wasn t able to finish without anyone of them. Thank you. Barokallahufikum.

Abstract of thesis presented to the Senate of Universiti Putra Malaysia in fulfilment of the requirement for the degree of Master of Science TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE FOR MINING WEB CONTENT OUTLIERS Chairman: Faculty: By WAN RUSILA BT WAN ZULKIFELI December 2013 Norwati Mustapha, PhD Computer Science and Information Technology In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. It is important to detect outliers especially when a web portal is hacked. Recently, there are only a few approaches suggested to Mining Web Content Outliers such as Signed-with-Weight technique and mining through mathematical approach. The mathematical approach developed is based on two way rectangular representations and correlation method. However the approaches do not take the advantage of position score and stemmed domain dictionary. Position score and stemmed domain dictionary are very useful in mining web content outliers because it may effects on reduction the relevance of documents. Therefore, this study was made to resolve the problems in Mining Web Content Outliers by combining the strength of word-based techniques, position score weighting technique and stemmed domain dictionary. The existing weighting technique was transformed to the Term Frequency and Inverse Document Frequency

with Position Score and Mean Value (TF.IDF.PSM) weighting technique by implementing a standard weighting technique from Information Retrieval called Term Frequency and Inverse Document Frequency (TF.IDF) and a weighting technique from Text Categorization called the Term Frequency and Relevance Frequency (TF.RF) into Web Content Mining. This technique is started with extracting the web pages, preprocess it and then generate the full word profile. Depending on the length of the character, the respective index on the stemmed domain dictionary is searched. Positive count is incremented by one, if the word is present in the dictionary and document. Then word frequency in a web page and in every web pages and position score are counted. Finally the dissimilarity measure is computed to determine outliers. In the dissimilarity measure part, the TF.IDF.PSM is used not only to calculate and analyze the relevant words but also to consider the importance of the irrelevant words by assigning weight based on the word position in a page. A statistical approach mean is added to balance the weight of position score. The technique has been tested on 431 web pages from the Course folder of University Wisconsin, provided by World Wide Knowledge Base. While the 43 benchmark dataset is from Science Medical folder provided by The 20 Newsgroups Dataset. Term Frequency and Inverse Document Frequency (TF.IDF) weighting technique from Information Retrieval (IR) and the Term Frequency and Relevance Frequency (TF.RF) weighting technique by Text Categorization are used during experimental phase and the results are qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental results show that the TF.IDF.PSM weighting technique achieves up to 98.95% of accuracy, which is about 3.21% higher than the Signed-with-Weight technique. Besides, it also achieves up to 94.19% of F1- measure, which is a 18.12% improvement from the Signed-with-Weight technique.

Abstrak tesis yang dikemukakan kepada Senat Universiti Putra Malaysia sebagai memenuhi keperluan untuk ijazah Master Sains PEMBERAT KEKERAPAN PERKATAAN DAN KEKERAPAN DOKUMEN SONGSANG BERSAMA PERINCIAN KEDUDUKAN DAN NILAI PURATA UNTUK MELOMBONG KANDUNGAN OUTLIERS DI LAMAN SESAWANG Pengerusi: Fakulti: Oleh WAN RUSILA BT WAN ZULKIFELI December 2013 Norwati Mustapha, PhD Sains Komputer dan Teknologi Maklumat Dalam beberapa tahun yang lepas, bidang perlombongan kandungan laman sesawang telah mengalami perkembangan yang amat pesat. Bagaimanapun, fokus hanya diberikan kepada aspek teknikal, rekabentuk visual dan corak kandungan laman sesawang yang kerap, dimana corak kandungan laman sesawang yang kurang kerap yang dipanggil 'outliers' sering diabaikan. Perlombongan kandungan laman sesawang 'outliers' digunakan untuk mengesan kandungan laman sesawang yang tidak relevan didalam sesuatu portal laman sesawang. Adalah penting untuk mengesan 'outliers' terutamanya apabila portal laman sesawang itu telah digodam. Sehingga kini, hanya terdapat beberapa pendekatan yang mencadangkan penggunaan perlombongan kandungan laman sesawang 'outliers', antaranya teknik 'Signed-with-Weight' dan perlombongan menggunakan pendekatan matematik. Pendekatan matematik dibangunkan berdasarkan kepada teknik Perwakilan dan Korelasi Segi Empat Tepat Dua Hala. Bagaimanapun pendekatan ini tidak menggunakan kelebihan yang ada pada perincian kedudukan dan perpustakaan domain yang telah disunting. Perincian kedudukan dan perpustakaan domain yang telah disunting adalah sangat berguna

dalam perlombongan kandungan sesawang 'outliers, kerana ianya memberi kesan pada pengurangan dokumen-dokumen yang relevan. Disebabkan itu, kajian ini telah dibuat untuk menyelesaikan masalah di dalam perlombongan kandungan laman sesawang 'outliers' dengan menggabungkan kekuatan yang ada pada teknik berdasarkan teks, teknik pemberat perincian kedudukan dan juga perpustakaan domain 'stem'. Teknik pemberat yang asal telah diubahsuai menjadi teknik pemberat Kekerapan Perkataan dan Kekerapan Dokumen Songsang bersama Perincian Kedudukan dan Nilai Purata (TF.IDF.PSM) dengan menggabungkan teknik pemberat asas daripada 'information retrievel' (TF.IDF) dan 'text categorization' (TF.RF) ke dalam perlombongan kandungan laman sesawang. Teknik ini dimulakan dengan menguraikan halaman laman sesawang, kemudian ia diproses dan profil perkataan penuh akan dijana. Indeks yang berkaitan di dalam perpustakaan domain 'stem' akan di cari bergantung kepada panjang perkataan tersebut. Jika perkataan itu dijumpai di dalam perpustakaan dan juga di dalam dokumen, kiraan positif akan ditambah satu. Frekuensi perkataan di dalam halaman laman sesawang dan juga di setiap halaman laman sesawang dan markah kedudukan akan dibilang. Akhir sekali, timbangan ketidaksamaan akan di kira untuk menentukan 'outliers'. Pada bahagian timbangan ketidaksamaan, teknik Kekerapan Perkataan dan Kekerapan Dokumen Songsang bersama Perincian Kedudukan dan Nilai Purata (TF.IDF.PSM) digunakan bukan sahaja untuk mengira dan menganalisa perkataan yang berkaitan, tetapi juga untuk mempertimbangkan kepentingan perkataan yang tidak berkaitan dengan meletakkan pemberat berdasarkan kepada kedudukan perkataan di dalam sesebuah halaman. Pendekatan statistik 'min' ditambah untuk mengimbangkan pemberat pada perincian kedudukan.

Teknik ini telah diuji pada 431 halaman laman sesawang daripada fail kursus Universiti Wisconsin yang disediakan oleh 'World Wide Knowledge Base'. Manakala 43 set data penanda aras adalah daripada fail Sains Perubatan yang disediakan oleh 'The 20 Newsgroup Dataset'. Teknik pemberat Kekerapan Perkataan dan Kekerapan Dokumen Songsang (TF.IDF) daripada perolehan maklumat dan teknik pemberat Kekerapan Perkataan dan Kekerapan Relevan (TF.RF) dengan pengkategorian perkataan telah digunakan semasa fasa eksperimen dan keputusan yang diperolehi telah diukur berdasarkan dua parameter iaitu peratusan ketepatan dan juga 'F1- measure'. Keputusan eksperimen menunjukkan yang teknik pemberat Kekerapan Perkataan dan Kekerapan Dokumen Songsang bersama Perincian Kedudukan dan Nilai Purata (TF.IDF.PSM) telah mencapai ketepatan sehingga 98.95% yang mana merupakan 3.21% lebih tinggi daripada teknik 'Signed-with-Weight'. Selain itu, ia juga memperolehi sehingga 94.19% markah 'F1-measure', yang merupakan 18.12% lebih tinggi daripada markah yang diperolehi dengan teknik 'Signed-with-Weight'.

ACKNOWLEDGEMENTS I would like to extend my gratitude to people who helped to bring this research successful. First, I would like to thank Associate Professor Norwati Mustapha for her help, professionalism, valuable guidance, and support throughout my entire program of study. Secondly, I would like to express my sincere thanks and appreciation to the supervisory committee member, Dr. Aida Mustapha for her guidance and valuable suggestions in making this work a success. My sincere appreciation also goes to all my friends in UPM for their continuous help and sharing of knowledge. I would also like to thank the expert who was involved in the validation for this research project, Assistant Professor Pookuzhali Sugumaran. Without her passionate participation and input, the validation could not have been successfully conducted. Finally, I must express my very profound gratitude to my parents and to my husband for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you. Wan Rusila Bt Wan Zulkifeli December 2013

APPROVAL This thesis was submitted to the Senate of Universiti Putra Malaysia and has been accepted as fulfilment of the requirement for the degree of Master of Science. The members of the Supervisory Committee were as follows: Norwati Binti Mustapha, PhD Associate Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Chairman) Aida Binti Mustapha, PhD Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Member) PhD BUJANG BIN KIM HUAT, Professor and Dean School of Graduate Studies Universiti Putra Malaysia Date:

Declaration by graduate student I hereby confirm that: this thesis is my original work; quotations, illustrations and citations have been duly referenced; this thesis has not been submitted previously or concurrently for any other degree at any other institutions; intellectual property from the thesis and copyright of the thesis are fully-owned by Universiti Putra Malaysia, as according to the Universiti Putra Malaysia (Research) Rules 2012; written permission must be obtained from supervisor and the office of Deputy Vice-Chancellor (Research and Innovation) before thesis is published in book form; there is no plagiarism or data falsification/fabrication in the thesis, and scholarly integrity is upheld as according to the Universiti Putra Malaysia (Graduate Studies) Rules 2003 (Revision 2012-2013) and the University Putra Malaysia (Research) Rules 2012. The thesis has undergone plagiarism detection software. Signature: Date: Name and Matric No.: Wan Rusila Binti Wan Zulkifeli, GS22215 Declaration by Members of Supervisory Committee This is to confirm that: the research conducted and the writing of this thesis was under our supervision; supervision responsibilities as stated in the Universiti Putra Malaysia (Graduate Studies) Rules 2003 (revision 2012-2013) are adhered to. Signature: Signature: Name of Name of Chairman of Member of Supervisory Supervisory Committee: Committee:

TABLE OF CONTENTS ABSTRACT ABSTRAK 12 ACKNOWLEDGEMENTS APPROVAL DECLARATION LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATION CHAPTER 1 INTRODUCTION 1.1 Motivation and Background 1 1.2 Problem Statement 5 1.3 Research Objectives 6 1.4 Research Scope 7 1.5 Research Contribution 8 1.6 Organization of Thesis 8 2 LITERATURE REVIEW Page iii ix x 12ii xvii xviii 2.1 Introduction 11 2.2 Web Mining 12 2.3 Data Mining and Text Mining 15 2.4 Web Content Outliers 18 2.4.1 Web Content Outlier Analysis 20 2.4.2 Issues in Web Content Outliers 30 2.4.3 The Importance of Mining Web Content Outliers 33 xx

2.5 2.6 2.7 2.8 2.9 Web Content Outlier Mining Process Categories of Outlier Detection Methods 2.6.1 Distribution-Based 2.6.2 Depth-Based 2.6.3 Deviation-Based 2.6.4 Distance-Based 2.6.5 Density-Based Sources and Type of Data in Web Content Outlier Mining Stemming Issues in Mining Web Content Outliers Summary 3 RESEARCH METHODOLOGY 3.1 Introduction 51 3.2 Research Steps 52 3.3 3.4 3.5 3.6 Experimental Design 3.3.1 Domain Dictionary 3.3.2 Data Collection 3.3.3 Dataset for Frequent Web Pages 3.3.4 Dataset for Less Frequent Web Pages System Requirements 3.4.1 Hardware Requirements 3.4.2 Software Requirements Evaluation Metrics Summary 4 THE TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY WITH POSITION SCORE AND MEAN VALUE (TF.IDF.PSM) WEIGHTING TECHNIQUE 4.1 Introduction 64 34 37 38 39 40 40 41 43 45 49 57 57 58 58 59 60 60 61 61 63

4.2 The Term and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) 64 4.2.1 Extracted Web Pages 67 4.3 4.4 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 Preprocessing Generate Full Word Profile Generate Word Frequency and Position Score Compute Dissimilarity Measure Determine Outliers The Term Frequency and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) Weighting Technique Algorithm Summary 5 RESULTS AND DISCUSSIONS 5.1 Introduction 90 5.2 Experimental Remarks 90 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Results for The Signed-With-Weight Technique Results of Term Frequency and Relevance Frequency (TF.RF) Weighting Technique Results for Term Frequency and Inverse Document Frequency (TF.IDF) Weighting Technique Results for Term Frequency and Inverse Document Frequency with Relevance Frequency (TF.IDRF) Weighting Technique Results for Term Frequency and Inverse Document Frequency with Position Score (TF.IDF.PS) Weighting Technique Discussion of Results Summary 68 71 74 77 87 88 89 92 95 99 104 108 114 116 6 CONCLUSION AND FUTURE RESEARCH 6.1 Introduction 118

6.2 Conclusion 120 6.3 Future Research 122 REFERENCES 124 APPENDICES A Positive Outliers Detected for Every Experiment 132 B Results for Experiments 134 C Example Records of Data after TF.IDF.PSM Technique 137 D Example Results: Data 1 / Times 1 - TF.IDF.PSM 150 BIODATA OF STUDENT 163 LIST OF PUBLICATIONS