Ubiquitous Computing and Communication Journal (ISSN )

Similar documents
Applying Packets Meta data for Web Usage Mining

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Building Data Mining Application for Customer Relationship Management

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

An Improved Apriori Algorithm for Association Rules

Dynamic Clustering of Data with Modified K-Means Algorithm

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data Mining Part 3. Associations Rules

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Advanced Research in Computer Science and Software Engineering

Correlation Based Feature Selection with Irrelevant Feature Removal

Improved Frequent Pattern Mining Algorithm with Indexing

Iteration Reduction K Means Clustering Algorithm

Performance Analysis of Data Mining Classification Techniques

Knowledge Discovery and Data Mining

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Data Mining Course Overview

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Mobile Application with Optical Character Recognition Using Neural Network

Text Mining: A Burgeoning technology for knowledge extraction

Data Mining Concepts

K-Mean Clustering Algorithm Implemented To E-Banking

Information Extraction Techniques in Terrorism Surveillance

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Visualization and text mining of patent and non-patent data

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

ABSTRACT I. INTRODUCTION. Dr. J P Patra 1, Ajay Singh Thakur 2, Amit Jain 2. Professor, Department of CSE SSIPMT, CSVTU, Raipur, Chhattisgarh, India

Mining Quantitative Association Rules on Overlapped Intervals

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Application of Clustering as a Data Mining Tool in Bp systolic diastolic

An Efficient Approach for Color Pattern Matching Using Image Mining

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Best Combination of Machine Learning Algorithms for Course Recommendation System in E-learning

Reading group on Ontologies and NLP:

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Adopting Data Mining Techniques on the Recommendations of Library Collections

DATA WAREHOUSING IN LIBRARIES FOR MANAGING DATABASE

Data Mining An Overview ITEV, F /18

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Unstructured Data. CS102 Winter 2019

Integrating Text Mining with Image Processing

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

International Journal of Software and Web Sciences (IJSWS)

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 5, ISSUE

PATTERN DISCOVERY IN TIME-ORIENTED DATA

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

2. Basic Task of Pattern Classification

A Review on Cluster Based Approach in Data Mining

Implementation of Data Mining for Vehicle Theft Detection using Android Application

ABJAD: AN OFF-LINE ARABIC HANDWRITTEN RECOGNITION SYSTEM

CSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false.

COMP 465 Special Topics: Data Mining

Combined Intra-Inter transaction based approach for mining Association among the Sectors in Indian Stock Market

An Edge Detection Algorithm for Online Image Analysis

Association Rules. Berlin Chen References:

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Techniques for Mining Text Documents

A New Technique for Segmentation of Handwritten Numerical Strings of Bangla Language

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Winter Semester 2009/10 Free University of Bozen, Bolzano

Mining of Web Server Logs using Extended Apriori Algorithm

A Hierarchical Document Clustering Approach with Frequent Itemsets

A Novel method for Frequent Pattern Mining

Generating Cross level Rules: An automated approach

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

COMPARISON OF K-MEAN ALGORITHM & APRIORI ALGORITHM AN ANALYSIS

HYPER METHOD BY USE ADVANCE MINING ASSOCIATION RULES ALGORITHM

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

International Journal of Mechatronics, Electrical and Computer Technology

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Analyzing Outlier Detection Techniques with Hybrid Method

Table Of Contents: xix Foreword to Second Edition

9. Conclusions. 9.1 Definition KDD

Keywords Binary Linked Object, Binary silhouette, Fingertip Detection, Hand Gesture Recognition, k-nn algorithm.

Transcription:

A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com Dr. Mohammad A. AL-Hamami Delmon University,Bahrain, 2011, M_ah_1@yahoo.com Dr. Soukaena H. Hashem University of technology, Iraq, 2011, soukaena_hassan@yahoo.com ABSTRACT Massive amount of new information being created and the world s data doubles every 18 months, 80-90% of all data is held in various unstructured formats. Useful information can be derived from this unstructured data. The aim of this research is to present a framework for handling handwritten documents in all its trends. Since the handwritten documents are unstructured data, so the objectives of the proposed strategy are: Converts the unstructured handwritten documents to a structure one and store it in a convenient database. The proposed database will be customized to contain three dimensions first for writer features, second for data features and third for documents features. The multidimensional database will be converted into transactional one then encoding the values of the feature for all attributes. Mines the proposed database, the resulting association rules will extract new pattern which leads to many prediction purposes. Keywords: handwritten documents, data mining and association rules. 1 INTRODUCTION Data mining is Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Data mining work has two branches and these are: Descriptive: understanding underlying processes or behavior (patterns and trends and Clustering) in detail (Pattern and trend analysis, Knowledge base creation, Summarization, and Visualization). Predictive: predict an unseen or unmeasured values (future projections, missing values and Classification) in detail (Classification, Question answering, Pattern and trend forecasting) [1]. 2 TEXT MINING Text Mining is a process that employs: (Statistical Natural Processing language (NLP): a set of algorithms for converting unstructured text into structured data objects plus Data Mining: the quantitative methods that analyze these data objects to discover knowledge).text Mining Techniques include the following: Information Retrieval (Indexing and retrieval of textual documents). Information Extraction (Extraction of partial knowledge in the text). Web Mining (Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building)). Clustering (Generating collections of similar text documents). Text Mining Process consists of sequenced steps [2, 3], see Fig. 1, they are: 1 UbiCC Journal, Volume 6: Issue 3 901

Figure 1: The overall process of text mining. 1. Text Preprocessing (Syntactic/Semantic text analysis): Part Of Speech (POS) tagging (Find the corresponding POS for each word), Word sense disambiguation (Context based or proximity based) and Parsing (Generates a parse tree (graph) for each sentence and each sentence is a stand alone graph). 2. Features Generation (Bag of words): Text document is represented by the words it contains (and their occurrences). Order of words is not that important for certain applications (Bag of words). Stemming: identifies a word by its root, Reduce dimensionality, and Stop words: The common words unlikely to help text mining. 3. Features Selection (Simple counting and Statistics): Reduce dimensionality which Learners have difficulty addressing tasks with high dimensionality, only interested in the information relevant to what is being analyzed. Irrelevant features means not all features help. 4. Text/Data Mining (Classification (Supervised) / Clustering (Unsupervised)): Supervised learning (classification): The training data is labeled indicating the class; new data is classified based on the training set, correct classification: The known label of test sample is identical with the class result from the classification model. Unsupervised learning (clustering): The class labels of training data are unknown; establish the existence of classes or clusters in the data, Good clustering method: high intra-cluster similarity. A. Text Mining (Classification definition): Given: a collection of labeled records (training set), each record contains a set of features (attributes), and the true class (label). Find: a model for the class as a function of the values of the features. Goal: previously unseen records should be assigned a class as accurately as possible. 2 B. Text Mining (Clustering definition): Given: a set of documents and a similarity measure among documents, Find: clusters such that: Documents in one cluster are more similar to one another and Documents in separate clusters are less similar to one another. Goal: Finding a correct set of documents clusters. C. Analyzing results: Are the results satisfactory? Does more mining need to be done? Does a different technique need to be used? Does another iteration of one or more steps in the process need to be done? 3 THE PROPOSED SYSTEM Some Previous works have been dealt with handwritten documents. Fig. 2 presents a method using Artificial Neural Network (ANN) to classify the documents according to data features for the writing group. As a result they find that ANN does a good job, but can t explain clearly its output. It is right since the result of classification will determine the group of writers, what about the classifications according to subject of documents and what about the classification for document s a feature. Figure 2: ANN classify handwritten documents according to their writing group The proposed system for text mining of the handwritten documents can be explained in the following steps: Step One: Determine the input and output; Input : Samples of handwritten documents (200 documents). Output: Association rules introduce predicted patterns aid to determine and extract much more relationships among writers, features of data and features of documents. UbiCC Journal, Volume 6: Issue 3 902

Step Two: Determining the attributes and their values: Determine the attributes of the first dimension, features of document s writer, which included the following (Age, Gender, Handedness, Ethnicity, Education and Schooling). These attributes are gotten as a prior knowledge associated with documents (each standard document naturally supplied by these information related to document s writer). The proposed encoding with attributes for the first dimension is: 1. Age: since the writers of document strongly older than 20 year will present this attribute by A if age is less than 45 else A will not appear. 2. Gender: if the gender was female B will appear else B will not appear. features. The proposed encoding with attributes for the third dimension is: 1. Language: if language was English then S will appear else S will not appear. 2. Subject of document: if subject medical T will appear else T will not appear. 3. Type of document: if text only U will appear else U will not appear. Step Three: For the first view of the proposal we present a multidimensional database which has three dimensions these are: features of document s writer, feature of written data and finally feature of text written in the documents, see Fig. 3. 3. Handedness, Ethnicity, Education, and Schooling all of these attribute will also presented by the same strategy. Determine the attributes of the second dimension, features of written data, which included the following (dark, blob, hole, slant, width, skew, height, slopehor, slopeneg, slopever, slopepos, pixelfreq). These features gotten from applying image processing procedures specified to extract these features. The proposed encoding with attributes for the second dimension is: 1. Dark: will be normalized then after that will take its normalized value and making a threshold for it according to their different values in different cases. Such that if dark less than 0.5 then G will appear else G will not appear. 2. Blob, Hole, Slant, Width, Skew, Height, Slopehor, Slopeneg, Slopever, Slopepos, and Pixelfreq all of these attributes will also presented by the same strategy Determine the attributes of the third dimension, feature of text written in the documents, which included the following (language, subject of document, type of document) these features gotten by using Optical Character Recognition Software for entering these documents to be digital documents. Then dealing with these digital documents to extract all the recognized Figure 3: The multidimensional database. Then this multidimensional database will be converted into a simple transactional one, see Fig. 4. Figure 4: The transactional database Now the data of transactional database will be written as the proposed encoding of feature s values, see Fig. 5. Tid Attributes Doc 1 ABCDE Doc 2 CDEFGHIJ Doc 3.. Figure 5: Encoded transactional database. Step Four: now since transactional database has very long itemsets, so searching frequent itemsets to find 3 UbiCC Journal, Volume 6: Issue 3 903

the association rules will be consume much more space and time that, if we use one of the two traditional methods for finding the frequent itemsets, these methods are: Breadth search can be viewed as bottom up approach where the algorithm visits patterns of size k+1 after finishing the k sized patterns. Depth search does the opposite where the algorithm starts by visiting patterns of size k before those of size k-1. The proposed procedure is to find the set of frequent itemsets in transactional database that has long itemsets. This procedure works as the following: 1. Uses traversing approach which consists of depth and breadth search to find the longest frequent itemset. 2. Find all its children by that we will get most of the frequent itemsets. 3. Detect the support for each frequent itemset. Some frequent itemsets don t appear in the children of longest frequent itemset, these exceptions frequent itemsets will be found with their supports by using the traditional method Apriori algorithm. The proposed procedure consists of two phases; the first phase must be applied while the second phase will be applied when it is necessary. The first phase of the traversing consists of depth search; from this search only the deepest node on the most left side has been taken then the support of this node in the database has been computed. If its support passes the minimum support threshold, the search will be terminated; otherwise the second phase will be applied. The second phase of the proposed procedure consists of breadth search by taking the node that has been generated in the first phase and considering it the root of the tree, then traversing that tree in breadth manner looking for the longest frequent itemset. The search will be terminated when the longest frequent itemset has been found. Step Five: now after finding all frequent itemsets, the traditional association rule procedure will be applied. This procedure will introduce the extracted association rules. As an example for the extracted association rules are: After the extracting, we proposed a procedure that applied before the analysis stage. This procedure is called Rule Classification which classifies the rules into six groups depending on the itemsets of right and left sides in any dimensions they found. Rule classification: Class1: The itemsets in both sides right and left are included in the first dimension. Class2: The itemsets in both sides right and left are included in the second dimension. Class3: The itemsets in both sides right and left are included in the third dimension. Class4: The itemsets in both sides right and left are included in the first and second dimension. Class 5: The itemsets in both sides right and left are included in the first and third dimension. Class6: The itemsets in both sides right and left are included in the second and third dimension. Class7: The itemsets in both sides right and left are included in the first, second and third dimensions The classification of association rules above are: A-----B (Class1).. GHFRSTU--- OABCD (Class7).. Step Six: this step includes the analysis stage which presents the most important step because it introduces full report for predictions, relationships and future trends to improve the performance of the mined database which represent the encoding for the system. To explain this stage we will explain how to analyze the following rule: GHFRSTU---OABCD (Class7) This rule classified as class7 since left and right sides included in the three dimensions. Left side has the frequent itemset GHFRSTU which composed from F in the first dimension, GHR in the second dimension and STU in the third dimension. A-----B.. GHFRSTU---OABCD.. Right side has the frequent itemset OABCD which composed from ABCD in the first dimension and O in the second dimension. From the classification and composition analysis and from translating the encoded 4 UbiCC Journal, Volume 6: Issue 3 904

letters into their attributes we could predicate the following: 1. If dark less than 0.5, Blob more than 0.3, schooling is high, pixel frequent pass threshold, language is English, subject of document is medical and type of document is text then. 2. Slopeneg will pass threshold and age of writer will be less than 45 and the gender will be female and handedness will be right. So we could predict the age, gender and handedness of the writer and also predict slopeneg of data document by knowing the dark, blob and pixelfreq of data document and schooling of writer combined with knowing the language, subject and type of feature s document. 4 IMPLEMENTATIONS To explain the implementation of the proposed system, we follow the following phases:: The First phase: The implementation is presented by taking each handwritten document and builds the first proposed multidimensional database that by: Convert the document to image and from it will extract all the features of the second dimension which presented by (dark, blob, hole, slant, width, skew, height, slopehor, slopeneg, slopever, slopepos, pixelfreq). The values of features will obtained by the traditional image processing procedures. The metadata of the writer presented by the first dimension which presented by (Age, Gender, Handedness, Ethnicity, Education and Schooling) will be obtained as metadata appended with the document. The metadata of the document presented by the third dimension which presented by (language, subject of document, type of document) will be obtained as metadata appended with the document. Fig. 6 will display how to build the proposed multidimensional database by filling the textbox with feature s values and scanned document. Then click the insertion command and convert it to transactional database by clicking the convert command, if the process of convert done successfully then Fig. 7 will appear. Figure 6: Form1 for building the multidimensional database and convert it to transactional. Figure 7: Message to notice the convert process done successfully. Fig. 8 will display how to extract the frequent itemsets from transactional database using the proposed procedure after entering the initial expected longest frequent itemset and then clicking the extracting command, and display how to get the rule classification after clicking the classification command. Figure 8: Form2 display extraction of frequent itemsets and classification of rules. The second phase: The second phase will present the implementation for an application of the proposed system which implies the possibility of extracting the feature of author from the features of written documents and feature of the subject of written document. This done by taking the document as an image from the document image and extract the features values of the document (second dimension), and the subject features values supplied with document (third dimension). To extract writer (author) features ; according to the extracted features from document image and supplied feature, the system will mine the existing multidimensional database to get the features of author which corresponding to the extracted and supplied features, see Fig. 9. Surely the proposed 5 UbiCC Journal, Volume 6: Issue 3 905

retrieving process would submit too many thresholds related to the feature values. Figure 9: Trying to get author feature by introducing document and subject features. 5 CONCLUSIONS From the proposed research we conclude the following: 1. Converting unstructured handwritten documents to structured frame by building the proposed multidimensional database then convert these multidimensional database into transactional one enrich the mining process since we included most of the features of documents, writers and data. 2. Building transactional database with long itemsets enable us to include all features we think it is important for predictions and extraction new patterns. 3. Using association rule techniques for dealing with the features instead of ANN makes the process of mining much more powerful since there is no limitations about no. of features entered and no. of features resulted to make classification and clustering. 4. The proposed procedure used to find frequent itemsets instead of traditional procedure makes the process of finding all frequent itemset from long itemset efficient and less time and space consumer. 5. The proposed procedure for classifying the extracted rules makes the analysis process much more easy and fast. 6 DISCUSSION In the proposed system we assume the attributes number less than 26 so we represent each attribute by one capital letter but if the no. of attribute will exceed twenty six we will use the capital and small letters. Representing all features by binary will not decrease the system performance since we will use critical threshold for theses features which are present the attributes of user handwritten documents. Representing the handwritten documents initially by multidimensional database that to be general form for many future work on these handwritten documents, in the proposed system we convert this multidimensional database into transactional database since we aim here to apply association rule data mining technique. REFERENCE [1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996. [2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [3] J. Han and M. Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, 2000. [4]. S. Mitra, and T. Ahharya, "Data Mining Multimedia, Soft Computing, and Bioinformatics", John Wiley and Sons, Inc., 2003. [5]. Ala a H. AL-Hamami, Mohammad Ala a Al-Hamami and Soukaena Hassan Hasheem, Applying data mining techniques in intrusion detection system on web and analysis of web usage, Asian Journal of Information Technology, Vol. 5, No. 1, p: 57-63, 2006. [6]. Ala a H. AL-Hamami, and Soukaena Hassan Hasheem, Privacy Preserving for Data Mining Applications, journal of technology, baghdad, Iraq, university of technology, Vol.26.No.5,2008. [7]. Mohammad A. Al- Hamami and Soukaena Hassan Hashem, " Applying Data Mining Techniques to Discover Methods that Used for Hiding Messages Inside Images ", The IEEE First International Conference on Digital Information Management (ICDIM2006), Bangalore, India, 2006. [8]. Ala a H. AL-Hamami, Mohammad Ala a Al-Hamami and Soukaena Hassan Hasheem, A Proposed Technique for Medical Diagnosis Using Data Mining, Fourth International Conference on Intelligent Computing and information Systems (ICICIS 2009) CAIRO, EGYPT March 19-22, 2009. 6 UbiCC Journal, Volume 6: Issue 3 906