Ubiquitous Computing and Communication Journal (ISSN )

Size: px

Start display at page:

Download "Ubiquitous Computing and Communication Journal (ISSN )"

Gavin Booth
5 years ago
Views:

1 A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, Dr. Mohammad A. AL-Hamami Delmon University,Bahrain, 2011, Dr. Soukaena H. Hashem University of technology, Iraq, 2011, ABSTRACT Massive amount of new information being created and the world s data doubles every 18 months, 80-90% of all data is held in various unstructured formats. Useful information can be derived from this unstructured data. The aim of this research is to present a framework for handling handwritten documents in all its trends. Since the handwritten documents are unstructured data, so the objectives of the proposed strategy are: Converts the unstructured handwritten documents to a structure one and store it in a convenient database. The proposed database will be customized to contain three dimensions first for writer features, second for data features and third for documents features. The multidimensional database will be converted into transactional one then encoding the values of the feature for all attributes. Mines the proposed database, the resulting association rules will extract new pattern which leads to many prediction purposes. Keywords: handwritten documents, data mining and association rules. 1 INTRODUCTION Data mining is Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Data mining work has two branches and these are: Descriptive: understanding underlying processes or behavior (patterns and trends and Clustering) in detail (Pattern and trend analysis, Knowledge base creation, Summarization, and Visualization). Predictive: predict an unseen or unmeasured values (future projections, missing values and Classification) in detail (Classification, Question answering, Pattern and trend forecasting) [1]. 2 TEXT MINING Text Mining is a process that employs: (Statistical Natural Processing language (NLP): a set of algorithms for converting unstructured text into structured data objects plus Data Mining: the quantitative methods that analyze these data objects to discover knowledge).text Mining Techniques include the following: Information Retrieval (Indexing and retrieval of textual documents). Information Extraction (Extraction of partial knowledge in the text). Web Mining (Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building)). Clustering (Generating collections of similar text documents). Text Mining Process consists of sequenced steps [2, 3], see Fig. 1, they are: 1 UbiCC Journal, Volume 6: Issue 3 901

Text Preprocessing (Syntactic/Semantic text analysis): Part Of Speech (POS) tagging (Find the corresponding POS for each word), Word sense disambiguation (Context based or proximity based) and

2 Figure 1: The overall process of text mining. 1. Text Preprocessing (Syntactic/Semantic text analysis): Part Of Speech (POS) tagging (Find the corresponding POS for each word), Word sense disambiguation (Context based or proximity based) and Parsing (Generates a parse tree (graph) for each sentence and each sentence is a stand alone graph). 2. Features Generation (Bag of words): Text document is represented by the words it contains (and their occurrences). Order of words is not that important for certain applications (Bag of words). Stemming: identifies a word by its root, Reduce dimensionality, and Stop words: The common words unlikely to help text mining. 3. Features Selection (Simple counting and Statistics): Reduce dimensionality which Learners have difficulty addressing tasks with high dimensionality, only interested in the information relevant to what is being analyzed. Irrelevant features means not all features help. 4. Text/Data Mining (Classification (Supervised) / Clustering (Unsupervised)): Supervised learning (classification): The training data is labeled indicating the class; new data is classified based on the training set, correct classification: The known label of test sample is identical with the class result from the classification model. Unsupervised learning (clustering): The class labels of training data are unknown; establish the existence of classes or clusters in the data, Good clustering method: high intra-cluster similarity. A. Text Mining (Classification definition): Given: a collection of labeled records (training set), each record contains a set of features (attributes), and the true class (label). Find: a model for the class as a function of the values of the features. Goal: previously unseen records should be assigned a class as accurately as possible. 2 B. Text Mining (Clustering definition): Given: a set of documents and a similarity measure among documents, Find: clusters such that: Documents in one cluster are more similar to one another and Documents in separate clusters are less similar to one another. Goal: Finding a correct set of documents clusters. C. Analyzing results: Are the results satisfactory? Does more mining need to be done? Does a different technique need to be used? Does another iteration of one or more steps in the process need to be done? 3 THE PROPOSED SYSTEM Some Previous works have been dealt with handwritten documents. Fig. 2 presents a method using Artificial Neural Network (ANN) to classify the documents according to data features for the writing group. As a result they find that ANN does a good job, but can t explain clearly its output. It is right since the result of classification will determine the group of writers, what about the classifications according to subject of documents and what about the classification for document s a feature. Figure 2: ANN classify handwritten documents according to their writing group The proposed system for text mining of the handwritten documents can be explained in the following steps: Step One: Determine the input and output; Input : Samples of handwritten documents (200 documents). Output: Association rules introduce predicted patterns aid to determine and extract much more relationships among writers, features of data and features of documents. UbiCC Journal, Volume 6: Issue 3 902

3 Step Two: Determining the attributes and their values: Determine the attributes of the first dimension, features of document s writer, which included the following (Age, Gender, Handedness, Ethnicity, Education and Schooling). These attributes are gotten as a prior knowledge associated with documents (each standard document naturally supplied by these information related to document s writer). The proposed encoding with attributes for the first dimension is: 1. Age: since the writers of document strongly older than 20 year will present this attribute by A if age is less than 45 else A will not appear. 2. Gender: if the gender was female B will appear else B will not appear. features. The proposed encoding with attributes for the third dimension is: 1. Language: if language was English then S will appear else S will not appear. 2. Subject of document: if subject medical T will appear else T will not appear. 3. Type of document: if text only U will appear else U will not appear. Step Three: For the first view of the proposal we present a multidimensional database which has three dimensions these are: features of document s writer, feature of written data and finally feature of text written in the documents, see Fig Handedness, Ethnicity, Education, and Schooling all of these attribute will also presented by the same strategy. Determine the attributes of the second dimension, features of written data, which included the following (dark, blob, hole, slant, width, skew, height, slopehor, slopeneg, slopever, slopepos, pixelfreq). These features gotten from applying image processing procedures specified to extract these features. The proposed encoding with attributes for the second dimension is: 1. Dark: will be normalized then after that will take its normalized value and making a threshold for it according to their different values in different cases. Such that if dark less than 0.5 then G will appear else G will not appear. 2. Blob, Hole, Slant, Width, Skew, Height, Slopehor, Slopeneg, Slopever, Slopepos, and Pixelfreq all of these attributes will also presented by the same strategy Determine the attributes of the third dimension, feature of text written in the documents, which included the following (language, subject of document, type of document) these features gotten by using Optical Character Recognition Software for entering these documents to be digital documents. Then dealing with these digital documents to extract all the recognized Figure 3: The multidimensional database. Then this multidimensional database will be converted into a simple transactional one, see Fig. 4. Figure 4: The transactional database Now the data of transactional database will be written as the proposed encoding of feature s values, see Fig. 5. Tid Attributes Doc 1 ABCDE Doc 2 CDEFGHIJ Doc 3.. Figure 5: Encoded transactional database. Step Four: now since transactional database has very long itemsets, so searching frequent itemsets to find 3 UbiCC Journal, Volume 6: Issue 3 903

4 the association rules will be consume much more space and time that, if we use one of the two traditional methods for finding the frequent itemsets, these methods are: Breadth search can be viewed as bottom up approach where the algorithm visits patterns of size k+1 after finishing the k sized patterns. Depth search does the opposite where the algorithm starts by visiting patterns of size k before those of size k-1. The proposed procedure is to find the set of frequent itemsets in transactional database that has long itemsets. This procedure works as the following: 1. Uses traversing approach which consists of depth and breadth search to find the longest frequent itemset. 2. Find all its children by that we will get most of the frequent itemsets. 3. Detect the support for each frequent itemset. Some frequent itemsets don t appear in the children of longest frequent itemset, these exceptions frequent itemsets will be found with their supports by using the traditional method Apriori algorithm. The proposed procedure consists of two phases; the first phase must be applied while the second phase will be applied when it is necessary. The first phase of the traversing consists of depth search; from this search only the deepest node on the most left side has been taken then the support of this node in the database has been computed. If its support passes the minimum support threshold, the search will be terminated; otherwise the second phase will be applied. The second phase of the proposed procedure consists of breadth search by taking the node that has been generated in the first phase and considering it the root of the tree, then traversing that tree in breadth manner looking for the longest frequent itemset. The search will be terminated when the longest frequent itemset has been found. Step Five: now after finding all frequent itemsets, the traditional association rule procedure will be applied. This procedure will introduce the extracted association rules. As an example for the extracted association rules are: After the extracting, we proposed a procedure that applied before the analysis stage. This procedure is called Rule Classification which classifies the rules into six groups depending on the itemsets of right and left sides in any dimensions they found. Rule classification: Class1: The itemsets in both sides right and left are included in the first dimension. Class2: The itemsets in both sides right and left are included in the second dimension. Class3: The itemsets in both sides right and left are included in the third dimension. Class4: The itemsets in both sides right and left are included in the first and second dimension. Class 5: The itemsets in both sides right and left are included in the first and third dimension. Class6: The itemsets in both sides right and left are included in the second and third dimension. Class7: The itemsets in both sides right and left are included in the first, second and third dimensions The classification of association rules above are: A-----B (Class1).. GHFRSTU--- OABCD (Class7).. Step Six: this step includes the analysis stage which presents the most important step because it introduces full report for predictions, relationships and future trends to improve the performance of the mined database which represent the encoding for the system. To explain this stage we will explain how to analyze the following rule: GHFRSTU---OABCD (Class7) This rule classified as class7 since left and right sides included in the three dimensions. Left side has the frequent itemset GHFRSTU which composed from F in the first dimension, GHR in the second dimension and STU in the third dimension. A-----B.. GHFRSTU---OABCD.. Right side has the frequent itemset OABCD which composed from ABCD in the first dimension and O in the second dimension. From the classification and composition analysis and from translating the encoded 4 UbiCC Journal, Volume 6: Issue 3 904

letters into their attributes we could predicate the following: 1. If dark less than 0.5, Blob more than 0.

Slopeneg will pass threshold and age of writer will be less than 45 and the gender will be female and handedness will be right.

with knowing the language, subject and type of feature s document.

5 letters into their attributes we could predicate the following: 1. If dark less than 0.5, Blob more than 0.3, schooling is high, pixel frequent pass threshold, language is English, subject of document is medical and type of document is text then. 2. Slopeneg will pass threshold and age of writer will be less than 45 and the gender will be female and handedness will be right. So we could predict the age, gender and handedness of the writer and also predict slopeneg of data document by knowing the dark, blob and pixelfreq of data document and schooling of writer combined with knowing the language, subject and type of feature s document. 4 IMPLEMENTATIONS To explain the implementation of the proposed system, we follow the following phases:: The First phase: The implementation is presented by taking each handwritten document and builds the first proposed multidimensional database that by: Convert the document to image and from it will extract all the features of the second dimension which presented by (dark, blob, hole, slant, width, skew, height, slopehor, slopeneg, slopever, slopepos, pixelfreq). The values of features will obtained by the traditional image processing procedures. The metadata of the writer presented by the first dimension which presented by (Age, Gender, Handedness, Ethnicity, Education and Schooling) will be obtained as metadata appended with the document. The metadata of the document presented by the third dimension which presented by (language, subject of document, type of document) will be obtained as metadata appended with the document. Fig. 6 will display how to build the proposed multidimensional database by filling the textbox with feature s values and scanned document. Then click the insertion command and convert it to transactional database by clicking the convert command, if the process of convert done successfully then Fig. 7 will appear. Figure 6: Form1 for building the multidimensional database and convert it to transactional. Figure 7: Message to notice the convert process done successfully. Fig. 8 will display how to extract the frequent itemsets from transactional database using the proposed procedure after entering the initial expected longest frequent itemset and then clicking the extracting command, and display how to get the rule classification after clicking the classification command. Figure 8: Form2 display extraction of frequent itemsets and classification of rules. The second phase: The second phase will present the implementation for an application of the proposed system which implies the possibility of extracting the feature of author from the features of written documents and feature of the subject of written document. This done by taking the document as an image from the document image and extract the features values of the document (second dimension), and the subject features values supplied with document (third dimension). To extract writer (author) features ; according to the extracted features from document image and supplied feature, the system will mine the existing multidimensional database to get the features of author which corresponding to the extracted and supplied features, see Fig. 9. Surely the proposed 5 UbiCC Journal, Volume 6: Issue 3 905

retrieving process would submit too many thresholds related to the feature values. Figure 9: Trying to get author feature by introducing document and subject features.

6 retrieving process would submit too many thresholds related to the feature values. Figure 9: Trying to get author feature by introducing document and subject features. 5 CONCLUSIONS From the proposed research we conclude the following: 1. Converting unstructured handwritten documents to structured frame by building the proposed multidimensional database then convert these multidimensional database into transactional one enrich the mining process since we included most of the features of documents, writers and data. 2. Building transactional database with long itemsets enable us to include all features we think it is important for predictions and extraction new patterns. 3. Using association rule techniques for dealing with the features instead of ANN makes the process of mining much more powerful since there is no limitations about no. of features entered and no. of features resulted to make classification and clustering. 4. The proposed procedure used to find frequent itemsets instead of traditional procedure makes the process of finding all frequent itemset from long itemset efficient and less time and space consumer. 5. The proposed procedure for classifying the extracted rules makes the analysis process much more easy and fast. 6 DISCUSSION In the proposed system we assume the attributes number less than 26 so we represent each attribute by one capital letter but if the no. of attribute will exceed twenty six we will use the capital and small letters. Representing all features by binary will not decrease the system performance since we will use critical threshold for theses features which are present the attributes of user handwritten documents. Representing the handwritten documents initially by multidimensional database that to be general form for many future work on these handwritten documents, in the proposed system we convert this multidimensional database into transactional database since we aim here to apply association rule data mining technique. REFERENCE [1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8: , [2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, [3] J. Han and M. Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, [4]. S. Mitra, and T. Ahharya, "Data Mining Multimedia, Soft Computing, and Bioinformatics", John Wiley and Sons, Inc., [5]. Ala a H. AL-Hamami, Mohammad Ala a Al-Hamami and Soukaena Hassan Hasheem, Applying data mining techniques in intrusion detection system on web and analysis of web usage, Asian Journal of Information Technology, Vol. 5, No. 1, p: 57-63, [6]. Ala a H. AL-Hamami, and Soukaena Hassan Hasheem, Privacy Preserving for Data Mining Applications, journal of technology, baghdad, Iraq, university of technology, Vol.26.No.5,2008. [7]. Mohammad A. Al- Hamami and Soukaena Hassan Hashem, " Applying Data Mining Techniques to Discover Methods that Used for Hiding Messages Inside Images ", The IEEE First International Conference on Digital Information Management (ICDIM2006), Bangalore, India, [8]. Ala a H. AL-Hamami, Mohammad Ala a Al-Hamami and Soukaena Hassan Hasheem, A Proposed Technique for Medical Diagnosis Using Data Mining, Fourth International Conference on Intelligent Computing and information Systems (ICICIS 2009) CAIRO, EGYPT March 19-22, UbiCC Journal, Volume 6: Issue 3 906

Applying Packets Meta data for Web Usage Mining

Applying Packets Meta data for Web Usage Mining Prof Dr Alaa H AL-Hamami Amman Arab University for Graduate Studies, Zip Code: 11953, POB 2234, Amman, Jordan, 2009 Alaa_hamami@yahoocom Dr Mohammad A AL-Hamami