ISSN Vol.04,Issue.11, August-2016, Pages:

Size: px

Start display at page:

Download "ISSN Vol.04,Issue.11, August-2016, Pages:"

Aleesha Whitney Gardner
5 years ago
Views:

WWW.IJITECH.ORG ISSN 2321-8665 Vol.04,Issue.

1 ISSN Vol.04,Issue.11, August-2016, Pages: Design-Based Matter for Paper Representation in Data Filtering ATIYA ANJUM 1, K SHILPA 2 1 PG Scholar, Dept of CS, Shadan Women s College of Engineering and Technology, Hyderabad, TS, India, 2 Associate Professor, Dept of CSE, Shadan Women s College of Engineering and Technology, Hyderabad, TS, India, Abstract: In design-based approaches it have been used in the field of data filtering to make users data needs from a collection of paper. A basic assumption for these approaches is that the paper in the collection are all about one matter. However, users interests can be various and the paper in the collection often involve multiple subjects. Subject representation, such as Latent Dirichlet Allocation (LDA), was put forward to make statistical frame work to represent multiple issue in a collection of documents, and this has been utilized in the fields of machine learning and data retrieval, etc. But its effectiveness in data filtering has not been so well examine. Patterns are always more critical than single terms for describing a paper. However, the huge amount of designs stop them from being able to use in real approach, therefore, selection of the most critical and representative designs from the huge amount of designs becomes crucial. To deal with this problems, a novel data filtering model, Maximum matched design-based matter representation (MDBMR), is proposed. Keywords: Topic Model, Data Filtering, Design Mining, Relevance Ranking, and User Interest Model. I. INTRODUCTION Data filtering (DF) is a system to remove redundant or unwanted data from a data or document stream based on paper representations which represent user s interest. Traditional DF models were developed based on a term-based approach, whose advantage is efficient computational performance, as well as mature theories for term weighting. But term based paper representation suffers from the problems of polysemy and synonymy. To overcome the limitations of term based approaches, design mining based techniques have been used for data filtering and achieved some improvements on effectiveness, since design carry more semantic meaning than terms. Topic modeling has become one of the most popular probabilistic text modeling techniques and has been quickly accepted by machine learning and text mining communities. It can automatically classify papers in a collection by a number of subjects and represents every paper with multiple subject and their corresponding distribution. Two representative approaches are Probabilistic Latent Semantic Analysis (PLSA) and LDA. However, there are two problems in directly applying topic models for data filtering. The first problem is that the topic distribution itself is insufficient to represent documents due to its limited number of dimensions (i.e. a pre-specified number of subjects). The second problem is that the word based topic representation is limited to distinctively represent documents which have different semantic content since many words in the topic modelling are frequent general words. In this project propose to overcome the limitation of existing system by using Natural Language Processing (NLP), i.e., the open English NLP 2.0 library used in enhanced LDA algorithm for filtering semantic meanings of designs from the collections of subjects. Here the LDA apply through the Gibbs sampling method and here also proposed Maximum matched Design-Based Matter Representation (MDBMR) for maximum matched design representation and document relevance ranking and also it to select the most representative and discriminative designs, which are to represent subjects instead of using frequent designs. After installation of this application, it helpful for paper searching inefficient and easy way from number of different papers and also available to download and view the paper based on user's interested area or designs. II. RELATED WORK Data filtering deals with the delivery of data that the user is likely to find interesting or useful. A data filtering system assists users by filtering the data source and deliver relevant information to the users. When the delivered information comes in the form of suggestions an information filtering system is called a recommender system. Because users have different interests the data filtering system must be personalized to accommodate the individual user's interests. This requires the gathering of feedback from the user in order to make a user profile of the preferences. Two major approaches exist for data filtering: content-based filtering and collaborative filtering system. A content-based filtering system selects items based on the correlation between the content of the items and the user's preferences, while a collaborative filtering system chooses items based on the correlation between people with similar preferences. Text clustering methods can be used to structure large sets of text or hypertext papers. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very large size of the databases and understand ability of the cluster description. Here introduced novel approach which uses frequent item (term) sets for text clustering IJIT. All rights reserved.

2 Such frequent sets can be efficiently discovered using algorithms for association rule mining. To cluster based on frequent term sets; here measure the mutual overlap of frequent sets with respect to the sets of supporting papers. We present two algorithms for frequent term-based text clustering, FTC which creates at clustering and HFTC for hierarchical clustering. An experimental evaluation on classical text papers as well as on web papers demonstrates that the proposed algorithms obtain clustering of comparable quality significantly more efficiently than state-of-the art text clustering algorithms. Furthermore, our methods provide an understandable description of the discovered clusters by their frequent term sets. Direct Discriminative design Mining for Effective Classification: direct discriminative design mining approach, DDP Mine, to tackle the efficiency issue arising from the two-step approach. DDP Mine performs a branchand bound search for directly mining discriminative designs without generating the complete design set. Instead of selecting best designs in a batch, a "feature centered" mining approach that generates discriminative designs sequentially on a progressively shrinking FP-tree by incrementally eliminating training instances. III. EXISTING SYSTEM All these data mining and text mining techniques hold the assumption that the user s Interest is only related to a single topic. However, in reality this is not necessarily the case. The number of returned patterns is huge because if a pattern is frequent, then each of its sub patterns is frequent too. Thus, selecting reliable patterns is always very crucial IV. PROPOSED SYSTEM User data needs are generated in terms of multiple subjects each subject is represented by design. designs are generated from topic models and are organized in terms of their statistical and taxonomic features; The most discriminative and representative designs, called Maximum Matched Designs, are proposed to estimate the document relevance to the user s data needs in order to filter out irrelevant documents. This phase consist of four stages. They are: Design Equivalence Class Topic-based User Interest Modeling Topic-based Document Relevance Ranking Algorithms. A. Latent Dirichlet Allocation Topic modeling algorithms are used to discover a set of hidden subjects from collections of papers, where a subject is represented as a distribution over words. Topics models provide interpret able low-dimensional representation of papers. LDA is a typical statistical topic modeling technique and the most common topic modeling tool currently in use. It can discover the hidden subjects in collections of papers using the words that appear in the paper. ATIYA ANJUM, K SHILPA contain structural information which can reveal the association between words. In order to discover semantically meaningful designs to represent subjects and papers, two steps are proposed: 1st construct a new transnational data set from the LDA model results of the document collection secondly, generate design-based representations from the transnational data set to represent user needs of the collection. C. Maximum Matched Designs Maximum Matched Designs (MDBMM), the design which represent user interests are not only grouped in terms of subjects, but also partitioned based on equivalence classes in each subject group. The design in different groups or different equivalence classes have different meanings and distinct properties. Thus, user data needs are clearly represented according to various semantic meanings as well as distinct properties of the specific designs in different subject groups and equivalence classes. Here NFA based Maximum Matched Designs are used, this help to expression based searching and NLP is used in the enhanced LDA model for Semantic meaning designs data or subjects filtering. The general structure of the proposed DF model is depicted. In the training part, here first generate user interest models from user profile (documents) by utilizing the proposed two-stage design-based matter representation. The two models are proposed to generate different designs to represent subjects in the user interest models. The first DF model is based on Design-Based Matter Representation (DBMR), in which user's interests are represented by multiple subjects and each subject is represented by using all frequent designs or closed designs. B. Design Enhanced LDA Design-based representations are considered more meaningful and more accurate to represent subjects than word based representations. Moreover, design based representations Fig.1. The Structure of the Proposed DF Model for Data Filtering and Retrieval Based on NLP and NFA Based Document Search.

3 Design-Based Matter for Paper Representation in Data Filtering The second DF model is based on LDA, in which the designs in each subject is further organized by different groups of equivalence classes. In the filtering part, for new incoming papers, based on the user interest models, relevant topical designs are selected and used to calculate the relevance of papers to the user's interest in order to filter out irrelevant papers and provide relevant papers to the user. Further, for the relevance ranking in Design-Based Matter Representation, corresponding to frequent designs and closed designs, there are two ranking models that is Design-Based Matter Representation FP and Design-Based Matter Representation FCP. Similarly, there are two ranking models in the Design-Based Matter Representation. The proposed topic model represents subjects using designs with structural characteristics which make it possible to interpret the subjects with semantic meanings. As with existing topic models, the proposed model is application independent and can be applied to various domains. Data filtering (DF) is a system to remove redundant or unwanted data from a data or paper stream based on paper representations which represent user's interests. The input data of DF is usually a collection of papers that a user is interested, which represent the user's long-term interests often called the user's profile. As mentioned before, user's data needs usually involve multiple subjects. Hence, the proposed Design-based Matter Representation is applied to extract long-term user's interest through DF. Information retrieval (IR) typically seeks to find papers that are related to a user generated query from a given collection. The input data of IR is a query consisting of a number of terms which represent the user s short-term interest. For both IF and IR systems, besides user interest modeling, another essential part is document relevance ranking, which estimates the relevance between user's interests and documents. Whether the information retrieval or information filtering, users always to find the most relevant and reliable information. The quality of relevance ranking can potentially impact on user's perception of the specific ranking system's reliability. In this thesis, the relevance ranking models are established based on how to manage subjects and extract the most meaningful and useful data from the multiple subject user interest model and the semantic meaning search design and NFA extract most meaningful and expression based document data. Following are the contributions are: The important contributions to the domains of topic modeling, information filtering and information retrieval For more information filtering and retrieval, here proposed Open English 2.0 NLP and NFA To extract efficient semantic meaning design based document data by using Open English 2.0 NLP To extract more papers based on expression related searching by using regular expression Non-deterministic niter automaton (NFA) method. 1. NFA based Maximum Matched Design Algorithm Input: Document topics Zj Equivalent class ECj...n Search expression R Output: Maximum Matched Design MC Process: 1. Take search expression R 2. Apply NFA expression to get Regular Expression set Rs 3. For expression ER on Rs 4. Check Rs sum Ec and Rs sum d 5. If (valid) Mc Rs 6. Update Rank (Mci) + MCjk 0.5 fci vci 7. End for generate Mc D. GibbsLDA++ Gibbs LDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large scale data sets including very large collections of text/web documents. There have been several implementations of this model in C, Java, and Mat-lab. Here decided to release this implementation of LDA in C/C++ using Gibbs Sampling to provide an alternative choice to the topic-model community. Gibbs LDA++ is useful for the following potential application areas Information Retrieval (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search. Document Classification/ Clustering, Document Summarization, and Text/Web Data Mining community in general. Collaborative Filtering. Content-based Image Clustering, Object Recognition, and other applications of Computer Vision in general. Other potential applications in biological data. V. EVALUATION TABLE I. The MPBTM with Different Topic Number Two hypotheses are designed for verifying the DF model proposed in this paper. The first hypothesis is given that user data needs involve multiple subjects, then document modeling by taking multiple subjects into consideration can generate more accurate user models to represent user data needs. The second hypothesis is that the proposed maximum matched designs are more effective than other designs to be used in determining relevant documents. To verify the hypotheses, experiments and evaluation have been conducted. This section discusses the experiments and evaluation in terms of data collection, baseline models, measures and results. We implement PLSA model with Lemur toolkit1 with 1000 iterations as default setting. And the LDA model is implemented by MALLET toolkit. The parameters for all LDA-based topic models are set as follows: the number of Iterations of Gibbs sampling is 1000, the hyper-parameters of the LDA model are α = 50/V and β = 0.01, which were used

and justified. Our experience shows that the performance of the proposed MDBMR model is not very sensitive to the settings of these parameters. The results are shown in Table 1.

4 and justified. Our experience shows that the performance of the proposed MDBMR model is not very sensitive to the settings of these parameters. The results are shown in Table 1. In the process of generating design enhanced subject representations, the minimum support σ rel for every subject in each collection is different, because the number of positive documents in different collections of the RCV1 is very different. In order to ensure enough transactions from positive documents to generate accurate designs for representing user needs, the minimum support σ rel is set as follows: 1 n 2 max (2/n, 0.3) 2 < n 10 σ rel = max 3/n, < n 13 (1) max(4/n, 0.3)13 < n oterwise. Where n is the number of transactions from relevant documents in each transactional database. (1) Pattern-based category 1. FCP Frequent closed designs are generated from the documents in the training dataset and used to represent the user s interests. The minimum support in the design-based representation, including the following two models for sequential closed design and phrases, is set to Sequential Closed Pattern (SCP) The design Taxonomy Model is one of the state-of-the art design-based models. It was developed to discover sequential closed patterns from the training dataset and rank incoming documents in the filtering stage with the relative supports of the discovered patterns that appear in the papers. In this model, every paper in the training dataset (D) is split in paragraphs which are the transactions for design mining. 3. n-gram TABLE II. Comparison of All Models on All Measures Using the First 50 Document Collections of RCV1 Most researches on phrases in modeling documents have employed an independent collocation discovery module. In ATIYA ANJUM, K SHILPA this way, a phrase with independent statistics can be indexed exactly as a word-based representation. a document collection, where n is empirically set to 3. (2) Term-based category 4. BM25 BM25 [2] is one of the state-of-the-art term-based document ranking approaches. In this paper, the term weights are estimated using the following equation: W t = tf (k+1) k 1 b +b DL AVDL log N n+0.5 +tf n+0.5 Where N is total number of documents in the collection; n is the number of documents that contain term t; tf is the term frequency; DL and AVDL are the document length and average document length, receptively; and k and b are the parameters, which are set as 1.2 and 0.75 as used and explained. 5. Support Vector Machine (SVM) The linear SVM has been proven very effective for text categorization and filtering. We would compare it with other baseline models; however, most existing SVMs are designed for making a binary decision rather than ranking documents. The SVM only uses term-based features extracted from training documents. There are two classes: y i ϵ {-1, 1} where +1 is assigned to a document if it is relevant; otherwise it is assigned with -1 and there are N labeled training examples: (d 1, y 1 ),, (d i, y i ),,(d N, y N ), d i ϵ R n where n is the dimensionality of the vector. Given a function h (d)= < w.d > +b where b is the bias, h(d) =+1 if < w.d > +b 0; otherwise h(d) =-1, and < w. d > is the dot product of an optimal weight vector w and the document vector d. To find the optimal weight vector w for the training set, we perform the N l following function: w = i=1 yiαidi subject to i=1 αiyi = 0 and α i 0, where α i is the weight of the sample d i. For the purpose of ranking, b can be ignored. VI. RESULTS For different document collections, the number of subjects involved in the collections can be different. Therefore, selecting an appropriate number of subjects is important. As Table 1 shows, the result of the MDBMR with 5 or 10 subjects achieves relatively the best performance for this particular dataset. When the subject number rises or reduces, the performance drops. Especially when the subject number rises to 15, the performance drops dramatically, although still outperforms most of the baseline models in Table 2. The proposed model MDBMR with 10 subjects is compared with all the baseline models mentioned above using the 50 human assessed collections. The results are depicted in Table 2 and evaluated using the measures in Section 4.2. Table 2 consists of three parts. The top, middle, and bottom parts in Table 6 provide the results of the topic modeling methods, the pattern mining methods, and term-based methods, respectively. The improvement% line at the bottom of each part provides the percentage of improvement achieved by the MDBMR against the best model among all the other baseline models in that part for each measure. (2)

5 Design-Based Matter for Paper Representation in Data Filtering A. Comparisons with Topic-based Models From the top part of Table 2, we can see that, the MDBMR outperforms all other topic-based models for all the four measures. The DBMR_FCP is the second best model for measures top20 and b/p, and is in a tie with the PBTM_FP as the second best model for measure F 1. The PBTM_FP is the second best model for measure MAP. This result demonstrates that using closed patterns (DBMR_FCP) and, especially, using the proposed maximum matched design (MDBMR) to represent topics achieved better results than using frequent designs (DBMR_FP) for most measures and better than using phrases (TNG) or words (PLSA_word and LDA_word) for all measures. The improvement% line in the top part of Table 1 shows that, the MDBMR which uses the maximum matched designs consistently achieves the best performance with the improvement percentage against the second best model from a minimum of 8.5 percent to a maximum of 11.7 percent. The comparison results clearly support the second hypothesis. By observing the results in Table 3 across all three parts, we can see that the topic-based models in the top part (except for TNG and PLSA_word) achieved better performance than the models in the second and third parts. Even though the TNG model did not always perform better than the non-topic modeling based models, it performs considerably better than the n-gram models. Both TNG and n-gram use phrases, the difference is that TNG uses phrases to represent the semantic meaning of subjects and also uses subject distributions to represent the user s subject preference, whereas n-gram directly uses phrases to represent the user s data needs. Similar comparisons can be made between DBMR_FCP and FCP, LDA_word and BM25, and between LDA_word and SVM. Both DBMR_FCP and FCP use frequent closed patterns to represent user interests. However, DBMR_FCP achieved clearly better performance TABLE III. T-Test P-Values for All Models Compared with the MDBMR the performance of the PLSA_word model is not better than BM25 or SVM. B. Comparisons with Pattern-Based Models The comparison results among the proposed model and pattern-based baseline models are in the middle part of Table 2. We can see that all the three pattern-based topic modelling models, i.e. MPBTM, PBTM_FCP and PBTM_FP, outperform the three pattern-based baseline models, i.e. SCP, n-gram, and FCP, which clearly shows the strength obtained by combining topic modelling with pattern-based models. Among the three baseline models, the SCP outperforms the other two models for b/p, MAP and F 1, while the FCP model performs the best for top20. The bottom line of the patternbased part in the table provides the percentage of improvement achieved by the MPBTM against the SCP for b/p, MAP and F 1, and against the FCP model for top20. C. Comparisons with Term-based Models From the bottom section of Table 2, we can see that the SVM achieved better performance than the BM25, while the MDBMR and the DBMR_FCP and the DBMR_FP consistently outperform the SVM. The maximum and minimum improvement achieved by the MDBMR against the SVM is 23.5 and 9.3 percent, respectively. We also conducted the T-test to compare the MDBMR with all other DBMR models and baseline models. The results are listed in Table 3. The statistical results indicate that the proposed MDBMR significantly outperforms all the other models (all values in Table 3 are less than 0.05) and Fig Point Results of Comparison between the Proposed MDBMR and Baseline Models. Than FCP simply because it takes multiple topics into consideration when generating user interests. The same reason applies for the better performance of LDA_word over BM25 and SVM; all of these use words to represent user interest, but LDA_word is a topic modelling method while BM25 and SVM are not. These comparisons can strongly validate the first hypothesis, i.e. taking multiple topics into consideration can generate more accurate user information needs. However, The improvements are consistent on all four measures. Therefore, we conclude that the MDBMR is an exciting achievement in discovering high-quality features in text documents mainly because it represents the text documents not only using the topic distributions at a general level but also using hierarchical design representations at a detailed specific level, both of which contribute to the accurate document relevance ranking.

6 VII. CONCLUSION Here presents an innovative design enhanced topic model for data filtering including user interest modeling and document relevance ranking. The proposed MDBMR model generates design enhanced topic representations to model user's interest s across multiple subjects. In the filtering stage, the MDBMR selects maximum matched designs, instead of using all discovered designs, for estimating the relevance of incoming documents. The proposed approach incorporates the semantic structure from topic modeling and the statistical relevant method from the most representative designs. The proposed model has been evaluated by using the Open English NLP 2.0 on the enhanced LDA models for retrieving semantic meaning of designs and NFA based Maximum matched designs based on regular expression for the task of more data filtering. In the proposed model demonstrates excellent strength on document modeling and relevance ranking. The proposed model automatically generates discriminative and semantic rich representations for modeling topics and documents by combining statistical relevant topic modeling techniques and data mining techniques. The technique not only can be used for data filtering, but also can be applied to many content-based feature extraction and modeling tasks. VIII. REFERENCES [1]Yang Gao, Yue Xu, and Yuefeng Li, Pattern-based Topics for Document Modelling in Information Filtering, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, No. 6, June [2]S. Robertson, H. Zaragoza, and M. Taylor, Simple BM25 extension to multiple weighted fields, in Proc. 13th ACM Int. Conf. Inform. Knowl. Manag., 2004, pp [3]F. Beil, M. Ester, and X. Xu, Frequent term-based text clustering, in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2002, pp [4]Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal, Mining frequent patterns with counting inference, ACM SIGKDD Explorations News lett., vol. 2, no. 2, pp , [5]H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative frequent pattern analysis for effective classification, in Proc. IEEE 23rd Int. Conf. Data Eng., 2007, pp [6]R. J. Bayardo Jr, Efficiently mining long patterns from databases, in Proc. ACM Sigmod Record, 1998, vol. 27, no. 2, pp [7]J. Han, H. Cheng, D. Xin, and X. Yan, Frequent pattern mining: Current status and future directions, Data Min. Knowl. Discov., vol. 15, no. 1, pp , [8]M. J. Zaki and C.-J. Hsiao, CHARM: An efficient algorithm for closed item set mining. in Proc. SDM, vol. 2, 2002, pp [9]Y. Xu, Y. Li, and G. Shaw, Reliable representations for association rules, Data Knowl. Eng., vol. 70, no. 6, pp , [10]X. Wei and W. B. Croft, LDA-based document models for ad-hoc retrieval, in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2006, pp ATIYA ANJUM, K SHILPA [11]C. Wang and D. M. Blei, Collaborative topic modeling for recommending scientific articles, in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2011, pp [12]D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , Author s Profile: Ms.Atiya Anjum is currently pursuing M. Tech with specialization in Computer Science at Shadan Women s College of Engineering and Technology, Telangana, India. She has received her Bachelor degree in Computer Science and Engineering from V.R.K. Women s College of Engineering And Technology, Telangana, India. K Shilpa is an Associate Professor in Shadan Women s College of Engineering and Technology. She has 6 years of teaching experience. She holds a post graduate degree of M.Tech (SE) from JNTU.

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,