A LITERATURE SURVEY: EDUCATIONAL DATA MINING AND WEB DATA MINING

A LITERATURE SURVEY: EDUCATIONAL DATA MINING AND WEB DATA MINING Sarvesh Tendulkar, Swarada Adavadkar Department of Computer Engineering MIT College of Engineering Pune, India ABSTRACT: Data Mining (DM) is extracting information from large set of data and obtaining meaningful results from it. Data Mining has evolved into diverse fields which requires development of data collection, creation, data management (including data storage retrieval and database transaction processing) and advanced data analysis (data warehousing and mining).this paper explains about two of Data Mining fields : Educational Data Mining (EDM) and Web Data Mining (WDM). Educational Data Mining (EDM) is an emerging field of data mining which applies various data mining algorithms for betterment of students and improving quality of student courseware. Web Data Mining (WDM) is a technique used to crawl through various web resources to collect information which an individual or a company use to promote business. In this survey, different published research papers on methodologies used in EDM and WDM are compared. Also, it gives an insight into techniques used in EDM and WDM. Keywords: Keyword Indexing, Search engines, document frequency, term weights, data mining, neural network, predicting school failure, job trends analysis PART I: EDM [1] INTRODUCTION Educational Data Mining (EDM) is a new trend in the data mining and Knowledge Discovery in Databases (KDD) field which focuses in mining useful patterns and discovering useful knowledge from the educational information systems, such as, admissions systems, registration systems, course management systems (moodle, blackboard, etc ), and any other systems dealing with students at different levels of education, from schools, to colleges and universities. As interest in EDM continued to increase, EDM researchers established an academic journal in 2009, the Journal of Educational Data Mining, for sharing and disseminating research results. In 2011, EDM researchers established the International Educational Data Mining Society to connect EDM researchers and continue to grow the field. Goals for EDM in educational field: Baker and Yacef [1] describes the following four goals of EDM: 1) Predicting student's future learning behaviour Sarvesh Tendulkar and Swarada Adavadkar 1

A LITERATURE SURVEY: EDUCATIONAL DATA MINING AND WEB DATA MINING 2) Discovering or improving domain models 3) Studying the effects of educational support 4) Advancing scientific knowledge about learning and learners [2] EDM METHODS Classification It is a two-way technique (training and testing) which maps data into a predefined class. This technique is useful for success analysis with low, medium, high risk students used in [2], student monitoring systems [3], predicting student performance, misuse detection used in [4] etc. Statistics It is a technique to identify outlier fields, record using mean, mode etc. and hypothetical testing. This technique is useful to improve the course management system & student response system [5]. Clustering It is a technique to group similar data into clusters in a way that groups are not predefined. This technique is useful to distinguish learner with their preference in using interactive multimedia system used in [6], Students comprehensive character analysis used in [7] and suitable for collaborative learning. Prediction It is a technique which predicts a future state rather than a current state. This technique is useful to predict success rate, drop out used in Dekker et al. [7], and retention management used in [8] of students. Neural Network It is a technique to improve the interpretability of the learned network by using extracted rules for learning networks. This technique is useful to determine residency, ethnicity used in [9], to predict academic performance used in [10], accuracy prediction in the branch selection used in and explores learning performance in a TESL based e-learning system. Association Rule Mining It is a technique to identify specific relationships among data. This technique is useful to identify students failure patterns [11], parameters related to the admission process, migration, contribution of alumni, student assessment, co-relation between different group of students, to guide a search for a better fitting transfer model of student learning etc. Apart from the above methods, two new methods i.e. distillation of data for human judgment and discovery with models to analyze the behavioral impact of students in learning environments. [3] APPLICATIONS 1. Analysis and Visualization of Data 2. Predicting Student Performance 3. Enrolment Management 4. Grouping Students 5. Predicting Students Profiling 6. Planning and scheduling 7. Organization of Syllabus 8. Detecting Cheating in Online Examination Sarvesh Tendulkar and Swarada Adavadkar 2

[5] LITERATURE SURVEY OF EDM TRENDS Author(s) and Year Dutta Borah et al., 2011 [13] Lin S.H., 2012 [8] Merceron and Yacef. 2005 [14] Amjad Abu Saa 2016 [15] Carlos Márquez- Vera, Cristóbal Romero Morales, and Sebastián Ventura Soto, 2013 [12] Data Mining Software(s) SPSS Clementine 11.1 WEKA Excel, Clementine, Tada-Ed, SODAS RapidMiner, WEKA WEKA Data Mining Algorithm(s) Classification- Decision Tree Decision Tree Algorithm Classification- DT, Clustering, Association Mining Decision Tree - C4.5, CART, CHAID, ID3 Decision Tree, Rule Induction, Synthetic Minority Oversampling Technique(SMOTE) Input Dataset AIEEE 2007 5943 records of 1st year students of Biola University Logic-ITA Student Data survey distributed to students within their daily classes and as an online survey using Google Forms A specific survey, a general survey, The Department of School Services Educational Outcome Predict student s enrolment decision Student Retention Management Student/teachers' performance monitoring system Students Performance Prediction Predicting the academic status of students PART II: WDM [6] INTRODUCTION The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc. Web mining refers to the automatic discovery of interesting and useful patterns from the Sarvesh Tendulkar and Swarada Adavadkar 3

A LITERATURE SURVEY: EDUCATIONAL DATA MINING AND WEB DATA MINING data associated with the usage, content, and the linkage structure of Web resources. Web mining lies in between and copes with semi structured data and/or unstructured data. Web mining calls for creative use of data mining and/or text mining techniques and its distinctive approaches. Mining the web data is one of the most challenging tasks for the data mining and data management scholars because there are huge heterogeneous, less structured data available on the web and we can easily get overwhelmed with data.[18] Web data mining can be broadly classified into 3 types: Web usage mining Web content mining Web structure mining [7] WDM APPLICATIONS Applications: 1. E-commerce to do personalized marketing 2. Analyze feedback for particular product 3. Government agencies to identify threat and fight terrorism 4. Companies to keep better relation with customers 5. Analyze current market trends for better performance[20] [9] LITERATURE SURVEY: INDEXING TECHNIQUES Sr.N o. Author & Year 1 Changsha ng Zhou, Wei Ding, Na Yang 2006[24] 2 Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando 2004[23] Indexing Techniques Double Indexing Mechanism of Search Engine Assigning Document Identifiers to Enhance Compressibilit y of Web Search Engines Indexes Methodology It has document index as well as word index. The document index is based on the clustering, and ordered by the position in each document. In the retrieval, the search engine first gets the document id of the word in the word index, and then goes to the position of corresponding word in the document index. It was the reordering algorithm which partitions the set of documents into k ordered clusters on the basis of similarity measure. According to this algorithm, the biggest document is selected as centroid of the first cluster and n/k1 most similar documents are assigned to this cluster. Then the biggest document is selected and the same process repeats Limitations Time consuming as the index exists at two levels. It is not effective in clustering the most similar documents. The biggest document may not have similarity with any of the documents but still it is taken as the representative of the cluster. 3 L. Huilin, K. Efficiently Crawling Focused search engine comprising of two components; filtering and link forecasting. Focused search engine does not Sarvesh Tendulkar and Swarada Adavadkar 4

Chunhua and W. Guangxin g 2007[22] 4 Rada Mihalcea and Dan Moldovan, 2000[21] Strategy for Focused Searching Engine Semantic Indexing using WordNet senses The first component filters out the irrelevant documents from previously fetched list of documents and the second component envisage the best matched links form related documents to assign them to crawler for further downloading. It shows improved effectiveness over the classic word based indexing techniques. Herein a Boolean Information retrieval system adds word semantics to the classic word based indexing. Two of the core tasks of the system, namely the indexing and retrieval components, use a combined wordbased and sense-based approach consider the context of the keywords of the user query. Representation is dense, so hard to index based on individual dimensions. 5 X. Chen, X. Zhang, 2008[20] 6 Parul Gupta, Dr. A.K. Sharma, 2010[16] HAWK: A Focused Crawler with Content and Link Analysis Context based Indexing in Search Engines using Ontology It is based on the content of the document and on link analysis. It is implemented using user-defined relevant formula, shark-search and Page-Rank. An index is built on the basis of context of the document rather than on the terms basis using ontology. The context of the documents being collected by the crawler in the repository is being extracted by the indexer using the context repository, thesaurus and ontology repository and then documents are indexed according to their respective context. It is does not make use of domain ontologies to represent topical maps and link Web pages. It considered terms only in title of the document with maximum frequency. User has to pass the context along with query keyword which creates problem with naive users. 7 P.Mudgil, A.K.Shar ma, P.Gupta 2013[17] An Improved Indexing Mechanism to Index Web Documents This indexing considers the senses of the keyword. Different users type the same query to get the different results according to their interest. It will provide the different senses to the user which will help in generating the more relevant results to the user. The processing of file includes: removal of stop words, stemming, adding of weights and indexing of file. [10] CONCLUSION Data Mining technology has emerged to remedy the issues of management and analysis of massive volumes of data. In this paper, a comparison of trends and techniques of EDM and WDM is presented. This work focuses on various trends used in educational data mining and Sarvesh Tendulkar and Swarada Adavadkar 5

A LITERATURE SURVEY: EDUCATIONAL DATA MINING AND WEB DATA MINING different technologies used for web data mining. It is observed that the educational outcome varies according to input dataset provided for the given experiment. From above survey it can also be inferred that context based indexing has proved to give more relevant results efficiently. REFERENCES [1] Baker, R.S.; Yacef, K (2009). "The state of educational data mining in 2009: A review and future visions". JEDM-Journal of Educational Data Mining 1 (1): 2017 [2] Vandamme, J.P. et al. (2007), Predicting academic performance by Data Mining methods, Taylor and Francis group Journal Education Economics.Vol.15, No.4, pp.405-419. [3] Luan, J., (2002), Data mining, knowledge management in higher education, potential applications,in Workshop associate of institutional research international conference. [4] Baker, R.S, Corbett, A.T., Koedinger, K.R (2004), Detecting Student Misuse of Intelligent Tutoring Systems in Proc. Lecture Notes in Computer ScienceVol.3220,531-540. [5] Campbell, J.P., and Oblinger, D.G. (2007,Oct.). Academic Analytics. EDUCAUSE, Washington, D.C. [Online].Available http://net.educause.edu/ir/library/pdf/pub6101.pdf,2007. [6] Chrysostomu K. el al.(2009), Investigation of users preference in interactive multimedia learning systems: a data mining approach,taylor and Francis online journal Interactive learning environments. Vol. 17,No. 2. [7] Zhang Y.et al. (2010), Using data mining to improve student retention in HE: a case study, in Proc.12th Int. Conf. on Enterprise Information Systems, Volume 1: Databases and Information Systems Integration.Portugal, pp.190-197. [8] Lin, S.H.(2012), Data Mining for student retention management ACM journal of Computing Sciences in CollegesI. Vol.27,No.4,pp. 92-99. [9] Yu et al. (2010), A data mining approach for identifying predictors of student retention from sophomore to junior year, Journal of Data Science.Vol.8, pp.307-325. [10] Vandamme, J.P. et al. (2007), Predicting academic performance by Data Mining methods, Taylor and Francis group Journal Education Economics.Vol.15, No.4, pp.405-419. [11] Oladipupo,O.O.,Oyelade,O.J.(2009), Knowledge Discovery from Students Result Repository: Association Rule Mining Approach, International Journal of Computer Science & Security,Vol.4,No.2,pp.199-207. [12] Carlos Mrquez-Vera, Cristbal Romero Morales, and Sebastin Ventura Soto "Predicting School Failure and Dropout by Using Data Mining Techniques", IEEE Journal Of Latin- American Learning Technologies, vol. 8, no. 1,February 2013. [13] Dutta Borah, M., Jindal, R., Gupta,D.,Deka,G.C (2011), Application of knowledge based decision technique to Predict student s enrolment decision. in Proc. Int. Conf. on Recent Trends in Information Systems, India:IEEE,pp.180-184,DOI :10.1109/ReTIS.2011.6146864. [14] Mercheron, A., and Yacef, K. (2005), Educational Data Mining: a case study in Proc. Conf. on Artificial Intelligence in Education Supporting Learning through Intelligent and Socially Informed Technology. IOS Press, Amsterdam, The Netherlands, pp. 467-474. [15] Amjad Abu Saa " Educational Data Mining & Students Performance Prediction", International Journal of Advanced Computer Science and Applications, vol. 7, no. 5, 2016. [16] P.Gupta and Dr. A.K.Sharma Context based Indexing in Search Engines using Ontology, International Journal of Computer Applications, Volume 1 No. 14 [17] P. Mudgil, A.K.Sharma and P.Gupta, An Improved Indexing Mechanism to Index Web Documents, 5th International Conference on Computational Intelligence and Communication Networks [18] Smith, D., & Ali, A., Analyzing computer programming job trend using web data mining Issue in Informing Science and Information Technology, 11, 203-214 [19] http://www.ieeefinalyearprojects.org/web_mining_projects_in_java_dotnet.html [20] X. Chen, X. Zhang, HAWK: A Focused Crawler with Content and Link Analysis, IEEE International Conference on e-business Engineering. pp- 677-680, 2008. Sarvesh Tendulkar and Swarada Adavadkar 6

[21] Rada Mihalcea and Dan Moldovan, Semantic Indexing using WordNet senses, in proceedings of ACL workshop on IR and NLP, Hong Kong, 2000. [22] ]L. Huilin, K. Chunhua and W. Guangxing, Efficiently Crawling Strategy for Focused Searching Engine, Advances in Web and Network Technologies and Information Management [23] Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando. Assigning Document Identifiers to Enhance Compressibility of Web Search Engines Indexes,. In the proceedings of SAC,2004. [24] Changshang Zhou, Wei Ding, Na Yang, Double Indexing Mechanism of Search Engine based on Campus Net,Proceedings of the 2006 IEEE Asia-Pacific Conference on Services Computing (APSCC'06) Sarvesh Tendulkar and Swarada Adavadkar 7