Adopting Data Mining Techniques on the Recommendations of Library Collections

Size: px

Start display at page:

Download "Adopting Data Mining Techniques on the Recommendations of Library Collections"

Donna Fisher
5 years ago
Views:

1 Adopting Data Mining Techniques on the Recommendations of Library Collections Shu-Meng Huang a, Lu Wang b and Wan-Chih Wang c a Department of Information Management, Hsing Wu College, Taiwan (simon@mail.hwc.edu.tw) b, c Graduate School of Management Sciences, Tamkang University, Taiwan (pheobemimilucky@hotmail.com) Correspondence: Shu-Meng Huang a Abstract In this research, the researchers explored not only the cluster of the readers with similar characteristics, but also the connection between the readers and the book collections of the library by using Data Mining techniques. By doing this, the library will be able to improve the interaction with its readers, and further increase the usage of library collections. The Modified Attribute-Oriented Induction (MAOI) method was introduced to deal with the multi-valued attribute table and further sort the readers into different clusters. Instead of using concept hierarchy and concept trees, MAOI method implemented the concept climbing and generalization of multi-valued attribute table with Boolean Algebra and modified Karnaugh Map, and described the clusters with concept description. On the other hand, the Chinese books in the library collections were classified into four groups with New Classification Science for Chinese Libraries (CCL). Not only the attributes of readers, but also the attributes of library collections borrowed by readers are included in the multi-valued attribute table. After the completion of induction, the reading preferences of the readers with the same characteristics can be learned. Keywords: Data Mining, Recommendations, MAOI, Multi-Valued Attribute I. INTRODUCTION Potential readers seeking information in a library often face a daunting, time and energy-consuming task. Given the immense body of data gathered in modern libraries, it can be difficult for these readers to quickly sift through the mass of information to uncover what they need. This difficulty can affect how often, and how willingly, readers make use of library s vast resources. Some studies[1][2], have indicated that one of the key aspects of a library s service and marketing success is how well they actively provide information to readers by means of personal service technology based on readers personal preferences and needs. In recent years, many e-commerce websites have adopted personal recommender information systems in an effort to increase their interaction with customers to generate a higher rate of return patronage[3][4]. Take YouTube as an example, the Recommended for you column provides viewers information related to the videos they just browsed;and on they analyze the pages customers have browsed, then actively recommend the books the customers might be interested. Thus, the application of this concept in the library must be able to improve the relationship between the library and the readers. In this research, a method of concept description in Data Mining was adopted Modified Attribute-Oriented Induction[5], to sort the readers into different clusters. Each cluster includes readers with similar characteristics and preferences. II. RELATED STUDIES There are mainly two recommender systems applied in online stores [6]: for the merchandise that customers would consume more often, such as books or movies, and for the merchandise that customers would not consume so often, such as cars or computers. The first recommender system analyzes the consumption records of the customers to uncover the customers preferences, and then provide advices. It usually adopts Data Mining technology, as well as personal service. But for the merchandise that customers don t buy so often, the advices based on the earlier consumption records may not achieve the expected results. The situation of readers making use of the collections of the library is similar to the consumption of the merchandise that customers buy more often. Therefore, many studies related to library recommendations adopt Data Mining technology to discover the relationship between readers and books. 2.1 Data Mining When facing massive data, Data Mining provides powerful and effective tools to transform the data into useful information and knowledge[7][8]. Table 1 is a summary of the functions and technologies frequently used in Data Mining[9]. The Market Basket Analysis is also named Association Rule Analysis. This technology can explore the connections between attributes or objects. And it s an appropriate means to dig the relationships between readers and books. But there are mass records in the library. When considering about analysis complication, execution efficiency and recommendation results, researchers

2 usually perform the grouping of the readers before analysis. Yu-Ling Cheng (2002) Jien-Hwa Tsao(2003) Chang-Ting Yang(2007) classified the readers based directly on reader s department. Ching-Shium Chen (2000) Yuan-Jing Zhang (2001) Chien-Yu Chen (2009) clustered the readers with Cluster Detection. Kuan-Hua Sun(2000) adopted Multilevel Association Rule Mining to discover the different characteristics of the readers who have different preferences. Yu-Ling Cheng (2003) adopted Memory-Based Reasoning to group the readers with the same background. Some of the methods adopted in the above mentioned studies are complicated, meanwhile, some are brief. But most of them only can deal with single-valued data. When facing multi-valued data, lots of pre-processing work must be done for further analysis. In this research, Modified Attribute-Oriented Induction, which was proposed by Shu-Meng Huang (2010), was adopted, to induct and sort the multi-valued data, such as reader s department and year. Table 1. Functions and technologies of Data Mining Function Affinity Classification Estimation Prediction Clustering Description Technology grouping Statistics ˇ ˇ ˇ ˇ ˇ ˇ Market Basket Analysis ˇ ˇ ˇ ˇ Memory-Based Reasoning ˇ ˇ ˇ ˇ Genetic algorithm ˇ ˇ Cluster Detection Link Analysis ˇ ˇ ˇ ˇ Decision Tree ˇ ˇ ˇ ˇ Artificial Neural Network ˇ ˇ ˇ ˇ ˇ 2.2 Modified Attribute-Oriented Induction (MAOI) Attribute-Oriented Induction Approach (AOI)was proposed in 1991 [10]. It appears in a form of described Data Mining[11], and can deal with different kinds of knowledge rules efficiently, such as characteristic rules, discrimination rules, quantitative rules, and data evolution regularities [12]. This approach is one of the most classification scheme in Data Mining [13]. The basic concept and steps of AOI include [14]: (1) Concept Hierarchy (2) Attribute-Removal (3) Concept-Tree climbing (4) Vote propagation (5) Attribute-Threshold Control (6) Rule transformation Though it s convenient making use of AOI to induct data into simple rule description, and some complicated procedures for data processing are eliminated as well. But different people make different Concept Hierarchies, and different definition leads to different results. The confidence would be low, if there is no apparent Concept Hierarchy between attributes. Besides, AOI only can deal with single-valued attribute data [3]. Therefore Shu-Meng Huang (2010) combine the concepts of Boolean bit and simplified Karnaugh map with AOI, named MAOI, to deal with multi-valued attribute data. Figure 1 presents the steps of MAOI. Table 2. is a database of high-frequent-crime areas in [3]. It s multi-valued attribute database. The researcher explained the steps of MAOI with it. Attribute table A B C Boolean Bit Modified K-map Induction Figure 1. The induction steps of Modified AOI Boolean Bit Transformation To decide a value s Boolean bit, a cutting point must be defined first. Taking the mean value of the attribute as the cutting point for that attribute, all the values in that attribute can be transformed. That means, if a value is

3 bigger than or equal to the cutting point, it s Boolean bit is 1. If a value is smaller than the cutting point, it s Boolean bit is 0. Table 3. presents the result after transformation. Area ID Table 2. Database of high-frequent-crime areas Gender Age Education <g 1,30> <g 2,70> <g 1,45> <g 2,55> <g 1,65> <g 2,35> <g 1,40> <g 2,60> <g 1,35> <g 2,65> <g 1,60> <g 2,40> <g 1,20> <g 2,80> <g 1,70> <g 2,30> <g 1,40> <g 2,60> <a 1,20><a 2,30> <a 1,25><a 2,35> <a 3,40> <a 1,35><a 2,25> <a 3,40> <a 1,20><a 2,40> <a 3,40> <a 1,30><a 2,20> <a 1,25><a 2,25> <a 1,10><a 2,40> <a 1,30><a 2,40> <a 3,30> <a 1,20><a 2,10> <a 3,70> <e 1,20><e 2,10> <e 3,40><e 4,30> <e 1,15><e 2,10> <e 3,35><e 4,30> <e 1,30><e 2,40> <e 3,10><e 4,20> <e 1,10><e 2,10> <e 3,40><e 4,40> <e 1,25><e 2,5> <e 3,40><e 4,30> <e 1,20><e 2,15> <e 3,35><e 4,30> <e 1,30><e 2,5> <e 3,35><e 4,30> <e 1,10><e 2,40> <e 3,40><e 4,10> <e 1,20><e 2,20> <e 3,30><e 4,30> <g 10 1,35> <a 1,20><a 2,30> <e 1,20><e 2,10> <g 2,65> <e 3,40><e 4,30> In this table, g1 mean male, g2 means female; a1means yang man, a2 means adult, a3 means old man; e1means primary education, e2 means secondary education, e3 means university education, e4 means institute of education Table 3.Database after Boolean bit transformation Area ID Gender Age Education Karnaugh Map of Attribute Age. a2,a3 a Figure 2. The Karnaugh Map of Age From Figure 2, it shows that 001 and 011 can be combined. That is, 001,011 0_1 _ means don t care With the same step, the education attribute can be simplified: 0011,1011 _ Data Replacement Table 4. presents the feature tha attribute values have been replaced with the simplified values inducted from Karnough Map. Table 4.Database after Karnaugh Map simplification Area ID Gender Age Education _1 _ _1 _ _1 _ _1 _ _1 _ _1 _ _1 _ _1 _ Scan and Recount Scan the database again, and count the rows with the same attribute values. There are 4 rules in table 5. Table 5. Database after scan and recount Gender age Education vote _1 0_ _1 _ Karnaugh Map Concept Karnaugh Map presents the simplification of Boolean Algebra in the way of intuitive graph. But to avoid double counting, the researcher simplified it. Only a2,a3 the nearest neighbors that have the largest added value will a1 be combined and simplified. Figure 2. is the The Descriptive Rules In table 5, the number 1 rule has the highest vote value. It can be described as: {<g 1,L><g 2,H>} {<a 1,L><a 3,H>} {<e 1,L><e 3,H> <e 4,H>} 70%

4 The interpretation of the rule: 70% of high-frequent-crime areas have more females, elderly people, university students and graduate students. III. RESEARCH METHOD 3.1 Research Process In this research, all the data came from a library in a college. After data selection, matching, pruning and replacement, the data was inducted by MAOI to generate descriptive rules. Figuer 3. is the research process. Figure 3. Research process 3.2 The Multi-Valued Table The data selected contains all the college students library records in Except some data pre-processing, the twelve departments were divided into three academies, and the library collections were divided into four groups by their book numbers according to New Classification Science for Chinese Libraries(CCL). Table 6. presents a database ready for further analysis. Attribute A stands for academy; a1 is the first academy, including departments of Accounting Information, Business of Administration, International Trade and Business, Marketing and Distribution Management, and Finance; a2 is the second academy, including the departments of Tourism Management, Hospitality Management, Travel Management, and Applied English; a3 is the third academy, including the departments of Information Management, Information Technology and Information Communication. Attribute B stands for student s year in the college; b1 is Freshman; b2 is Sophomore; b3 is Junior; b4 is sinor. Attribute C stands for gender; c1 is male, and c2is female. Attribute D stands for student s grade; d1is 90~100; d2 is 80~89; d3 is 70~79; d4 is under 69.Attribute E stands for the classification of the library collections; e1 is 000~299; e2 is300~499; e3 is 500~799; e4 is 800~999. Table 6. Database of student s records in the library Month ID A B C D E <a1,101><a2,111> <a3,83> <a1,117 ><a2,115> <a3,81> <a1,207><a2,191> <a3,143> <a1,203><a2,177> <a3,48> <a1,156><a2,154> <a3,102> <a1,136><a2,110> <a3,93> <a1,11><a2,4> <a3,19> <a1,6><a2,9> <a3,18> <a1,147><a2,151> <a3,117> <b1,2><b2,91> <b3,68><b4,134> <b1,1><b2,96> <b3,67><b4,149> <b1,3><b2,160> <b3,147><b4,231> <b1,3><b2,127> <b3,155><b4,240> <b1,1><b2,110> <b3,123><b4,178> <b1,3><b2,106> <b3,118><b4,112> <b1,0><b2,3> <b3,6><b4,25> <b1,0><b2,6> <b3,6><b4,21> <b1,92><b2,114> <b3,114><b4,95> <c1,134> <c2,161> <c1,167> <c2,146> <c1,295> <c2,246> <c1,252> <c2,176> <c1,194> <c2,218> <c1,178> <c2,161> <c1,21> <c2,13> <c1,22> <c2,11> <c1,176> <c2,239> <d1,9><d2,96> <d3,124><d4,66> <d1,28><d2,143> <d3,74><d4,68> <d1,45><d2,198> <d3,206><d4,92> <d1,38><d2,195> <d3,89><d4,106> <d1,56><d2,126> <d3,143><d4,87> <d1,31><d2,94> <d3,135><d4,79> <d1,0><d2,27> <d3,4><d4,3> <d1,2><d2,19> <d3,8><d4,4> <d1,48><d2,149> <d3,132><d4,86> <e1,36 ><e2,98 > <e3,75><e4, 194> <e1,44><e2,119> <e3,77><e4,201> <e1,98><e2,186> <e3,144><e4,380> <e1,87><e2,226> <e3,147><e4,294> <e1,62><e2,161> <e3,115><e4,248> <e1,67><e2,134> <e3,79><e4,226> <e1,5><e2,24> <e3,7><e4,6> <e1,3><e2,19> <e3,5><e4,8> <e1,70><e2,140> <e3,82><e4,282>

5 <a1,170><a2,199> <a3,131> <a1,177><a2,206> <a3,143> <a1,195><a2,192> <a3,152> <b1,78><b2,175> <b3,149><b4,98> <b1,76><b2,214> <b3,139><b4,97> <b1,121><b2,184> <b3,139><b4,95> <c1,269> <c2,231> <c1,253> <c2,273> <c1,285> <c2,254> <d1,70><d2,123> <d3,165><d4,142> <d1,63><d2,202> <d3,194><d4,67> <d1,76><d2,211> <d3,176><d4,76> <e1,69><e2,213> <e3,121><e4,302> <e1,67><e2,183> <e3,147><e4,331> <e1,63><e2,213> <e3,131><e4,324> Boolean Bit Transformation In the grid 1A, the data is <a1,101> <a2,111> <a3,83>. The total of these values is 295, and the average number is 295/3=98. When taking 98 as the cutting point, because (101>98.3), (111>98.3), and (83<98.3), the Boolean bit of 1A becomes 110. Repeat the steps mentioned above, we can transform all the attribute values into Boolean bit, as shown in table 7. But column E is a special column. Because every student can borrow more than one kind of books, we define the cutting point to be quarter of the number of the students in the month. That means the cutting point of column E equals to the cutting point of column B or D. Table 7. Database after Boolean bit transformation M. ID A B C D E Attribute B b1,b2 d1,d2 e1,e2 b3,b Attribute D d3,d4 Attribute E e3,e4 0111, _ , _ , _ Karnaugh Map Concept The Karnaugh maps of Attribute A, B, D, E are presented in figure4. a1 Attribute A a2,a ,110 1_0 Figure 4. The Karnaugh Map of Attribute A,B,D,E Data replacement Replace the attribute values with the rules inducted in figure 4 with Karnaugh Map, we complete table8. Table 8. Database after Karnaugh Map simplification M. ID A B C D E 1 1_ _0 01_1 2 1_ _0 01_1 3 1_0 011_ 10 01_0 01_1 4 1_0 011_ 10 01_0 01_1 5 1_0 011_ 01 01_0 01_1 6 1_0 011_ 10 01_0 01_ _ _0 0100

6 9 1_0 011_ 01 01_0 01_1 10 1_0 011_ _1 11 1_0 011_ 01 01_0 01_1 12 1_0 011_ 10 01_0 01_ Scan and Recount Table 9. Database after scan and recount A B C D E vote library can actively provide readers with appropriate recommendation, and consider the purchase strategy of the collections. sets of applications and a wider range of multi-valued tables, as the purposes to verify this algorithm and to discover the generalized knowledge from Relational Databases. 1 1_0 011_ 10 01_0 01_ _0 011_ 01 01_0 01_ _ _ _0 01_ _ _0 01_ _0 011_ _ The descriptive rules From table 9, the sum of the votes for rule number1, 2, and 3 is 9. Thus, rule number 1, 2, and 3 have included 75% of the data. And they are the 3 highest inducted rules in this research. They can be described as the following: (1){<a1,H><a3,L>} {<b1,l><b2,h><b3,h>} {<c1,h ><c2,l>} {<d1,l><d2,h><d4,>l} {<e1,l><e2,h ><e4,h>} 33.3% It means that there are about 33.3% readers who are males and in the second or third year of the first academy. Their grades are about 80~89. Their reading preference is on book number 300~499 and 800~999. (2) {<a1,h><a3,l>} {<b1,l><b2,h><b3,h>} {<c1,l><c2,h>} {<d1,l><d2,h><d4,>l} { <e1,l><e2,h><e4,h>} 25.0% It means that there are about 25% readers who are females and in the second or third year of the first academy. Their grades are about 80~89. Their reading preference is on book number 300~499 and 800~999. (3) {<a1,l><a2,l><a3,h>} {<b1,l><b2,l><b3,l> <b4,h>} {<c1,h><c2,l>} {<d1,l><d2,h><d 4,>L} {<e1,l><e2,h><e3,l><e4,l>} 16.6% It means that there are about 16.6% readers who are females and in the fourth year of the third academy. Their grades are about 80~89. Their reading preference is on book number 300~499. IV. Conclusion To improve library s service and marketing success, the readers needs should be satisfied. There are lots of methods proposed to analyze the relationships between readers and library collections. Most of them only can handle the single-valued attributes. But in our daily life, many information appear as multi-valued attributes. MAOI can induct multi-valued attributes directly, and present the results briefly and descriptively. In this research, 3 rules were uncovered to explain the characteristics of the readers and their reading preferences. They have accounted for about 75% information. Therefore, it s a successful induction. The

7 REFERENCE [1]Jun-Rong Huang, Using clusters to find the most adaptive recommendations of books Journal of Educational Media & Library Science, 43:3, pp , 2006 [2]Ou, J., Lin, S. and Li, J., The Personalized Index Service System in Digital Library, Proc. of the Third International Symposium on Cooperative Database Systems for Advanced Applications, pp92-99, 2001 [3]J. B. Schafer, J. A. Konstan, and J. Riedl, E-Commerce Recommendation Applications, Data Mining and Knowledge Discovery, 5(1), pp , 2001 [4]A. Ansari, S. Essengaier, and R. Kohli, Internet Recommendation Systems, Journal of Marketing Research, 37(3), 2000 [5]Shu-Meng Huang, A study on the Modified Attributed-Oriented-Induction Algorithm of Mining the Multi-Value Attribute Data, ICERM, pp62, 2010 [6]W.P. Lee, C.H. Liu, and C.C. Lu, Intelligent agentbased systems for personalized recommendations in Internet commerce, Expert Systems with Applications, vol. 22, no.4, pp , 2002 [7]M. S. Chen, J. Han and P. S. Yu, Data Mining : An Overview From a database Perspective, IEEE, Transactions on Knowledge and Data Engineering, Vol. 8, No.6, pp ,1996 [8]Fayyad, U.M., Data Mining and Knowledge Discovery :Making Sense Out of Data, IEEE Expert, Vol.11, Issue 5, pp20-25, 1996 [9]M.J.A. Berry and G. Linoff, Data Mining Techniques:For Marketing, Sales, and Customer Support, John Wiley & Sons [10]Y. Cai, N. Cercone, and J. Han, attribute-oriented induction in relational database, Knowledge Discovery in Databases,Ch 12, AAAI/MIT Press [11]Jiawei Han and Micheline Kamber, Data Mining : Concepts and Techniques (Second Edition), Morgan Kaufmann Pub, 2006 [12]J. Han, Y. Cai, and N. Cercone, Knowledge Discovery in Databases : An Attribute-Oriented Approach, In Proceedings of the 18 th VLDB Conference, Vancouver, British Columbia, Canada. Pp , 1992 [13]Yen-Liang Chen, Ching-Cheng Shen, Mining generalized knowledge from ordered data through attribute-oriented induction tecniques. European Journal of Operational Research, 166, pp , 2005 [14]J. Han, Y. Cai and N. Cercone, Data-Driven Discovery of Quantitative Rules in Relational Database, IEEE Transaction on Knowledge and Data engineering, Vol.5, No.1, February 1993

Ubiquitous Computing and Communication Journal (ISSN )

Ubiquitous Computing and Communication Journal (ISSN ) A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com