An Agent for Semi-automatic Management of s

Size: px

Start display at page:

Download "An Agent for Semi-automatic Management of s"

Berniece Shepherd
5 years ago
Views:

1 An Agent for Semi-automatic Management of s Fangfang Xia a and Liu Wenyin b a Dept. of Computer Science & Technology, Tsinghua University, Beijing , China b Dept. of Computer Science, City University of Hong Kong, Hong Kong SAR, China xff99@mails.tsinghua.edu.cn; csliuwy@cityu.edu.hk ABSTRACT Recent growth in the use of s for communication and the corresponding growth in the volume of s have made automatic processing of desirable. However, most existing systems failed to work in practice due to low classification accuracy and inconvenient user interfaces. In this paper, we present an adaptive Personal Agent (PEA) which can learn the mail handling preferences of its user and automatically categorize and manage its user s s. One of the key ideas in this approach is extracting both the high-level semantic features (e.g., concept information) from the body text and other low-level features (e.g., sender, time, importance, etc.) from the entire message for similarity assessment based on the standard Information Retrieval (IR) approach. Another main contribution of our work is establishing both global and local information space models for building relevance categories based on the user s folders. Besides, a query refinement strategy is incorporated to make the agent act as an incremental learner. That is, it can adjust its working strategy based on only the new examples and avoid a total re-training using all previous examples. To test the effectiveness of our system, we did experiments on its two main functions, retrieval and relevance categorization and obtained preliminary promising results. Keywords: Overload, Management, Example -based Learning, Information Retrieval, Content-based Retrieval, Relevance Categories, Query Refinement, Personal Agent (PEA) 1. INTRODUCTION The explosion in electronic communication is dramatically changing the way people interact with one another. overload [1,2] has become a growing problem since more and more users are embracing the online technologies in recent years. According to Forrester Research, 7 trillion s are sent per day in 2002 and an estimated 81 percent of organizations that introduced to improve their efficiency now complain that is becoming a victim of its own success. IDC estimates that in 2002 the average business user spends an average of over 2.4 hours a day just dealing with an average of 30 work-related messages [2]. These numbers are still increasing or updated every day. To address the problem of overload, many researchers have done evaluation of some common manual management strategies for s, including Piorritizers, archivers [3], No filers, Spring cleaners, Frequent filers [2], and Folderless cleaners [4]. Whittaker and Sidner [2] have found that a major aim of filing is to reduce the huge number of undifferentiated inbox items into a relatively small set of folders each containing multiple related messages. Balter [5] has developed a mathematical model to illustrate that storage time is the major time consumer for users with more than a thousand stored messages and the best long term strategy is to use folders sparsely (4 to 20) in combination with the search functionality. He suggest those users who want to use folders use agents that can automatically suggest folders for archiving since the agents could help reduce the storage time drastically and a larger number of folders may help reduce the time to retrieve a message. Hence, the early research focused on a variety of machine learning techniques to classify s into folders. Among the famous prototypes, SwiftFile used shortcut buttons to archive messages into folders, but only when initiated by the user [6]. Mock used a nearest-neighbor classifier to group inbox s into categories in his experimental framework [7]. Some projects, such as Enfish Onespace, and Metastorm s infowise, use information retrieval techniques to measure similarities among folders or individual messages [8]. Other companies, such as Abridge, Plumtree, and Tacit, use rules or user-supplied categories to group s. There are also flexible organizers. For example, the Gnus news and mail reading system [9], distributed with recent versions of GNU Emacs has hooks that allow installation of arbitrary programs for filtering and foldering news and mail. Furthermore, there are several open-source readers which could be modified to include a hook for arbitrary classifiers [10]. With the vast amount of interest and research that has been accomplished with automatic categorization, why hasn t the concept been incorporated into existing readers? The current difficulties with automatic 1

2 organization exist in the following aspects. First, the user s folders are usually not well organized and they change over time as new messages are received; this inbox irregularity has set hurdles for accurate classification. Second, most of the learning algorithms are based on statistics, and for the algorithms to perform well, a large amount of data must be on hand; the training time is usually considerable. Third, many of the current algorithms do not learn incrementally : they update by requiring a complete re-training based upon all data, including the original training messages. Fourth, most existing systems provided limited user-oriented functions; they do not allow classification into multiple categories and use imp licit rules that users cannot adjust. In this paper, we focus on the issue of automatic categorization to save the time on archiving (when there are a large number of folders) and present an example-based semi-automatic learning approach for this purpose. A prototype system Personal Agent (PEA) is built based on this approach, which can adapt to an individual user by learning his/her management preferences from the interaction examples between the user and the system. Based on the user s preferences, PEA can automatically categorize and manage his/her incoming and/or stored s. One of the key ideas in this approach is extracting both the high-level semantic features (e.g., concept information) from the text and other low-level features (e.g., sender, time, importance, etc.) from the entire message for similarity assessment. Another main contribution of our work is establishing both global and local information space models for building relevance categories based on the user s folders. Besides, a query refinement strategy is incorporated to make the agent act as an incremental learner. Experiments have shown the effectiveness of the proposed approach. The remainder of this paper is structured as follows. In Section 2, we present our solution of the Personal Agent and describe its system architecture and user interface. We then present the core algorithms and other implementation details in Section 3. We will also show the preliminary experimental results of the agent in Section 4. Finally, we conclude and present some directions for future work. 2. SOLUTIONS Many of the difficulties described with classification may be alleviated through better classifiers, while another way to resolve these difficulties is to sidestep the entire problem with an alternate technology. We adopt one alternate technology, Relevance Categories [8], which addresses some of the same information management issues as automatic classification while avoiding many of the problems discussed in the previous section. In order to utilize as much detail information as possible, we extract all useful features from an message, including sender, receipt, time, topic, body, etc. Different methods are then employed to compute the similarities respectively. The overall similarity between two messages is the weighted sum of these features. Note that, different sets of weights are assigned to the features in different folders. Learning from the user s feedback, the weights can be adjusted automatically to represent more exactly the user s preferences to the diverse features within one folder and thus refine the query of this folder. 2.1 Architecture The architecture of our agent system is shown in Figure 1. The system consists of two components: the user interface, and the core component of Personal Agent. The user interface is divided into four parts: two functional parts and two peripheral ones. The functional parts include an retrieval interface and an classification interface, both of which provide user feedback interfaces. The system configuration part is where the user can set the parameters and manually adjust part of the folder space coefficients. The non-feedback function part consists of some auxiliary functions such as events logging and message filing according to their category. In the core component, we have three spaces, i.e., the weights space, the local information space and the global information space, five modules, namely, the feature extractor, the nearest-neighbor similarity evaluator, the inverted indexer, the matcher and the relevance categorizer, and finally two databases which store low-level features and high-level semantic features, respectively. They work together to perform both function and feedback routines. A typical scenario of the system is as follows. Upon installation of the agent, the feature extractor scans all the s in the user s personal folder; both low-level and high-level features of the s are extracted and the corresponding databases are constructed. Then, the nearest-neighbor similarity evaluator and the inverted indexer work simultaneously. The indexer builds the global information space for each folder according to the existing inbox structure; the evaluator compares s within each folder to set up the local information space and decide the initial weights for the features. Once the three space models are available, the matcher compares the user s query with the local space model of s to 2

3 yield the ret rieval results and the outcome is given in the form of a rank list. The user can denote irrelevant s which are ranked improperly high, and thus the negative feedback is applied. The relevance categorizer is triggered when a new message comes in or the user adjusts the inbox structure, e.g., moving s from one folder to another or creating new folders. In these occasions, the agent first updates its database and space models and then refreshes its classification. The agent learns from user feedbacks by refining inner space models to yield more accurate results in the future. 2.2 User Interface Figure 1. Architecture of the PEA We implement our Personal Agent as an add-in in Microsoft Outlook 2002 on Windows XP. The basic interface is a supplemental command bar which is indicated within the red (or gray) rectangle (containing the Retrieval, Archive, and Settings buttons) in the upper-right part of Figure 2. Upon the first time startup, the scanning process is performed which automatically creates a category out of every folder the user maintains. The messages in the folder are then associated with that category. While the agent is enabled, new s are automatically classified into the best matching folder. They are only grouped together but not moved immediately. The user can view inbox s that are grouped into categories and make the mails really go to their assigned folders simply by clicking the Archive button. When the user manually adjusts the categorization result in the inbox or move mail from one folder to another, relevance feedbacks are provided and the learning process is then 3

4 triggered. In these occasions, the agent will automatically show the accompanying changes it made and the user can cancel some of them. The Retrieval button is used to aid users that wish to search s. This function provides the capability to quickly display a list of messages ranked by relevance (using the similarity metrics) to the selected messages. In this manner, other messages in the same thread or in the same topic will be displayed at the top of the list. The feedback mechanism is also provided for the retrieval function. Finally, the Settings button is for users to access and change the agent s parameters such as constants and feature weights. Users can also enable or disable some non-feedback functions and change the running modes there. Figure 2. User Interface of the PEA 3. ALGORITHMS AND IMPLEMENTATIONS 3.1 Feature Extracting and Similarity Assessment There are two kinds of features that can be used in our agent. One is low-level feature, such as sender, time, importance, etc. The other is high-level semantic feature extracted from the subject and body of an . We compute first the similarity between two e mails at each level and then calculated their weighted sum as the overall similarity. We implement the relevant retrieval functionality of our agent by similarity assessment. All the s are compared with the query one and then sorted in the descending order of their similarities. A high rank usually indicates significant relevancy Low-level features We have extracted eight basic low-level features in our agent. They are sender, recipients, creation time, importance, body format and three Boolean variables (IsRead, IsReplied and IsWithAttachment). To compute the similarity, we also incorporate an additive feature sender-recipients which is useful in some particular occasions. This is not another independent feature; we add it mostly because of the following concern: In a quite frequent occasion, a user wants to keep all his correspondence with a person in the same folder. However, either the sender or recipients feature alone cannot help him. For example, two s, one from A to B and the other from B to A, are obviously related, but the similarities calculated based on sender and recipients are both 0. In such case, the sender-recipients feature mingles the sender and receivers into one set and the similarity calculated on it should be 1. This feature is also useful for work groups. The similarities corresponding to each of the features is computed differently and their detailed calculation methods will be presented in an extended version of this paper. 4

5 3.1.2 High-level features We have extracted two high-level features in our agent. They are subject and body. Since they are both text features, we use the same method to get the comparing results. Our implementation is based upon an inverted index with integrated TF/IDF [11] values. The detailed algorithm will be presented in an extended version of this paper The overall similarity Although there may be many sophisticated similarity assessment methods, we use the simplest similarity models to obtain the overall similarity. With high-level and low-level similarities calculated separately, the overall similarity is simply calculated as the liner combination of them. Note that different folders are assigned with different sets of weights and they are consistent ly refined by user s feedbacks. This is the key point for our agent to gain intelligence and will be further discussed in the following sections. 3.2 Folder Space and Relevance Categories A key function of our agent is to classify s according to existing folders. Section 3.1 gives an algorithm of computing the similarity between two individual messages. In order to assess the similarity between a message and a folder, we should also build a user folder space model, through which the nature of different folders could be well characterized. Many existing systems achieve this goal by assigning each folder a vector compatible with the vector. Since such vector is usually the average of all the s in the folder, its weakness in classifying is obvious as described in Section 1. To utilize as much detail information as possible, we explore both global and local properties of a folder in establishing its space model. (More exactly, folder here should be replaced by relevance category, a concept that will be discussed soon.) Global Information : Global information of a folder is the semantic information of all the messages in that folder (As we shall introduce the relevance categories concept in the following text, the messages in the category linked with this folder should also be included). The messages are concatenated and treated like a single document. The N most frequent terms (either from the body or the subject field) and term frequencies are extracted. (In our agent, N was set to 50 by default.) The resulting terms comprise part of the query for the category that it represents. Note that as the set of messages changes, the queries are simple to update. All that is required is to re-compute the term frequencies. Local Information: Local information of a folder is obtained by the simple nearest-neighbor method. Given a target message to classify, its features are extracted and compared to all messages in the folder using the algorithm introduced in Section 3.1. The top M matches are averaged as the local measure for the category. M was set to 3 by default in our agent. The introduction of local information should be helpful since some users maintain too generic folders (e.g., Projects ) encompassing multiple irrelevant sub-categories. It is also useful when dealing with topic-drift occasions. The basic concept of Relevance Categories [8] is to provide the same functionality as regular folders or categories. Users can assign to categories, or remove them from categories just like they are normally used to. Relevance Categories are initially built based on the existing folders in the user s inbox. When new s come in, they are automatically assigned to one category by our agent. The user can manually correct the wrongly classifications or assign one to multiple categories. In these occasions, our agent will refine the queries based on the feedbacks, trying to approach more precisely to the user s subjective intention. Otherwise, the newly assigned s will be regarded as members of its category from then on, even though their real movements to the destination folders will not be applied until the user explicitly perform the Archive function of our agent. In the computation of the -category similarity, a unique weight vector indicating the user s preference placed on different features is assigned to each category to obtain the weighted feature sum. Apart from the global and local information, this weight vector is another important part of the folder space model, which alone builds up the Weights Space. How to compute the weight vector and adjust it based on user feedbacks thus becomes the central problem in our query refinement strategy. 3.3 Query Refinement Strategy Queries are created for each relevance category. Corresponding to the folder space model, the query refinement strategy for our agent could also be divided into two parts, the global query refinement and the local query refinement. 5

6 Global query refinement is an approach to the precise representation of the global semantic feature of a category. Negative training could be employed for s the user explicitly denotes as not belonging to the category. These might arise in the agent s retrieval function if the user wishes to apply corrective action to highly ranked messages so that they are displayed toward the bottom of the list. To apply negative training, the N most frequent terms are extracted from the negative examples and subtracted from the N most frequent terms from the positive examples. This may result in some terms with negative frequencies. Local query refinement is mainly the adjustment of the weight vector mentioned in Section 3.2. Our agent learns from user feedbacks in order that the weight vector will more and more tally with the user s subjective emphasis on the features. The detailed algorithm is presented in an extended version of this paper. 4. PERFORMANCE EVALUATION In order to test the two main functions of our agent, retrieval and classification, we designed two corresponding experiments. Since the effectiveness of the relevance categories on the purely semantic feature, i.e. our global information space, has been tested by Mock over the Reuters corpus [8], we will only concentrate on the overall performance of our agent on the multi-feature basis. The test data we use are mainly the daily s of the authors. The volume is not very large (about 1000). However, it represents a typical user s situation well. 4.1 Retrieval Accuracy In this experiment, we randomly select a number of s (the number is less than 20, since usually a user does not have the patience to select more than 5 s in each iteration or go over more than 4 iterations) belonging to the same category as query (positive feedback) examples and do retrieval. Since we exactly use 100 s as our ground truth for each query and we also only actually check first 100 s, the value of precision and recall are the same. Therefore, we use the term accuracy to refer to both. The results are show in Figure 3, with the x axis being the number of query (positive feedback) s and the y axis the average retrieval accuracy. As the figure shows, the average accuracy of retrieval exceeds 50% when the number of query s reaches 10. Accuracy Number of Query s Figure 3. Retrieval Accuracy 4.2 Categorization Accuracy and Feature Abilities The second experiment evaluates the performance of the categorizer on learning a user s mail sorting preferences from hand-sorted mails. The input data are six months of the first author s sorted mails. Table 1 shows the folders and distribution of messages in the data set. These data pose an interesting challenge for a learning system. Not only is the distribution of messages in the folders highly non-uniform, but the selection of folders for messages is also strongly idiosyncratic. While the content of the folder FROM HER was exclusively determined by a single keyword match (sender= Arendt ), other folders were not determined by a single keyword match with the from or to fields, but rather by the subjective judgment of the first author of this paper of what folder would be the best mnemonic for later retrieval of the message based on its content, time, recipients, etc. For example, the REMINDER folder only maintains 6

emails received within the recent week, while the E -MAGAZINE folder contains various HTML messages the first author of this paper subscribed from various websites.

7 s received within the recent week, while the E -MAGAZINE folder contains various HTML messages the first author of this paper subscribed from various websites. In this case, the task of the agent is to learn a model of the user s sorting preferences. Table 1. Hand Archived s in Our Experiments. Folder Name Count Percentage CS % E-MAGAZINE % FROM HER % MISCELLANEOUS % PERSONAL % PHILOSPHY GROOP % PROJECTS % REMINDER % SOCCER % Total Exemples % (a) (b) Figure 4. (a) Categorization Accuracy and (b) Feature Discrimination Abilities The results of this experiment are shown in Figures 4 (a) and (b). Through learning, the agent achieves 82% test accuracy after 100 training examples and 87% after 200. The weights of features begin to show the user s different emphasis on them as the number of training examples increases. We only show three of the features in the figure. However, the trends of features are clear, which proves that the agent is capable of learning a user's preferences by our query refinement strategy. 7

8 The strategy of our agent has many advantages. First, relevance categories are not such hard folders; they are merely an add-on to existing categories and could be ignored and used exactly like a normal category without impacting performance; therefore, the errors made by our agent are more likely tolerated by users. Second, based on the simple similarity-computing algorithm, the management of our agent will still be possible in the presence of sparse data. Third, since both high-level and low-level features are extracted, the agent can handle diverse occasions well. Our agent obviously surpasses the traditional classifiers which focus only on the text features in dealing with categories like From her in the above experiment. Fourth, the incorporation of global and local information enables the agent to fit for the various user inboxes that are not well organized. Besides, t he query refinement can be done fast and hence can avoid the problems that most classifiers have regarding to intensive computation at the adjusting stage. 5. CONCLUSION AND FUTURE WORK We present an intelligent agent which can learn from the user s interactions with the system and hence can semiautomatically manage the user s s. The feature that distinguishes our system from the existing retrieval or management approaches is fourfold. First, different features of s are extracted with corresponding similarity assessment methods designed for them. The employment of both high level semantic features and other low level features enables our agent to perform ambidextrously. Second, the adoption of relevance categories for our UI sidesteps some of the common hurdles that its peer systems normally face. Though the concept of relevance categories is really a step back from pure categorization, it allows for multiple or overlapping categories and is more likely to be tolerated by users when classification errors occur. Third, a unique space model is established for each user folder base on both global and local information of its encompassing s. This makes it possible for the agent to fit a user s sorting habits which may be extremely idiosyncratic. Fourth, an efficient query refinement strategy is presented to facilitate the learning process. The next phase is to further refine our space models. For example, noun phrase extraction, better term selection, use of more terms, support for languages other than English and mix languages, variation of test parameters and assumptions, and different similarity metrics might significantly improve the categorization accuracy. Additional work is also required to quantify the performance of current classification algorithms with both test data and user studies. Besides, much work remains to be completed in code enhancements such as latching into more Outlook events, database integration for classifiers, or MS.NET upgrades. Finally, new experiments that integrate classification and information retrieval techniques across and into calendaring, notes, or other types of data may also be explored. REFERENCES 1. overload--facts and figures: an e-mountain of _overload.htm 2. Whittaker S and Sidner C. overload: explo ring personal information management of . SIGCHI 96, pp Pliskin N. Interacting with electronic mail can be a dream or a nightmare: a user s point of view. Interacting with Computers 1(3): Bälter O. Strategies for organizing messages. SIGCHI 97, pp Bälter O. Keystroke level analysis of message organization. SIGCHI 2000, pp Segal R and Kephart J. Incremental learning in SwiftFile. ICML Mock K. An experimental framework for categorization and management. SIGIR Mock K. Dynamic organization via relevance categories. ICTAI Ingebrigsten LM. Gnus network user services Malone TW, Lai KY, and Fry C. Experiments with oval: a radically tailorable tool for cooperative work. ACM TOIS 13(2): Salton G. Automatic Text Processing, Addison-Wesley,

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798