A Framework for Delivery of Thai Content through Mobile Devices

Size: px

Start display at page:

Download "A Framework for Delivery of Thai Content through Mobile Devices"

Clara Simpson
6 years ago
Views:

1 A Framework for Delivery of Thai Content through Mobile Devices Chuleerat Jaruskulchai, Atichart Khanthong, and Wanlapa Tantiprasongchai Intelligent Information Retrieval and Database Department of Computer Science, Faculty of Science, Kasetsart University, Bangkok, Thailand. Abstract With the increasing of mobile devices, there are challenges in providing text information to the mobile clients. Unfortunately, mobile devices have limited display and navigation capabilities. Furthermore, the inconvenient of tiny keypad makes more difficult to input keywords or other information. This problem is more challenge when working with Thai Text. This research paper introduces a framework for delivery of Thai Content through mobile devices. It explores on particular aspect of the automated construction of personalized focus or user s attention. Documents are disseminated based on the personalized focus and routed to a mobile device. Instead of delivery every document, the documents are clustered, topic is extracted for each cluster. Additionally, content of each document is summarized. Basic Naive Bay algorithm is deployed for filter user s attention and topic extraction is based on term frequency and inverse document frequencies. Important sentences are extracted for summarization. An object-oriented technology is used to develop this demo system. Keywords: Mobile Device application, Document summarization, processing of Thai Text. 1. Introduction It has been expected that the handheld computer market will grow larger than computer industry and mobile clients have become a new target for business industry. Due to the popularity and capability of Personal Digital Assistants (PDA) and Mobile devices, such capabilities will increase the usability of PDAs. According to the wireless technology, it provides an opportunity to access information in any time and any places. Thus, numerous information services are offer through the mobile clients, such as travel guides, entertainment advice, news, flight schedules, driving directions. However, these services are task-specific and mobile clients known where to locating information. For web browsing and searching application, mobile devices have limited display, graphics capabilities, navigation capabilities, and processing speed. Furthermore, the inconvenient of tiny keypad makes more difficult to input keywords or other information. This problem poses a number of issues in designing user interface for PDA. It is a challenge when working with Thai Text due to the numerous Thai alphabets. Most of Thai PDA s applications offered are installed and run on PDA. This paper presents a framework to facilitate web navigation, searching, and browsing for small devices for Thai texts. Documents are clustered and automatic created topic to describe the cluster content. To solve the limited display, text summarization techniques are proposed. To save user time for locating information, Naïve Bayes classifier is employed to classified user s preference and notify user by Related Works There are several aspects of mobile client which are attractive to the researchers and can be categorized into three groups. First basic aspect is the effective browsing for client devices. Work in [Matt J. et al. 1999], reports the study of the impact of display size will reduce the user effectiveness by up to 50% original tasks. The effectiveness is measured by the number of scrolling for viewing information. However, Dillon et al reported that the comprehension rate PDA is the same as desktop s display (cited in [Matt J. et al. 1999]). Second aspect is the study of dynamic transformation of format in web pages to small devices [Watters C. and Zhang R., 2003]. Another approach in this aspect is called Web Clipping Application. This approach use propriety language to request portions of web pages they reformat for display. Third aspect is the design of web searching for PDAs which is effective from information retrieval. Thus, this aspect is more concern in the navigation capabilities and how users interact with PDA for searching information. Many successful information retrieval methodologies are deployed such as text summarization, clustering of documents using concept hierarchies and term extraction [Chan D.L. et al. 2002]. Most of the current applications for PDA are available download form the Internet and run on local device. Problem for web browsing for Thai PDA uses the web clipping technology. 190 Jaruskulchai, C.; Khanthong, A. and Tantiprasongchai, W.

In processing of Thai text, there are several issues such as word boundary, sentence extraction. Report from National Electronics and Computer Technology Center (NECTEC) [Sornlertlamvanich V. et al.

2 In processing of Thai text, there are several issues such as word boundary, sentence extraction. Report from National Electronics and Computer Technology Center (NECTEC) [Sornlertlamvanich V. et al. 2000] states that there are number of processing of Thai text has not been fully resolved. Due to the Thai writing system has no end word marker, word segmentation research still one of the research topic. The effectiveness of word segmentation is around 80-95% in precision and 80% in recall. However, many researchers have moved to discover sentence extraction, and national of language processing is employed to improve the word segmentation. 3. A Framework for Delivery of Thai Content Framework for delivery of Thai content is an extended our previous research [Jaruskulchai J., Kiewsuwansuk S., and Kantasena J., 2001] to provide facility to mobile clients. The main focus of our previous research is to investigate the clustering algorithm and topic extraction for Thai text. The research shows the potential to research in Thai text with little concern on word or sentence boundary. It is a challenge to move forward our research to serve mobile clients. Our framework offers several utilities to manage personal information. Not Only navigate function for users to browse information, but system will monitor new information and inform user through . Additionally, two types of information are offered to mobile clients, full text document and summarized document. There are many issues for designing and developing client mobile application. Michael and Kim [Michaeal J. Apbers and Loel Kim, 2000] had lay out a theoretical framework for understanding differences between handheld and full-sized web environment and their intended uses. Michael and Kim reported that there are two types of functions which can be provided to mobile clients, simple look-up and information manipulation. Simple look-up describes the skills and activities involved in locating and recognizing a desired chunk of information, such as checking a stock price, looking up a phone number, or reading an . Information manipulation is more complex task, generally, means the user needs to interact with different pieces of information, such as comparing airline fares. In our framework, we apply many mechanisms from information retrieval area such as simple lookup, information filtering according to user s preferences, document clustering and text summarization. The design framework consists of two main modules. Server module is responsible for collecting, indexing, filtering and summarizing documents. Server also mails user s preferences through mail. Client module allows users to manage (edit, review, delete) their profiles through web browser or PDA s devices. To access personal information is controlled by an address and a password. Finger 1 shows our framework. Detail of information filtering and summarizing are described in next section. Figure 1. A Framework for Delivery of Thai Content Jaruskulchai, C.; Khanthong, A. and Tantiprasongchai, W. 191

3 3.1 An automated Construction of Personalized Information Personal information and an information filtering can be used interchangeable and it is closely to text classification algorithm. Thus, to classify user s preferences, the probabilistic naive Bayes is employed. The naïve Bayes is very efficient algorithm and has been employed in many research fields, such as text classification, classification of , automate mail filtering. To estimate naïve Bayes parameters, user needs to provide a set of training documents or user s preference document. User s interest or user s preferences about information will express in keywords of interest and assign term weight for the most important keywords. In theory, keywords and weight of the most important terms are represented as a term weight vector, according to the vector space model. Most of implementation of naïve Bayes, initial information or training data set will be obtained from user. In real world application, asking user to rate the relevant document at the first time of registration may be not workable. Thus, system precomputes the posterior probability for each user by using the user s keyword and term weight, and this probabilities will be updated when user retrieve the documents. However, system presumes the retrieved documents are the relevant documents and update the probabilities parameters. When new information or new documents are arrived in the system, each document will be classified or filtered according to the probabilistic naïve Bayes and stored in each user s preference file waiting to delivery to user. 3.2 Document Clustering Enhancing the capability of mobile device, instead of delivery each document, documents are clustered according their contents and extracting topic for describing the content of each cluster. The complete linkage is employed in this framework to cluster documents of the same similar concept before delivery back to users. Extracting topic from full text, the high frequent word is extracted and group for describing the cluster. Full detail of our technique for clustering document will be found in [Jaruskulchai J., Kiewsuwansuk S., and Kantasena J., 2001] 3.3 Extracting Summary Sentences The main focus on this research is extracting summary sentences for representing the compact content. Summarization is a process of abstracting key content from one or more information sources. A variety of methods have been investigated. If target reader or function of use is concerned, a summary can be fall into three categories, indicative, informative and critical [Hahn Udo and Mani Inderjeet, 2000]. The indicative summary provides compact content to alert user not to miss the information. Informative summary provides essential information which can substitute the original source. Lastly, critical summary not only the abstracting of information but also provides some opinion on that content. However, the most difficult task of summarization of Thai Text is the sentence boundary, if results of word boundary algorithm are acceptable. Current research in sentence boundary can be found in [Charoenpornsawat P. and Sornlertlamvanich V. 2001], these approaches are probabilistic part-of-speech trigram, grammatical rule based, feature-based are deployed. The featurebased approach was evaluated by ORCHID [Information Research and Development Division, 2003] corpus, a part-of-speech tagged corpus. Thai sentence definition has not fully defined. Figure 2 excerpts some sentences from ORCHID corpus and it was claimed that a sentence. Thus, corrected sentence boundary is not in our concerned and the summary is aimed at indicative summary. พยายามวางแผน และประสานงานอย างใกล ช ด (Try to plan and to cooperate closely) ร ฐมนตร ว าการกระทรวงว ทย าศาสตร Figure 2. Excerpt of sentences from ORCHID corpus Thus, phrases are the major component, which use in the process of Thai text summarization. The simple algorithm presented by H.P. Luhn [Luhn H.P. 1958] is used to measure the important sentences and will be referred as within sentence clustering techniques. This method has been researched in [Buyukkokten O. Garcia-Molina H. and Peapcke A. 2000] with different data set and inverted document frequency shame weight. The summaries process is started by filter out of the Thai stop words. Then, in each document, the high frequent words (TF-Total Frequencies) are computed for representing the significant words in each document. The frequent word occurs more than 10 percents across in the document are eliminated. The rare words and specific words will be not removed. Then, sentence is divided into clusters according to the distance of none significant words. Thus, a cluster is a sequence of consecutive words in which this sequence starts and ends with a significant word and not more than D n of none significant words to separate significant word. If more than one clusters in a sentence, the highest one is selected. Then, sentence is ranked by counting the square of number of significant words in cluster divided by total number of words in cluster. In Figure 3 shows details of computation of sentence ranking. 192 Jaruskulchai, C.; Khanthong, A. and Tantiprasongchai, W.

Sentence 1 2 3 4 5 6 7 8 9 [ * * * * ] 1 2 3 4 5 6 7 Cluster with in sentence Figure 3. Computation of clusters and sentence ranking If number of D n > 2 then the sentence ranking is 2.

Thus, to reduce the lost of content, system allow user to view the original document. At current report, the system is evaluated the effectiveness of summarization using the user s satisfaction.

4 Sentence [ * * * * ] Cluster with in sentence Figure 3. Computation of clusters and sentence ranking If number of D n > 2 then the sentence ranking is 2.3 The number of none significant word (D n ) should greater than 2. The system limits number of sentences in the summary of each document. Thus, to reduce the lost of content, system allow user to view the original document. At current report, the system is evaluated the effectiveness of summarization using the user s satisfaction. Thirty documents are randomly selected and are evaluated by second year under graduate students. Evaluation criteria are satisfy, fair, and unacceptable. The results show that more than 53%, of summaries are fair, around 11 %, readers feel satisfactory and the rest of results are unacceptable. 4. System Implementation The developing of this framework is written in Java technology. This framework is built on a client-server model, where text based and client modules are handled at the server side. The process of indexing and retrieving is developing using RMI technology, a distributed process. The summary content delivered to client is marked up with XML tags, thus it is easy to reformatted and display on any devices. Displaying information on PDA, the kxml parser is employed. On general browser uses default IE parser. On mobile client need at least 16 MB and Palm OS version 4 or higher. There are no standard for displaying Thai on PDA. The current version for displaying Thai needs to install Thai routing from Thaihack. For text summarization, the test data set is collected from the Thai news paper, Daily News. The content of data set is a daily event, such as economic, foreign affairs, political and social news in Thailand. Figure 4 shows the some display of this framework. 6. Conclusion and Future works This framework is our experimental to explore new application in PDA devices and extended our previous research for mobile clients. This framework will be served for a particular search task. We present a compact content by applying summarization and clustering techniques. Furthermore, the user s preferences information is filtered according to their needs and mail back to users. The system has not been conducted the evaluation in the theoretical information retrieval science, since the purpose to present the possible model for delivery Thai Text to mobile users. For future research in summarization process, there number of approaches may be investigated and experiment with Thai text. The approaches for text summarization has been grouped into 4 categories, a summary consists of list of terms or concepts terms, a single passage extracted from the text, a sequence of sentences extracted from the text and use natural language understanding for generating summary. The most important for Thai text summarization is the standard data test, since there are a number of parameters needed to explore. Additionally, to improve the summarization result, natural language processing technique may need to extract proper nouns. Figure 4. Shows Original Document, Summarization Results and Clustering Results Jaruskulchai, C.; Khanthong, A. and Tantiprasongchai, W. 193

5 Acknowledgments This research was partially supported by the Office of the National Research Council of Thailand, References [1] Buyukkokten O., Garcia-Molina H. and Peapcke A., Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices, Proceedings of the Tenth International World-Wide Web Conference, 2000 [2] Marsden G., Cheery R., and Haefele A., Small Screen Access to Digital Libraries, Computer Network 31 (1999) [3] Michael J. Albers, Loel Kim, Implications of the wireless web for technical communicators: User web browsing characteristics using palm handhelds for information retrieval, Proceedings of IEEE professional communication society international professional communication conference and Proceedings of the 18th annual ACM international conference on Computer documentation: technology & teamwork September 2000 [4] Watters C. and Zhang R., PDA Access to Internet Content: Focus on Forms, HICSS 36 Hawaii, Jan [5] Matt J., Gary M., Norliza M., and Boone K., Improving Web Interaction on Small Displays, Computer Network 31(1999) [7] Hahn Udo and Mani Inderjeet, The Challenges of Automatic Summarization, IEEE Computer, Nov. 2000, (Vol 33, No. 11) [8] Luhn H.P., The Automatic Creation of Literature Abstracts, Advances in Automatic Text Summarization, edited by Inderjeet Mani and Mark T. Maybury, The MIT Press, Cambridge, Massachusetts, Landon, England, [9] Charoenpornsawat P. and Sornlertlamvanich V., Automatic Sentence Break Disambiguation for Thai, Proceedings of ICCPOL2001, Korea, pp , May [10] Chan D.L., Luk R.W.P., Mark W.K., Leon H.V., Ho E.K.S. and Lu Q., Multiple Related Document Summary and Navigation using Concept Hierarchies for Mobil Clients, Proceedings of the 2002 ACM Symposium on Applied Computing (SAC), March 10-14, 2002, Madrid, Spain. ACM 2002 [11] Jaruskulchai J., Kiewsuwansuk S., and Kantasena J., Thai Text Document Clustering, The Fifth National Computer Science and Engineering Conference, 7-9 Nov 2001, Chiang Mai, Thailand, [12] Information Research and Development Division, Orchid Corpus, National Electronics and Computer Technology Centers, /itech/download.html, Jan [6] Sornlertlamvanich V., Potipiti T., Wutiwiwatchai C., and Mittrapiyanuruk P., The State of the Art in Thai Language Processing, Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL2000), Hong Kong, pp , October Jaruskulchai, C.; Khanthong, A. and Tantiprasongchai, W.

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,