Document Summarization on Handheld Device:

Size: px

Start display at page:

Download "Document Summarization on Handheld Device:"

Bertram Garrett
5 years ago
Views:

1 Document Summarization on Handheld Device: An Information Visualization Tool for Mobile Commerce Christopher C. Yang Dept. of Systems Eng. and Eng. Management The Chinese University of Hong Kong Hong Kong SAR, China Fu Lee Wang Dept. of Systems Eng. and Eng. Management The Chinese University of Hong Kong Hong Kong SAR, China Abstract: Wireless access with mobile (or handheld) devices is a promising addition to the WWW and traditional electronic business. Much information is generated or accessed on the road, in conversation, while shopping in stores or meeting with business clients although traditional electronic business requires users to be stationary with network connection. Mobile devices provide convenience and portable access to the huge information space on the Internet. With the advance of mobile technologies, it creates added values and business opportunities to mobile commerce. However, the limited screen size, narrow network bandwidth, small memory capacity and low computing power are the shortcomings of handheld devices. Loading and visualizing large documents on handheld devices become impossible. The limited resolution restricts the amount of information to be displayed. The download time is intolerably long. All these shortcomings make vendors and users hesitate to explore the enormous opportunities in mobile commerce. In this paper, we introduce the fractal summarization model for document summarization on handheld devices. Three-tier architecture with the middle-tier conducting the major computation is also proposed. Visualization of summary on handheld devices is also investigated. The automatic summarization, three-tier architecture, information visualization are potential solutions to the existing problems in handheld devices for mobile commerce. Keyword: Document summarization, mobile commerce, fisheye view, handheld devices. 1. Introduction The advance of mobile network creates business opportunities and provides value-added services to users. Access to the Internet through mobile phones and other handheld devices is growing significantly in recent years. The Wireless Application Protocol (WAP) and Wireless Markup Language (WML) provide the universal open standard and markup language. In this age of information, many information-centric applications have been developed for the handheld devices [3][18][22]. For example, users can now surf the web, check , read news, quote stock price, etc. using handheld devices.

2 The convenience of handheld devices allows information access without geometric limitation. Executive officers may need to access important documents for making decision while they are on the roads out of their office in a dynamic business society. However, there may not be any network connection to access the Internet and the decision making cannot be deferred. In these situations, mobile devices, which provide Internet access anywhere, are able to solve the problem. Internet access with wireless handheld devices eliminates the geometric limitation; however there are other limitations of handheld devices that restrict its capability. 1.1 Shortcomings of Wireless Handheld Devices Although the development of wireless handheld devices is fast in recent years, there are many shortcomings associated with these devices, such as screen size, bandwidth, and memory capacity. There are two major categories of wireless handheld devices, namely WAP-enabled mobile phones and wireless PDAs. The products of Nokia and Palm are taken as examples here since both of them are the leading companies in mobile technologies. Screen Resolution Popular Wireless Handheld Devices Nokia 3320, 3330, 3360, 5510, 5210, 8310, 8390, Nokia Nokia 3350, 3410, 3510, 3590, 3610, 6310, 6510, 6590, Nokia 6610, Nokia Nokia 9110i, Palm i705 Table 1: Screen Resolutions of Wireless Handheld Devices. At present, the typical display size of popular WAP-enabled handsets and PDAs are pixels and pixels, respectively, which is approximately 1/126 to 1/30 of the display area of a standard personal computer ( pixels). The memory capacity of a handheld device greatly limits the amount of information that can be stored. The maximum WML deck size is 64 kilobyte (Nokia 9110i and Nokia 9210), and the maximum WML deck size for most of popular handset is about 1.4 kilobytes to 2.8 kilobytes binary. The typical memory capacity of PDAs is 8MB.

3 The current bandwidth available for WAP is 9.6Kbps and can be speedup to 40.2 kbps data speed with GPRS, however it is not comparable with the broadband internet connection for PC Potential Solutions by Automatic Summarization and Information Visualization Although handheld devices are convenient, they impose other constraints that do not exist on desktop computers. The low bandwidth and small resolution are major shortcomings of handheld devices. Information overloading is a critical problem; advance searching techniques solve the problem by filtering most of the irrelevant information. However, the precision of most of the commercial search engines is not high. Users may only find a few relevant documents out of a large pool of searching result. Given the large screen and high bandwidth for desktop computing, users may still be able to browse the searching result one by one and identify the relevant information using desktop computers. However, it is impossible to search and visualize the critical information on a small screen with an intolerable slow downloading speed using handheld devices. Automatic summarization summarizes a document for users to preview its major content. Users may determine if the information fits their needs by reading their summary instead of browsing the whole document one by one. The amount of information displayed and downloading time are significantly reduced. In this paper, we propose the fractal summarization model based on the statistical data and the structure of documents. Information visualization techniques are also employed to reduce the visual loads. Three-tier architecture which reduces the computing load of the handheld devices is discussed. 2. Three-tier Architecture Two-tier architecture is typically utilized for Internet access. The user s PC connects to the Internet directly, and the content loaded will be fed to the web browser and present to the user as illustrated in Figure 1. Web Server HTML Browser User User s PC Figure 1. Traditional Web Browsing The summarizer summarizes a document for users to preview before presenting the whole document. As shown in Figure 2, the content will be first fed to the summarizer after loading to the user s PC. The summarizer connects to database server when necessary and generates a summary to display on the

4 browser. Web Server HTML Document Server XML Summarizer Browser User DB Server for Summarizer SQL User s PC Figure 2 Document Browsing with Summarizer on PC Web Server Document Server HTML XML Summarizer WAP Server WAP Gateway Wireless WML Handheld PDA Local Synchronization User DB Server for SQL WAP Gateway Summarizer Figure 3 Document Browsing with Summarizer on WAP The two-tier architecture cannot be applied on handheld device since the computing power of handheld devices is insufficient to perform summarization and the network connection of mobile network does not provide sufficient bandwidth for navigation between the summarizer and other servers. The three-tier architecture as illustrated in Figure 3 is proposed. A WAP gateway is setup to process the summarization. The WAP gateway connects to Internet trough broadband network. The wireless handheld devices can conduct interactive navigation with the gateway through wireless network to retrieve the summary piece by piece. Alternatively, if the PDA is equipped with more memory, the complete summary can be downloaded to PDA through local synchronization.

5 3. Automatic Summarization 3.1 Automatic Summarization Traditional automatic text summarization is the selection of sentences from the source document based on their significance to the document [5][19]. The selection of sentences is conducted based on the salient features of the document. The thematic, location, title, and cue features are the most widely used summarization features. The thematic feature is first identified by Luhn [19]. Edmundson proposed to assign the thematic weight to keyword based on term frequency, and the sentence weight as the sum of thematic weight of constituent keywords [5]. In information retrieval, absolute term frequency by itself is considered as less useful than term frequency normalized to the document length and term frequency in the collection [11]. As a result, the tfidf (Term Frequency, Inverse Document Frequency) method is purposed to calculate the thematic weight of keyword [23]. The significance of sentence is indicated by its location [2] based on the hypotheses that topic sentences tend to occur at the beginning or in the end of documents or paragraphs [5]. Edmondson proposed to assign positive weights to sentences according to their ordinal position in the document, i.e., the sentences in the first and last paragraphs and the first and last sentences of the paragraphs. There are several functions purposed to calculate the location weight of sentence. Alternatively, the preference of sentence location can be stored in a list called Optimum Position Policy, and the sentence will be selected base on their order in the list [17]. The title feature is proposed based on the hypothesis that the author conceives the title as circumscribing the subject matter of the document. When the author partitions the document into major section, he summarizes it by choosing appropriate heading [5]. The weight of heading is very similar to the keyword approach. A title glossary is a list consisting of all the words in title, sub-title and heading. Positive weights are assigned to the title glossary, where the title words will be assigned a weight relatively prime to the heading words. The heading weight of sentence is calculated by the sum of heading weight of its constituent words. The cue phrase approach is proposed by Edmundson [5] based on the hypothesis that the probable relevance of a sentence is affected by the presence of pragmatic words such as significant, impossible, and hardly. A stored cue dictionary is used to identify the cue phases, which comprise of three sub-dictionaries: (i) bonus word, that are positively relevant; (ii) stigma words, that are negatively relevant; and (iii) null words, that are irrelevant. The cue weigh of sentence is calculated by the sum of cue weight of its constituent words

6 Typical summarization systems select a combination of summarization features [5][16][17], W sentence (s n )= a 1 w cue (s n )+ a 2 w keyword (s n )+ a 3 w title (s n )+ a 4 w location (s n ). where a 1, a 2, a 3, and a 4 are positive integers to adjust the weighting of four summarization methods. The sentences with sentence weight higher than a threshold are selected as part of the summary. 3.2 Fractal Summarization The Traditional Summarization Model considers the source document as a sequence of sentences, the sentences extracted will be concatenated as a summary of a sequence of sentences but ignore the structure of the document. Advance summarization techniques make use of the document structure to calculate the probability of a sentence to be included in the summary, for example, the sentence in the conclusion section should receive a higher score, and some other systems use the graph theory to calculate the weights of sentence. Many studies of human abstraction process has show that the human abstractor search the topic sentences according to the document structure, they start to search the topic sentences from the top-level document structure, they will extend their search to the lower level until they found enough information [6][9]. Therefore the document structure is an important factor in summarization; a high quality summary cannot be generated without consideration of the document structure. In this paper, we propose the fractal summarization model. Besides, the traditional summarization model requires the whole summary to be generated at a single batch. It takes a long time to generate a summary for a long document; therefore it is not suitable for M-Commerce. The fractal summarization model can generate a brief skeleton of summary, and the details of the summary will be generated only on demand of users, and hence, provides an interactive interface. Document Chapter Chapter... Section Section... Sub-section Sub-section... Paragraph Paragraph... Sentence Sentence... Term Term... Word Word... Character Character... Figure 4. Document Structure

7 A document can be represented by a hierarchical structure as shown on Figure 4. A document consists of chapters. A chapter consists of sections. A section may consist of subsections. A section or subsection consists of paragraphs. A paragraph consists of sentences. A sentence consists of terms. A term consists of words. A word consists of characters. A document structure can be considered as a Fractal [20] structure. At the lower abstraction level of a document, more specific information can be obtained. Although a document is not a true mathematical fractal object since a document cannot be viewed in an infinite abstraction level, we may consider a document as a Prefractal [7]. The smallest unit in a document is character; however, neither a character nor a word will convey any meaningful information concerning the overall content of a document. The lowest abstraction level in our consideration is a term. The Fractal Summarization Model applies a similar technique as fractal image compression [1][13]. An image is regularly segmented into sets of non-overlapping square blocks, called range blocks, and then each range block is subdivided into sub range blocks, until a contractive mapping can be found to represent this sub range block. The Fractal Summarization Model generates the summary by a simple recursive deterministic algorithm based on the iterated representation of a document. The original document is partitioned by the document structure, and each block is iteratively partitioned to child blocks until each block can be transformed to some key sentences by traditional summarization methods (Figure 5). Fractal Summarization Algorithm 1. Choose a Compression Ratio. 2. Choose a Threshold Value. 3. Calculate the Sentence Number Quota of the Summary. 4. Divide the document into range blocks 5. Repeat 5.1 For each range block, Calculate the sum of weight of the sentence under the range block. 5.2 Allocate Quota to each block in proportion to their sum of weights. 5.3 For each range block, If the quota is less than threshold value Select the sentence in the range block by traditional summarization method Else Divide the range block into sub range block Repeat Step 5.1, 5.2, Until all the range block is processed

8 Document Weight 1000 Quota 40 Chapter 1 Weight 300 Quota 12 Chapter 2 Weight 500 Quota 20 Chpater 3 Weight 200 Quota 8 Section 1.1 Weight 100 Quota 4 Section 1.2 Weight 150 Quota 6 Section 1.3 Weight 50 Quota 2 Section 2.1 Weight 100 Quota 3 Section 2.2 Weight 250 Quota 10 Section 2.3 Weight 150 Quota 7 Section 3.1 Weight 120 Quota 5 Section 3.2 Weight 80 Quota 3 Paragraphs... Paragraphs... Paragraphs... Figure 5. An Example of Fractal Summarization Model The compression ratio of summarization is defined as the ratio of number of sentences in the summary to the number of sentences in the source document. It was chosen as 25% in most literatures because it is proved that extraction of 20% sentences can be informative as the full text of the source document [21], those summarization systems can achieve up to a recall of 96% [5][14][25]. However, Teufel pointed out the high-compression ratio abstracting is more useful, and he reported a recall of 49.6% at around 4% compression ratio [24]. In order to minimize the bandwidth requirement and reduce the pressure on computing power of handheld devices, the default value of compression ratio is chosen as 4%. By the definition of compression ratio, the sentence quota of the summary can be calculated by the number of sentences in the source document times the compression ratio. The weights of sentences under a range block are calculated by the traditional summarization methods described in Section 3.1. The total sentence weight is a weighted sum of all the summarization features [5]. It is confirmed that the weighting of different feature do not have any substantial effect on the average precision [16], therefore the total weights of sentence is calculated as the sum of all the scores of summarization methods. A threshold value is the maximum number of sentences can be selected from a range block, if the quota is larger than the threshold value, and the range block must be divided into sub-range block. Document summarization is different from image compression, more than one attractor can be chosen in one range block. It is proven that the summarization by extraction of fixed number of sentences, the optimal length of summary is 3 to 5 sentences [9]. The default value of threshold is chosen as 5 in our system. It is believed that a full-length text document contains a set of subtopics [12] and a good quality summary should cover as many subtopic as possible, the fractal summarization will produce a summary with a

9 wider coverage of information subtopic than traditional summarization model. The traditional summarization model extracted most of sentences from few chapters, take the Hong Kong Annual Report as an example (Table 2), the traditional summarization model extracted 29 sentences from one chapter when the sentence quota is 80 sentences, and total 53 sentences extracted from top 3 chapters out of total 23 chapters, 8 chapters without sentence been extracted at all. However, the fractal summarization model extracts the sentences evenly from each chapter. In our example, it extracts maximum 8 sentences from one single chapter, and at least 1 sentences from each the chapter. Fractal Summarization Traditional Summarization Maximum No of sentences extracted from one single chapter 8 29 Minimum No of sentences extracted from one single chapter 1 0 Standard deviation of No of sentences extracted from chapters Table 2: No of sentences extracted by two summarization models from Hong Kong Annual Report 2000 An experiment is conducted to benchmark the fractal summarization with the traditional summarization models. Five college graduates are selected as subjects. They were asked to evaluate the quality of summaries without knowing the generation methods of the summaries. The fractal summarization achieves precision of 0.84 on average, while the traditional summarization achieves precision of 0.67 on average. The fractal summarization outperforms the traditional summarization. 4. Summary Visualization on Handheld Devices The summary generated by Fractal Summarization Model is represented in a tree structure. WML is the markup language supported by wireless handheld devices. The basic unit of a WML file is a desk; each desk must contain one or more cards. The card element defines the content display to users, and the card cannot be nested. Each card links to another card within or across decks. Nodes on the same level of the fractal summarization model are converted into card, and anchor links are utilized to implement the tree structure. Given a card of a summary node, there may be a lot of sentences or child-nodes, a large number of sentences in a small display area makes it difficult to read. Fisheye view is a visualization technique to enlarge the focus of interest and diminish the information that is less important [8] (Figure. 7). The objects in neighborhood of the focus of interest will be displayed with a larger visual size. The visual size of objects that are further away will be decreased inversely proportional to its distance to the focus point. In our system, we have implemented the fisheye view with 3-scale font mode available for WML. The prototype system using Nokia 6590 Handset Simulator is presented on Figure 7. The document

presenting is the Hong Kong Annual Report 2000.

that they are more important, and the rest are in normal font or small font according to their

Figure 6: Example of Fisheye View (a) Chapters of HKAR 2000 (b) Chapter 19 of HKAR 2000 Figure

KWML on Palm V A handheld PDA is usually equipped with more memory, and the complete summary

10 presenting is the Hong Kong Annual Report There are totally 23 chapters in the annual report, 6 of them are in large font, which means that they are more important, and the rest are in normal font or small font according to their importance to the report (Figure 7a). The Figure 7b shows the summary of Chapter 19. Figure 6: Example of Fisheye View (a) Chapters of HKAR 2000 (b) Chapter 19 of HKAR 2000 Figure 7. Screen Capture of the WAP Summarization System Figure 8. KWML on Palm V A handheld PDA is usually equipped with more memory, and the complete summary can downloaded as a single WML file to the PDA through local synchronization. To read the summary, the PDA is required to install a standard WML file reader, i.e. KWML as shown in Figure 8 [15].

11 5. Conclusion Mobile commerce is a promising addition to the electronic commerce by the adoption of portable handheld devices. However, there are many shortcomings of the handheld devices, such as limited resolution and low bandwidth. In order to overcome the shortcomings, fractal summarization and visualization by fisheye views are proposed in this paper. The fractal summarization creates a summary in tree structure and presents the summary to the handheld devices through cards in WML. Users may browse the selected summary by clicking the anchor links from the highest abstraction level to the lowest abstraction level. Based on the sentence weight computed by the summarization technique, fisheye views are employed to enlarge the focus of interest and diminish the less significant sentences. Such visualization effect draws users attention on the important content. The three-tier architecture is presented to reduce the computing load of the handheld devices. The proposed system creates an information visualization environment to avoid the existing shortcomings of handheld devices for mobile commerce. 6. References [1] M. F. Barnsley and A. E. Jacquin, Application of recurrent iterated function systems to images. In Proceedings SPIE Visual Communications and Image Processing '88, vol. 1001, pp [2] P. B. Baxendale, Machine-Made Index for Technical Literature - An Experiment. IBM Journal (October). pp [3] O. Buyukkokten, H. Garcia-Molina, A. Paepcke, and T. Winograd, Power Browser: Efficient Web Browsing for PDAs. Human-Computer Interaction Conference The Hague, The Netherlands. [4] O. Buyukkokten, H. Garcia-Molina and A. Paepcke, Text Summarization of Web pages on Handheld Devices. In Proc. of Workshop on Automatic Summarization 2001 held in conjunction with NAACL [5] H.P. Edmundson, New Method in Automatic Extraction. Journal of the ACM, vol. 16, no. 2 pp [6] B. Endres-Niggemeyer, E. Maier, and A. Sigel, How to Implement a Naturalistic Model of Abstracting: Four Core Working Steps of an Expert Abstractor. Information Processing & Management, vol. 31 no. 5, pp [7] J. Feder, Fractals. Plenum, New York. [8] G. W. Furnas, Generalized Fisheye Views. In Proceedings of the SIGCHI Conference on Human Factors in Computing System. [9] B.G. Glaser, and A. L. Strauss, The discovery of grounded theory; strategies for qualitative research. Aldine de Gruyter, New York. [10] J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell, Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of SIGIR, pp

12 [11] D.K. Harman, Ranking algorithms. In W.B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 14, pp Prentice-Hall. [12] Hearst Marti A, Subtopic Structuring for Full-Length Document Access. In Proc. of the 16th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp [13] A. Jacquin Fractal image coding: A review. In Proceeding of the IEEE, vol. 81, no. 10, pp [14] J. Kepiec, J. Pedersen, and F. Chen, A Trainable Document Summarizer. In Proc. of the 18th Annual International ACM Conf. on Research and Development in Information Retrieval (SIGIR). pp [15] KWML, KWML - KVM WML (WAP) Browser on Palm. [16] A. M. Lam-Adesina, and G J. F. Jones, Applying summarization Techniques for Term Selection in Relevance Feedback, In Proceeding of SIGIR pp.1-9. [17] C. Y. Lin, and E.H. Hovy Identifying Topics by Position. In Proc. of the Applied Natural Language Processing Conference (ANLP-97), Washington, DC, pp [18] H. Lo M-Mail: A Case Study of Dynamic Application Partitioning in Mobile Computing, Master Thesis of Waterloo University. [19] H.P Luhn The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, pp [20] B. Mandelbrot, The fractal geometry of nature, New York: W.H. Freeman. [21] G. Morris, G. M. Kasper, and D. A Adams, The effect and limitation of automated text condensing on reading comprehension performance. Information System Research, pp [22] PALM, PALM: Providing Fluid Connectivity in a Wireless World, White Paper of Palm Inc., [23] G. Salton, and C. Buckley, Term-Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, vol. 24, pp [24] S. Teufel, and M. Moens, Sentence Extraction and rhetorical classification for flexible abstracts, AAAI Spring Symposium on Intelligent Text summarization, Stanford. [25] S. Teufel, and M. Moens, Sentence Extraction as a Classification Task, In Workshop Intelligent and scalable Text summarization, ACL/EACL.

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam