Supporting Constructivist Learning in a Multimedia Presentation System

Size: px

Start display at page:

Download "Supporting Constructivist Learning in a Multimedia Presentation System"

Katrina Briggs
5 years ago
Views:

1 Supporting Constructivist Learning in a Multimedia Presentation System Dula Kumela 1, Kenneth Watts 2, and W. Richards Adrion 3 Abstract - The Research in Presentation Production for Learning Electronically (RIPPLES) group in the Department of Computer Science at UMASS Amherst have developed a course delivery system named Multimedia Asynchronous Networked Individualized Courseware (MANIC). MANIC uses the approach of record and playback. While record and playback technologies can be very effective in supporting a constructivist mode of instructional delivery, the technology is not inherently constructivist. In support of a more constructivist mode of instruction, we have implemented advanced indexing and search features in MANIC that makes use of ranking and relevance, and a query expansion technique to generate queries and conduct search over the World Wide Web (WWW) using Google. In this paper we describe initial experiments conducted, and our plans for additional assessment and enhancement of the search mechanism. deliver lectures and course materials in and outside of the classroom. The RIPPLES Project delivers lectures using several related technologies. Most are based on the MANIC framework and include streaming and CD/DVD modes of courseware delivery. Current RIPPLES technologies include a number of features important to creating a constructivist-learning environment. Figure 1 shows how the indexing and search mechanisms within the CD-MANIC courseware can be linked with an on-line text. Indexes are interlinked so that students can search the text or the CD-MANIC presentation or use either the text or CD-MANIC index to reach linked points in the text and associated lecture. The Learner Logger [7] is embedded within the CD-MANIC browser. Data are collected Index Terms - Constructivist Learning, Query expansion, Precision, Ranking, Recall, Relevance and Search. INTRODUCTION Constructivist-oriented instructional models, including active learning, peer learning, and cooperative learning, are effective in computer science education, for example [9,10]. Constructivism holds that learners define their own meaning to the world by constructing understanding through experience. For the purpose of this paper, we view constructivist learning as an environment where learners, typically in groups, explore (with guidance from an instructor) the learning environment and construct meaning based on their (shared) learning experiences and active investigation of provided materials. A constructivist-learning environment [11], whether physical or technology-based, should be active (learners are engaged), constructive (learners build on prior knowledge, integrating with new ideas and experiences to construct meaning), intentional (goal-directed), complex (learners face ill-structured and complicated problems), contextual (problems are situated in a realistic context), conversational (learners interact with each other) and reflective (learners report on progress and process). The RIPPLES Project investigates how to most effectively use the WWW and CD/DVD-ROM technology to FIGURE 1 CD MANIC EXAMPLE continuously and uploaded to a server in the background during student sessions, and allow us to track all student behavior including: format choices (video, audio, slides); relative size and positioning on the screen time spent on each step of the presentation, external links followed; menu choices; responses to queries; quiz scores; and student feedback and evaluation forms. Most record and playback approaches, such as the original RIPPLES/MANIC technology, produce courseware that is largely passive, typically only allowing some form of 1 Dula Kumela, Research Assistant, University of Massachusetts Amherst, RIPPLES laboratory, dkumela@cs.umass.edu 2 Kenneth Watts, Senior Software Engineer, University of Massachusetts Amherst, RIPPLES laboratory, watts@cs.umass.edu 3 W. Richards Adrion, Professor of Computer science, University of Massachusetts Amherst, RIPPLES laboratory, adrion@cs.umass.edu T1H-18

2 simple search over the text in the recorded lectures, and they generally have limited ability to support learner navigation, a key to knowledge discovery and construction. Our opportunity to apply constructivist pedagogies depends on creating a constructivist-learning environment in and outside the classroom. To do this, we had to go well beyond record-andplayback technologies and the extended search mechanism described here is an important first step. ADVANCED SEARCH IN CD-MANIC In the original RIPPLES/MANIC courseware, learners can search over the text in lecture slides and, if available, the electronic version of the course text book using simple Boolean combinations of keywords [1]. An index to the matched slides/text is presented to the learner as a sequentially ordered list. These results are not extremely useful to the user since the search does not rank the result based on relevance to the topic the user is querying. Search is limited within a single course, rather than allowing searches over multiple courses and over the great wealth of information available in the internet. As the standard index (slides, text) has improved, Learner Logger data show that users seldom employ search. To support constructivist learning, learners need a mechanism to search for a given topic or subject from material available within a given course, across related courses, within texts and other reference material, and throughout the WWW. To enable broader exploration and knowledge construction, we replaced CD-MANIC brute force searching with a more advanced searching technique that makes use of ranking and relevance. The updated system generates matching documents (e.g., relevant lecture segments) from within the locally available material. A second-level query, derived from content of the matching documents and the initial user query using query expansion techniques, is conducted over the WWW and a database of reference material using Google TM. The returned local and WWW links are filtered by relevance to learner interests. MANIC COURSE DEVELOPER MANIC USER MANIC INDEXER MANIC BROWSER LOGGER Log Database Server Index databse JNI Connector Query Expansion Module FIGURE 2 ARCHITECTURE OF MANIC SEARCH LUCENE API GOOGLE API TECHNOLOGY ARCHITECTURE The new RIPPLES/MANIC search mechanism was developed using open source systems and frameworks and free online services, integrated with software developed by the RIPPLES Group. RIPPLES component software includes a query expansion module, a user interface module for indexing, an extension to the Learner Logger to capture user interaction with the system, and a Java Native Interface (JNI) component that provides the interface between open source code implemented in Java and the existing C++ implementation of CD-MANIC. To index and search lecture slides (and online texts), we used the open source Lucene search Application Program Interface (API) available from the Apache Jakarta project [2]. The decision to use Lucene was made based on the fact that it is an open source API that we can modify to meet our needs. Lucene also provides most of the features we were looking for in a search engine such as ranking and relevance when performing a search over indexed material. The Google TM search API [3] is used to search for materials over the WWW. The overall architecture of the new CD-MANIC search is depicted in Figure 2. DESCRIPTION OF COMPONENTS In this section we will give description of the components that make up the advanced search feature of CD-MANIC. The discussion will include description and usage of the following: Lucene API Google API Query expansion module JNI integration with CD-MANIC Learner Logger interface Lucene API Lucene is a high-performance, full-featured text search engine written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially crossplatform. Lucene was started by Doug Cutting 4 as an independent project in September 2001 it became an official Jakarta project. As most of the Jakarta projects, Lucene is also provided as an open source library. Lucene provides an easy to use API for indexing a set of documents and performing searches over the created index. Indexing is the process of creating a special database (an index ) that contains a compiled version of documents, optimized for quick lookup of a list of documents that contain certain words or terms. By default, Lucene stores the index as a set of files in a file system. Lucene provides the flexibility to implement other storage methods such as nonresident inmemory storage, or mapping of Lucene data to any third party relational database. Lucene API provides elaborate control over the information stored in the index for each document and how this information is used during indexing and searching. On 4 Doug Cutting, originator of the Lucene search, cutting@ apache.org T1H-19

one extreme, it is possible to store for each document just its location (e.g. URL or file path) and index the content of the document as a monolithic piece of text.

3 one extreme, it is possible to store for each document just its location (e.g. URL or file path) and index the content of the document as a monolithic piece of text. On the other extreme, it is possible to store the entire document as well as various attributes such as Author, Title, and Date and perform searches that consider these attribute for matching and ranking. One of the major advantages of Lucene over other currently available open source search engines is its ability to index any file or document type. Lucene supports an API that a developer can use to index a file as far as it meets a given standard without being specific to any file type and gives the responsibility of reading in and parsing the document to be indexed to the developer using the API. A given document can be indexed using Lucene as long as the developer can provide a parser for the document based on the Lucene standard. Searching in Lucene is the operation of locating a subset of the documents that contains desired content or has attributes that match some specification. The input for a search operation is a query (could be a term, a set of terms or Boolean combination of terms) that specifies a criteria for selecting the documents and its output is a list of documents or hits that matched that criteria. The hit list is typically ordered by some measure of relevancy (called ranking or scoring ) and may contain only a subset of the set of documents that matched the query (typically the documents that have the highest rank or score). The search operation is performed on the index. The index database is optimized for locating documents that contains certain words or terms quickly. Rank or score of documents for a given search query is calculated using (1). score_d = sum_t (tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d (1) VALUE score_d sum_t tf_q TABLE I DESCRIPTION OF VALUES IN (1) DESCRIPTION Score for document d Sum for all terms t The square root of the frequency of t in the query idf_t log (numdocs/docfreq_t+1) + 1 norm_q sqrt(sum_t((tf_q*idf_t)^2)) tf_d The square root of the frequency of t in d norm_d_t Square root of number of tokens in d in the same field as t boost_t The user-specified boost for term t coord_q_d Number of terms in both query and document / number of terms in query numdocs Number of documents in index docfreq_t Number of documents containing term t Lucene is a library that provides an API to index and search a set of documents; it is not a standalone application or does not include a user interface. For this reason, we had to develop an interface which can be used to select and index a user select a set of documents from a local file system and index them. It also has a feature to indicate the indexing progress and show the listing of documents that have been indexed. We currently support indexing for text and html documents which meets our current requirement. Support for other file types can be added with minimal effort as needed. The interface also lets a user load a given index database and perform search operation. This helps to test a given index database after indexing a set of documents. FIGURE 3 CD-MANIC INDEXER USER INTERFACE Google TM API Google TM is a widely used search engine. Its popularity has increased over the years. The developers at Google TM provide an API that allows developers to use the Google TM search engine from within their own application. This service is provided as a Web service that can be accessed over the Internet. The Google TM API provides functions to support three different request submission types: search, cache and spelling. Search requests include a query string and a set of parameters to the Google TM Web API service and receive in return a set of search results derived from Google TM 's index of over 2 billion Web pages. Cache requests send a URL and receive in return the contents of the URL when Google TM 's crawlers last visited the page (if available). Spelling requests receive a suggested spell correction for the query (if available). Spell corrections mimic the same behavior as found on the Google TM Web site and are subject to the same query string limitations as any other search request. The input string is limited to 2048 bytes and 10 individual words. The return type for spelling requests is a text string. In our implementation of the RIPPLES/MANIC search we used only the Google TM search request feature. At the current stage of our product, we have not found use for the other two features provided in the API. When passing the search query that was supplied to CD-MANIC search directly to the Google TM search API, some of the results returned from Google TM were not relevant to the subject the user was querying. To improve the amount of relevant documents returned from the search we implemented a query expansion module described in the next section. documents. The first version of the user interface module lets T1H-20

4 Query Expansion Module The Google TM search engine returns some documents that are not relevant when supplied with the query from the user of the system. This could be because Google TM keeps track of various documents in its database that have the terms supplied by the user query but not related to the subject the user is searching on. It was necessary that we find a mechanism that would solve this problem. We implemented a query expansion module that makes use of locally available documents to generate a more descriptive query that generates more relevant results from Google TM. Automatic query expansion has long been suggested as a technique for dealing with the fundamental issue of word mismatch in information retrieval [4]. If a relevant document does not contain the terms that are in the query, then that document will not be retrieved. The aim of query expansion is to reduce this query/document mismatch by expanding the query using words or phrases with similar meaning or some other statistical relation to the set of relevant documents [5]. A number of approaches to query expansion have been suggested, studied and, more recently, attention has been given to techniques that analyze the corpus to discover word relationships (global techniques) and those that analyze documents retrieved by the initial query (local feedback). It is widely accepted that the local feedback technique is more effective than the global techniques. The approach we took for the implementation of our query expansion module was local feedback analysis. The general concept of local feedback dates back at least to a 1977 paper by Attar and Fraenkel [6]. This technique fit well with the problem we were facing. Local feedback analysis involves getting the content of the top ranked documents and generating an expanded query. This approach works best if the top raked documents are relevant, and that is the assumption when using the technique. In our implementation the expanded query is generated using the following procedure. The user supplied query is passed to the Lucene API to generate the matching lecture slides and the top 5 documents are selected. The content of the 5 documents is then used to create a list that has a pairing of a word and its rank. The ranking of words is assigned based on weighted frequency of the word in the content of the top ranked documents and the query supplied by the user. The formula used is given in (2). Rank = 8q + 4d (2) In reference to (2), q is the number of times the word appears in the query and d is the average number of times it occurs in the top documents. Once the words are ranked, we take the top 3 ranked words and combine it with the initial user query to generate the expanded query. This approach falls a little short of what we plan to achieve since it does not take into consideration words like the and a that could create noise in the result of word ranking. There are various techniques that could be used to disregard noise creating words. One approach is to create a stop list and not consider words that are in the list during the query expansion process. The other approach is to employee a mechanism that would give lower weight to common words. Our approach to solving this problem was to use our larger collection of lecture slides to figure out common words and give them lower weight. For each word that is being considered we calculated the inverse document frequency (IDF) value using (3). IDF (w) = log (N/n) (3) In reference to (3), w is the word being measured, N is the number of documents in the collection and n equals the number of documents that contain the word. Once the IDF value of a word is calculated, then it is multiplied by the weight of word to get the value used to rank the word. Essentially, this is the same as calculating TF 5 * IDF to rank words which in turn give common words lower weight. It means that the ranking of terms is done using (4). Rank = (8q + 4d) * log (N/n) (4) ORIGINAL QUERY locking file system Remote objects TABLE 2 SAMPLE OF QUERY EXPANSION EXPANDED QUERY locking strict two-phase file system directory handle nfs remote objects server pointers system-wide Learner Logger interface In the design of CD-MANIC, a great emphasis was given to capturing student interaction with the system. The idea behind this is to analyze the student interaction logs and improve the system by creating a student model that represents usage of the system. The Learner Logger interface was developed to meet this demand. Since CD-MANIC is a standalone system that executes on the student machine, the mechanism of capturing student interaction is implemented as a two-step process. First, the ability to log the student s behavior within the context of the application was built into CD-MANIC through the logger interface. Actions such as clicking on the slide index, resizing the application, searching, and selecting items from the application menu are recorded in the log file, along with a date and time stamp by the logger interface. However, no personal information about the student is ever captured or recorded in the log. The logger interface records and locally stores the log data on the client machine in a text file which is a series of variable length comma separated records. The second half of the problem is sending the data to a place where it can be retrieved and reviewed. This can be initiated either by the student through a menu item or automatically by the application. In the case of automatic update, the system will upload the log data of the student to the CD-MANIC log server in a given time interval. Students 5 TF = Term Frequency T1H-21

5 are uniquely identified using the serial number of their local hard drive. The biggest advantage of having the logger interface is the ability it provides us to study student interaction with the system and improve the system accordingly. It also allows us to capture student interaction with the system even if the student is not connected to the internet when using the system, as long as the user connects at some point. We have noticed some disadvantages to the approach of logging we are using. First, there is the potential that log data may never be retrieved from any students. For example, a student may only use the application when they are not connected to the network and so their logged data will not be sent to the server. Second, it is possible for a single student to run the application on more than one machine. Since the logger interface does not attempt to identify a single student based on personal information and the uniqueness of a student is identified by the uniqueness of their hardware, a single student has the potential to appear as multiple students in the log data. Despite these disadvantages, it seems that they do not significantly impact the data collection process due to the amount and diversity of the data that has been received by the log server [7]. JNI Connector One of the challenges we faced while integrating CD-MANIC, Lucene API and Google API was the fact that the different modules were implemented in different programming languages. CD-MANIC was built using C++ while the Google TM API and Lucene were implemented in Java. We were able to find an open source implementation of Lucene in C++, but it was not stable enough to be used in a production environment. To solve this problem, we implemented a module using the Java JNI technology [8] that serves as a Connector between the C++ implementation of CD-MANIC and the Java implementation of Lucene and the Google TM API. The JNI connector provides a messaging service to the different modules of the system by passing calls from one module to the other and by making the necessary data conversion so that the modules interact seamlessly. Refer to Figure 1 to see how this interaction occurs in the system. EXPERIMENTS To evaluate the work we have done, we performed an experiment on the Lucene API and on the effectiveness of our query expansion module on the Google TM search. Our experiment emphasized precision and recall values. Precision and Recall are calculated using (5) and (6). Recall= A / (A + C) (5) Precision =A / (A+B) (6) where A = Relevant retrieved B = Not relevant retrieved C = Relevant not retrieved While evaluating the Lucene API we discovered that as precision value goes down the recall value goes up. The observation is consistent to the general belief that recall and precision are inversely related. It is possible to have a recall rate of 1 by returning every document that has the query terms, but that will reduce the precision rate. In any implementation of a search engine, there is a tradeoff between precision and recall rate. Refer to figure 4 for the Lucene experimental results Our evaluation of the query expansion has interesting results. In general, it helped to increase the precision rate on the search done over the internet to some level. The interesting observation was that if the local search results that are used to expand the query are not relevant to the subject matter, the expanded query results are not useful. The success of the query expansion is dependant on the relevance of the local search results to the subject in consideration. In cases where the local search result is relevant, the expanded query does a great job of refining the search result we get from Google TM. A search for the query term Java with a good local search result generated various results that point to documents that talk about Java RMI. The documents that were retrieved with the local search extensively talked about Java RMI. Query expansion improves the search result we get from the Google TM search. The other interesting observation was the fact that if we keep increasing the number of terms by which the query is expanded, it will have a negative impact on the precision rate. For the case of our query expansion module, we have found that expanding the query by a maximum of 3 terms gives us the best precision rate result and going above that threshold reduces the rate. Figure 5 depicts this fact. T1H-22 Precision Precision Recall FIGURE 4 A GRAPH OF PRECISION VS. RECALL FOR LOCAL (LUCENE) SEARCH # expanded terms FIGURE 5 A GRAPH OF PRECISION VS. NUMBER OF EXPANDED TERMS

6 Figure 6 shows the runtime evaluation of the Google TM search API. The amount of time it takes to generate search results from the Google TM API increases as the length of the query increases as expected. runtime in second # of terms in query FIGURE 6 A GRAPH OF RUNTIME VS. NUMBER OF TERMS IN QUERY We have not completed evaluation of student interaction with the system since data collection began for Spring 2004 courses. We are receiving log data on a daily basis and expect to have sufficient data for evaluation soon. At this point, the collected data is not large enough to give us a complete picture of how the search feature is being used. It is our assumption that we have not received enough log information because most CD-MANIC users use the system offline and the collected data is not being sent to the log server. Based on a survey we conducted most of the users like the updated search feature and their learning experience improved because of it. Our survey also pointed out that some use CD-MANIC at work and their company is protected by a firewall. This prohibits CD-MANIC from sending the log data to the log server. CONCLUSIONS AND FUTURE WORK We described the improved search feature of CD-MANIC and the results we obtained. We showed that the query expansion mechanism we implemented for searching Google TM helped in refining the search to list relevant documents. We believe our work in adding this advanced search feature to CD-MANIC will create a constructivist-learning environment. We are continuing to improve all versions of MANIC by adding new features: Incorporating audio/video indexing to MANIC to access information that is present in audio/video recordings of lectures [12]. Adding ability for MANIC to suggest different queries to the learner when the original query returns no or few results; for example, using spell checkers and technical thesauri to suggest spelling corrections or alternatives. Combining quiz/test results with Learner Logger data to generate queries based on a model of student comprehension; for example, data on the time spent navigating and searching for certain topics can be combined with test results to lead students to supplementary material from within the course, other courses or the WWW. ACKNOWLEDGMENTS We thank James Allan and Andrew McCallum of the Center for Intelligent Information Retrieval in the Department of Computer Science at UMASS Amherst for their help during the effort of developing an improved search tool for CD- MANIC. This work was partially supported by the National Science Foundation under EIA REFERENCES [1] Thampuran, R, S, "A Multimedia Course Delivery System Combining Web and CD/DVD-Based Technologies", Thesis publication [2] [3] Google API, " [4] Tianjin, P. R, Probabilistic Query Expansion Using Query Logs ", WWW 2002 Honolulu [5] Jinxi, Xu, "Query Expansion using local and global document analysis", [6] Attar, R, "Local feedback in Full-text retrieval systems", Journal of the Association of computing Machinery [7] Burleson, W, "An Empirical Study of Student Interaction with CDbased Multimedia Courseware ", Proc of American Society of Engineering Education, 2002 [8] [9] J. D. Chase and Edward G. Okie, Combining Cooperative Learning And Peer Instruction In Introductory Computer Science, Proceedings of SIGCSE 2000, 2000 [10] Said Hadjerrouit, A Constructivist Approach to Object-Oriented Design and Programming, Proceedings of the 4th Annual Conference on Integrating Technology into Computer Science Education ITiCSE 99, 1999 [11] D. Jonassen, Designing Constructivist Learning Environments, in Instructional Theories and Models, C.M Reigeluth, Ed., 1998 [12] T1H-23

Information Retrieval

Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,