Information Retrieval Technique for MIR and P2P Network

Size: px

Start display at page:

Download "Information Retrieval Technique for MIR and P2P Network"

Egbert Potter
6 years ago
Views:

1 2014 年 5 月第十七卷二期 Vol. 17, No. 2, May 2014 Information Retrieval Technique for MIR and P2P Network Huei-Chen Hsu Ya-Li Chung Hui-Chun Chan

2 Web Journal of Chinese Management Review Vol. 17 No 2 1 Information Retrieval Technique for MIR and P2P Network ABSTRACT Huei-Chen Hsu Ya-Li Chung Hui-Chun Chan Peer-to-Peer (P2P) systems are application layer networks which enable networked hosts to share resources in a distributed manner. An important problem in such networks is to be able to efficiently search the contents of other peers. The focus of this paper is the motivation how technological innovations enable new value propositions. In this paper, the authors describe a P2P system music metadata registration and query resolution and present a survey of search techniques for information retrieval in P2P networks. These studies showed the technical feasibility of these ideas in a prototypical P2P environment and also present a realistic experimental evaluation and comparison of these techniques, using a distributed middleware infrastructure we have designed and implemented. Keywords: Information Retrieval; P2P network; music information retrieval (MIR) Huei-Chen Hsu Department of Marketing Management, TransWorld University Ya-Li Chung Department of Business Administration, TransWorld University Hui-Chun Chan Department of Business Administration, TransWorld University

3 Web Journal of Chinese Management Review Vol. 17 No 2 2 Introduction The advances in public networks and the deployment of powerful personal computing units by end users have brought a shift from the traditional Client-Server computing model to the Peer-to-Peer (P2P) computing model. There are many advantages to P2P networks, such as distributed computing and storage power, fault-tolerance, and reliability. In the P2P model, each node (peer) has a symmetric role of both a client and a server. Large numbers of peers collaborate in a dynamic and ad-hoc manner and share information in large-scale distributed environments without any centralized coordination. Recently the P2P model has also been proposed in as an alternative model to WWW-crawling based systems to cope with information that changes frequently. One of the greatest potential benefits of P2P networks [Baumann, 2003] is the ability to harness the collaborative efforts of users to provide semantic, subjective, and community-based tags to describe musical content. The digital content industry has to think about new value propositions for customers. It leads to successful adoption by end users. New value propositions being enabled by technological innovations are possible. In this paper we consider the information retrieval problem in P2P networks. Assume that each peer has a database (or collection) of documents that it shares in the network. The documents can be collections of text, audio, video or other semi-structured documents. A node searches for information by sending query messages to its peers. We present recent work in the field of music information retrieval (MIR) satisfying the requirements. The information retrieval problem is a more complex operation than traditional search techniques based on object identifiers or filenames, currently being used in P2P systems. The Information Retrieval (IR) community has over the years developed algorithms [Baumann & Klüter, 2002] for precise document retrieval in static data environments (such as a corpus of documents). However these methods are not directly applicable to P2P systems where there is no central repository, there are large numbers of documents and nodes are joining and leaving in a dynamic and ad-hoc manner. We believe that such new capabilities are an important step for making P2P systems applicable to a wide set of applications than simply object storage. After the presentation of a recent approach [Burns & Richard, 2006] to value-based modeling in e-commerce scenarios we present a survey of search techniques for information retrieval in P2P networks including our recently proposed Intelligent Search Mechanism and introduce the middleware simulation infrastructure and present some of our experimental results.

4 Web Journal of Chinese Management Review Vol. 17 No 2 3 Information Retrieval techniques for MIR and P2P The main challenge for information retrieval in P2P networks is to be able to guide the query to the sources that contain the most relevant answers in a fast and efficient way. In particular, our objective is to decrease the number of messages sent per query while at the same time maintain a high recall rate (i.e. search result quality). One could argue that both the ideas of Music Information Retrieval (MIR) and Peer-to-Peer networks (P2P) essentially started with Napster ( While a variety of novel ways of searching and retrieving music, especially in audio format, have been proposed, they haven t found their way into P2P networks and remain largely academic exercises. We present the search techniques using object identifiers rather than keywords. The Breadth First Search (BFS) Technique BFS is a technique widely used in P2P file sharing applications such as Gnutella [Gnutella, 2012]. BFS sacrifices its performance and network utilization in the sake of its simplicity. The BFS search protocol (figure 1a) in a peer-to-peer network N works as follows. A node q generates a Query message which is propagated to all of its neighbors (peers). When a peer p receives a Query request, it first forwards the query to all the peers, other than the sender, and then searches its local repository for relevant matches. If some node r receives the query and has a match, r generates a QueryHit message to transmit the result. The QueryHit message includes information such the number of corresponding documents and the network connectivity of the answering peer. When node q receives QueryHits from more than one peer, it may decide to download the file from the peer with the best network connectivity. QueryHit messages are sent along the same path that carried the Query messages; therefore no path information needs to be stored in the transmitted messages. The Random Breadth-First-Search (RBFS) Technique The Random Breadth-First-Search (RBFS) technique can dramatically improve over the naive BFS approach. In RBFS (figure 1b) a peer q forwards a search message to only a fraction of its peers, selected at random. The fraction of peers is a parameter to the mechanism. The advantage of RBFS is that it does not require global knowledge; a node is able to make local decisions in a fast manner since it only needs to select a portion of its peers. On the other hand, this algorithm is probabilistic. Therefore some large segments of the network may become unreachable because a node was not able to understand that a particular link would lead the query to a large segment of the network. The Intelligent Search Mechanism (ISM) The Intelligent Search Mechanism (ISM) is a new mechanism for information retrieval in P2P networks (figure 1c). The objective of the algorithm, which is to help the querying peer is to find the most relevant answers to its query quickly and efficiently.

5 Web Journal of Chinese Management Review Vol. 17 No 2 4 Keys to improving the speed and efficiency of the information retrieval mechanism is to minimize the communication costs, that is, the number of messages sent between the peers, and to minimize the number of peers that are queried for each search request. To achieve this, a peer estimates for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only. Directed BFS and the Most Results in Past (>RES) Heuristic In Yang [Yang & Garcia-Molina, 2002] present a technique where each node forwards a query to a subset of its peers based on some aggregated statistics (figure 1d). The authors compare a number of query routing heuristics and mention that the Most Results in Past (>RES) heuristic has the most satisfactory performance. A query is satisfied if Z, for some constant Z, or more results are returned. In >RES a peer q forwards a search message to k peers which returned the most results for the last m queries. In their experiments they chose k = 1 and m = 10 turning in that way their approach from a Directed BFS into a Depth-First-Search approach. Figure 1 Routing Query Messages in a P2P Network Using ISM works well in environments which exhibit strong degrees of query locality and where peers hold some specialized knowledge. One problem of the ISM mechanism is that search messages may get locked into a cycle and consequently fail to explore other parts of the network. To solve this problem, we pick a small random subset of peers and add it to the set of most relevant peers for each query. As a result, with high probability the mechanism will explore a larger part of the network and will learn about the contents of additional peers. P2P Community Platform P2P system that provides flexible search semantics based on attribute-value (AV) pairs

6 Web Journal of Chinese Management Review Vol. 17 No 2 5 and supports automatic extraction of musical features and content-based similarity retrieval. It is a very natural paradigm to process the content where it is, in this case at the private computers of the consumers. We will first make some remarks why a mainstream music service should follow a P2P paradigm in order to be successful. In addition to this a P2P protocol offers the possibility to implement peer clustering by content to realize such issues as collaborative retrieval on top. This step offers the potential to have automatic and dynamic community building (e.g. special genre interest groups) over the time the network evolves. These features fall into both categories of the value type framework: efficiency = reduction of search time and esthetics = exploring the beauty of recommended, previously unknown but relevant new tracks. JXTA was chosen as the platform for developing our integrated approach. We used straight forward the following functionalities of the client to realize our own version M-Peer: (1) Peer Groups: can be used to model different interest groups according to specific styles, genres or artist fan communities. (2) Group Chat, 1 to 1 Chat: supporting the verbal communication about preferred music has been one of the cornerstones in our design of an intelligent P2P music platform since we aim at extracting cultural metadata from chats in the future (see Fig. 2). (3) File Sharing: there is nothing special about this feature, but we realized that we would have a great benefit by adding our special additional tags (e.g. pre-computed audio profiles, extended meta-tags such as similar artists, etc.) into the existing MP3 file format which is supported at hand by its mime type. (4) Meta-tag representation: by using ID3v2 as part of the MP3 file format we were able to supply the interesting musical features using a well-established format, having a critical mass of users, resp. standard applications such as MP3 players, playlist and cataloguing tools to read and store this format. In this way simple file-sharing enters a totally new dimension of quality since each MP3 file can be interpreted as a self-contained music knowledge container. (5) File Search: at this point we added again our component for a fuzzy match of phonetic misspelled queries to correct entities in the artist field. (6) Presentation of metadata attributes and values: after a computation of similar songs by using the music similarity metrics we added the according artists as a new meta-tag. Furthermore standard ID3 tags such as artist and genre are presented (see Fig.3).

These features can be used for the determination of music similarity.

7 Web Journal of Chinese Management Review Vol. 17 No 2 6 Figure 2 MPeer Client: Embedded recommendations Figure 3 M-Peer Client: Meta-tag Search The aforementioned automatic audio analysis recognizes properties about loudness, tempo features. These features can be used for the determination of music similarity. Lots of authors have presented different approaches in the recent past to tackle this problem which is a research issue in music information retrieval (MIR). An excellent introduction to the entire MIR field and related communities can be found in [Futrelle & Downie, 2002]. While we are interested in providing recommendations for similar music other authors have coped with the problem of identifying audio tracks in large music databases even for distorted audio queries [Allamanche et. al., 2001]. Such methods are essential to realize stable digital right management systems (DRM) either for central server or P2P approaches. In our perspective on consumer value we focus on the value for end-users, which is not directly increased by using DRM technology. An initial description of the system presented in this paper can be found in Grimm and Nützel [Grimm & Nützel, 2002]. The main difference of this paper from previous work is presenting an alternative model without copy protection for a friendly P2P approach

8 Web Journal of Chinese Management Review Vol. 17 No 2 7 enabling users to re-distribute musical content. Users do not have to care for DRM and have the freedom to receive, consume and re-distribute multimedia content. Re-distribution allows for earning credits and represents therefore increased consumer value. In this way their approach can be seen as complementary to our work [Pachet & Aucouturier, 2003]. Wang, Li and Shi [Wang et al., 2002] evaluated four different P2P models with integrated content-based music retrieval. They showed how acceleration of retrieval could be achieved in large-scale distributed music networks. Therefore they reach an increase of consumer value by improving the extrinsic value of efficiency for music selection. The paper outlines a very generic content-based retrieval method missing details about the audio features and similarity measure. It seems to be optimized for exact retrieval using audio extracts or sung queries. In contrast to our approach recommendations for similar music based on sound or cultural context are neglected. Following recent work in P2P networks [Pfeiffer & Vincent, 2001]; our system decouples the process of locating content from the process of downloading it. Each node in the network not only stores music files for sharing but also information about the location of other music files in the network. Therefore when the user submits a search to the P2P system, the system returns a set of peers from which the music files that satisfy the search criteria can then be downloaded. Each file in the system is described by a Music File Description (MFD) which essentially is a set of attribute-value (AV) pairs. The user as well as automatically extracted features for describing musical content (underlined). The Music Feature Extraction Engine (MFEE) is the component that calculates these features from the audio files (see Figure 4). These automatically extracted features in addition to being used for content-based similarity search can also be used for various other types of audio analysis such as musical genre classification. There are two operations that are supported by the system and both take MFD as arguments. In registration, a new music file is made available for sharing and its associated MFD is registered to the P2P system. During searching, the user query is converted into an appropriate MFD which is then used to locate the nodes that contain files that match the search criteria. Once the nodes are located they are contacted directly to start the actual downloading for the file. The main concern in the design of our system is the efficient content discovery rather than the actual downloading mechanism [Logan, 2000].

9 Web Journal of Chinese Management Review Vol. 17 No 2 8 Figure 4 Software architecture on a peer node

10 Web Journal of Chinese Management Review Vol. 17 No 2 9 PeerWare Simulation Infrastructure & Experiments In order to benchmark the efficiency of the information retrieval algorithms, we have implemented PeerWare 3, a distributed middleware infrastructure which allows us to benchmark different query routing algorithms over large-scale P2P systems. For the experiments we deployed 104 nodes running on a network of 50 workstations, each of which has an AMD Athlon4 1.4 GHz processor with 1GB RAM running Mandrake Linux 8.0 (kernel ) all interconnected with a 10/100 LAN. Our evaluation metrics were: (i) the recall rate, that is, the fraction of documents each of the search mechanisms retrieves, and (ii) the efficiency of the techniques, that is, the number of messages used to find the results. To test the applicability of the information retrieval algorithms in P2P systems, we chose to implement only the algorithms that require local knowledge when searching for the documents (i.e. BFS, RBFS, >RES and ISM). RBFS, >RES and ISM all forward some query message to only half of the neighbors that would be contacted with BFS. Additionally for >RES, we set k = 0:5 * d and m = 100, where d is the degree of a node. In this way a peer propagates the request selectively to half of its peer which returned the most results in the past 100 queries. In our first experiment we performed 10 queries, each of which was run 10 consecutive times, and where each host has an average degree of 8 connections. The queries are keywords randomly sampled from the dataset. Figure 5 (top, left) shows the number of messages required by the four query routing techniques. The figure indicates that Breadth-First-Search (BFS) requires almost 2,5 times as many messages as its competitors with around 1050 messages per query. BFS's recall rate is used as the basis for comparing the recall rate of the other techniques and is therefore set to 100%. Random Breadth-First-Search (RBFS), the Intelligent Search Mechanism (ISM) and the Most Results in the Past (>RES) on the other hand use all significantly less messages but ISM is the one that finds the most documents. That is attributed to the fact that ISM improves its knowledge over time. More specifically, ISM achieves almost 90% recall rate while using only 38% of the BFS's messages. Figure 5 (top, right) illustrates that both >RES and ISM start out with a low recall rate (i.e %) because they are initially both choosing their neighbors at random. Therefore their recall rate performance is comparable to that of RBFS. The values shown are the averages of 10 consecutive requests. In the second experiment we investigated the effect of increasing the TTL parameter to our results. Figure 5 (bottom, left) shows that by increasing the value of TTL to 5 the ISM mechanism discovers almost the same documents with what BFS finds for TTL=4. More specifically, our experimental results show that our mechanism achieves 100% recall rate (Figure 2 (bottom, right)) while using only 57% of the number of messages used in BFS. We have also shown that ISM has approximately the same query response time (QRT) (~250ms) with its two competitors (RBFS and >RES) and far less QRT than BFS (which required 60% more

11 Web Journal of Chinese Management Review Vol. 17 No 2 10 time). Figure 5 Messages (left) and Recall Rate (right) used by the 4 Algorithms for 10x10 queries for (TTL=4) (top) and (TTL=5) (bottom)

12 Web Journal of Chinese Management Review Vol. 17 No 2 11 Conclusions In this paper we presented a number of different query routing techniques that enable efficient information retrieval in P2P systems. Existing techniques are not scaling well because they are either based on the idea of preceding the network with queries or because they require some form of global knowledge. We have shown the various tradeoffs and experimentally evaluated four of the techniques that require no global knowledge. The main challenge for a query routing technique is to query peers that contain the most relevant content with minimum messaging. We described technological solutions may act as value enabling tools in e-commerce scenarios for virtual music. Advanced features for artist recommendations based on audio and cultural features may act in the same way. We showed the technical feasibility of these ideas in a prototypical P2P environment. Another important aspect of the proposed system is that new attributes can be incorporated into the system with minimum effort. In addition, user studies need to be conducted to explore how the users interact with the system and what are their typical queries. In this paper we consider the information retrieval problem in P2P networks. In particular, our objective is to decrease the number of messages sent per query while at the same time maintain a high recall rate (i.e. search result quality). Also, we present a survey of search techniques for information retrieval in P2P networks, including recent techniques and express a realistic experimental evaluation and comparison of these techniques, using a distributed middleware infrastructure we have designed and implemented.

13 Web Journal of Chinese Management Review Vol. 17 No 2 12 REFERENCES Allamanche, E., Herre, J., Hellmuth, O., Froba, B.,Kastner, T., & Cremer, M. (2001). Content-based identification of audio material using MPEG-7 low level description. In proceedings of the 2 nd ISMIR 2001, Indiana University Bloomington, Indiana, USA. Baumann, S. (2003). Music similarity analysis in a P2P environment. In proceedings of the 4 th WIAMIS 2003, London, UK. Baumann, S., & Klüter, A. (2002). Super-convenience for non-musicians: Querying MP3 and the semantic web. In proceedings of the 3 rd ISMIR 2002, Paris, France. Burns, K. S., & Lutz, R. J. (2006). The function of format: Consumer responses to six on-line advertising formats. Journal of Advertising, 35(1), Futrelle, J., & Downie, J. S. (2002). Interdisciplinary communities and research issues in music information retrieval. In proceedings of the 3 rd ISMIR 2002, Paris, France. Gnutella (2012). Grimm, R., & Nützel, J. (2002). Peer-to-peer music-sharing with profit but without copy protection. In proceedings of the 2 nd WEDELMUSIC 2002, Darmstadt, Germany. Logan,S. (2000). Mel frequency cepstral coefficients for music modelling. In proceedings of the 1 st ISMIR 2000, Plymouth, USA. Pachet, F., & Aucouturier, J. (2003). Representing musical genre: A State of the Art. Journal of New Music Research, 32(1), Pfeiffer, S., & Vincent, T. (2001). Formalization of MPEG-1 compressed domain audio features. Report No 01/196 of CSIRO Mathematical and Information Sciences, Australia. Wang, C., Li, J., & Shi, S. (2002). A kind of content-based music information retrieval method in a peer-to-peer environment. In proceedings of the 3 rd ISMIR 2002, Paris, France. Yang, B., & Garcia-Molina, H. (2002). Efficient search in peer-to-peer networks. In proceedings of the 22 nd ICDCS2002, Vienna, Austria.

Exploiting Locality for Scalable Information Retrieval in Peer-to-Peer Networks

Exploiting Locality for Scalable Information Retrieval in Peer-to-Peer Networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Department of Computer Science and Engineering University of California