Advanced techniques for management of personal digital music libraries

Size: px

Start display at page:

Download "Advanced techniques for management of personal digital music libraries"

Elvin Franklin
6 years ago
Views:

1 Advanced techniques for management of personal digital music libraries Jukka Rauhala TKK, Laboratory of Acoustics and Audio signal processing Abstract In this paper, advanced techniques based on Music information retrieval (MIR) are reviewed in the context of personal digital music library management. There is a growing need for improved search services as the size of personal digital music libraries is growing rapidly. MIR-based approaches offer advanced methods for extracting metadata information directly from audio files, such as genre information and mood information. Moreover, audio features extracted with MIR can be used to form a statistics-based profile on the musical taste of the user, which can be used to generate user-tailored playlists. In addition, Query by humming (QBH) is a MIR technique, which offers a new way to search a known music track from a large library effectively. Finally, these methods introduced are discussed in a personal digital music library context. 1 INTRODUCTION Many people have nowadays a personal digital music library on their desktop or in their mobile player due to the increasing popularity of MP3-players. As the size of the libraries can easily reach the size equivalent to hundreds of CD records, there is a huge need for efficient library management techniques, more specifically, for techniques that improve and speed up the search for music. Currently, a typical MP3-player offers ways to search for music based on different classifications, such as artist, album, and genre information, that has been predefined by the user or downloaded from an internet server. Music information retrieval (MIR)- based (Downie, 2003) approaches offer new, interesting ways to search for music. MIR is a research topic, which is focusing on developing algorithms to extract information from, e.g., audio files or MIDI files. This information can be used for the classification of data, and, hence, it can be further used in digital music libary management. MIRbased approach has two major advantages over current approach. First, an Internet connection is not needed for providing classification metadata. Second, the user can tailor the metadata according to his preferences by controlling the MIR-based methods via parameters. In this paper, four interesting techniques are looked into: automatic genre recognition, automatic mood detection, intelligent recommendation, and Query by Humming (QBH) system. Automatic genre recognition means that the algorithm detects 1

2 the genre of the music from audio data. Tzanetakis and Cook (2002) have introduced an algorithm that is examined in this work. Another way to enhance the digital library management is to extract emotional information from audio files and to offer a way to create playlists on-the-fly based on emotions. There are several existing approaches for this, e.g., by van Breemen and Bartneck (2003), by Tolos et al. (2004), and by Li and Ogihara (2004). Moreover, intelligent recommendation methods use some kind of intelligent logic to form a profile for the user s musical taste. They offer a way for the user to tell the system whether he likes or dislikes the song proposed by the system. Then, the system takes the user input and MIR information that has been obtained to improve the profile. The system can also extract information from the environment in order to enhance the profile to be dependent of environmental variables. A number of systems, both commercial (e.g. Last.fm (2006)) and academic (e.g. Lifetrak (Reddy and Mascia, 2006)), exist nowadays. From the digital music library point of view, learning-based methods provide ways to develop profiles that can be used for creating playlists. The fourth advanced technique, QBH (McNab et al., 1997), is one of the most popular research topics related to MIR at the moment. In QBH, the user can make a search on the digital music library by humming a short melody from a certain song. Then, the algorithm makes a search in the database and suggests songs that match the input signal the best. The existing QBH methods are still in their infancy, as they do not reach desired robustness due to several challenges, such as the analysis of the humming. There has been a lot of research in the area of managing digital music libraries mostly focusing on public libraries located in the Internet, or in online music stores. For instance, most of the QBH research projects, such as MELody index (McNab et al., 1997), incorporate a public online digital music library. However, most of the research results can be applied on personal digital music libraries as well. Additionally, Pachet et al. (2004) have presented a personal digital music library, which uses advanced techniques based on MIR. This paper is organized as follows. First, personal digital music libraries are introduced in Section 2. Then, current systems for managing digital music libraries are presented in Section 3. In Section 4, the four MIR-based management techniques are shown. The introduced techniques are discussed from the personal digital music library point of view in Section 5 followed by the conclusions in Section 6. 2 PERSONAL DIGITAL MUSIC LIBRARIES In the past few years, many people have acquired their own personal digital music library. This has been made possible by the rapid increase of devices with capability to play digital music combined to large storage memory. Some examples of these devices, Apple IPOD MP3-player and Nokia N91 mobile phone, are shown in Figure 1. Since most of these devices are portable, it is common to have a large digital music library stored in a PC, and a copy of the library (or a limited version) on the portable device. 2

Figure 1. Pictures of Apple IPOD MP3-player (left, picture from http://www.aeonity.com/images/store/free-photo-ipod-2.jpg) and Nokia N91 mobile phone (right, picture from http://nds3.nokia.

First, the introduction of advanced auditorybased audio codecs has reduced the need for memory.

3 Figure 1. Pictures of Apple IPOD MP3-player (left, picture from and Nokia N91 mobile phone (right, picture from There are three main reasons that have enabled the rapid growth of the market of devices with built-in digital music library. First, the introduction of advanced auditorybased audio codecs has reduced the need for memory. The most important milestone was MPEG1 layer 3 codec (MPEG1 layer 3), which is the most popular audio codec for digital music at the moment. Second, the processing power of portable devices, not to speak of PCs, has increased over the years. Hence, almost all portable devices are able to decode MP3 data in real-time. Third, the size and price of hard disks have decreased, which means that nowadays it is possible to store digital music equivalent to hundreds of CDs on a portable device. A very important part of the digital music library is the user interface and the management system. The importance of these components has just grown as the size of digital music libraries has increased rapidly due to cheap storage memory. Moreover, portable digital music players have major restrictions to the user interface, which will further limit the management system. Hence, there is a growing need for advanced digital library management techniques. 3 DIGITAL MUSIC LIBRARY MANAGEMENT SYSTEMS In general, there are two main actions in digital music library management, when it is considered from the user point of view: addition and removal of music tracks, and playing of music tracks. An important part of the playing is the searching for music tracks to be played. In this work, we concentrate on the searching part of the management process, as it is the most challenging part. The search for music tracks can be divided into two approaches: a known-item search and a general search. In the known-item search, the user wants to play, for example, a certain song or album. The user might not remember all the details about the particular track(s), but the management system should assist the user to provide a fast way to search for the tracks. The other option for search is general search, where the user just wants to play some music without having anything particular in mind. However, the user might have priorities concerning, for example, mood, genre, artist 3

4 Table 1. Selection of ID3v2 tags (ID3v2, 2006). Album title Composer BPM (beats per minute) Lyricist Language Mood Lead performer Band Conductor Publisher Track number Album Performer Track Initial key etc. that the management system should be able to take into account. At the moment, the most popular personal digital music libraries are Microsoft s windows media player, Apple s itunes, and Winamp. Majority of the available systems rely on the metadata classifying the music tracks, which is obtained from an internet server, or entered by the user. An example of the metadata format is ID3v2 (ID3v2, 2006), which is supported by the three software mentioned above. The most important classifications specified in ID3v2 are shown in Table 1. Moreover, Microsoft Windows Media Player version 10 provides automatic playlists, such as music tracks I dislike, Favorites listen to at night, and Favorites listen to on Weekends (Microsoft, 2005). An interesting new standard, which will be included in future digital music libraries, is MPEG-7 (Martínez, 2002). MPEG-7 is not an audio coding standard, but a metadata standard encapsulating a large variety of audio and video features. In addition to the ordinary metadata features, such as the tags shown in Table 1, MPEG-7 specifies 17 low-level audio descriptors, which are listed in Table 2. These low-level features are common in MIR and many MIR-based algorithms take advantage of them. Hence, MPEG-7 could be used in the future to assist MIR-based management algorithms by providing data required in the process. There has been some research on advanced management systems for personal digital music libraries. An example of this is the Sony Music Browser (Pachet et al., 2004), which uses MIR-based methods. A screenshot of the application is shown in Figure 2. 4

5 Table 2. MPEG-7 low-level audio descriptors. Basic descriptors: Audio waveform Audio power Spectral descriptors: Audio spectrum envelope Audio spectrum centroid Audio spectrum spread Audio spectrum flatness Signal parameter descriptors: Harmonic ratio Upper limit of harmonicity Audio fundamental frequency Timbral descriptors: Log attack time Temporal centroid Harmonic spectral centroid Harmonic spectral deviation Harmonic spectral spread Harmonic spectral variation Spectral centroid Figure 2. A screenshot of the Sony Music Browser (Pachet et al., 2004). 4 ADVANCED MANAGEMENT TECHNIQUES In this Section, four advanced management techniques based on MIR are introduced. First, the basics of MIR are presented. Then, the four techniques, including automatic genre recognition, automatic mood detection, intelligent recommendation method, and QBH, are explained.s 5

6 Table 3. Seven classes of music information according to Downie (2003). Class Description Pitch Perceived frequencies of pitched tones, intervals, keys Temporal Duration of musical events: tempo, meter, pitch duration, harmonic duration, and accents Harmonic Relations between multiple pitched notes Timbral Everything related to tone color Editorial Fingerings, dynamic instructions, articulations etc. Textual Lyrics Bibliographic Music metadata : title, composer, performers etc. 4.1 Introduction to Music information retrieval (MIR) MIR is a wide research field that includes retrieval of all kinds of information from, for example, audio signals. Downie (2003) has defined seven classes, or facets, of information considered in MIR as shown in Table 3. Out of these seven classes, all other classes, except textual and bibliographic, can be extracted directly from an audio signal, at least in theory. Moreover, bibliographic class differs from the other classes as it cannot be derived from the content. The audio features used in MIR algorithms can be divided into low-level and highlevel audio features, where the low-level features can be extracted in a straight-forward manner from the signal and high-level features are usually determined based on a group of low-level features. For instance, a common way to detect the key of a music signal (high-level feature) is to determine a histogram of the pitches occurring in the signal (low-level features) and to compare it with pre-defined histograms determined for all keys. The low-level audio features are typically extracted from audio signals with common audio signal processing methods, such as autocorrelation function (ACF), short-term Fourier transform (STFT), and mel-frequency cepstral coefficients (MFCC). 4.2 Automatic genre recognition Musical genre can be defined as a categorical label, which is created by humans to classify music tracks (Tzanetakis and Cook, 2002). Genres can be classified in a tree format, for example heavy music is a top-level genre and it can be divided into white metal and black metal. Current digital music library software include genre classification obtained usually from the internet that can be used for making searches and playlists. An automatic genre recognition algorithm would offer interesting enhancements. First, it could provide genre classification for a large music library without requiring internet connection. Second, it would be possible to control the genre classification with some parameters, which is not possible with current software. Humans are usually very good at recognizing genres, while it is a difficult task for computers and usually they fail to reach the accuracy of humans. 6

7 Table 4. List of the features used in the genre recognition algorithm by Tzanetakis and Cook (2002). Timbral texture features: Mean of spectral centroid Variance of spectral centroid Spectral rolloff Spectral flux Zerocrossings over the texture window Low energy Means of Mel-Frequency Cepstral Coefficients (5 parameters) Variances of Mel-Frequency Cepstral Coefficients (5 parameters) Rhythm content features: Relative amplitudes of the first two peaks in the beat histogram (2 parameters) Ratio of the these amplitudes Periods of the first two peaks in beats per minute (2 parameters) Overall sum of the beat histogram Pitch content features: Amplitude of maximum peak of the folded histogram Period of the maximum peak of the unfolded histogram Period of the maximum peak of the folded histogram Pitch interval between the two most prominent peaks of the folded histogram The overall sum of the histogram Tzanetakis and Cook (2002) have proposed a musical genre recognition system, which uses 30-dimensional feature vector including timbral texture, rhythm content, and pitch content features. Table 4 shows a full list of the features used in the algorithm. Timbral texture features are determined by using STFT and MFCC, whereas rhythm content features are obtained by calculating a beat histogram with a wavelet transform (WT). In extraction of pitch content features, a multi-pitch estimation algorithm by Tolonen and Karjalainen (2000) is used for determining pitches from the signal that are then rounded to the nearest MIDI note number. The resulting note data is used to form two kind of histogram: unfolded histogram and folded histogram. In unfolded histogram, the each MIDI note corresponds to a single histogram bin, whereas in folded histogram one histogram bin corresponds to an octave. Finally, standard statistical pattern recognition (SPR) methods are used to determine the genre classification by using the feature vector. Tzanetakis and Cook have developed a graphical user interface called GenreGram for their genre classification algorithm. The software indicates graphically the results from the genre detection process, as seen in Figure 3. This software is part of the MARSYAS, which is a freely available musical signal analysis package (MARSYAS, 2006). 7

Figure 3. A screenshot of GenreGram a genre recognition software. 4.3 Automatic mood detection Mood is strongly connected to musical performances humans perceive emotions through listening to music.

8 Figure 3. A screenshot of GenreGram a genre recognition software. 4.3 Automatic mood detection Mood is strongly connected to musical performances humans perceive emotions through listening to music. The perception varies from person to person, not to speak of people in non-western countries, but there are some general rules that can be determined. For instance, a song played in major key with fast tempo is perceived as happy, whereas a song played in minor key in a slow tempo is perceived as sad, roughly speaking. Table 5 shows one proposed presentation (Mancini et al., 2006) how the acoustic cues can be mapped to emotions. Emotional classification is not yet in wide use, even though it is included in, for example, ID3v2 as mood classifier. Again, an automatic mood detection algorithm would provide way to control emotion classification without requiring internet connection. Automatic mood detection algorithms have been presented by, for example, Friberg et al. (2002), van Breemen and Bartneck (2003), Li and Ogihara (2004), Tolos et al. (2004), and Lu et al. (2006). 8

9 Table 5. Mapping of acoustic cues to emotions according to Mancini et al. (2006). Acoustic cue Sadness Anger Happiness Mean tempo slow fast fast Tempo/timing large timing small tempo small tempo variations variations variability variability and timing variations Sound level low high high, little sound level variability Articulation legato staccato staccato, large articulation variability Duration contrasts soft sharp sharp Timbre dull sharp bright Tone attacks slow abrupt fast Micro-intonation flat accent on unstable rising notes Vibrato slow large vibrato extent - Ritardando final ritardando no ritardando Intelligent recommendation Intelligent recommendation is another advanced approach for digital music library management. Intelligent recommendation systems use a number of features obtained from the audio signal to form a user profile based on the user feedback. Web-based services use the feedback from every user to improve user profiles, but statistics-based search in a personal digital music library can rely purely on the feedback of one user. Based on the user profile, the system is able to recommend to the user some music tracks to be played that the system thinks are in line with the user s musical taste. In addition, the system can use environmental variables to improve the user profile to take into account the environmental conditions. A typical intelligent recommendation system block diagram is shown in Figure 4. Commercial web-based systems utilizing a statistics-based approach have been launched recently, for example, Last.fm (2006), which is shown in Figure 5. Reddy and Mascia (2006) have proposed a statistics-based recommendation system, which is examining five environmental variables: space, time, kinetic, entropic, and meteorological. 9

10 User profile Digital music library Feature extraction Statistical engine Recommendations Feedback User Environmental variables Figure 4. A block diagram of an intelligent recommendation system. Figure 5. Screenshot of the Last.fm application (2006). The user is able to give feedback by clicking the heart icon (positive), by clicking the cancel icon (negative), or clicking the tag icon, which categorizes the song using the user-defined tag. 4.5 Query by Humming (QBH) QBH is an interesting MIR application for searching music from large libraries. The user can hum a short piece of music and the QBH system performs a search to find the particular track. A simplified block diagram of a typical QBH system is shown in Figure 6. The QBH system consists of four blocks: humming transcription, music track transcription, database of the symbolic data of music tracks, and a search algorithm. First, the QBH requires a predefined database of symbolic data of music tracks. When the user hums or sings a line, the humming transcription block converts it to the symbolic form, which is used in the database. Then, the search algorithm makes a search and returns the results to the user. 10

11 Digital music library Transcription Symbolic data Transcription Search List of music tracks Humming Figure 6. Block diagram of a typical QBH system. The first task in implementing a QBH system is to construct a database of the music data. The transcription from audio signal to symbolic data can be done either manually, which is extremely laborious, or automatically. The goal is transcribe all pitched notes in the music signal into a symbolic format, such as MIDI. There is a lot of ongoing research on automatic music transcription algorithms, for example, by Klapuri (2004). However, even the best algorithms at the moment fail to produce reliable results. The use of automatic transcription algorithm would allow including the database inside the QBH software without requiring an Internet connection. The other option is to use an external database, which is located in the Internet. The second part of the QBH system is the transcription of the user input, which can be humming, singing, whistling, etc. Even though this part is much easier than the transcription of the music signal due to monophonic nature, there are, still, major challenges. First, it is very difficult to make the transcription independent of the user, so that whosoever can use the system without calibration. Second, the system should be robust enough that the service does not require any musical skills. Finally, the QBH system takes the transcribed input signal and tries to perform a search to find the best matching tracks. The search should be robust for notes which are out-of-tune and not in correct rhythm. Moreover, it should be independent of the key and tempo signatures. There are a number of proposed database algorithms, for example by Wiggins et al. (2002). In addition, one solution to improve the robustness is to examine the pitch differences in adjacent notes and to consider only the signs of these differences. Hence, the audio signals can be coded as a sequence of + and -. As a result, the database search does not take into account whether the notes are inputted in tune and in correct rhythm; what matters is if the note relations are inputted correctly. The disadvantage in this solution is that the definitions for the search are loosened and the algorithm might return a lot of candidates. 11

12 Figure 7. Screenshot of New York University s Query By Humming service (2006). QBH systems have been proposed by, for example, McNab et al. (1997). Moreover, there are a number of web-based QBH interfaces for large digital music libraries, such as Muspedia (2006) and New York University s Query By Humming (2006). A screenshot of the New York University s Query By Humming service is shown in Figure 7. There are also variations of QBH, for example Query by tapping (QBT), which uses only the rhythmic content. The advantage of this approach is that it is simpler than QBH. On the other hand, it might be difficult to narrow down the results to the correct song. 5 DISCUSSION Figure 8 displays how the advanced management techniques presented in the previous section can be applied for personal digital music library. Automatic genre recognition and mood detection algorithms can provide genre and mood classification data, which can be used for making general searches. In other words, these algorithms, as well as intelligent recommendation method, can be applied for generating a playlist when the user just wants to listen to certain type of music instead of specific tracks. On the other hand, when the user wants to play a specific track, QBH method is a powerful service for making efficient searches with large databases. These new features would require small changes in the user interface. First, genre recognition and mood detection can be thought as background processes that are not visible to the user. In addition, if the user can control the genre recognition and mood detection via parameters, it needs to be implemented in the user interface. Second, QBH requires recording buttons for humming the input signal. Third, intelligent recommendation system needs the feedback controls implemented in the user interface as well as the optional environmental detectors. 12

13 Genre recognition Intelligent recommendation Digital music library Mood detection Metadata Conventional search User Transcription QBH database Query by humming Figure 8. A block diagram of a personal digital music library management system, which uses advanced techniques. Table 6. Comparison of the introduced techniques. Technique Maturity Search type Advantages Challenges Genre recognition Under development General Genre information Accuracy and robustness without online connection, parameterization Mood detection Under General Mood information Accuracy, development without online robustness, connection, detection of parameterization large amount of moods Intelligent Good General Automatic Determining recommendation generation of the user profile tailored playlists Query by Under Not-known Powerful search Robustness humming development with large against databases inaccuracies in the input signal 6 CONCLUSION In this paper, four advanced management techniques for managing a personal digital music library have been presented: automatic genre recognition, automatic mood detection, intelligent recommendation, and QBH. Genre recognition, mood detection and intelligent recommendation can be used for generating tailored playlists for the user, whereas QBH provides a fast way to search for a known track. These methods offer significant improvements to the management of current personal digital music libraries, especially with large databases. On the other hand, MPEG-7, which is a new metadata standard incorporating audio features used in MIR, is a promising standard and it will be most likely implemented in future personal digital music libraries. Hence, 13

14 it can be suggested that in the future the personal digital music libraries will take advantage of the introduced advanced techniques as well as other MIR-based methods. Future work includes implementing software that incorporates the presented features, which can be further used in usability testing. REFERENCES Downie, J.S Music information retrieval (Chapter 7). Annual Review of Information Science and Technology 37, ed. Blaise Cronin. Medford, NJ. Pp Available from Friberg, A A fuzzy analyzer of emotional expression in music performance and body motion. In J. Sundberg & B. Brunson (Eds.) Proceedings of Music and Music Science. Stockholm, Sweden. October Klapuri, A. P Signal processing methods for the automatic transcription of music. Ph.D. dissertation. Tampere University of Technology. Last.fm [Online] Available at Li, T. and Ogihara, M Content-based music similarity search and emotion detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Montreal, Canada. May Pp Mancini, M.; Bresin, R.; Pelachaud, C From Acoustic Cues to an Expressive Agent. In Gibet, S.; Courty, N.; Kamp, J.-F. (Eds.): Gesture in Human-Computer Interaction and Simulation: 6 th International Gesture Workshop. Berlin Heidelberg: Springer. Pp MARSYAS [Online] Available at Martínez, J. M.; Koenen, R.; Pereira, F MPEG-7: The Generic Multimedia Content Description Standard, Part 1. IEEE Multimedia. Vol. 9. No. 2. Pp McNab, R.J.; Smith, L.A.; Bainbridge, D.; Witten, I.H The New Zealand Digital Library MELody index. D-Lib Magazine. Microsoft Mix Your Music in Playlists. [Online] Available at _how_to.aspx. Muspedia [Online] Available at New York University Query By Humming [Online] Available at Pachet, F.; La Burthe, A.; Zils, A.; Aucouturier, J.-J Popular Music Access: The Sony Music Browser. Journal of the American Society for Information Science and Technology. Vol. 55. No. 12. Pp Reddy, S. and Mascia, J Lifetrak: Music In Tune With Your Life. In Proceedings of the 1 st ACM International Workshop on Human-centered Multimedia. Santa Barbara, USA. Pp Tolonen, T. and Karjalainen, M A Computationally Efficient Multipitch Analysis Model. IEEE Transactions on Speech and Audio Processing. Vol. 8. No. 6. Pp

15 Tolos, M.; Tato, R.; Kemp, T Mood-based navigation through large collections of musical data. In Proceedings of the 2 nd IEEE Consumer Communications and Networking Conference. Las Vegas, USA. Jan Pp Tzanetakis, G. and Cook, P Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing. Vol. 10. No. 5. Pp van Breemen, A. and Bartneck, C An Emotional InterFace for a Music Gathering Application. In Proceedings of the 8 th International Conference on Intelligent User Interfaces. Miami, USA. Pp Wiggins, G. A.; Lemström, K.; Meredith, D SIA(M)ESE: An Algorithm for Transposition Invariant, Polyphonic Content-Based Music Retrieval. In Proceedings of the ISMIR 02 Third International Conference on Music Information Retrieval. Paris, France. October Pp

CHAPTER 8 Multimedia Information Retrieval

CHAPTER 8 Multimedia Information Retrieval Introduction Text has been the predominant medium for the communication of information. With the availability of better computing capabilities such as availability