Audio Content-Based Music Retrieval

Size: px

Start display at page:

Download "Audio Content-Based Music Retrieval"

Edgar Morton
6 years ago
Views:

Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik meinard@mpi-inf.mpg.

1 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Meinard Müller 2007 Habilitation Bonn, Germany 2007 MPI Informatik Saarland, Germany Senior Researcher Multimedia Information Retrieval & Music Processing 6 PhD Students

Barcelona, Spain Postdoctoral Resarcher Artificial

Research Council Music information retrieval,

retrieval Searching for artist, title, Rich and

2 Joan Serrà Ph.D. Music Technology Group Universitat Pompeu Fabra Barcelona, Spain Postdoctoral Resarcher Artificial Intelligence Research Institute Spanish National Research Council Music information retrieval, machine learning, time series analysis, complex networks,... Music Retrieval Textual metadata Traditional retrieval Searching for artist, title, Rich and expressive metadata Generated by experts Crowd tagging, social networks Content-based retrieval Automatic generation of tags Query-by-example

3 Query-by-Example Database Query Hits Retrieval tasks: Bernstein (1962) Beethoven, Symphony No. 5 Audio identification Audio matching Version identification Category-based music retrieval Beethoven, Symphony No. 5: Bernstein (1962) Karajan (1982) Gould (1992) Beethoven, Symphony No. 9 Beethoven, Symphony No. 3 Haydn Symphony No. 94 Query-by-Example Taxonomy Retrieval tasks: Audio identification Audio matching Specificity level High specificity Granularity level Fragment-based retrieval Version identification Category-based music retrieval Low specificity Document-based retrieval

4 Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

5 Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

6 Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

7 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

8 Audio Identification Database: Goal: Huge collection consisting of all audio recordings (feature representations) to be potentially identified. Given a short query audio fragment, identify the original audio recording the query is taken from. Notes: Instance of fragment-based retrieval High specificity Not the piece of music is identified but a specific rendition of the piece Application Scenario User hears music playing in the environment User records music fragment (5-15 seconds) with mobile phone Audio fingerprints are extracted from the recording and sent to an audio identification service Service identifies audio recording based on fingerprints Service sends back metadata (track title, artist) to user

9 Audio Fingerprints An audio fingerprint is a content-based compact signature that summarizes some specific audio content. Requirements: Discriminative power Invariance to distortions Compactness Computational simplicity Audio Fingerprints An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Requirements: Discriminative power Invariance to distortions Compactness Computational simplicity Ability to accurately identify an item within a huge number of other items (informative, characteristic) Low probability of false positives Recorded query excerpt only a few seconds Large audio collection on the server side (millions of songs)

10 Audio Fingerprints An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Requirements: Discriminative power Invariance to distortions Compactness Computational simplicity Recorded query may be distorted and superimposed with other audio sources Background noise Pitching (audio played faster or slower) Equalization Compression artifacts Cropping, framing Audio Fingerprints An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Requirements: Discriminative power Invariance to distortions Compactness Reduction of complex multimedia objects Reduction of dimensionality Making indexing feasible Allowing for fast search Computational simplicity

11 Audio Fingerprints An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Requirements: Discriminative power Invariance to distortions Computational efficiency Extraction of fingerprint should be simple Size of fingerprints should be small Compactness Computational simplicity Literature (Audio Identification) Allamanche et al. (AES 2001) Cano et al. (AES 2002) Haitsma/Kalker (ISMIR 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) Dupraz/Richard (ICASSP 2010) Ramona/Peeters (ICASSP 2011)

12 Literature (Audio Identification) Allamanche et al. (AES 2001) Cano et al. (AES 2002) Haitsma/Kalker (ISMIR 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) Dupraz/Richard (ICASSP 2010) Ramona/Peeters (ICASSP 2011) Fingerprints (Shazam) Steps: 1. Spectrogram 2. Peaks (local maxima) Frequency (Hz) Intensity Efficiently computable Standard transform Robust

13 Fingerprints (Shazam) Steps: 1. Spectrogram 2. Peaks Frequency (Hz) Intensity Fingerprints (Shazam) Steps: 1. Spectrogram 2. Peaks / differing peaks Robustness: Frequency (Hz) Intensity Noise, reverb, room acoustics, equalization

acoustics, equalization Audio codec acoustics, equalization Audio codec Superposition of

14 Fingerprints (Shazam) Steps: 1. Spectrogram 2. Peaks / differing peaks Robustness: Frequency (Hz) Intensity Noise, reverb, room acoustics, equalization Audio codec Fingerprints (Shazam) Steps: 1. Spectrogram 2. Peaks / differing peaks Robustness: Frequency (Hz) Intensity Noise, reverb, room acoustics, equalization Audio codec Superposition of other audio sources

15 Matching Fingerprints (Shazam) Database document Frequency (Hz) Intensity Matching Fingerprints (Shazam) Database document (constellation map) Frequency (Hz)

16 Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) Frequency (Hz) Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks Frequency (Hz) #(matching peaks) Shift (seconds)

17 Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks Frequency (Hz) #(matching peaks) Shift (seconds) Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks Frequency (Hz) #(matching peaks) Shift (seconds)

18 Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks Frequency (Hz) #(matching peaks) Shift (seconds) Matching Fingerprints (Shazam) Database document (constellation map) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks Frequency (Hz) #(matching peaks) Shift (seconds)

Matching Fingerprints (Shazam) Frequency (Hz) Database document (constellation map) #(matching peaks) Query document (constellation map) 1. Shift query across database document 2.

19 Matching Fingerprints (Shazam) Frequency (Hz) Database document (constellation map) #(matching peaks) Query document (constellation map) 1. Shift query across database document 2. Count matching peaks High count indicates a hit (document ID & position) Shift (seconds) Indexing (Shazam) Index the fingerprints using hash lists Hashes correspond to (quantized) frequencies Hash 2 B Fre equency (Hz) Hash 2 Hash 1

20 Indexing (Shazam) Index the fingerprints using hash lists Hashes correspond to (quantized) frequencies Hash list consists of time positions (and document IDs) N B 2 B = number of spectral peaks = #(bits) used to encode spectral peaks = number of hash lists N / 2 B = average number of elements per list Hash 2 B Fre equency (Hz) Hash 2 Problem: Individual peaks are not characteristic Hash lists may be very long Not suitable for indexing Hash 1 List to Hash 1: Indexing (Shazam) Idea: Use pairs of peaks to increase specificity of hashes Frequency (Hz) 1. Peaks 2. Fix anchor point 3. Define target zone 4. Use paris of points 5. Use every point as anchor point

21 Indexing (Shazam) Idea: Use pairs of peaks to increase specificity of hashes 1. Peaks 2. Fix anchor point f 2 3. Define target zone 4. Use paris of points Frequency (Hz) t f 1 5. Use every point as anchor point New hash: Consists of two frequency values and a time difference: (,, ) f 1 f 2 t Indexing (Shazam) A hash is formed between an anchor point and each point in the target zone using two frequency values and a time difference. Fan-out (taking pairs of peaks) may cause a combinatorial explosion in the number of tokens. However, this can be controlled by the size of the target zone. Using more complex hashes increases specificity (leading to much smaller hash lists) and speed (making the retrieval much faster).

22 Indexing (Shazam) Definitions: N = number of spectral peaks p = probability that a spectral peak can be found in (noisy and distorted) query F = fan-out of target zone, e. g. F = 10 B = #(bits) used to encode spectral peaks and time difference Consequences: F N = #(tokens) to be indexed 2 B+B = increase of specifity (2 B+B+B instead of 2 B ) p 2 = propability of a hash to survive p (1-(1-p) F ) = probability that, at least, on hash survives per anchor point Example: F = 10 and B = 10 Memory requirements: F N = 10 N Speedup factor: 2 B+B / F 2 ~ 10 6 / 10 2 = (F times as many tokens in query and database, respectively) Conclusions (Shazam) Many parameters to choose: Temporal and spectral resolution in spectrogram Peak picking strategy Target zone and fan-out parameter Hash function

23 Literature (Audio Identification) Allamanche et al. (AES 2001) Cano et al. (AES 2002) Haitsma/Kalker (ISMIR 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) Dupraz/Richard (ICASSP 2010) Ramona/Peeters (ICASSP 2011) Fingerprints (Philips) Steps: 1. Spectrogram Frequency (Hz) Intensity Efficiently computable Standard transform Robust

24 Fingerprints (Philips) Steps: 1. Spectrogram (long window) Frequency (Hz) Intensity Coarse temporal resolution Large overlap of windows Robust to temporal distortion Fingerprints (Philips) Steps: 1. Spectrogram (long window) 2. Consider limited frequency range Frequency (Hz) Intensity Hz Most relevant spectral range (perceptually)

Band Fingerprints (Philips) Intensity Steps: 1.

Log-frequency (Bark scale) 300 2000 Hz Most relevant spectral range

resolution Robust to spectral distortions Bit Fingerprints (Philips)

Binarization Local thresholding Sign of energy difference

25 Band Fingerprints (Philips) Intensity Steps: 1. Spectrogram (long window) 2. Consider limited frequency range 3. Log-frequency (Bark scale) Hz Most relevant spectral range (perceptually) 33 bands (roughly bark scale) Coarse frequency resolution Robust to spectral distortions Bit Fingerprints (Philips) State Steps: 1. Spectrogram (long window) 2. Consider limited frequency range 3. Log-frequency (Bark scale) 4. Binarization Local thresholding Sign of energy difference (simultanously along time and frequency axes) Sequence of 32-bit vectors

26 Fingerprints (Philips) Sub-fingerprint: 32-bit vector Not characteristic enough Bit State Fingerprints (Philips) Sub-fingerprint: 32-bit vector Not characteristic enough Fingerprint-block: Bit State 256 consecutive sub-fingerprints Covers roughly 3 seconds Overlapping

27 Fingerprints (Philips) Sub-fingerprint: 32-bit vector Not characteristic enough Fingerprint-block: Bit State 256 consecutive sub-fingerprints Covers roughly 3 seconds Overlapping Fingerprints (Philips) Sub-fingerprint: 32-bit vector Not characteristic enough Fingerprint-block: Bit State 256 consecutive sub-fingerprints Covers roughly 3 seconds Overlapping

28 Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) Band Intensity Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) Band Intensity BER Shift (seconds)

29 Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) Band Intensity BER Shift (seconds) Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) Band Intensity BER Shift (seconds)

30 Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) Band Intensity BER Shift (seconds) Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) Band Intensity BER Shift (seconds)

Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) Band 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) 3.

31 Matching Fingerprints (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) Band 1. Shift query across database document 2. Calculate a block-wise bit-error-rate (BER) 3. Low BER indicates hit Intensity BER Shift (seconds) Indexing (Philips) Note: Problem: Strategy: Individual sub-fingerprints (32 bit) are not characteristic Fingerprint blocks (256 sub-fingerprints, 8 kbit) are used Computation of BER between query fingerprint-block and every database fingerprint-block is expensive Chance that a complete fingerprint-block survives is low Exact hashing problematic Only sub-fingerprints are indexed using hashing Exact sub-fingerprint matches are used to identify candidate fingerprint-blocks in database. BER is only computed between query fingerprint-block and candidate fingerprint-blocks Procedure is terminated when database fingerprint-block is found, where BER falls below a certain threshold

32 Indexing (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) 1. Efficient search for exact matches of sub-fingerprints (anchor points) Band Intensity Indexing (Philips) Database document (fingerprint-blocks) Query document (fingerprint-block) Band Intensity 1. Efficient search for exact matches of sub-fingerprints (anchor points) 2. Calculate BER only for blocks containing anchor points

33 Conclusions (Philips) Comparing binary fingerprint-blocks expressing tempo-spectral changes Usage of some sort of shingling technique see [Casey et al. 2008, IEEE-TASLP] for a similar approach applied to a more general retrieval task Acceleration using hash-based search for anchor-points (sub-fingerprints) Concepts of fault tolereance are required to increase robustness Susceptible to distortions in specific frequency bands (e. g. equalization) or to superpositions with other sources Conclusions (Audio Identification) Basic techniques used in Shazam and Philip systems Many more ways to define robust audio fingerprints Delicate trade-off between specificity, robustness, and efficiency Audio recording is identified (not a piece of music) Does not allow for identifying studio recording using a query taken from live recordings Does not generalize to identify different interpretations or versions of the same piece of music

34 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

corresponding audio fragments of similar musical content.

35 Audio Matching Database: Goal: Notes: Audio collection containing: Several recordings of the same piece of music Different interpretations by various musicians Arrangements in different instrumentations Given a short query audio fragment, find all corresponding audio fragments of similar musical content. Instance of fragment-based retrieval Medium specificity A single document may contain several hits Cross-modal retrieval also feasible Audio Matching Beethoven s Fifth Various interpretations Bernstein Karajan Scherbakov (piano) MIDI (piano)

36 Application Scenario Content-based retrieval Application Scenario Cross-modal retrieval

37 Literature (Audio Matching) Pickens et al. (ISMIR 2002) Müller/Kurth/Clausen (ISMIR 2005) Suyoto et al. (IEEE TASLP 2008) Casey et al. (IEEE TASLP 2008) Kurth/Müller (IEEE TASLP 2008) Yu et al. (ACM MM 2010) Audio Matching Two main ingredients: 1.) Audio features Robust but discriminating Chroma-based features Correlate to harmonic progression Robust to variations in dynamics, timbre, articulation, local tempo 2.) Matching procedure Efficient Robust to local and global tempo variations Scalable using index structure

38 Audio Features Example: Chromatic scale Spectrogram Frequency (Hz z) Intensity (db) Audio Features Example: Chromatic scale Log-frequency spectrogram MIDI pitch Intensity (db)

representation (normalized, Euclidean) Chroma

39 Audio Features Example: Chromatic scale Chroma representation Chroma Intensity (db) Audio Features Example: Chromatic scale Chroma representation (normalized, Euclidean) Chroma Intensity (normalized) )

40 Audio Features Example: Beethoven s Fifth Chroma representation (normalized, 10 Hz) Karajan Scherbakov Audio Features Example: Beethoven s Fifth Chroma representation (normalized, 2 Hz) Smoothing (2 seconds) + downsampling (factor 5) Karajan Scherbakov

41 Audio Features Example: Beethoven s Fifth Chroma representation (normalized, 1 Hz) Smoothing (4 seconds) + downsampling (factor 10) Karajan Scherbakov Audio Features There are many ways to implement chroma features Properties may differ significantly Appropriateness depends on respective application MATLAB implementations for various chroma variants ISMIR 2011, Poster Session (PS2), Tuesday 13-15

42 Matching Procedure Compute chroma feature sequences Database Query N very large (database size), M small (query size) Matching curve Matching Procedure Query DB Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

43 Matching Procedure Query DB Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich Matching Procedure Query DB Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

44 Matching Procedure Query DB Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich Matching Procedure Matching curve Query: Beethoven s Fifth / Bernstein (first 20 seconds) Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

45 Matching Procedure Matching curve Query: Beethoven s Fifth / Bernstein (first 20 seconds) Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich Hits Matching Procedure Problem: How to deal with tempo differences? Karajan is much faster then Bernstein! Beethoven/Karajan Matching curve does not indicate any hits!

46 Matching Procedure 1. Strategy: Usage of local warping Karajan is much faster then Bernstein! Warping strategies are computationally expensive and hard for indexing. Beethoven/Karajan Matching Procedure 2. Strategy: Usage of multiple scaling Beethoven/Karajan

47 Matching Procedure 2. Strategy: Usage of multiple scaling Beethoven/Karajan Matching Procedure 2. Strategy: Usage of multiple scaling Beethoven/Karajan

48 Matching Procedure 2. Strategy: Usage of multiple scaling Query resampling simulates tempo changes Beethoven/Karajan Matching Procedure 2. Strategy: Usage of multiple scaling Query resampling simulates tempo changes Minimize over all curves Beethoven/Karajan

49 Matching Procedure 2. Strategy: Usage of multiple scaling Query resampling simulates tempo changes Minimize over all curves Resulting curve is similar warping curve Beethoven/Karajan Experiments Audio database 110 hours, 16.5 GB Preprocessing chroma features, 40.3 MB Query clip 20 seconds Retrieval time 10 seconds (using MATLAB)

50 Experiments Query: Beethoven s Fifth / Bernstein (first 20 seconds) Rank Piece Position 1 Beethoven s Fifth/Bernstein Beethoven s Fifth/Bernstein Beethoven s Fifth/Karajan Beethoven s Fifth/Karajan Beethoven (Liszt) Fifth/Scherbakov Beethoven s Fifth/Sawallisch Beethoven (Liszt) Fifth/Scherbakov Schumann Op. 97,1/Levine Experiments Query: Shostakovich, Waltz / Chailly (first 21 seconds) Expected hits Shostakovich/Chailly Shostakovich/Yablonsky

51 Experiments Query: Shostakovich, Waltz / Chailly (first 21 seconds) Rank Piece Position 1 Shostakovich/Chailly Shostakovich/Chailly Shostakovich/Chailly Shostakovich/Yablonsky Shostakovich/Yablonsky Shostakovich/Yablonsky Shostakovich/Chailly Bach BWV 582/Chorzempa Beethoven Op. 37,1/Toscanini Beethoven Op. 37,1/Pollini Indexing Matching procedure is linear in size of database Retrieval time was 10 seconds for 110 hours of audio Much too slow Does not scale to millions of songs Need of indexing methods

each codebook vector an inverted list each list contains feature indices in ascending

52 Indexing General procedure Convert database into feature sequence (chroma) Quantize features with respect to a fixed codebook Create an inverted file index contains for each codebook vector an inverted list each list contains feature indices in ascending order [Kurth/Müller, IEEE-TASLP 2008] Indexing Quantization Feature space Visualization (3D)

53 Indexing Quantization Feature space Codebook selection of suitable size R Quantization using nearest neighbors Indexing How to derive a good codebook? Codebook selection by unsupervised learning Linde Buzo Gray (LBG) algorithm similar to k-means adjust algorithm to spheres Codebook selection based on musical knowledge

54 Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.) Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.)

55 Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.) Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.)

56 Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.) Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.)

57 Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.) Indexing LBG algorithm Steps: 1. Initialization of codebook vectors 2. Assignment 3. Recalculation 4. Iteration (back to 2.) Until convergence

58 Indexing LBG algorithm for spheres Example: 2D Assignment Recalculation Projection Indexing LBG algorithm for spheres Example: 2D Assignment Recalculation Projection

59 Indexing LBG algorithm for spheres Example: 2D Assignment Recalculation Projection Indexing LBG algorithm for spheres Example: 2D Assignment Recalculation Projection

60 Indexing Codebook using musical knowledge Observation: Chroma features capture harmonic information Example: C-Major Example: C # -Major Experiments: For more then 95% of all chroma features >50% of energy lies in at most 4 components Indexing Codebook using musical knowledge C-Major C # -Major Choose codebook to contain n-chords for n=1,2,3,4 n template #

Indexing Codebook using musical knowledge Additional consideration of harmonics in chord templates Example: 1-chord C Harmonics 1 2 3 4 5 6 Pitch C3 C4 G4 C5 E5 G5 Frequency 131 262

61 Indexing Codebook using musical knowledge Additional consideration of harmonics in chord templates Example: 1-chord C Harmonics Pitch C3 C4 G4 C5 E5 G5 Frequency Chroma C C G C E C Replace by with suitable weights for the harmonics Indexing Quantization Orignal chromagram and projections on codebooks Original LBG-based Model-based

Indexing Query and retrieval stage Query consists of a short audio clip (10-40 seconds) Specification of fault tolerance setting fuzzyness of query number of admissable mismatches tolerance to tempo

62 Indexing Query and retrieval stage Query consists of a short audio clip (10-40 seconds) Specification of fault tolerance setting fuzzyness of query number of admissable mismatches tolerance to tempo variations tolerance to modulations Indexing Retrieval results Medium sized database 500 pieces 112 hours of audio mostly classical music Selection of various queries 36 queries duration between 10 and 40 seconds hand-labelled matches in database Indexing leads to speed-up factor between 15 and 20 (depending on query length) Only small degradation in precision and recall

63 Indexing Retrieval results No index LBG-based index Model-based index Average Precisi ion Average Recall Indexing Conclusions Described method suitable for medium-sized databases index is assumed to be in main memory inverted lists may be long Goal was to find all meaningful matches high-degree of fault-tolerance required (fuzzyness, mismatches) number of intersections and unions may explode What to do when dealing with millions of songs? Can the quantization be avoided? Better indexing and retrieval methods needed! kd-trees locality sensitive hashing

64 Conclusions (Audio Matching) Matching procedure Strategy: Exact matching and multiple scaled queries simulate tempo variations by feature resampling different queries correspond to different tempi indexing possible Strategy: Dynamic time warping subsequence variant more flexible (in particular for longer queries) indexing hard Conclusions (Audio Matching) Audio Features Strategy: Absorb variations already at feature level Chroma invariance to timbre Normalization invariance to dynamics Smoothing invariance to local time deviations Message: There is no standard chroma feature! Variants can make a huge difference!

65 Feature Design Enhancement of chroma features Usage of audio matching framework for evaluating the quality of obtained audio features Usage of matching curves as mid-level representation to reveal a feature s robustness and discriminative capability M. Müller and S. Ewert (2010): Towards Timbre-Invariant Audio Features for Harmony-Based Music. IEEE Trans. on Audio, Speech & Language Processing, Vol. 18, No. 3, pp [Müller/Ewert, IEEE-TASLP 2010] Motivation: Audio Matching

66 Motivation: Audio Matching Four occurrences of the main theme First occurrence Third occurrence Chroma Features First occurrence Third occurrence Chroma scale

67 Chroma Features First occurrence Third occurrence Chroma scale How to make chroma features more robust to timbre changes? Chroma Features First occurrence Third occurrence Chroma scale How to make chroma features more robust to timbre changes? Idea: Discard timbre-related information [Müller/Ewert, IEEE-TASLP 2010]

68 MFCC Features and Timbre MFCC coefficient [Müller/Ewert, IEEE-TASLP 2010] MFCC Features and Timbre MFCC coefficient Lower MFCCs Timbre [Müller/Ewert, IEEE-TASLP 2010]

69 MFCC Features and Timbre MFCC coefficient Lower MFCCs Timbre Idea: Discard lower MFCCs to achieve timbre invariance [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance Short-Time Pitch Energy Steps: 1. Log-frequency spectrogram Pitch scale [Müller/Ewert, IEEE-TASLP 2010]

70 Enhancing Timbre Invariance Log Short-Time Pitch Energy Steps: 1. Log-frequency spectrogram 2. Log (amplitude) Pitch scale [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance PFCC Steps: 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT Pitch scale [Müller/Ewert, IEEE-TASLP 2010]

71 Enhancing Timbre Invariance PFCC Steps: Pitch scale 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Discard lower coefficients [1:n-1] [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance PFCC Steps: Pitch scale 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients [n:120] [Müller/Ewert, IEEE-TASLP 2010]

72 Enhancing Timbre Invariance Steps: Pitch scale 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients [n:120] 5. Inverse DCT [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance Steps: Chroma scale 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients [n:120] 5. Inverse DCT 6. Chroma & Normalization [Müller/Ewert, IEEE-TASLP 2010]

73 Enhancing Timbre Invariance Steps: Chroma scale CRP(n) 1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients [n:120] 5. Inverse DCT 6. Chroma & Normalization Chroma DCT-Reduced Log-Pitch [Müller/Ewert, IEEE-TASLP 2010] Chroma versus CRP Shostakovich Waltz First occurrence Third occurrence Chroma [Müller/Ewert, IEEE-TASLP 2010]

2010] Quality: Audio Matching Query: Shostakovich, Waltz /

74 Chroma versus CRP Shostakovich Waltz First occurrence Third occurrence Chroma CRP(55) n = 55 [Müller/Ewert, IEEE-TASLP 2010] Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Shostakovich/Chailly Shostakovich/Yablonsky

75 Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Shostakovich/Chailly Shostakovich/Yablonsky Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Shostakovich/Chailly Shostakovich/Yablonsky

76 Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Shostakovich/Chailly Shostakovich/Yablonsky Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Standard Chroma (Chroma Pitch) Shostakovich/Chailly Shostakovich/Yablonsky

77 Quality: Audio Matching Query: Shostakovich, Waltz / Yablonsky (3. occurrence) Standard Chroma (Chroma Pitch) CRP(55) Shostakovich/Chailly Shostakovich/Yablonsky Quality: Audio Matching Query: Free in you / Indigo Girls (1. occurence) Standard Chroma (Chroma Pitch) Free in you/indigo Girls Free in you/dave Cooley

78 Quality: Audio Matching Query: Free in you / Indigo Girls (1. occurence) Standard Chroma (Chroma Pitch) CRP(55) Free in you/indigo Girls Free in you/dave Cooley Chroma Toolbox There are many ways to implement chroma features Properties may differ significantly Appropriateness depends on respective application MATLAB implementations for various chroma variants ISMIR 2011, Poster Session (PS2), Tuesday 13-15

79 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

Version Identification Goal: Given a music recording or a sufficiently long excerpt of it, find all corresponding music recordings within a collection that can be regarded as a kind of version or

80 Version Identification Goal: Given a music recording or a sufficiently long excerpt of it, find all corresponding music recordings within a collection that can be regarded as a kind of version or reinterpretation of the same piece. Version Identification Versions: Alternative recordings, renditions, arrangements or performances of the same musical piece. Live versions Cover songs Tributes Demos Parodies Contemporary versions of an old song Radically different interpretations of a musical piece

81 Version Identification Motivation Search and organization of music collections Near-duplicate detection/discovery of music documents Musical rights management and licenses Radio broadcast monitoring, advertisements,... Aid in court decisions/music copyright infringement (e.g. Müllensiefen & Pendzich, Mus. Sci., 2009) Creative application contexts Composers (e.g. assess originality of ideas) Musicologists (e.g. track origins of a motive) Understanding (what s the essence of a song?) Fun/leisure Version Identification Bob Dylan Knockin on Heaven s Door Duke Ellington Take the A train Metallica Enter Sandman Nirvana Poly [Incesticide Album] Stan Getz Garota de Ipanema Black Sabbath Paranoid AC/DC High Voltage Queen Bohemian rhapsody Key Harmonization Timbre Tempo Timing Lyrics Recording conditions Structure Avril Lavigne Knockin on Heaven s Door James Aebersold Take the A train Apocalyptica Enter Sandman Nirvana Poly [Unplugged] Rosa Pasos & Ron Carter Garota de Ipanema Cindy & Bert Der Hund Der Baskerville AC/DC High Voltage [live] Molotov Bohemian rhapsody

82 Version Identification Everything changes, but there must be some essence that allows us to identify versions How to compare two different recordings? What is kept?... What allows us to still identify such pieces? Tonal information: The played notes, their temporal succession, combinations and organization. Version Identification Main idea: Exploit common information (tonal information) Be invariant against potential changes (timbre, main tonality, tempo, structure, )

83 Main building blocks of a version identification system Version Identification: Approach A [Serrà et al., New J. Phys., 2009]

(PCP) features Correlates to harmonic + melodic progression Robust to

84 Version Identification: Approach A [Serrà et al., New J. Phys., 2009] Version Identification: Approach A Feature extraction Chroma (PCP) features Correlates to harmonic + melodic progression Robust to changes in timbre and instrumentation Normalization introduces invariance to dynamics

85 Version Identification: Approach A Feature extraction Harmonic Pitch Class Profiles (HPCP) [Gómez, PhD, 2006] Version Identification: Approach A [Serrà et al., New J. Phys., 2009]

so that the shifted chroma vectors are matched with minimal cost Transpose the songs accordingly Cyclic Chroma Shifts Given

86 Version Identification: Approach A Key invariance Knockin on Heaven s Door Bob Dylan Avril Lavigne Compute average chroma vectors for each song Consider cyclic shifts of the chroma vectors to simulate transpositions Determine optimal shift indices so that the shifted chroma vectors are matched with minimal cost Transpose the songs accordingly Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y x y 1 Cost Shift index

87 Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y x σ (y) 1 Cost Shift index Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y x σ 2 (y) 1 Cost Shift index

88 Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y x σ 3 (y) 1 Cost Shift index Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y x σ 11 (y) 1 Cost Shift index

89 Cyclic Chroma Shifts Given chroma vectors Fix a local cost measure Compute cost between x and shifted y Minimizing shift index: 3 x σ 3 (y) 1 Cost Shift index Version Identification: Approach A [Serrà et al., New J. Phys., 2009]

90 Version Identification: Approach A Local similarity Phase space embedding (smoothing) Cross-recurrence plot (thresholding) [Kantz & Schreiber, Camb. Univ. Press, 2004; Marwan et al., Phys. Rep., 2007] Version Identification: Approach A Local similarity Phase space embedding (smoothing) Taking m consecutive samples Separated with some delayτ Concatenating them to form a high-dimensional vector r xˆ i r r r [ x x x Lx ] = i i τ i 2τ i ( m 1) τ r [Kantz & Schreiber, Camb. Univ. Press, 2004]

91 Version Identification: Approach A Local similarity: Smooting Local dissimilarity (cost) matrix Version Identification: Approach A Local similarity: Smooting Local dissimilarity (cost) matrix

92 Version Identification: Approach A Local similarity: Smooting Local dissimilarity (cost) matrix: Tempo changes get blurred Version Identification: Approach A Local similarity: Smooting Local dissimilarity (cost) matrix: Tempo changes get blurred

Version Identification: Approach A Local similarity: Smooting Filtering along 8 plausible directions and minimizing [Müller & Kurth, ICASSP, 2006]

93 Version Identification: Approach A Local similarity: Smooting Filtering along 8 plausible directions and minimizing [Müller & Kurth, ICASSP, 2006] Version Identification: Approach A Local similarity Cross recurrence plot (thresholding the local cost dissimilarity measure) [Marwan et al., Phys. Rep., 2007]

94 Version Identification: Approach A Local similarity Cross recurrence plot (thresholding the local cost dissimilarity measure) Absolute threshold Percentage of the mean cost found Taking each frame s K nearest neighbors [Marwan et al., Phys. Rep., 2007] Version Identification: Approach A [Serrà et al., New J. Phys., 2009]

Alignment Classical DTW Global correspondence between X and Y X Y Subsequence DTW Subsequence of Y corresponds to X X Y Local Alignment Subsequence of Y corresponds to subequence of X X Y Global

95 Alignment Classical DTW Global correspondence between X and Y X Y Subsequence DTW Subsequence of Y corresponds to X X Y Local Alignment Subsequence of Y corresponds to subequence of X X Y Global Alignment Dynamic Time Warping (DTW) An optimal warping between two time series (from beginning to end) is found by dynamic programming ( accumulating local costs + backtracking) D( n, m) D( n 1, m 1) = C( n, m) + min D( n, m 1) D( n 1, m)

parallelogram) Global Alignment Dynamic Time Warping (DTW) Or

96 Global Alignment Dynamic Time Warping (DTW): We can apply global path constraints (Sakoe-Chiba band) (Itakura parallelogram) Global Alignment Dynamic Time Warping (DTW) Or local path constraints D ( n 1, m 1) D( n, m) = C( n, m) + min D( n, m 1) D( n 1, m)

97 Global Alignment Dynamic Time Warping (DTW) Or local path constraints D ( n 1, m 1) D( n, m) = C( n, m) + min D( n 1, m 2) D( n 2, m 1) Alignment Classical DTW Global correspondence between X and Y X Y Subsequence DTW Subsequence of Y corresponds to X X Y Local Alignment Subsequence of Y corresponds to subequence of X X Y

98 Subsequence Alignment Subsequence DTW Compute cumulative dissimilarity matrix as before Backtrack from lowest value in a given end region unti the start region is reached Alignment Classical DTW Global correspondence between X and Y X Y Subsequence DTW Subsequence of Y corresponds to X X Y Local Alignment Subsequence of Y corresponds to subequence of X X Y

99 Local Alignment Task: Let X=(x 1,,x N ) and Y=(y 1,,y M ) be two time series. Then, find the maximum similarity of a subsequence of X and a subsequence of Y Note: This problem is also known from bioinformatics. The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Similar concepts have been developed in parallel in the nonlinear time series analysis community (physics). The cross recurrence plots and their recurrence quantification measures are also a good starting point. Strategy: Develop our own variant of these algorithms Specifically tackle structure changes and small temporal disruptions (both very important for version identification) Local Alignment Computation of accumulated score matrix S from a binary similarity (reward) matrix R Think positive! D( n 1, m 1) D( n, m) = C( n, m) + min D( n 1, m 2) D( n 2, m 1) S( n 1, m 1) S( n, m) = R( n, m) + max S( n 1, m 2) S( n 2, m 1) S( n 1, m 1) R( n, m) + max S( n 1, m 2) S( n 2, m 1) S( n, m) = 0 S( n 1, m 1) g max S( n 1, m 2) g S( n 2, m 1) g if R( n, m) > 0 otherwise Differentiate between matches and mismatches

100 Local Alignment Computation of accumulated score matrix S from a binary similarity (reward) matrix R S( n 1, m 1) R( n, m) + max S( n 1, m 2) if R( n, m) > 0 S( n 2, m 1) S( n, m) = 0 S( n 1, m 1) g max otherwise S( n 1, m 2) g S( n 2, m 1) g g penalizes``inserts and ``delets in alignment Zero-entry allows for jumping to any cell without penalty Best local alignment score is the highest value in S Best local alignment ends at cell of highest value Start is obtained by backtracking to first cell of value zero Local Alignment Score Example: Knockin on Heaven s Door 140 Knockin' on Heavens's Door Accumulated score matrix Bob Dylan Bob Dylan Guns and Roses Guns and Roses 0

101 Local Alignment Score Example: Knockin on Heaven s Door 140 Knockin' on Heavens's Door Accumulated score matrix Cell with max. score = 94.2 Bob Dylan Bob Dylan Guns and Roses Guns and Roses 0 Local Alignment Score Example: Knockin on Heaven s Door 140 Knockin' on Heaven's Door Accumulated score matrix Cell with max. score = 94.2 Bob Dylan Bob Dylan Alignment path of maximal score Guns and Roses Guns and Roses 0

102 Local Alignment Score Example: Knockin on Heaven s Door Knockin' on Heaven's Door Accumulated score matrix Bob Dylan Bob Dylan Accumulated score Cell with max. score = 94.2 Alignment path of maximal score 20 Alignment path Guns and Roses Guns and Roses 0 Local Alignment Score Example: Knockin on Heaven s Door Knockin' on Heaven's Door Accumulated score matrix Cell with max. score = 94.2 Bob Dylan Bob Dylan Alignment path of maximal score Guns and Roses Guns and Roses Matching subsequences

103 Version Identification: Conclusions Main idea: Extract common, shared information and be invariant towards potential changes Tonal information (notes) Timbre, tempo/rhythm, key/harmonicity, structure, lyrics,... Example approach Chroma (PCP) features Transposition indices (circular shifts) Local similarity measure (subsequence + thresholding) Local alignment Many other approaches exist! Casey et al., IEEE-TASLP, 2008: Shingling + Indexing Marolt, IEEE-TMM, 2008: Mid-level representation + 2D Spectrum Ellis & Polliner, ICASSP, 2007: Beat tracking + Correlation (cf. Serrà, PhD, Literature Review, 2011)

104 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Overview Part I: Audio identification Part II: Audio matching Coffee Break Part III: Version identification Part IV: Category-based music retrieval

105 Category-based Retrieval Goal: Find music recordings within a collection that share a specific, predefined characteristic or trait. Humans naturally group objects with similar characteristics or traits into categories (or classes). Category: Any word or concept that can be used to describe a set of recordings sharing some property. Also referred to as class, type or kind. Category-based Retrieval Categories are one of the most basic ways to represent and structure human knowledge They encapsulate characteristic, intrinsic features of a group of items Prototypes can be built and potential category members can be compared to these They can overlap and have some kind of organization (e.g. fuzzy and hierarchical) [Zbikowski, Oxford Univ. Press, 2007]

(e.g. bossa-nova ) Moods (e.g. sad ) Specific instruments (e.

106 Category-based Retrieval Category-based Retrieval Category-based query examples: Genres (e.g. bossa-nova ) Moods (e.g. sad ) Specific instruments (e.g. extremely distorted electric guitar ) Instrument ensembles (e.g. string quartet ) Key/mode (e.g. G minor pieces ) Dynamics (e.g. lots of loudness changes ) Type of recording (e.g. live recording ) Epoch (e.g. music from the 80s )

Category-based Retrieval Objective: Label/annotate music documents (ideally with some confidence and probability measurements) in a way that makes sense to a user.

107 Category-based Retrieval Objective: Label/annotate music documents (ideally with some confidence and probability measurements) in a way that makes sense to a user. Categories usually imply a taxonomy: Based on the opinion of experts, on the wisdom of the crowd (folksonomy), on a single user (personomy), Taxonomies can be organized, their members can overlap, different relation types can be applied (ontologies) Category-based Retrieval Unsupervised: No taxonomy. Categories emerge from the data organization under a predefined similarity measure Typical clustering algorithms: K-means, hierarchical clustering, SOM, growing hierarchical SOM,

108 Category-based Retrieval Supervised: Use a taxonomy Map features to categories (usually without describing the rules of this mapping explicitly) Typical supervised learning algorithms: KNN, neural networks, LDA, SVM, trees, Category-based Retrieval Motivation Search and organization of music collections (discovery of music documents) Fun/leisure Playlist generation Music recommendation Music similarity... [Tversky, Psych. Revs., 1977; Bogdanov et al., IEEE-TMM, 2011]

109 Typical Approach: Automatic Annotation Labeled audio Data processing: Feature transformation and selection Classifier training and validation Feature extraction Typical Approach: Automatic Annotation Labeled audio Data processing: Feature transformation and selection Classification Feature extraction Unlabeled audio Annotation, Prob(Annotation) and Conf(Annotation)

110 Typical Approach: Automatic Annotation Labeled audio Data processing: Feature transformation and selection Classification Feature extraction Unlabeled audio Annotation, Prob(Annotation) and Conf(Annotation) Typical Approach: Automatic Annotation Feature extraction Features depend on the task! (e.g. MFCC for key etimation?) Temporal: ZCR, energy, loudness, rise time, onset asyncrony, amplitude modulation, Spectral: MFCC, LPC, spectral centroid, spectral bandwidth, bark-band energies, Tonal: Chroma/PCP, melody, chords, key, Rhythmic: Beat histogram, IOIs, rhythm transform, FP, Features extracted on a short-time frame-by-frame basis

Typical Approach: Automatic Annotation Feature extraction But usually summarized to represent global characteristics of the whole audio excerpt: Component-wise means Component-wise variances

111 Typical Approach: Automatic Annotation Feature extraction But usually summarized to represent global characteristics of the whole audio excerpt: Component-wise means Component-wise variances Covariances Higher-order statistical moments (skewness, kurtosis, ) Derivatives (1st, 2nd, ) Max, min, percentiles, histograms,... (+ more exotic summarizations) Typical Approach: Automatic Annotation Labeled audio Data processing: Feature transformation and selection Classification Feature extraction Unlabeled audio Annotation, Prob(Annotation) and Conf(Annotation)

112 Typical Approach: Automatic Annotation Data processing Feature transformation and feature selection Rescale each feature so that all of them have the same influence (normalize, standardize, Gaussianize?) Detect highly correlated features (try to decorrelate them?) Remove uninformative features Reduce the dimensionality Learn about the features and the task! Typical Approach: Automatic Annotation Data processing: Feature transformation Principal Component Analysis (PCA) Linear transform Eliminates redundancy Decorrelates variables It can be used to reduce the number of features Assumes some homogeneity of the distributions We lose the information of which specific feature is relevant

113 Typical Approach: Automatic Annotation Data processing: Feature selection Fisher s ratio Ratio between inter- and intra-class scatter Measures discriminative power of feature j between classes i and i Ratio low when classes are well separated No standard way of performing this feature selection iteratively Sequential backward selection (removing bad features) Sequential forward selection (selecting good features) Typical Approach: Automatic Annotation Data processing: Feature selection Many other feature selection algorithms: Correlation-based Chi-squared Consistency Gain ratio Information gain Bootstrapping Classifier-affine feature selection [Witten & Frank, Elsevier, 2005]

114 Typical Approach: Automatic Annotation Labeled audio Data processing: Feature transformation and selection Classification Feature extraction Unlabeled audio Annotation, Prob(Annotation) and Conf(Annotation) Typical Approach: Automatic Annotation Classifiers Bayesian/probabilistic Classifiers Naïve conditional independence [Witten & Frank, Elsevier, 2005]

115 Typical Approach: Automatic Annotation Classifiers Gaussian Mixtures [Witten & Frank, Elsevier, 2005] Typical Approach: Automatic Annotation Classifiers Neural Networks

116 Typical Approach: Automatic Annotation Classifiers K Nearest Neighbors Classifier Typical Approach: Automatic Annotation Classifiers Trees

Classifiers Many other classifiers exist Logistic regression

117 Typical Approach: Automatic Annotation Classifiers Support Vector Machines Typical Approach: Automatic Annotation Classifiers Many other classifiers exist Logistic regression Perceptrons Random forest Hidden Markov models Boosting Ensembles of classifiers

118 Typical Approach: Automatic Annotation Machine learning / data mining software packages Category-based Retrieval Classifiers: Issues Always perform a visual inspection of your features (Check that there are no things such as inf, or nan)

119 Category-based Retrieval Classifiers: Issues Do not classify M items with D features when D is larger or comparable to M As a rule of thumb (personal): M>20*D Category-based Retrieval Classifiers: Issues Be careful when comparing approaches that use a different number of features

120 Category-based Retrieval Classifiers: Issues If possible, use the same amount of instances per class ``Normalize by random sampling most populated classes Run several experiments (Monte-carlo) Category-based Retrieval Classifiers: Issues Avoid overfitting

121 Category-based Retrieval Classifiers: Issues Avoid overfitting Diagonal line = ideal solution for the problem Category-based Retrieval Classifiers: Issues Avoid overfitting Let s solve the same problem for a smaller dataset

122 Category-based Retrieval Classifiers: Issues Avoid overfitting Let s solve the same problem for a smaller dataset Category-based Retrieval Classifiers: Issues Avoid overfitting Classification of new data (3 errors)

123 Category-based Retrieval Classifiers: Issues Avoid overfitting Solution using a simple line (1 error) Category-based Retrieval Classifiers: Issues How to avoid overfitting: Use learning algorithms that intrinsically, by design, generalize well Pursue simple classifiers Evaluate with the most suitable measure for your task (unbiased and low-variance statistical estimators) Estimate the performance of a model with data you did not use to construct the model

Retrieval Classifiers: Issues Check the default parameters of your classifier Examples: Minimum number of instances per

124 Category-based Retrieval Classifiers: Issues Hold-out cross-validation Cross-fold validation LOOCV, Stratified cross-validation, nested crossvalidation, With all of them: ensure a sufficient number of repetitions Category-based Retrieval Classifiers: Issues Check the default parameters of your classifier Examples: Minimum number of instances per tree branch Number of NN in a KNN Complexity parameter in an SVM Distance-based classifier Which distance Effect of normalization/re-scaling

Category-based Retrieval Classifiers: Issues If you evaluate your features, demonstrate their usefulness/increment in accuracy Use several classifiers (include a one-rule classifier?

125 Category-based Retrieval Classifiers: Issues If you evaluate your features, demonstrate their usefulness/increment in accuracy Use several classifiers (include a one-rule classifier?) Use several datasets (music collections) Always compute/report random baselines (average accuracy, but also their std, min, and max values) Include also a baseline feature set in the evaluation Category-based Retrieval Classifiers: Issues If you evaluate your features, demonstrate their usefulness/increment in accuracy

126 Category-based Retrieval Classifiers: Issues Test for statistical significance! Category-based Retrieval: Conclusions Scenario: Retrieve music based on automatically annotated labels from audio Assign a label/class, a probability of that label, and a confidence value Automatic annotation Three main steps for supervised learning: Feature extraction Feature transformation and selection Classification Issues with classification tasks Overfitting, statistical significance, number of features, test several methods/datasets, randomizations,

127 Tutorial Audio Content-Based Music Retrieval Meinard Müller Saarland University and MPI Informatik Joan Serrà Artificial Intelligence Research Institute Spanish National Research Council Wrap Up Part I: Audio identification Part II: Audio matching Part III: Version identification Part IV: Category-based music retrieval

Wrap Up References (Audio Identification) E. Allamanche, J. Herre, O.

AudioID: Towards content-based identification of audio material.

Journal of VLSI Signal Processing Systems, 41(3):271 284, 2005. E.

In Proc. IEEE ICASSP, pp. 281-284, 2010. J. Haitsma and T. Kalker.

128 Wrap Up References (Audio Identification) E. Allamanche, J. Herre, O. Hellmuth, B. Fröba, and M. Cremer. AudioID: Towards content-based identification of audio material. In Proc. 110th AES Convention, P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In Proc. MMSP, pp , P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprinting. Journal of VLSI Signal Processing Systems, 41(3): , E. Dupraz and G. Richard. Robust frequency-based audio fingerprinting. In Proc. IEEE ICASSP, pp , J. Haitsma and T. Kalker. A highly robust audio fingerprinting system. In Proc. ISMIR, pages , Y. Ke, D. Hoiem, and R. Sukthankar. Computer Vision for Music Identification. In Proc. IEEE CVPR, pages , 2005.

Machine Learning for Music Discovery

Talk Machine Learning for Music Discovery Joan Serrà Artificial Intelligence Research Institute (IIIA-CSIC) Spanish National Research Council jserra@iiia.csic.es http://joanserra.weebly.com Joan Serrà