SMART SONGS SELECTION IN PLAYLISTS USING PARALLEL K-MEANS CLUSTERING

Size: px

Start display at page:

Download "SMART SONGS SELECTION IN PLAYLISTS USING PARALLEL K-MEANS CLUSTERING"

Conrad Robinson
6 years ago
Views:

1 International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 3, March 2018, pp , Article ID: IJCIET_09_03_077 Available online at ISSN Print: and ISSN Online: IAEME Publication Scopus Indexed SMART SONGS SELECTION IN PLAYLISTS USING PARALLEL K-MEANS CLUSTERING Pradyun Manoj Department of Computer Science, Christ [Deemed to be University], Hosur Road, Bhavani Nagar, Bengaluru , Karnataka, India Saleema JS Department of Computer Science, Christ [Deemed to be University], Hosur Road, Bhavani Nagar, Bengaluru , Karnataka, India ABSTRACT Most songs today are of different tempo, pitch and time signature. In a music player application, the typical shuffle picks the succeeding song or preceding song at random with no parameters to choose the songs. Different songs from different genres can have a tempo range anywhere between forty beats per minute and three hundred beats per minute. In this paper, the quick and efficient parallel k means clustering algorithm is implemented in Hadoop on the million-song dataset subset to form clusters for the songs based on tempo and pitch. The aim of this paper is to reduce the variation that occurs when a typical shuffle picks the succeeding song at random. This variation can be in the form of tempo or other parameters. The formation of clusters and intern the reduction in the variation of tempo can be used in a new smart shuffle. After the clusters have been formed, the smart shuffle picks the songs within that specific cluster. This paper aims at reducing the variation by 50%. This would have many musical benefits and would also be more pleasing to the listener. Keywords: Hadoop, Parallel K-Means Clustering,, Tempo, Variation Cite this Article: Pradyun Manoj and Saleema JS, Smart Songs Selection in Playlists using Parallel K-Means Clustering, International Journal of Civil Engineering and Technology, 9(3), 2018, pp INTRODUCTION With the development of social media, business analytics, healthcare analysis and online shopping portals, the vast amount of data being produced per minute passes the petabyte threshold. The data that is collected by various analytics sectors is in the unstructured format. Collected data can be of different variety or volume. This is what is referred today as big data. There are a vast number of scalable frameworks that process and structure this data efficiently. The MapReduce framework is one of the most-widely used frameworks in big data analysis. The framework is usually used to process multi-terabyte datasets of different editor@iaeme.com

2 Smart Songs Selection in Playlists using Parallel K-Means Clustering variety. Independent chunks of data are formed when a MapReduce job splits the input dataset and later allows a map task to process these splits of data in parallel. The map task receives a set of <key, value> pairs and gives an output of processed <key, value> pairs [1]. The target concept of parallelism is to achieve speed without losing accuracy. In a typical map-reduce job, once the input data is split into its respective blocks, the "Mapper class" sends in a map function to do the required processing on the data. Once the data has been processed, the output of the mapper is then sorted or combined and given as input to the reduce task. The reducer will now receive the input and aggregate the data according to the given <key, value> pairs [1]. The final output of the reducer is then written back to the HDFS (Hadoop Distributed File System) [2]. The output of the reducer can either be the final output of the job or the solution can be returned to the map-reduce framework again to do a secondary job. This process iterates until the desired solution is met. This framework is similar to the divide and conquer approach, sharing the basic idea. Music is a form of art which comprises of three dimensions, namely, rhythm, melody and harmony. It is often referred to as organized sound. Different elements that form the basis of music are pitch, rhythm, dynamics, timbre and texture. comprises of melody and harmony. The associated concepts of rhythm are tempo, meter and articulation. Softness and loudness refer to dynamics and the tonal quality or color of a musical sound is timbre and texture. The most prominent elements in a song are pitch and rhythm. The pitch of a song is a perceptual property of sounds that allows their ordering on a frequency-related scale [3]. It could more commonly be referred to as the quality which makes it possible to classify a certain song or sound as higher or lower [4]. A sound s pitch can only be determined if it is differentiable from noise, which is unclear and unstable [5]. In the theory of music, the ordered frequency or pitch of different musical notes, whether ascending or descending, forms a scale. Any scale in music is formed by two major components, the tonic and the interval pattern. The tonic is the starting point of the scale and the interval pattern is the type of scale [6]. There are a total of 12 notes in music and the notations are represented by A, A#, B, C, C#, D, D#, E, F, F#, G, G#. In this paper, the notes will be represented as numbers from 0 to 11. The tempo of a song is measured by the number of beats per minute. This would directly reflect the speed of a song. Tempo can influence the genre of a musical piece and the performer s interpretation. Most musical pieces will contain a specific tempo which could range anywhere between 40 beats per minute to 300 beats per minute. For example, a song could contain a tempo of 80 beats per minute with a 4/4 time signature. A typical shuffle in a music player would not consider the tempo of a song as a parameter when picking the succeeding song. Hence, the succeeding song could have a tempo of 140 beats per minute. This could cause the listener a certain amount of discomfort as each succeeding song would vary in genre or danceability. This could also influence the mood of the listener. Hence, in this paper, the parallel k means clustering algorithm is implemented to form clusters of songs based on pitch and tempo to reduce the amount of variation that is produced through the transition between songs. The parallel k means algorithm is implemented as a quick and efficient algorithm is necessary to form the clusters. This algorithm utilizes the parallelism that is offered by mapreduce. In a typical k-means clustering approach, being an unsupervised learning technique, the clusters are determined by finding the shortest distance between a given point and the centroids. The vector is then assigned to the closest centroid. The same is continued for all the vectors. Once the first iteration is complete with all the vectors, the centroid is recalculated and the second iteration is performed. This process is continued until there is little or no change in the cluster centroids. Using the final clusters, a conclusion is drawn. In this editor@iaeme.com

3 Pradyun Manoj and Saleema JS algorithm, the distance calculation is most time-consuming. Since the distance calculation between a vector and a centroid does not affect the outcome of the distance calculation between another vector and a centroid, the distance calculation can be executed in parallel. Hence, the Parallel K-Means algorithm is implemented in this paper which parallelizes the distance calculation 2. LITERATURE REVIEW Before the existence of parallelism and the Hadoop MapReduce framework, all the programs were executed in serial. That is, when a program was to be executed, the instructions in the program would be executed sequentially. In other words, one instruction would not run until the previous instruction is complete. This whole process of executing instructions and programs sequentially was time consuming. Further, all the programs executed on a single processor. To resolve the time complexity that serial execution took, especially on large datasets, a technique called parallel processing was developed. The big advantage that parallel processing holds over serial processing is reduced time complexity. Although expensive, it reduces the amount of time taken for a single program to finish executing. While the execution of one instruction does not depend on the execution of another instruction, parallel processing can be performed [7]. Hadoop's MapReduce is one such framework that takes full advantage of parallel processing. The MapReduce framework uses non-local resources to process large volumes of data in a speedy and efficient manner. The MapReduce framework is divided into two main classes, the mapper class and the reducer class, respectively. The Mapper class has a function called the map function that receives the input data from the Hadoop distributed File System and further divides it according to the given parameters and stores it as small chunks of data in the form of <key, value> pairs [1,8]. Depending on the size of the data, the mapper size can vary. Similarly, the number of mappers can be increased or decreased accordingly. In some cases, having an increased number of mappers would be unnecessary. This is because, the dataset is not large and a small number of mappers would be enough to process all the data [8]. The size of each mapper as well can be modified according to the requirements of the job. The default size of each mapper depends on the machine the map-reduce job runs on. Once the data has been given its respective <key, value> pairs, the reducer class receives the input from the mapper class. The reducer class has a function called the reduce function. The main goal of the reduce function is to shuffle the data and finally reduce or aggregate the data according to the <key, value> [8]. To summarize the map-reduce framework, the mapper class receives the data from the Hadoop distributed file system, processes it into <key, value> pairs. The output of the mapper is a list of <key, value> pairs. The reducer class then receives this list and performs a shuffle and sort to produce a list of keys and aggregated values. And finally, the output is a list of final <key, value> pair/pairs [9]. G Dunn, Music Preferences based on audio features, and its relation to personality. Author conducted a study with 165 males and 189 female participants to find how music preferences linked to objective audio features relate to the personality of an individual. The method of the study was a Principal Component Analysis. The audio features were extracted and computationally derived from the audio clips. The results revealed that the excitementseeking was higher or positively related to music with a greater number of percussive events. The same excitement-seeking was negatively related to music with fewer percussive events [10]. Weizhong Zhao et al., Parallel K Means Clustering Based on MapReduce. The authors adapted the existing k-means algorithm in MapReduce framework which was implemented in editor@iaeme.com

4 Smart Songs Selection in Playlists using Parallel K-Means Clustering Hadoop. This would substantially increase the clustering speed and efficiency and make it applicable to large scale data. By properly assigning the correct <key, value> pairs, the k- means algorithm can be executed in parallel. Only one kind of MapReduce job is required by the PK-Means algorithm. Three functions were implemented. In the Map function of the mapper class, the input dataset is stored in the Hadoop distributed file system or HDFS as a sequence file of <key, value> pairs. After the dataset is split, it is globally broadcast to all the mappers. The distance is now calculated in parallel. According to the authors, the distance calculation is the most time-consuming. Since the distance calculation between two vectors is independent of the distance calculation between two other vectors, the execution can be performed in parallel. The combine function then combines the intermediate data from the same map task. The output of the combine function is then received as input to the reduce function. The reduce function then sums up all the samples and computes the total number of samples assigned to each cluster. The new calculated centroids are used in the next iteration. This process continues until there is little or no change to the cluster centroids. In conclusion, the authors demonstrate the speed-up, scale-up and size-up of the algorithm. The algorithm performs better as the size of the dataset increases. The speed-up and size-up performance increases. It is also able to scale well. The results finally show that the algorithm can process large datasets quick and efficiently on commodity hardware [11]. Considering the background, no study has been performed on clustering songs based on audio features of tempo and pitch to reduce the variation between transition of songs. The typical shuffle has an average variation that can lie anywhere between 40 beats per minute to 150 beats per minute. This paper aims at reducing the average variation by a 50% margin. The impact of smooth transitioning between songs where the variation is minimal would prove to be more pleasing to the listener and beneficial. In this paper, since the data size is relatively large, the parallel k-means clustering approach is adopted. 3. PARALLEL K-MEANS CLUSTERING The analysis conducted required a fast and effective clustering algorithm. The traditional k- means clustering approach will not produce the result required within the timeframe. Since the parallel k means clustering approach can speed-up, scale-up and size-up efficiently with the given data, this algorithm was implemented. The algorithm has three functions; map, combine and reduce. Algorithm 1 demonstrates the map function [11]. Algorithm 1: Algorithm Map; 1. Sample instance is constructed from value; 2. Double.MAX_VALUE is assigned to minimum distance mindis; 3. index = -1; 4. for i in range 0 to centers.length do dis = ComputeDist(instance, centers[i]); if dis < mindis { mindis = dis; index= i; } 5. End for 6. index is taken as key ; 7. value is constructed as a string comprising of the values from different dimensions; 8. output < key, value > pair; 9. End editor@iaeme.com

5 Pradyun Manoj and Saleema JS In step 4, the closest center point from the given sample is computed where the ComputeDist function returns the distance between the center points centers[i] and instance. Once the map function returns an output of < key, value > pairs, the result is sent to the Combine function. Algorithm 2 demonstrates the same [11]. Algorithm 2: Algorithm Combine; 1. One array is initialized to record the sum of value from each of the dimensions contained in the same cluster; list is V; 2. num =0, a counter is initialized in the same cluster to record the sum of sample number; 3. while(v.hasnext()) { sample instance is constructed from V.next(); the values from different dimensions of instance are added to the array increment num; 4. } 5. key is taken as key ; 6. value is constructed as a string comprising of the values from different dimensions and num; 7. output < key, value > pair; 8. End A combiner is used after each map task to combine the intermediate data from the same map task. The reducer function is then used to calculate the final output (centroids) and the input of the reduce function is obtained by the data received from the combine function of each host. Algorithm 3 demonstrates the reduce function [11]. Algorithm 3: Algorithm Reduce; 1. One array is initialized to record the sum of value from each of the dimensions contained in the same cluster; list is V; 2. num =0, a counter is initialized in the same cluster to record the sum of sample number; 3. while(v.hasnext()) { sample instance is constructed from V.next(); the values from different dimensions of instance are added to the array increment num; 4. } 5. The entries of the array are divided by NUM to get the new center s coordinates; 6. key is taken as key ; 7. value is constructed as a string comprising of the center s coordinates 8. output < key, value > pair; 9. End The final output from the reduce function will provide the calculated centroids and the cluster numbers. This < key, value > pair will be used in the next iteration as the new centroids. The process repeats until there is little or no change in the cluster centroids. The distance calculation in the parallel k-means clustering algorithm is performed using the Euclidean Distance. Equation (1) depicts the Euclidean distance [11] calculation for n- space: editor@iaeme.com

6 Smart Songs Selection in Playlists using Parallel K-Means Clustering ( ) ( ) ( ) ( ) ( ) ( ( ) ) (1) In this paper, a two-dimensional Euclidean distance formula is implemented [11]. Equation (2) represents the same: ( ) ( ) ( ) ( ) (2) The parallel distance computation through the map reduce framework is better explained diagrammatically. Fig. 1 demonstrates the parallel execution of each map task for calculating mutually exclusive distances. Figure 1 Block diagram of generic parallel distance calculation with pitch and tempo The distance calculation is executed in parallel in the map tasks. This process greatly increases the speed of the clustering algorithm as the distance calculation consumes the most amount of time and each execution can be performed independently. The minimum distance is calculated for all the vectors and the cluster number is assigned as the key. The value contains a string of different dimensions. The reduce function re-calculates the new centroids by finding the average of all the vectors in that cluster. The output is in the form of < key, value > pairs. 4. IMPLEMENTATION 4.1. Dataset Description The million-song dataset [12] is a collection of different audio features consisting of timbre, texture or tonal quality, pitch, pitch confidence, tempo and other data for a million popular contemporary music tracks. The size of the entire data set is around 300GB. Although it does not contain any audio tracks, it does contain the extracted audio features of the respected tracks and metadata. For research purposes, this experiment is run on the million-song subset which contains 1.8% of the entire dataset. There are 10,000 songs in this subset and this will prove sufficient to demonstrate the reduction of variation in tempo with the parallel k-means clustering algorithm. The subset is a randomly generated set from the million-song dataset. The two audio features, pitch and tempo, needed for the analysis are extracted. Table 1 gives the feature description editor@iaeme.com

7 Pradyun Manoj and Saleema JS Table 1 Extracted audio feature description Feature Type Range Tempo Integer Float The pitch is determined by the scale of a song. The 12 notes in music theory are A, A#, B, C, C#, D, D#, E, F, F#, G, G# and each note is assigned an integer value from 0 to 11. Tempo is measured by the total number of beats per minute. In the dataset, the type is float and the tempo ranges from 0.0 to beats per minute Experimental Results First, a situation where the clustering is not applied before the shuffle. The variation in tempo is calculated. 7 songs are picked at random and the same is demonstrated in Table 2. Table 2 Variation calculation without clustering Randomly selected songs Tempo (in BPM) Song 1 Song 2 Song 3 Song 4 Song 5 Song 6 Song Variation Tempoi+1 Tempoi Average Variation The calculated average variation is bpm. The aim of this analysis is to reduce this variation by 50%. The data points are added to a scatter plot before the clustering is performed as seen in Fig Tempo Figure 2 and tempo data points before clustering editor@iaeme.com

8 Smart Songs Selection in Playlists using Parallel K-Means Clustering For the parallel k-means implementation, k is initialized at 4. After applying the algorithm to form clusters based on the two parameters, it was found that the clusters formed have minimal variation in their tempo and pitch. The centroids from each cluster formed are given in Table 3. Table 3 Resultant cluster centroids after parallel k-means execution on pitch and tempo Cluster Number Tempo If 7 songs were to be picked at random from cluster 1, the calculated average variation is represented in Table 4 and corresponding scatter plot in Fig. 3. Table 4 Variation calculation after shuffle on cluster 1 Randomly selected songs Tempo (in BPM) Song 1 Song 2 Song 3 Song 4 Song 5 Song 6 Song Variation Tempoi+1 Tempoi Average Variation Tempo Figure 3 Scatter Plot for cluster 1 The calculated variation from the randomly selected songs in cluster 1 is bpm. Samples of randomly selected songs within each cluster were generated and the reduction in variation was always greater than 50%. Another example where songs were randomly selected from cluster 3 is shown in Table 5 and corresponding scatter plot in Fig editor@iaeme.com

9 Tempo Tempo Pradyun Manoj and Saleema JS Tempo Variation in BPM Table 5 Variation calculation after shuffle on cluster 3 Tempo (in BPM) Variation Tempo i+1 Tempo i Song Song Song Song Song Song Song Average Variation Figure 4 Scatter Plot for cluster 3 The calculated average variation from the randomly selected songs in cluster 3 is This is substantially less than the calculated variation when the clustering was not applied. The results were better than expected. The aim was to achieve a 50% reduction in the variation. The sample shows a reduction in variation of greater than 50%. The same process of selecting songs at random within the same cluster was repeated and the recorded variation in all plausible cases were always meeting the aim of this paper. Aggregated view of all clusters in a single scatter plot is represented in Fig Cluster 1 Cluster 2 Cluster 3 Cluster Figure 5 Scatter plot for aggregated clusters editor@iaeme.com

10 Smart Songs Selection in Playlists using Parallel K-Means Clustering A total of 10 runs were conducted after the analysis to compare the reduction in variation, as seen in Table 6. To demonstrate the reduction in variation, any cluster can be used. In this comparison, results from cluster 1 were used. Table 6 Comparisons of Tempo Variations for 7 randomly selected songs in smart shuffle Run Number Before Clustering After clustering Reduction in variation % Run % Run % Run % Run % Run % Run % Run % Run % Run % Run % Average reduction in variation % 88.94% An average reduction in variation percentage is recorded at 88.94%. According to Fig. 5, songs can be selected either horizontally or vertically. If songs are selected horizontally, the succeeding songs would be of similar tempo and increasing pitch. If songs are selected vertically, the succeeding songs would be of similar pitch and increasing tempo. 5. CONCLUSION The aim of this paper was to reduce the average variation by 50%. The calculated results show that the average reduction in variation in the 10 runs was 88.94%. The analysis was a success as the reduction in variation was greater than 50%. Since the typical shuffle had a very high rate of variation, the mood and continuity for the listener was affected. The new smart shuffle, eliminates the variation by an average of 88.94% and this would be more pleasing to the listener and have numerous musical benefits. With the flexibility to change the parameters to meet different application requirements, there is vast opportunity to further develop the smart shuffle. This can be incorporated in any music player application looking to enhance its features or reach a specific type of musically-inclined audience. REFERENCES [1] D. Jeffrey, S.Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM 51.1 (2008): [2] Borthakur, Dhruba, The hadoop distributed file system: Architecture and design, Hadoop Project Website (2007): 21. [3] Klapuri, Anssi, Introduction to music transcription Signal Processing Methods for Music Transcription, (Boston : Springer, 2006) [4] Plack, C.J, A J. Oxenham, and Richard R. Fay, : Neural Coding and Perception, (New York: Springer, 2005) 1-6. [5] Randel, D.Michael, The Harvard dictionary of music, Harvard University Press, [6] Hewitt, M. John, Musical Scales of the World, Note Tree, [7] Logan, G. D, Parallel and serial processing, Stevens handbook of experimental psychology (2002) editor@iaeme.com

11 Pradyun Manoj and Saleema JS [8] Apache Software Foundation, MapReduce Tutorial, [9] J.Dean, S.Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proc. of Operating Systems Design and Implementation, San Francisco, CA, 2004, [10] D.Greg, Music preferences based on audio features and its relation to personality, ESCOM 2009: 7th Triennial Conference of European Society for the Cognitive Sciences of Music, [11] Zhao, Weizhong, H.Ma, and Q.He, Parallel k-means clustering based on mapreduce, IEEE International Conference on Cloud Computing, (Berlin:Springer,2009), [12] Million Song Dataset, official website by Thierry Bertin-Mahieux, [13] Chandra Das, Shilpi Bose, Matangini Chattopadhyay, Samiran Chattopadhyay, A Novel Distance Based Modified K-Means Clustering Algorithm for Estimation of Missing Values in Micro-Array Gene Expression Data, International Journal of Information Technology & Management Information System (IJITMIS), Volume 5, Issue 3, September - December (2014), pp [14] Deepika Khurana and Dr. M.P.S Bhatia, Dynamic Approach to K-Means Clustering Algorithm, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, May-June (2013), pp editor@iaeme.com

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering