San Jose State University. Math 285: Selected Topics of High Dimensional Data Modeling

Size: px

Start display at page:

Download "San Jose State University. Math 285: Selected Topics of High Dimensional Data Modeling"

Debra Cunningham
5 years ago
Views:

1 Project Report on Ordinal MDS and Spectral Clustering on Students Knowledge and Performance Status and Toy Data San Jose State University Math 285: Selected Topics of High Dimensional Data Modeling Submitted by Yuntian Yang On 12/13/215 1

2 I. Introduction: Non metric Multidimensional Scaling is first introduced by Shepard 1962a, 1962b. And Kruskal 1964a, 1964b expanded the ideas of Shepard and introduced the concept of loss function called stress. Non metric Multidimensional Scaling is also referred as Ordinal Multidimensional Scaling. As the name implies, it preserves the order or rank of the original data points in its mapped lower dimensional space. In this report, our primary goal is to demonstrate how Ordinal MDS works by applying it to a set of toy data and another set of real data of students knowledge status. There are also steps of cluster analysis performed on the real data to provide some understanding about the data as if it is completely unprocessed or raw as it is a standard procedure in data mining. The original data of students knowledge status is slightly modified removed labels and the sixth column true labels of knowledge level to serve the purpose of this report. The main goal of this project is to study and demonstrate learning about non metric MDS and practice data mining. The data sets provided here include self created toy data, small data from book example and a real data set about students knowledge status. II. Ordinal Non metric Multidimensional Scaling Classical multidimensional scaling serves to reduce the dimension of data space while preserving distances between any pair of data points in the configuration a mapping of the original data space. Ordinal multidimensional scaling serves a different purpose while it also reduces the dimension of the original data space. Ordinal MDS serves to preserve the order between any pairs of data points. This method is particularly useful in social science where data points are often recorded and measured as categorical data. For example, a customer s rating on a survey is usually ranked in some categories say, strongly agree, disagree, neutral, agree, and strongly agree where measurements of distance would be meaningless in this case. However, the order is very meaning and it is essentially the purpose of this survey. Another example would be a patient s pain level which doctors use to assess the severity of the patient s symptoms. Moreover, a student s overall grade or knowledge level can be another example. A student s performance on an exam is given a grade in numerical value but the student s overall performance in the class is recorded in letter grade instead of numerical grade. And it is what school system uses to assess and evaluate a student s knowledge level on a subject matter. In fact, the main goal of this report is to demonstrate how ordinal MDS is applied to a real data of students knowledge. In ordinal MDS, we use disparity as a measure of how well the distance in data configuration matches the dissimilarity. For the method of measuring dissimilarity, there are many choices. The book uses Minkowski s method and we will use city block method. We can view disparities as functions of distances where the order of the original dissimilarities is preserved by the disparities. III. Algorithm of ordinal MDS 1. Compute dissimilarity or similarity matrix from the data matrix. 2. Find the configuration of data in lower dimension and the distance matrix of the points in the configuration. 2

3 3. Arrange dissimilarity points in ascending order and impose the same order on the corresponding distance points from the configuration. 4. Compute the disparities which require the distance points to be in ascending order. 5. Apply Kruskal s Algorithm. The actual steps of this algorithm are not presented in this report as it is used to calculate stress, but it will be included in the reference. We will use MATLAB functions to compute stress. IV. Demonstration of ordinal MDS using toy data IV.1. Book Example: Demonstration of isotonic regression Exploratory Data Analysis When we are dealing with ordinal MDS, sometimes the data matrix is not given while the dissimilarity matrix and distance matrix of the configuration in lower dimension are given. In such scenario, one can try to recovery the original order of data points based on disparities acquired from the dissimilarity and distance matrix. The disparities will preserve the order of original data points. Let s first demonstrate such example using toy data. Six dissimilarity points and six distance points in its configuration of four items are given: Dissimilarities 2.1, ; Distances ; The goal of ordinal MDS is to preserve order of data points. Since these points are not in order, we will first rearrange dissimilarities in ascending order and impose the same order on the corresponding distance points. We then have: Dissimilarities_sorted ; in ascending order Distances_correspond ; corresponds to dissimilarities but not in order Here we wish the distances turn out to be in order then we wouldn t need to apply isotonic regression. However, since the distances from the configuration are not in ascending order, we will apply isotonic regression to these points: 1. Compute the cumulative sums of distances and where 1,,. And is the number of dissimilarities there are. 2. Find the greatest convex minorant of the graph that is going through the origin. The points on the greatest convex minorant are the lines which touch the points. These points put distance points into blocks. 3. Compute disparities from these blocks by taking the average of the distances that belong to the block. This algorithm and proof are provided by Cox and Cox Compute the cumulative sums of distances and where 1,,. And is the number of dissimilarities there are. 5. Find the greatest convex minorant of the graph that is going through the origin. The points on the greatest convex minorant are the lines which touch the points. These points put distance points into blocks. 6. Compute disparities from these blocks by taking the average of the distances that belong to the block. This algorithm and proof are provided by Cox and Cox 21. 3

4 This particular part of the code of isotonic regression in Matlab is provided by EDA Exploratory Data Analysis and attached in appendix and reference section. Now, we have obtained disparities and they are: Disparities ; A graph figure 1 would best describe what we are saying about greatest convex minorant. The black straight lines here separate the cumulative sums of distances blue lines into blocks and this assists us to find the convex minorant in each block. %Matlab code provided here: %Book example: creating toy data to demonstrate isotonic regression dissim = [ ]; dists = [ ]; n = length(dissim); [dissim,inds] = sort(dissim);%sort dissimiarities. dists = dists(inds); %impose the order of dissimiliarties onto distances. D = cumsum(dists); %cumulative sum of distances D = [ D ]; %adding orgin as the first point. slope = D(2:end)./(1:n); %find the slopes %find the smallest slope which leads us to convext mirnorant i = 1; k = 1; while i <= n value = min(slope(i:n)); minpoint(k) = find(slope == value); i = minpoint(k) + 1; k = k + 1; end; K = convhull(d, :n); minpoint = intersect(minpoint + 1, K) - 1; %divide distances into blocks so we can find disparties %the disparities are defined as the averages of the distances over those blocks. j = 1; for i = 1:length(minpoint) dispar(j:minpoint(i)) = mean(dists(j:minpoint(i))); j = minpoint(i) + 1; end; plot(1:max(size(d)), D); 4

5 25 Figure 1: Greatest Convex Minorant demo graph 2 15 D j index i IV.2. Own example: experimenting isotonic regression and ordinal MDS with toy data We first measure the distances of data points in the original data matrix and construct a dissimilarity matrix whose diagonal entries will be and the rest of the entries are positive. The dissimilarity matrix measures how dissimilar two points are to each other and it is determined by several different ways. We could also use similarity matrix. Then the diagonal entry would be 1, and the order of disparities would be in reverse order of the original data. For this toy data, we choose city block or Manhattan method to compute the dissimilarities. This allows us to have different entry values in dissimilarity matrix D of original data and distance matrix C obtained from the configuration. We could use Euclidean method to compute dissimilarities but it would yield exactly the same values in both D and C matrices. It would not be as effective for the purpose of demonstration. We can also use the upper or lower triangle of the dissimilarity matrix since the dissimilarity matrix is always symmetric. Now we will look at the toy data matrix X not pertaining any meaning, created for demonstration and its dissimilarity matrix obtained by computing the city block distance between each data point each row in X. And, we will refer to each entry in the dissimilarity matrix D as where and are the rows and column numbers, respectively. Since it is a dissimilarity matrix, we have X, D, C And we have: 9, 8, 11, 9, 18, , , , , , Re arrange these entries of in ascending order and their corresponding distances, we have: 5

6 8, 9, 9, 11, , , , , , Since the distances from the configuration are not in ascending order, we will apply isotonic regression to these points: By using the code for isotonic regression, we obtained a convex minorant graph, figure 2: 1 Figure 2: Greatest Convex Minorant toy data graph D j index i And the disparities are: , , , , , and Now we will check how well the isotonic regression works comparing to the functions for performing non metric MDS provided by MATLAB. %MATLAB code for non metric MDS. [Y2,stress, disparity] = mdscale(dism,2); nonzeros(tril(disparity))'; We then have: disparity , comparing to what acquire above dispar The two vectors are very close in value. The difference is that MATLAB function computes disparities of a dissimilarity matrix and preserves the order of it in its original order. What we did with isotonic regression is we sorted the dissimilarity matrix so that disparities would appear in ascending order. If we hadn t sorted the dissimilarities, the vector dispar would appear more similar in order to disparity. The following two graphs of disparities/distances against dissimilarities are presented. Figure 3 left is a plot of vector form disparities/distances vs dissimilarities acquired from isotonic regression. And Figure 4 right is a plot of matrix form disparities/distances against dissimilarities acquired from mdscale function in MATLAB. Both appear to perverse the order of the original dissimilarities. 6

7 24 22 Figure 3: Toy Data with Disparities obtain from isotonic regression Distances Disparities 25 2 Figure 4: Toy Data with MATLAB mdscale fc'n Distances Disparities 2 Distances/Disparities 18 Distances/Disparities Dissimilarities Dissimilarities Isotonic Regression method was relatively traditional comparing to methods developed today. For the rest of this report, we will be using MATLAB functions for ordinal MDS on the real data of students knowledge status. We will also experiment isotonic regression on the real data but will only attempt to understand and draw conclusions about the data using resulted obtained from the MATLAB functions. IV.3. Real data example: ordinal MDS on students knowledge level data. The data set used here is about students knowledge status on Electrical DC Machines provided by UCI Machine Learning Repository. In the 258 by 5 data matrix X, each data point represents one of 258 student s performance in five areas. The five main criteria each column of data matrix X we use to determine students knowledge level: 1. STG The degree of study time for goal object materials, input value 2. SCG The degree of repetition number of user for goal object materials input value 3. STR The degree of study time of user for related objects with goal object input value 4. LPR The exam performance of user for related objects with goal object input value 5. PEG The exam performance of user for goal objects input value With this data set, we will attempt to gain understanding of the relationship students behaviors, test performances and their knowledge level about the subject. We will attempt to draw conclusions based on the outcomes obtained from ordinal non metric MDS and spectral clustering. In this example, we compute two dissimilarity matrices using Euclidean and city block. The reason is we do not know much about the data and Euclidean distance may not be a meaningful measure. %construct 1 st dissimilarity matrix with Euclidean distance dissimilarities = pdist2(x,x,'euclidean'); [Y,stress,disparities] = mdscale(dissimilarities,5); 7

-'); xlabel('dissimilarities'); ylabel('distances/disparities') legend({'distances' 'Disparities'},

8 distances = pdist2(y,y,'cityblock'); [dum,ord] = sortrows([disparities(:) dissimilarities(:)]); figure;plot(dissimilarities,distances,'bo',... dissimilarities(ord),disparities(ord),'r.-'); xlabel('dissimilarities'); ylabel('distances/disparities') legend({'distances' 'Disparities'}, 'Location','NorthWest'); Stress 1.95e 16. %construct 2 nd dissimilarity matrix with Manhattan distance dissimilarities = pdist2(x,x,'cityblock'); Stress.65. 8

9 Although we used different dissimilarity matrices, the order of dissimilarity seemed to be preserved by both figures. However, the disparities based on Euclidean dissimilarity matrix yields a much smaller stress, 1.95e 16 than the stress yielded by the disparities based on city block matrix. And dissimilarity matrix obtained using Euclidean distance seems more reasonable and linear. The conclusion here is because the original data matrix has values from to 1, and nonnumerical labels are removed. Each entry is measured in percentages so Euclidean distance is actually preferred. We cannot arrive at a ranking system and compare to the true labels provided because we do not know how the author weighted each criterion in the data matrix or how he assessed the knowledge level. Our primary goal of learning and demonstrating ordinal MDS is achieved here. Further study and learning will be carried to extend the knowledge on the subject matter to expand our knowledge in data analysis and machine learning. V. Cluster Analysis using spectral clustering Spectral Clustering method is used to cluster the data. The results should separate data points into groups based on students five criteria. In a way, it is separating students overall standing without specific labeling. Clusters would be helpful for us to understand students standing based on their behaviors all five criteria and how they are associated..8 2 clusters, 8th NN.8 2 clusters, 5th NN Sum of point to centroid distances.2173 Sum of point to centroid distances

10 .8 2 clusters, 1th NN.8 2 clusters, 2th NN Sum of point to centroid distances.2365 Sum of point to centroid distances.2382 The 5 th, 1 th, and 2 th Nearest Neighbors as sigma produce good clusters. The reason is likely to be the way the data set is structured. A lot of data points toward the end tend to have a higher percentage on average. Also, all clusters have very close sum of point to centroid distances but it does appear that the later the nearest neighbor we pick, the further the sum of point to centroid distances become..8 3 clusters, 5th NN Sum of point to centroid distances

11 .8 3 clusters, 1th NN.8 3 clusters, 2th NN Sum of point to centroid distances.119 Sum of point to centroid distances.1211 The same trend happens here as well when we attempt to separate data points into 3 clusters..8 4 clusters, 1th NN 4 clusters, 2 NN Sum of point to centroid distances.757 Sum of point to centroid distances.772 We are trying to separate data into 4 clusters because the true label categorize students knowledge status into 4 levels, very low, low, middle and high. However, these clusters are not telling us about the knowledge level because it is simply separating students overall performance all five criteria. The problem is we do not know how the assessment of knowledge level is made. However, the clusters do tell us that if we separate the data points into 2 clusters, it produces the best cut. Examining the true label, very low 5, low 122, middle 129, and high 13. There might be a typo on the website because low 129 and middle 122. The values are very closely clusters with low, middle and high. On the other hand, very low is made almost outlier in the true label. This is perhaps the reason why 2 clusters produce further sum to point to centroid distsances. 11

12 VI. Conclusion Ordinal MDS is very useful in preserving the order of dissimilarities while we examine the distances and disparities of the configuration in lower dimensions. It is help when we are dealing with categorical data. However, ordinal MDS alone is actually not enough to carry out the ambitious agenda to assess and rank each students knowledge level. More knowledge and practice is necessary in meeting higher demands that could be truly applied in industry. On the other hand, the concept and algorithms of ordinal MDS are demonstrated and the report served its purpose. 12

13 Appendix: Code: %%============= Ordinal MDS with real data============ clear close all clc X = xlsread('project_data_no_labels.xlsx'); %dissimilarities = pdist2(x,x,'euclidean'); %dissimilarities = pdist2(x,x,'minkowski',5); dissimilarities = pdist2(x,x,'cityblock'); %dissimilarities = pdist2(zscore(x), zscore(x), 'cityblock'); %construct dissimilarity matrix [Y,stress,disparities] = mdscale(dissimilarities,5); distances = pdist2(y,y,'cityblock'); [dum,ord] = sortrows([disparities(:) dissimilarities(:)]); figure;plot(dissimilarities,distances,'bo',... dissimilarities(ord),disparities(ord),'r.-'); xlabel('dissimilarities'); ylabel('distances/disparities') legend({'distances' 'Disparities'}, 'Location','NorthWest'); % plot(1:max(size(y)), disparity); %% Clustering kncut(x,2,5) kncut(x,2,1) kncut(x,3,8) kncut(x,3,5) kncut(x,3,1) kncut(x,3,2) kncut(x,4,1) kncut(x,4,2) %Function for multiway NCut algorithm (kncut(data,k-clusters,n-th NN). function [SS] = kncut(x,k,jnn) %X data matrix, k=# of clusters,jnn= jth N.N. n = size(x, 1); %find the distance matrix for i = 1:n for j = 1:n Diff = X(i,:) - X(j,:); Dist(i,j) = norm(diff,2); end; end; %compute average distance for sigma sort_dist = sort(dist,2); %choose the average jth N.N. ;sigma = scalar sigma = mean(sort_dist(:,jnn)); 13

14 %compute weight matrix for i = 1:n for j = 1:n W(i,j) = exp(-((dist(i,j)^2)/(2*sigma^2))); end; end; W_tr = W - eye(n); %compute the degree matrix for i= 1:n D(i,i) = sum(w_tr(i,:),2); %add each row of weight matrix. end; D_inv = diag(1./diag(d)); %computer Lrw = I - D^-1*W matrix Lrw = eye(n) - D_inv*W_tr; %computer eigenvectors and eigenvalues of Lrw [U,V] = eig(lrw); %sort and return the position of eigenvalues in ascending order. [values,indices] = sort(diag(v)); %shuffle U according to the order of 'indices' s.t. the 2nd colum of V is % the 2nd smallest eigenvalue U_tr = U(:,indices); %pick the eigenvector corresponding to the smallest eigenvalue U_tr2 = U_tr(:,2:k); %Apply kmeans and output C_i. [Y,C,Sumn] = kmeans(u_tr2,k,'replicate',1); SS = sum(sumn); figure; gcplot(x,y); 14

15 Reference: Links: Knowledge Modeling H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp , Multidimensional Scaling, Cox and Cox and nonmetric multidimensionalscaling.html Nonclassical and Nonmetric Multidimensional Scaling, MathWorks. Textbooks: Exploratory Data Analysis, 2 nd Ed, Wendy Martinez, Angel Martinez, and Jeffrey Solka. The analysis and interpretation of multivariate data for social scientists, Chapter 2 and 3, David Bartholomew, Fiona Steele, Irini Moustaki, and Jane Galbraith. 15

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups