Sampling Operations on Big Data

Size: px
Start display at page:

Download "Sampling Operations on Big Data"

Transcription

1 Sampling Operations on Big Data Vijay Gadepally, Taylor Herr, Luke Johnson, Lauren Milechin, Maja Milosavljevic, Benjamin A. Miller Lincoln Laboratory Massachusetts Institute of Technology Lexington, MA {vijayg, taylor.herr, luke.johnson, lauren.milechin, maja.milosavljevic, Abstract The 3Vs - Volume, Velocity and Variety - of Big Data continues to be a large challenge for systems and algorithms designed to store, process and disseminate information for discovery and exploration under real-time constraints. Common signal processing operations such as sampling and filtering, which have been used for decades to compress signals are often undefined in data that is characterized by heterogeneity, highdimensionality, and lack of known structure. In this article, we describe and demonstrate an approach to sample large datasets such as social media data. We evaluate the effect of sampling on a common predictive analytic, link prediction. Our results indicate that greatly sampling a dataset can still yield meaningful link prediction results. I. INTRODUCTION The volume of information passing through today s data systems often outpaces our ability to compute meaningful results. Storing all of this data quickly stresses the limits of databases and storage systems. In the case of predictive analytics, where data is stored in memory, this increase in data volume affects the ability to perform meaningful computations. Up-front data reduction, such as sampling or filtering data, can help alleviate some of these issues. However, these data reduction techniques may impact the quality of results. The focus of this study is to explore the effects of a variety of sampling techniques on a specific predictive analytic and type of data: link prediction on large graphs. Sampling techniques on large graphs have been investigated in numerous studies. However, the performance of these techniques have not been measured in the context of predictive analytics. For an introduction to the variety of sampling techniques and how they effect various graph features such as degree distributions and singular value distributions, we refer the reader to [1]. In [2], the authors present a series of sampling techniques to perform frequent subgraph mining. Sampling techniques such as random walk and jump methods are discussed in [3] and have been applied to subgraph mining. Vertex-based sampling methods, such as random node sampling [1], [2] or wedge sampling [4], have also been shown to preserve various graph features including the graph clustering coefficient. Distribution A: Public Release. This work is sponsored, by the Assistant Secretary of Defense for Research & Engineering, under Air Force Contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Of growing interest in the graph community is the application of anticipatory analytics to large graphs such as those that describe social or computer networks. One particularly interesting anticipatory analytic is link prediction [5], where we predict whether two entities in a graph will form a relationship in the near future. Link prediction can allow analysts to anticipate new connections and react in a proactive rather than reactive manner. In [6], the authors describe link prediction in the context of web pages. In [7], the authors describe link prediction on social networks. Link prediction is a computationally heavy task and a motivating example for the need of up-front data reduction in large graphs and datasets. In this study, we evaluate the application of link prediction techniques to a large sampled social media graph derived from Twitter. The article is structured as follows. We first describe the parallels we have noticed between signal processing and big data in Section II, followed by a description of the analytic environment D4M in Section III. We then describe the types of sampling methods and link prediction methodology in Section IV. Section V describes the evaluation framework and results obtained of applying link prediction techniques on sampled datasets. We conclude the article in Section VI. II. SIGNAL PROCESSING AND BIG DATA Traditional signal processing deals with operations on multidimensional analog or digital signals often represented as arrays of floating point or integer values. Traditional signal sources include radars, sensors such as GPS or LIDAR, imagery and video. Very often, a signal processing analysis pipeline attempts to perform signal acquisition and reconstruction for signals that are sampled or in which there is missing data. To support detection algorithms, one may use techniques to improve the signal to noise ratio (SNR) or signal quality. A signal processing pipeline may also perform operations to extract features or patterns that can be used to characterize new signals or extract signals of interest. Often, a combination of techniques such as signal compression, domain transformations and signal reconstruction steps are used to do these operations. Big Data analytics, often characterized by analytics applied to datasets that strain available resources, also support similar operations. Unlike traditional signal processing algorithms, however, these operations are performed on large quantities of structured and/or unstructured datasets. Some of the common

2 operations include error and anomaly detection to determine corrupted portions of a dataset, data compression, clustering or community detection, pattern recognition and model construction. Fundamentally, these operations are similar to signal processing operations but require the ability to find a suitable representation for data that can be used across a wide variety of datasets as general as numeric arrays. Correlating heterogenous sources of information requires a set of tools, data structures and mathematical foundations. III. D4M AND ASSOCIATIVE ARRAYS D4M, or the Dynamic Distributed Dimensional Data Model, is a software package developed at MIT Lincoln Laboratory that enables researchers to prototype solutions and analytics for large datasets. The D4M package contains support for a mathematical data type called associative arrays, a schema to convert structured and unstructured datasets to associative arrays, and a set of software tools to support mathematics on associative arrays and database operations. A. Associative Arrays Associative arrays are used to describe the relationship between multidimensional entities using numeric/string keys and numeric/string values. Associative arrays provide a generalization of sparse matrices. Formally, an associative array A is a map from d sets of keys K 1 K 2... K d to a value set V with a semi-ring structure A : K 1 K 2... K d V where (V,,, 0, 1) is a semi-ring with addition operator, multiplication operator, additive-identity/multiplicativeannihilator 0, and multiplicative-identity 1. Furthermore, associative arrays have a finite number of non-zero values which means their support supp(a) = A 1 (V \{0}) is finite. While associative arrays can be any number of dimensions, a common technique to use associative arrays in databases is to project the d-dimensional set into two dimensions as in: A : K 1 {K 2 K 3... K d } V where the operation indicates a union operation. In this 2D representation, K 1 is often referred to as the row key and {K 2 K 3... K d } is referred to as the column key. A more thorough description of associative arrays can be found in [8] and examples of associative operations in [9]. B. D4M Schema The D4M schema [10], provides a four table solution that can be used to represent many datasets of interest. Each table in this schema is represented by an associative array as described in the previous section. The schema allows arbitrary datasets which may be structured or unstructured to be represented in a common format so that linear algebraic operations can be applied on the data. The D4M schema converts a dense dataset into a sparse associative array by exploding each data entry into an element of an associative array where each unique column-value pair is a column. This schema has been applied to a wide variety of datasets such as HPC scheduler logs [11], social media, and medical data [12]. C. D4M Software Tools Associative arrays have a one-to-one relationship with many database data models such as key-value stores. The schema described in the previous section can convert an arbitrary dataset into an associative array that resembles key-value pairs. Since a majority of large datasets are indexed and stored in databases, D4M supports connectivity with a variety of back-end database technologies. Operations on database tables (which are represented as remote associative arrays) are similar to operations on local associative arrays. Currently, the D4M environment is supported in MATLAB and GNU Octave. We are currently developing a binding for the Julia language [13]. A more thorough description of the D4M software environment is provided in [12]. IV. SAMPLING AND LINK PREDICTION In this section, we describe the sampling methods applied to large graphs and the techniques used for link prediction. A. Sampling Methods To best cover the different styles of sampling, we have selected sampling methods that fall into a number of categories. These include edge sampling methods where edges are selected by a predetermined criteria; snowball sampling methods where algorithms start with a small set of vertices and expand until a target graph size is reached; exploration algorithms where an initial set of vertices and edges are selected and random walk techniques are applied to determine the sampled graph; and vertex sampling methods where a set of vertices are selected and the sample graph contains edges incident on these vertices. Specifically, we used the following sampling methods for a sampling factor S being applied to a graph G: Random Edge Sampling: In this method, any edge in the graph is kept with a probability p where p is 1/S. Thus the overall sampled graph size will be approximately G p. Popularity Based Sampling: In this method, an edge is kept based on the inverse of the popularity, or degree, of the vertices it connects. Specifically, the probability of an edge being kept after sampling is proportional to 1/ d i d j where d i and d j correspond to the degrees of vertices v i and v j respectively. Random area sampling: In this sampling method, we begin with a random set of seed vertices. A vertex is chosen by the probability p/d avg where d avg is the average degree of a vertex in the graph and p is the inverse of the sampling factor S. The sampled graph consists of all edges and vertices connected to the selected vertices. Random Jump Sampling: In this method, we randomly select a vertex v in our graph. Until we reach the desired number of edges (approximately the number of edges in G divided by S), we randomly select an edge connected

3 to v, and continue or we jump to another vertex selected from a uniform distribution with probability p jump. Commonly, the value of p jump is selected to be Wedge Sampling: In this method, we randomly select a number of vertices from a uniform distribution. To create a sampled wedge, we select two edges uniformly that are incident on the selected vertices. B. Link Prediction We perform link prediction using a number of similarity metrics applied to graphs. In such methods we compute statistics of a graph up to a certain time point and score all possible non-observed edges in the graph. Predicting new edges can be done by selecting those new edges which satisfy a similarity threshold. In this study, we selected the top- N scored edges instead. The value of N is determined by computing the increase in the size of the graph from time step t to time step t + 1. Thus, N = G(t + 1) G(t), where G(t) is the number of edges in G at time t. Suppose that at time t vertices v i and v j are not connected. We can compute the similarity, Sim(v i, v j ), between v i and v j as proportional to the number of common features they share. These features can be as local similarity metrics, global similarity metrics, or quasi-local metrics [5]. In our studies, we have found that local similarity metrics perform well [14]. The first local similarity metric is the common neighbors statistic. Common neighbors is a statistic of the number of neighbors two vertices have in common and can be found by computing the inner product of a graph. The insight behind this method is that two vertices with many common neighbors have a high likelihood of being connected in the future. Using this measure, we can compute the similarity, Sim(v i, v j ), between v i and v j as Sim(v i, v j ) = (Γ(v i ) Γ(v j )), where Γ(v i ) and Γ(v j ) indicate neighbors of v i and v j respectively. The second similarity metric is the Jaccard coefficient. Jaccard is an extension of the common neighbors method and normalizes the common neighbors score by taking into account the degrees of each vertex. The third similarity metric is the lowrank approximation. In this method, we compute a rank-k matrix approximation of the original graph adjacency matrix through a singular value decomposition of the original matrix. Potential new edges are then scored on the inner product of the low-rank approximation. This method is akin to the common neighbors method being applied to a low-rank approximation of the original adjacency matrix. The final similarity metric used in this study is the Adamic Adar measure [15]. The Adamic Adar similarity metric increases the likelihood that an edge is predicted when a non-observed edge connects two vertices that are connected to a number of unique vertices. For example, in a social media graph we would rate two users closer together if they were both connected to a unique user or set of users. The similarity of v i and v j using this measure is given as: Sim(v i, v j ) = z Γ(v i) Γ(v j) 1 log 10 (Γ(z)) where Γ(v i ) and Γ(v j ) refer to vertices who are neighbors of v i and v j respectively. In the following section, we describe the evaluation framework and results. V. EVALUATION AND RESULTS In this section, we describe the evaluation framework and results obtained when performing link prediction on sampled data. Twitter data was collected from the public API and converted to associative arrays using the D4M tools described in Section III. A. Evaluation Framework Each collected graph is temporally separated into 10 training steps and 5 testing steps. We apply the different sampling techniques to the training graphs and evaluate the quality of links predicted when using the different similarity metrics discussed in Section IV-B. Details of the process are discussed below: 1) We first sample the original graph using the techniques described in Section IV-A. Sampling is performed on the integrated graph over 10 time periods. For each of the five sampling methods, we sample the graphs at rates where the number of edges is reduced by a power of two from 1/2 to 1/64. In the random jump algorithm, we set the number of seed vertices to be approximately 0.15 E, where E is the number of edges in the sampled graph. 2) We perform link prediction using the similarity metrics discussed in the previous section on the sampled graphs. This process first takes the sampled graphs and evaluates the different prediction metrics for new potential nonobserved links, and then predicts the top-n scored links. In the low-rank approximation, we use a rank 10 approximation of the original adjacency matrix of the graph, which is computed via a singular value decomposition. 3) We evaluate the predictive performance by comparing the predicted links against the ground truth determined from the testing data set. Predictive performance is calculated after controlling for edges that are removed by sampling and is limited to non-observed edges. 4) We evaluate the probability of detection and false alarm for predicted links at varying future time steps. Each sampling and link prediction method runs on a square, symmetric adjacency matrix representation of the graph. The schema of Section III-B stores Twitter data in an incidence matrix, E. In order to construct the adjacency matrix for each time step, we first extract all edges that occur during that time step (an edge is a row in the incidence matrix). The incidence matrix contains many vertex types in its columns, such as username, location, time, and words used in the Tweet. Construction of the adjacency matrix from the incidence matrix can be done by the relation A = W T L where A is the adjacency matrix and W and L are incidence matrix representations of a subset of the columns in E. In order to make the adjacency matrix square, we simply add the

4 Fig. 1. Generating the adjacency matrix from the incidence matrix representation of the dataset. First, a subset of columns we wish to sample is extracted from E, denoted by L and W. Next, the adjacency matrix is created as A = W T L. Finally, the adjacency matrix is made square by adding the transpose. transpose of A to itself such that A = A + A T. This process is described in Figure V-A. Most sampling methods on an adjacency matrix can be realized by randomly selecting rows or column labels, and then selecting entries in those rows or columns. For example, a random area sample is performed by selecting random row/column indices, and then including the entire row and column with those indices. Graph generation, sampling and prediction were done at the Massachusetts Institute of Technology Lincoln Laboratory using the MIT SuperCloud architecture [16]. The system consists of approximately 300 nodes with dual socket 16-core AMD processors and 128 GB of RAM per node. B. Results and Discussion The probability of detection and false alarm rates of predicted links allows us to quickly determine the quality of predicted edges. In presenting the results, we use the common area under the receiver operator characteristics curve (AUC). In this measurement, an AUC of 1 indicates perfect detection and an AUC of 0.5 indicates no discrimination power. For each of the 10 instantiations of the graph, performance is averaged over all samples. In addition, the error bars in the plots show performance at the 25th and 75th percentiles. The results from performing link prediction are shown in Figure 2. From the results, prediction performance tends to decrease at a higher sampling rate, as expected. However, there are some interesting results observed. Random edge sampling and random walk sampling consistently perform better than other sampling methods. We believe that this may be due to the fact that both of these methods favor the selection of edges connected to vertices with high degree. This notion is reinforced by the relatively low performance of the popularity based sampling approach which essentially normalizes the degree distribution of the vertices. For all sampling methods, the power of link prediction greatly degrades at the 1/64 rate of sampling to nearly an AUC of 0.5. This is likely due to the high number of isolated vertices. For both random sampling and random walk sampling, the AUC remains above 90% until approximately a sampling factor of 25%. This result indicates that nearly 75% of a dataset can be removed with a 10% reduction in overall link prediction performance. Such a reduction may be very helpful for big data systems where approximate prediction results are sufficient. The results for link prediction through common neighbors, Jaccard coefficient and Adamic Adar are very similar at all sampling levels. This is likely due to the fact that potential new edges are scored by a similarity metric that is based on the number of common neighbors. Link prediction through the low rank approximation method, however, has vastly different results. This is likely due to the rank-10 approximation. Changing the rank should improve these results. There are also interesting aspects of the running time results. For all the neighbor based sampling methods, random edge sampling seems to reduce in running time at a rate slower than the other sampling techniques. This may be due to the smaller number of isolated vertices left behind in the random sampling method. The low-rank approximation has very strange timings which we believe may be due to the different sampling methods effect on the distribution of eigenvalues. This change in eigenvalue distribution may cause the eigensolver to converge at very different rates under different sampling methods. VI. CONCLUSIONS In this article, we demonstrated applying and evaluating the effect of sampling methods on link prediction on a real social media graph derived from Twitter. From our experiments, we believe that sampling large datasets has significant potential to greatly reduce the run time and computational footprint of link prediction algorithms while still maintaining adequate prediction performance. As future work, we are interested in expanding the signal processing and big data parallels by incorporating filtering operations. Further, we would like to evaluate the effect of sampling on other types of analytics and look at whether it is possible to merge faster, less accurate samples such as those at higher sampling rates. ACKNOWLEDGMENT The authors wish to thank Jeremy Kepner and Albert Reuther for their assistance in developing the sampling and link prediction concepts. The authors also wish to thank the LLGrid team at MIT Lincoln Laboratory for their support in performing the experiments. REFERENCES [1] J. Leskovec and C. Faloutsos, Sampling from large graphs, in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp [2] R. Zou and L. B. Holder, Frequent subgraph mining on a single large graph using sampling techniques, in Proceedings of the eighth workshop on mining and learning with graphs. ACM, 2010, pp [3] M. Al Hasan and M. J. Zaki, Output space sampling for graph patterns, Proceedings of the VLDB Endowment, vol. 2, no. 1, pp , [4] C. Seshadhri, A. Pinar, and T. G. Kolda, Triadic measures on graphs: The power of wedge sampling, in Proceedings of the SIAM Conference on Data Mining (SDM). SIAM, [5] L. L u and T. Zhoua, Link prediction in complex networks: A survey, Physica A: Statistical Mechanics and its Applications, vol. 390, no. 6, pp , [6] B. Taskar, M. fai Wong, P. Abbeel, and D. Koller, Link prediction in relational data, in Advances in Neural Information Processing Systems 16, ser , S. Thrun, L. Saul, and B. Schölkopf, Eds. MIT Press, 2004.

5 Fig. 2. Performance and time measurements for various link prediction methods applied on sampled Twitter data. Each row describes the results for common neighbors, Jaccard s coefficient, low rank, and Adamic adar methods. The first column describes the area under the ROC curve (AUC) for predicting links one time step out. The second column describes the AUC for predicting links five time steps out. The third column describes the overall time taken to perform link prediction under various sampling techniques. [7] D. Wang, D. Pedreschi, C. Song, F. D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.-L. Barabasi, Human mobility, social ties, and link prediction, in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2011, pp [8] J. Kepner, J. Chaidez, V. Gadepally, and H. Jansen, Associative arrays: Unified mathematics for spreadsheets, databases, matrices, and graphs, New England Database Day, [9] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner, Graphulo: Linear algebra graph kernels for nosql databases, in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International. IEEE, 2015, pp [10] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz et al., Dynamic distributed dimensional data model (d4m) database and computation system, in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp [11] V. Gadepally and J. Kepner, Big data dimensional analysis, in Proc. IEEE High Performance Extreme Computing, [12] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen et al., D4m: Bringing associative arrays to database engines, IEEE High Performance Extreme Computing, [13] J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman, Julia: A fast dynamic language for technical computing, arxiv preprint arxiv: , [14] L. Edwards, L. Johnson, M. Milosavljevic, V. Gadepally, and B. A. Miller, Sampling large graphs for anticipatory analytics, IEEE High Performance Extreme Computing, [15] L. A. Adamic and E. Adar, Friends and neighbors on the web, Social networks, vol. 25, no. 3, pp , [16] A. Reuther, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, M. Hubbell, P. Michaleas, J. Mullen, A. Prout et al., Llsupercloud: Sharing hpc systems for diverse rapid prototypingsupercloud: Sharing hpc systems for diverse rapid prototyping, in High Performance Extreme Computing Conference (HPEC), 2013 IEEE. IEEE, 2013, pp. 1 6.

Sampling Large Graphs for Anticipatory Analysis

Sampling Large Graphs for Anticipatory Analysis Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015

More information

Big Data Dimensional Analysis

Big Data Dimensional Analysis Big Data Dimensional Analysis Vijay Gadepally & Jeremy Kepner MIT Lincoln Laboratory, Lexington, MA 02420 {vijayg, jeremy}@ ll.mit.edu arxiv:1408.0517v1 [cs.db] 3 Aug 2014 Abstract The ability to collect

More information

D4M 3.0: Extended Database and Language Capabilities

D4M 3.0: Extended Database and Language Capabilities D4M 3.0: Extended Database and Language Capabilities Lauren Milechin, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner, Alexander Chen, Dylan Hutchison MIT EAPS, Cambridge, MA 02139 lauren.milechin@mit.edu

More information

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Jeremy Kepner SIAM Annual Meeting, Minneapolis, July 9, 2012 This work is sponsored by the Department of the Air Force under Air

More information

Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System

Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System Jeremy Kepner, William Arcand, William Bergeron, Nadya Bliss, Robert Bond, Chansup Byun, Gary Condon, Kenneth Gregson, Matthew

More information

Analysis and Mapping of Sparse Matrix Computations

Analysis and Mapping of Sparse Matrix Computations Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored

More information

Signal Processing on Databases

Signal Processing on Databases Signal Processing on Databases Jeremy Kepner Lecture 0: Introduction 3 October 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,

More information

DataSToRM: Data Science and Technology Research Environment

DataSToRM: Data Science and Technology Research Environment The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering

More information

Social Behavior Prediction Through Reality Mining

Social Behavior Prediction Through Reality Mining Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO

More information

The BigDawg Monitoring Framework

The BigDawg Monitoring Framework The BigDawg Monitoring Framework Peinan Chen, Vijay Gadepally, Michael Stonebraker MIT CSAIL MIT Lincoln Laboratory chenp@mit.edu, {vijayg, stonebraker}@csail.mit.edu Abstract BigDAWG is a polystore database

More information

The HPEC Challenge Benchmark Suite

The HPEC Challenge Benchmark Suite The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 This work is sponsored by the Defense Advanced

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Sub-Graph Detection Theory

Sub-Graph Detection Theory Sub-Graph Detection Theory Jeremy Kepner, Nadya Bliss, and Eric Robinson This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions,

More information

Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs

Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs Michael M. Wolf and Benjamin A. Miller Lincoln Laboratory Massachusetts Institute of Technology Lexington, MA 02420

More information

An Advanced Graph Processor Prototype

An Advanced Graph Processor Prototype An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant

More information

Covert and Anomalous Network Discovery and Detection (CANDiD)

Covert and Anomalous Network Discovery and Detection (CANDiD) Covert and Anomalous Network Discovery and Detection (CANDiD) Rajmonda S. Caceres, Edo Airoldi, Garrett Bernstein, Edward Kao, Benjamin A. Miller, Raj Rao Nadakuditi, Matthew C. Schmidt, Kenneth Senne,

More information

Anomaly Detection in Very Large Graphs Modeling and Computational Considerations

Anomaly Detection in Very Large Graphs Modeling and Computational Considerations Anomaly Detection in Very Large Graphs Modeling and Computational Considerations Benjamin A. Miller, Nicholas Arcolano, Edward M. Rutledge and Matthew C. Schmidt MIT Lincoln Laboratory Nadya T. Bliss ASURE

More information

Cluster-based 3D Reconstruction of Aerial Video

Cluster-based 3D Reconstruction of Aerial Video Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for

More information

ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N.

ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N. ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N. Dartmouth, MA USA Abstract: The significant progress in ultrasonic NDE systems has now

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

High-throughput Ingest of Data Provenance Records into Accumulo

High-throughput Ingest of Data Provenance Records into Accumulo High-throughput Ingest of Data Provenance Records into Accumulo Thomas Moyer MIT Lincoln Laboratory Lexington, MA 02420 Email: tmoyer@ll.mit.edu Vijay Gadepally MIT Lincoln Laboratory Lexington, MA 02420

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu 1 Introduction Yelp Dataset Challenge provides a large number of user, business and review data which can be used for

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Daniel Kimball, Elizabeth Michel, Paul Keltcher, and Michael M. Wolf MIT Lincoln Laboratory Lexington, MA {daniel.kimball,

More information

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011 Activity-Based Community Detection Scott Philips, Edward Kao, Michael Yee and Christian Anderson Graph Exploitation Symposium August 9 th 2011 23-1 This work is sponsored by the Office of Naval Research

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Hypergraph Exploitation for Data Sciences

Hypergraph Exploitation for Data Sciences Photos placed in horizontal position with even amount of white space between photos and header Hypergraph Exploitation for Data Sciences Photos placed in horizontal position with even amount of white space

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Graph Exploitation Testbed

Graph Exploitation Testbed Graph Exploitation Testbed Peter Jones and Eric Robinson Graph Exploitation Symposium April 18, 2012 This work was sponsored by the Office of Naval Research under Air Force Contract FA8721-05-C-0002. Opinions,

More information

Two Dimensional Microwave Imaging Using a Divide and Unite Algorithm

Two Dimensional Microwave Imaging Using a Divide and Unite Algorithm Two Dimensional Microwave Imaging Using a Divide and Unite Algorithm Disha Shur 1, K. Yaswanth 2, and Uday K. Khankhoje 2 1 Indian Institute of Engineering Science and Technology, Shibpur, India 2 Indian

More information

Detection Theory for Graphs

Detection Theory for Graphs Detection Theory for Graphs Benjamin A. Miller, Nadya T. Bliss, Patrick J. Wolfe, and Michelle S. Beard Graphs are fast emerging as a common data structure used in many scientific and engineering fields.

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Compressing and Decoding Term Statistics Time Series

Compressing and Decoding Term Statistics Time Series Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,

More information

Dr. Jeremy V. Kepner Supercomputing Center Fellow, Founder, and Head MIT Lincoln Laboratory, 244 Wood St., Lexington, MA,

Dr. Jeremy V. Kepner Supercomputing Center Fellow, Founder, and Head MIT Lincoln Laboratory, 244 Wood St., Lexington, MA, Dr. Jeremy V. Kepner Supercomputing Center Fellow, Founder, and Head MIT Lincoln Laboratory, 244 Wood St., Lexington, MA, kepner@ll.mit.edu, http://www.mit.edu/~kepner Princeton University, PhD. in Astrophysics

More information

Recall precision graph

Recall precision graph VIDEO SHOT BOUNDARY DETECTION USING SINGULAR VALUE DECOMPOSITION Λ Z.»CERNEKOVÁ, C. KOTROPOULOS AND I. PITAS Aristotle University of Thessaloniki Box 451, Thessaloniki 541 24, GREECE E-mail: (zuzana, costas,

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Overlay (and P2P) Networks

Overlay (and P2P) Networks Overlay (and P2P) Networks Part II Recap (Small World, Erdös Rényi model, Duncan Watts Model) Graph Properties Scale Free Networks Preferential Attachment Evolving Copying Navigation in Small World Samu

More information

Statistical Physics of Community Detection

Statistical Physics of Community Detection Statistical Physics of Community Detection Keegan Go (keegango), Kenji Hata (khata) December 8, 2015 1 Introduction Community detection is a key problem in network science. Identifying communities, defined

More information

The Constellation Project. Andrew W. Nash 14 November 2016

The Constellation Project. Andrew W. Nash 14 November 2016 The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Fast Anomaly Detection Algorithms For Hyperspectral Images

Fast Anomaly Detection Algorithms For Hyperspectral Images Vol. Issue 9, September - 05 Fast Anomaly Detection Algorithms For Hyperspectral Images J. Zhou Google, Inc. ountain View, California, USA C. Kwan Signal Processing, Inc. Rockville, aryland, USA chiman.kwan@signalpro.net

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

Texture Mapping using Surface Flattening via Multi-Dimensional Scaling

Texture Mapping using Surface Flattening via Multi-Dimensional Scaling Texture Mapping using Surface Flattening via Multi-Dimensional Scaling Gil Zigelman Ron Kimmel Department of Computer Science, Technion, Haifa 32000, Israel and Nahum Kiryati Department of Electrical Engineering

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

SPECTRAL SUBGRAPH DETECTION WITH CORRUPT OBSERVATIONS. Benjamin A. Miller and Nicholas Arcolano

SPECTRAL SUBGRAPH DETECTION WITH CORRUPT OBSERVATIONS. Benjamin A. Miller and Nicholas Arcolano 204 IEEE International Conference on Acoustic Speech and Signal Processing (ICASSP) SPECTRAL SUBGRAPH DETECTION WITH CORRUPT OBSERVATIONS Benjamin A. Miller and Nicholas Arcolano Lincoln Laboratory Massachusetts

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Emerging Measures in Preserving Privacy for Publishing The Data

Emerging Measures in Preserving Privacy for Publishing The Data Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the

More information

Introduction to Data Science

Introduction to Data Science UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics

More information

Inverted Index for Fast Nearest Neighbour

Inverted Index for Fast Nearest Neighbour Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

Learning Graph Grammars

Learning Graph Grammars Learning Graph Grammars 1 Aly El Gamal ECE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Abstract Discovery of patterns in a given graph - in the form of repeated

More information

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Computer Experiments: Space Filling Design and Gaussian Process Modeling Computer Experiments: Space Filling Design and Gaussian Process Modeling Best Practice Authored by: Cory Natoli Sarah Burke, Ph.D. 30 March 2018 The goal of the STAT COE is to assist in developing rigorous,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc. CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems Leigh M. Smith Humtap Inc. leigh@humtap.com Basic system overview Segmentation (Frames, Onsets, Beats, Bars, Chord Changes, etc) Feature

More information

Diffusion Wavelets for Natural Image Analysis

Diffusion Wavelets for Natural Image Analysis Diffusion Wavelets for Natural Image Analysis Tyrus Berry December 16, 2011 Contents 1 Project Description 2 2 Introduction to Diffusion Wavelets 2 2.1 Diffusion Multiresolution............................

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

The clustering in general is the task of grouping a set of objects in such a way that objects

The clustering in general is the task of grouping a set of objects in such a way that objects Spectral Clustering: A Graph Partitioning Point of View Yangzihao Wang Computer Science Department, University of California, Davis yzhwang@ucdavis.edu Abstract This course project provide the basic theory

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network Journal of Innovative Technology and Education, Vol. 3, 216, no. 1, 131-137 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.12988/jite.216.6828 Graph Sampling Approach for Reducing Computational Complexity

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

SteerFit: Automated Parameter Fitting for Steering Algorithms Supplementary Document

SteerFit: Automated Parameter Fitting for Steering Algorithms Supplementary Document Eurographics/ ACM SIGGRAPH Symposium on Computer Animation (2014) Vladlen Koltun and Eftychios Sifakis (Editors) SteerFit: Automated Parameter Fitting for Steering Algorithms Supplementary Document Glen

More information

Cluster-based Instance Consolidation For Subsequent Matching

Cluster-based Instance Consolidation For Subsequent Matching Jennifer Sleeman and Tim Finin, Cluster-based Instance Consolidation For Subsequent Matching, First International Workshop on Knowledge Extraction and Consolidation from Social Media, November 2012, Boston.

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Reversible Wavelets for Embedded Image Compression. Sri Rama Prasanna Pavani Electrical and Computer Engineering, CU Boulder

Reversible Wavelets for Embedded Image Compression. Sri Rama Prasanna Pavani Electrical and Computer Engineering, CU Boulder Reversible Wavelets for Embedded Image Compression Sri Rama Prasanna Pavani Electrical and Computer Engineering, CU Boulder pavani@colorado.edu APPM 7400 - Wavelets and Imaging Prof. Gregory Beylkin -

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Image Inpainting Using Sparsity of the Transform Domain

Image Inpainting Using Sparsity of the Transform Domain Image Inpainting Using Sparsity of the Transform Domain H. Hosseini*, N.B. Marvasti, Student Member, IEEE, F. Marvasti, Senior Member, IEEE Advanced Communication Research Institute (ACRI) Department of

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari Laboratory for Advanced Brain Signal Processing Laboratory for Mathematical

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

Detection and Deletion of Outliers from Large Datasets

Detection and Deletion of Outliers from Large Datasets Detection and Deletion of Outliers from Large Datasets Nithya.Jayaprakash 1, Ms. Caroline Mary 2 M. tech Student, Dept of Computer Science, Mohandas College of Engineering and Technology, India 1 Assistant

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian Demystifying movie ratings 224W Project Report Amritha Raghunath (amrithar@stanford.edu) Vignesh Ganapathi Subramanian (vigansub@stanford.edu) 9 December, 2014 Introduction The past decade or so has seen

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Signal Processing on Databases

Signal Processing on Databases Signal Processing on Databases Jeremy Kepner Lecture 1: Using Associative Arrays This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Simplicial Complexes of Networks and Their Statistical Properties

Simplicial Complexes of Networks and Their Statistical Properties Simplicial Complexes of Networks and Their Statistical Properties Slobodan Maletić, Milan Rajković*, and Danijela Vasiljević Institute of Nuclear Sciences Vinča, elgrade, Serbia *milanr@vin.bg.ac.yu bstract.

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information