Hadoop Based Link Prediction Performance Analysis

Size: px
Start display at page:

Download "Hadoop Based Link Prediction Performance Analysis"

Transcription

1 Hadoop Based Link Prediction Performance Analysis Yuxiao Dong, Casey Robinson, Jian Xu Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556, USA Abstract Link prediction is an important problem in social network analysis and has been applied in a variety of fields. Link prediction aims to estimate the likelihood of the existence of links between nodes by the known network structure. The time complexity of link prediction algorithms in huge-scale networks remains unexplored and unsolved, especially for sparse networks. In this project, we will explore how parallel computing speeds up link prediction in huge-scale networks. We implemented similarity based link prediction algorithms based on MapReduce, which have the time complexity of O(n) in sparse networks. We analyzed the performance of our algorithms on the Data Intensive Science Cluster at University of Notre Dame. Weevaluate the performance with different configurations, monitor the resource utilization of the distributed computation, and optimize accordingly. After analyzing the efficiency with different configurations, we present the fastest approach of performing parallelized link prediction, which is particularly suited for real-world big data. Index Terms Social network analysis, Link prediction, MapReduce, Parallelization I. INTRODUCTION Social networks are an important part of our society. These networks are in constant flux and understanding how nodes relate is of great interest. Barabási demonstrated that networks expand continuously by the addition of new vertices which preferentially attach to existing, well connected verticies [1]. Many researchers have studied the network evolution and modeling the dynamic network structure [2], [3], [4]. Link prediction is used to understand and identify the mechanisms of network growth and evolution. Link prediction aims to estimate the likelihood of the existence of links between nodes based on the known network structure information. The classical problem of link prediction is the prediction of existing yet unknown links - called missing links. Most of previous work on link prediction employs cross-validation by splitting the data into two sets: training and testing. Consider this motivating example. People in the real world meet new friends. The relationship is represented by the appearance of a new connection in his or her social graph. Through the new relationship, both people s social circle enlarges. Predicting these relationships before they are formed is vital to the success of a social networking service. Link prediction attempts to address the issue of discovering future connections. We can experience the results of link prediction through the friend recommendation engine on Facebook. However, there are now more than 1 billion users on Facebook. The massive scale is an impediment to successful prediction. A scalable and efficient solution is needed to accurately recommend friends. Challenge: The major challenge of link prediction stems from the sparse, yet gigantic, nature of social networks. A sparse network implies that the existing links between nodes represent a small fraction of the total possible links. The data size requires either years or many computers. To solve the strongly unbalanced data between unexisting links and existing links, we can undersample the holdout test set [5] or only sample negative instances in the test set [6]. Modifying the sampling method changes the data distribution so that it no longer presents the same challenges at the real-world distribution. Since the algorithm no longer reflects the capabilities and limitations of the prediction model, the results are uninterpretable [7]. Thus, parallelization is the only feasable and meaningful method for studying link formation and consequently providing the motivation for our work. Three designed patterns, based on MapReduce, have been proposed to speed up network analysis algorithms[8]. PEGA- SUS is a MapReduce based framework for graph mining which implements most of the classical graph mining algorithms[9]. Although a large amount research has been conducted on MapReduce based graph mining, no MapReduce framework exists for link prediction. Our goal in this project is to design, implement, and analyze the performance of similarity based link prediction algorithms on Data Intensive Science Cluster at University of Notre Dame. The experimental results on several large-scale datasets of variety network types show that the MapReduce based link prediction algorithm is more effective and scalable than traditional ones and its running time decreases with more compute units for appropriately sized data chunks. More work will be done to improve performance and gain further insight into these findings. II. RELATED WORK Our work is related to link prediction and graph mining in huge-scale networks. Link prediction has attracted considerable attention in recent years both from computer science and physics community. Existing work can be classified into two categories: unsupervised methods [10] and supervised methods [7], [11], [12], [13]. Most unsupervised

2 link prediction algorithms are based on a similarity measure between nodes of graph. A seminal work by Liben-Nowell and Kleinberg for unsupervised methods addresses the problem from an algorithmic point of view. The authors investigate how different proximity features can be exploited to predict the occurrence of new links in social networks [10]. For the supervised methods, Lichtenwalter et al. motivated the use of a binary classification framework and vertex collocation profiles [7], [11]. Place features can be exploited into the supervised model for link prediction on location-based social networks [12]. To recommend friends on Facebook, a supervised random walk is designed for link prediction and recommendation [13]. Existing work focuses on link prediction in a particular network without consideration for general parallelized design of large-scale networks. Recently, the focus of graph mining is huge-scale networks. In 2004, Google presented its MapReduce framework for large-scale data indexing and mining [14], which leads the direction of analyzing big data. Three design patterns, bsed on MapReduce, to speed up nework analysis algorithms have been proposed [8]. Moreover, Kang et al. propose two frameworks for huge-scale graph management and analysis: one is GBASE [15]: a scalable and general graph management and mining system based on MapReduce, the other one is a MapReduce based spectral analysis system in billion-scale graphs [16]. Then Yang et al. propose a Self Evolving Distributed Graph Management Environment for partition management of large graphs [17]. The existing work give solutions for general framework for graph mining in large-scale graph. Here we use MapReduce based methods to implement predicting links in huge-scale networks. III. OVERVIEW In this project, we demonstrate how link prediction algorithms benefit from parallel computing. We evaluate the performance of our Hadoop implementation on Data Intensive Science Cluster (DISC) at University of Notre Dame. We design the parallelized strategy for link prediction algorithms using MapReduce model, test the the validity and performance of our MapReduce implementations on DISC, and demonstrate how the number of Mapper tasks and Reducer tasks affect the overall performance. In addition, we explore how graph properties influence the performance of the link prediction problem by testing graphs of different size, type, and clustering coefficient. Finally, we seek to analyze the microscopic details of our implementations. We will monitor the resource utilization of the program, find computation bottlenecks, and attempt to improve our implementation. We will monitor the load balance by comparing completion time in different nodes. Communication time between nodes is an important factor which will also be explored. Big data is divided into small parts for distributed computation and parallel computation can be utilized to reduce the link prediction time. We disover issues that will affect the performance and propose the best approach of performing Fig. 1: The core MapReduce process for our algorithms. parallelized link prediction for big data using DISC. The core parallelized solution for our framework is shown in Figure 1. DISC contains 26 nodes, consisting of 32 GB RAM, 12 x 2 TB SATA disks, 2 x 8-core Intel Xeon E GHz, and Gigabit Ethernet, which is sufficient for big-data manipulation. The software required is Hadoop and the Java runtime environment, both are already installed. IV. EVALUATION SETUP The main thrust of our research is to investigate the performance of link prediction algorithms. Our evaluation approach is divided into three parts. Each part will be focused on reducing the number of variables to explore at the next level down. We will start with a macroscop analysis by treating our link prediction implementation as a black box. The lowest level looks at individual machines and attempts to find performance bottlenecks. We use five data sets as the basis for our evaluation. The data sets range from small (12,000 nodes, 237,000 connections, and 3MB) to large (4.8M nodes, 68M connections, and 1GB). The data sets each represent different types of graphs: citation networks, collaboration networks, social networks, and web networks. See Appendix A for a detailed description of each data set. We analyze the performance at a variety of levels; each providing a unique perspective of the system. To evaluate the macroscopic behavior of our link predictor we wrapped the entire system with a timer; allowing us to obtain a measurement of the completion time of each run. We chose to use this metric for performance because it is the most relevant to an end user. At the next level down, we measure the running time of each hadoop submission. Our implementation consists of ten consecutive Hadoop jobs as demonstrated in Figure 2. The breakdown of time provided at this level allows us to focus our detailed analysis at the next level. Focused testing is important because each evaluation run lasts many hours. The lowest level we are currently exploring is the performance of the most time consuming Hadoop submissions. In this level we analyze the disk i/o, network traffic, and CPU usage. The information gained from these tests will allow us to tune the performance of our implementation to the specific machines we are using.

3 Fig. 2: Steps involved in Link Prediction V. RESULTS In this section, we examine the scalability and efficiency of our MapReduce based link prediction framework from three aspects: overall performance, graph influence, and breakdown performance. A. Overall performance The highest level analyzes the total running time while varying the number of Reducers as well as the data set under observation. Before discussing the performance of our framework, we first analyze the tradeoff of the number of reducers. Fig. 3: Total running time with different number of Reducers, for ND Web data set. Tradeoff. By varying the number of Reducers from 1 to 50 for the ND Web data set, in Figure 3 we see that the average completion time follows three distinct trends. First is the rapid decrease occuring between 1 and 7 reducers. Here we witness the benefits of parallelization: with few reducers the work is under-parallelized. In other words, each Reducer is operating at maximum throughput. The second trend is a steady state where the average time does not increase or decrease. At this stage the benefits of additional Reducers is approximately equal to the additional overhead. On the right portion of the graph, 25 to 50 reducers, we observe an increase in completion time. Here, the amount of data in each chunk is sufficently small, causing the setup time to dominate the total run time. The performance for the ND Web data set has a large variance. We attribute the unpredictable behavior to overhead of disk access and network I/O in this distributed computation platform. This overhead is important for the smaller data sets since the data processing portion of our framework is much shorter than in the large data sets. However, while the overhead of Hadoop based link prediction still exists for a larger data set, the Live Journal, the variance of performance is much smaller, which is shown in Figure 4(a). Because the Live Journal data set is of magnitudes larger than the ND Web one, more time is spent on the actual computing the scores for potential links, thus the variation caused by the overhead of distributed computing is a small percentage of the total run time. As a consequence, our implementation of Hadoop based link prediction has been shown to be suitable for big data. Scalability. We use Live Journal data set as a representative of big data to see how our approach of parallel computing can help speed up the performance. Ideally, if the overhead of distributed computation can be ignored, and the job is evenly distributed to the computing nodes, the job will complete in t = T N = T N 1, where T is the time to complete the job on a single machine, and N is the number of Reducers which is no more than the number of computers. However, as the overhead of distributed computation such as distributing the job to different computers and collecting the results through network does exist, the completion of time cannot be as good as a inversely propotional function, in other words, the power of N cannot be ideally 1, but should lie between 1 and 0. With this observation, we fit the plots with Reducer count smaller than 25 using a power distribution t = T N α, and the power α = , which meets our analysis. Note that 1 < α < 0 delivers a concave curve, the speed-up increases first rapidly then slowly as the number of Reducers increases. This indicates that if we have multiple jobs running on the same distributed computing cluster, for each job we can set a relatively small number of Reducers for optimal overall performance. For example, if we want to simultaneously run link prediction on four data sets that are of similar size as the Live Journal data set, we can set the number of Reducers R = 6 for every job, so that every job sees the benefits of parallelization, while 4 R = 24 < 25 minimizes the overhead of running multiple reducing procedures on a single machine. We also examine the speedup of our algorithm. Theoretically we expect that the number of reducers can directly indicate the degree of concurrence. Thus the theoretical overall throughout by n reducers is denoted as n T, where T is the throughout by one reducer. In Figure 4(b), we plot speedup of our algorithms. Basically, we find our algorithm shows a much better speedup than linear speedup. The experiments on Live Journal network indicate that our MapReduce based link prediction framework has excellent scalability. B. The impact of graph properties Besides the Reducer parameter, we also investigate the impact of graph properties on computation time in the highest level. We control the variables by using fixed 25 Reducers and analyze the performance on small, medium and large networks, shown in Table I.

4 (a) Total running time for Live Journal dataset. (b) Speedup on Live Journal dataset. Fig. 4: Running time and speedup. TABLE I RUNNING TIME ON DIFFERENT DATA SETS WITH 25 REDUCERS Data set Time (s) Nodes Edges HepPh Collaboration 94.9 ± ND Web 1089 ± LiveJournal 6818 ± Graph Size. From Table I we see that the time consumed for different graphs is proportional to the number of nodes in the graph, or we can say, our approach has a time complexity of O(N). The reason is that the link prediction algorithm we used is based on common neighbors. As a consequence, if two nodes do not have a common neighbor, the score of connection between these two nodes must be 0. In other words, if node a and b have a common neighbor c, then a and b must simultaneously exist in Adj[c], otherwise a and b will never coexist in the adjacency list of any node. With this fact, we only have to calculate the scores of connections in the adjacency list, rather than to calculate the scores of connections between every two nodes. Time complexity analysis. The observation that we do not have to compute scores for every potential connection significantly reduces the time complexity of our approach. If the average degree of nodes in the network is k = 2E/N, then for the adjacency list of every node, the number of pairs to deal with is k(k 1) 2, and the total numbers of scores to compute for the whole network is k(k 1)N 2. As a result, the time complexity for link prediction based on common neighbors is O(Nk 2 ), and space complexity O(N k). Barabási and Albert proposes the power law distribution of degrees in real-life networks [1], i.e., the probability that a node has a degree of k is inversely propotional to k. As a consequence, the degrees of most nodes in empirical networks are small, giving us a small average degree. Therefore, for reallife networks which are big but sparse, the time complexity of link prediction based on common neighbors is O(N). In our Hadoop based link prediction implmentation, since the data will be distributed to multiple processing units, the time complexity is O(N/U), where U is the number of computing nodes. C. Breakdown of jobs To analyze which procedures take the most time in our implementation, we breakdown the jobs for a more insightful performance analysis. Our implmentation consists of ten consecutive Hadoop jobs which are shown in Figure 2. Again we control the variables by using fixed 25 Reducers and analyze the performance of the ten consecutive procedures for small, medium and large networks, as shown in Figure 5(a), 5(b), 5(c). Heavyweight jobs. In all three of these breakdown graphs, the seventh procedure in light blue, getlpscore, and the tenth procedure in light red, getauc, occupy the majority of the time. The prominence of these two procedures is within expectation: getlpscore is the procedure that actually computes the scores for potential connections with the algorithm based on common neighbors. As analyzed above, the time complexity of getlpscore is O(N). The reason that getauc takes another big share of total time is different from that of getlpscore. In this last step, the scores of potential links that are stored on the 25 machines need to be transfered via the network to the controlling machine, during which the overhead of disk operations and network communication is heavy and non-negligible. Also, after collecting and merging the results from those 25 machines, the calculation of AUC score must be completed on the controlling machine. The calculation of AUC score further makes the procedure getauc under-parallelized, therefore this step grows in superlinear time against the network size. In a word, the larger the network, the more proportion the last step getauc will take. Procedures 1 6 can be completed in constant time, such as splitdata, or in sublinear time, so getlpscore takes a larger proportion of time compared with procedures 1 6 when dealing with bigger data sets. Lightweight jobs. Based on the analysis above, procedures 1 6 can be completed in constant time or in sublinear

5 time, which coincides with the actual experiment running time. We can read more from the breakdown of jobs. For example, in data sets HepPh Collaboration and Live Journal, getadjlist (Procedure 6, the grey bar) is slightly higher than getdegreestats (Procedure 6, the purple bar), while in ND Web, getadjlist is noticeably lower than getdegreestats. This phenomenon is due to time complexity of getadjlist is O(E), while the complexity of getdegreestats is O(N). As the average degree is proportional to E/N, we can infer from the breakdown that the average degree in ND Web data set is smaller than that of the other two data sets. Our inference can be verified that the average degree of nodes of ND Web data set ( 4.6) is indeed less than that of HepPh Collaboration ( 19.7) and that of Live Journal ( 14.2). As we can see, the breakdown graphs reflects the nature of algorithms we used as well as the intrinsic properties of networks. Last but not least, the breakdown graphs confirm our approach works better on big data sets than on small ones. The error bars in the median-sized network ND Web and in the large-sized network Live Journal are good, while in the small-sized network HepPh Collaboration the variation is not acceptable. Also, the actual two steps of performing link prediction (Procedure 7 and 10) in the small data set comprises of less than half (44%) of the total time, indicating the efficiency of Hadoop based link prediction on small data sets is unsatisfactory, while on large networks such as Live Journal the efficiency is much higher (85%). VI. CONCLUSION AND FUTURE WORK In order to improve the performance we need to dive deeper into the implementation. We will add instrumentation designed to study specific elements of the system including: the I/O and network bandwidth of Hadoop distributed file system (HDFS), and the computing performance of CPU in different nodes. We will use the data to create a breakdown of time spent and find bottlenecks. Once the bottlenecks are found we will determine possible methods for reducing the effect, consequently improving performance. To evaluate our hypothesis that the I/O and communication among distributed computing nodes is the main bottleneck, we will reduce the CPU availability and measure the effect on total run time. Though we have been running jobs on DISC day and night for more than one month and collected considerable amount of data, we do not have statistically significant data for the Live Journal data set at most configurations. Performing link prediction on big data is not easy by nature, especially for the Live Journal data set which is larger than 1GB. We will keep running jobs on DISC clusters continuously and collect as much as data as possible. REFERENCES [1] A.-L. Barabási and R. Albert, Emergence of Scaling in Random Networks, Science, [2] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, Group formation in large social networks: membership, growth, and evolution, in KDD 06, 2006, pp [3] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins, Microscopic evolution of social networks, in KDD 08, 2008, pp [4] J. E. Hopcroft, T. Lou, and J. Tang, Who will follow you back? reciprocal relationship prediction, in CIKM 11, [5] S. S. Mohammad AI Hasan, Vineet Chaoji and M. Zaki, Link prediction using supervised learning, in Workshop on LACS of SDM 06, 2006, pp [6] C. Wang, V. Satuluri, and S. Parthasarathy, Local probabilistic models for link prediction, in ICDM 07, 2007, pp [7] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, New perspectives and methods in link prediction, in KDD 10. ACM, [8] J. Lin and M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in MLG 10, [9] U. Kang, C. E. Tsourakakis, and C. Faloutsos, Pegasus: A peta-scale graph mining system implementation and observations, in ICDM 09, [10] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, in CIKM 03. ACM, [11] R. N. Lichtenwalter and N. V. Chawla, Vertex collocation profiles: subgraph counting for link analysis and prediction, in WWW 12, [12] A. Scellato, Salvatore. Noulas and C. Mascolo, Exploiting place features in link prediction on location-based social networks, in KDD 11. ACM, [13] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in WSDM 11, 2011, pp [14] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in OSDI 04, 2004, pp [15] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos, GBASE: a scalable and general graph management system, in KDD 11, [16] U. Kang, B. Meeder, and C. Faloutsos, Spectral analysis for billionscale graphs: discoveries and implementation, in PAKDD 11, [17] S. Yang, X. Yan, B. Zong, and A. Khan, Towards effective partition management for large graphs, in SIGMOD 12, APPENDIX The data used for evaluation are publicly available at Stanford Network Analysis Project (SNAP). 1 In the ND Web data set, nodes represent pages from University of Notre Dame (domain nd.edu) and directed edges represent hyperlinks between them. The data was collected in 1999 by Albert, Jeong and Barabasi. Live Journal is a free on-line community with almost 10 million members; a significant fraction of these members are highly active. (For example, roughly 300,000 update their content in any given 24-hour period.) Live Journal allows members to maintain journals, individual and group blogs, and it allows people to declare which other members are their friends they belong. Arxiv HEP-PH collaboration (High Energy Physics - Phenomenology) network is from the e-print arxiv and covers scientific collaborations between authors papers submitted to High Energy Physics - Phenomenology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes. The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arxiv, and thus represents essentially the complete history of its HEP-PH section. Arxiv HEP-PH citation (high energy physics phenomenology ) graph is from the e-print arxiv and covers all the citations within a dataset of 34,546 papers with 421,578 edges. If a paper i cites paper j, the graph contains a directed edge 1

6 (a) HepPh Collaboration dataset (b) ND Web dataset (c) Live Journal dataset Fig. 5: Running time with 25 Reducers. from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arxiv, and thus represents essentially the complete history of its HEP-PH section. The data was originally released as a part of 2003 KDD Cup. Epinions social network is a who-trust-whom online social network of a a general consumer review site Epinions.com. Members of the site can decide whether to trust each other. All the trust relationships interact and form the Web of Trust which is then combined with review ratings to determine which reviews are shown to the user.

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Link Prediction and Recommendation across Heterogeneous Social Networks

Link Prediction and Recommendation across Heterogeneous Social Networks 2012 IEEE 12th International Conference on Data Mining Link Prediction and Recommendation across Heterogeneous Social Networks Yuxiao Dong,JieTang,SenWu, Jilei Tian, Nitesh V. Chawla, Jinghai Rao, Huanhuan

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

arxiv: v1 [cs.si] 12 Jan 2019

arxiv: v1 [cs.si] 12 Jan 2019 Predicting Diffusion Reach Probabilities via Representation Learning on Social Networks Furkan Gursoy furkan.gursoy@boun.edu.tr Ahmet Onur Durahim onur.durahim@boun.edu.tr arxiv:1901.03829v1 [cs.si] 12

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

The Complex Network Phenomena. and Their Origin

The Complex Network Phenomena. and Their Origin The Complex Network Phenomena and Their Origin An Annotated Bibliography ESL 33C 003180159 Instructor: Gerriet Janssen Match 18, 2004 Introduction A coupled system can be described as a complex network,

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

A Novel Parallel Hierarchical Community Detection Method for Large Networks

A Novel Parallel Hierarchical Community Detection Method for Large Networks A Novel Parallel Hierarchical Community Detection Method for Large Networks Ping Lu Shengmei Luo Lei Hu Yunlong Lin Junyang Zou Qiwei Zhong Kuangyan Zhu Jian Lu Qiao Wang Southeast University, School of

More information

Edge Classification in Networks

Edge Classification in Networks Charu C. Aggarwal, Peixiang Zhao, and Gewen He Florida State University IBM T J Watson Research Center Edge Classification in Networks ICDE Conference, 2016 Introduction We consider in this paper the edge

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Annales Univ. Sci. Budapest., Sect. Comp. 43 (2014) 57 68 COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Imre Szücs (Budapest, Hungary) Attila Kiss (Budapest, Hungary) Dedicated to András

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Distance Estimation for Very Large Networks using MapReduce and Network Structure Indices

Distance Estimation for Very Large Networks using MapReduce and Network Structure Indices Distance Estimation for Very Large Networks using MapReduce and Network Structure Indices ABSTRACT Hüseyin Oktay 1 University of Massachusetts hoktay@cs.umass.edu Ian Foster, University of Chicago foster@anl.gov

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Predict Topic Trend in Blogosphere

Predict Topic Trend in Blogosphere Predict Topic Trend in Blogosphere Jack Guo 05596882 jackguo@stanford.edu Abstract Graphical relationship among web pages has been used to rank their relative importance. In this paper, we introduce a

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Measurements on (Complete) Graphs: The Power of Wedge and Diamond Sampling

Measurements on (Complete) Graphs: The Power of Wedge and Diamond Sampling Measurements on (Complete) Graphs: The Power of Wedge and Diamond Sampling Tamara G. Kolda plus Grey Ballard, Todd Plantenga, Ali Pinar, C. Seshadhri Workshop on Incomplete Network Data Sandia National

More information

CS 224W Final Report Group 37

CS 224W Final Report Group 37 1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the

More information

A Parallel Algorithm for Finding Sub-graph Isomorphism

A Parallel Algorithm for Finding Sub-graph Isomorphism CS420: Parallel Programming, Fall 2008 Final Project A Parallel Algorithm for Finding Sub-graph Isomorphism Ashish Sharma, Santosh Bahir, Sushant Narsale, Unmil Tambe Department of Computer Science, Johns

More information

Characteristics of Preferentially Attached Network Grown from. Small World

Characteristics of Preferentially Attached Network Grown from. Small World Characteristics of Preferentially Attached Network Grown from Small World Seungyoung Lee Graduate School of Innovation and Technology Management, Korea Advanced Institute of Science and Technology, Daejeon

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han Chapter 1. Social Media and Social Computing October 2012 Youn-Hee Han http://link.koreatech.ac.kr 1.1 Social Media A rapid development and change of the Web and the Internet Participatory web application

More information

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou Sun yzsun@cs.ucla.edu January 10, 2017 Overview of Information Network Analysis Network Representation Network

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop

More information

arxiv: v1 [cs.si] 8 Jun 2012

arxiv: v1 [cs.si] 8 Jun 2012 Multi-Scale Link Prediction Donghyuk Shin Si Si Inderjit S. Dhillon arxiv:1206.1891v1 [cs.si] 8 Jun 2012 Abstract The automated analysis of social networks has become an important problem due to the proliferation

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

ScienceDirect A NOVEL APPROACH FOR ANALYZING THE SOCIAL NETWORK

ScienceDirect A NOVEL APPROACH FOR ANALYZING THE SOCIAL NETWORK Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 48 (2015 ) 686 691 International Conference on Intelligent Computing, Communication & Convergence (ICCC-2015) (ICCC-2014)

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

arxiv: v1 [cs.si] 8 Apr 2016

arxiv: v1 [cs.si] 8 Apr 2016 Leveraging Network Dynamics for Improved Link Prediction Alireza Hajibagheri 1, Gita Sukthankar 1, and Kiran Lakkaraju 2 1 University of Central Florida, Orlando, Florida 2 Sandia National Labs, Albuquerque,

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

Topic mash II: assortativity, resilience, link prediction CS224W

Topic mash II: assortativity, resilience, link prediction CS224W Topic mash II: assortativity, resilience, link prediction CS224W Outline Node vs. edge percolation Resilience of randomly vs. preferentially grown networks Resilience in real-world networks network resilience

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum

Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY 2018 1 Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum Belle Andrea, Pourcelot Gregoire Abstract The aim of this

More information

Social & Information Network Analysis CS 224W

Social & Information Network Analysis CS 224W Social & Information Network Analysis CS 224W Final Report Alexandre Becker Jordane Giuly Sébastien Robaszkiewicz Stanford University December 2011 1 Introduction The microblogging service Twitter today

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Epilog: Further Topics

Epilog: Further Topics Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Link prediction in multiplex bibliographical networks

Link prediction in multiplex bibliographical networks Int. J. Complex Systems in Science vol. 3(1) (2013), pp. 77 82 Link prediction in multiplex bibliographical networks Manisha Pujari 1, and Rushed Kanawati 1 1 Laboratoire d Informatique de Paris Nord (LIPN),

More information

Network embedding. Cheng Zheng

Network embedding. Cheng Zheng Network embedding Cheng Zheng Outline Problem definition Factorization based algorithms --- Laplacian Eigenmaps(NIPS, 2001) Random walk based algorithms ---DeepWalk(KDD, 2014), node2vec(kdd, 2016) Deep

More information

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Takuya Akiba U Tokyo Takanori Hayashi U Tokyo Nozomi Nori Kyoto

More information

Extracting Information from Complex Networks

Extracting Information from Complex Networks Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Graph Exploitation Testbed

Graph Exploitation Testbed Graph Exploitation Testbed Peter Jones and Eric Robinson Graph Exploitation Symposium April 18, 2012 This work was sponsored by the Office of Naval Research under Air Force Contract FA8721-05-C-0002. Opinions,

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

The Barnes-Hut Algorithm in MapReduce

The Barnes-Hut Algorithm in MapReduce The Barnes-Hut Algorithm in MapReduce Ross Adelman radelman@gmail.com 1. INTRODUCTION For my end-of-semester project, I implemented an N-body solver in MapReduce using Hadoop. The N-body problem is a classical

More information

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Gautam Bhat, Rajeev Kumar Singh Department of Computer Science and Engineering Shiv Nadar University Gautam Buddh Nagar,

More information

Fault Identification from Web Log Files by Pattern Discovery

Fault Identification from Web Log Files by Pattern Discovery ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping 2014 IEEE International Conference on Big Data : Fast Billion-Scale Graph Computation on a PC via Memory Mapping Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng (Polo) Chau Georgia Tech Atlanta,

More information

Delegated Access for Hadoop Clusters in the Cloud

Delegated Access for Hadoop Clusters in the Cloud Delegated Access for Hadoop Clusters in the Cloud David Nuñez, Isaac Agudo, and Javier Lopez Network, Information and Computer Security Laboratory (NICS Lab) Universidad de Málaga, Spain Email: dnunez@lcc.uma.es

More information

Content-based Modeling and Prediction of Information Dissemination

Content-based Modeling and Prediction of Information Dissemination Content-based Modeling and Prediction of Information Dissemination Kathy Macropol Department of Computer Science University of California Santa Barbara, CA 9316 USA kpm@cs.ucsb.edu Ambuj Singh Department

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

The Oracle Database Appliance I/O and Performance Architecture

The Oracle Database Appliance I/O and Performance Architecture Simple Reliable Affordable The Oracle Database Appliance I/O and Performance Architecture Tammy Bednar, Sr. Principal Product Manager, ODA 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.

More information

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS Ren Wang, Andong Wang, Talat Iqbal Syed and Osmar R. Zaïane Department of Computing Science, University of Alberta, Canada ABSTRACT

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information