Improvement of path analysis algorithm in social networks based on HBase

Size: px
Start display at page:

Download "Improvement of path analysis algorithm in social networks based on HBase"

Transcription

1 J Comb Optim (2014) 28: DOI /s z Improvement of path analysis algorithm in social networks based on HBase Yan Qiang Bo Pei Weili Wu Juanjuan Zhao Xiaolong Zhang Yue Li Lidong Wu Published online: 17 November 2013 Springer Science+Business Media New York 2013 Abstract When social network has reached hundreds of million users, the analysis of data in social network services becomes very important. Understanding how nodes interconnect in large graphs is an essential problem in many fields. In order to find connecting nodes between two nodes or two groups of source nodes in huge graphs, we propose a parallelized data-mining algorithm to get the shortest path between nodes in a social network based on HBase distributed key/value store. Our algorithm Y. Qiang B. Pei J. Zhao W. Wu X. Zhang (B) Y. Li College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China xiaolong.zhang@gmail.com Y. Qiang qiangyan@tyut.edu.cn B. Pei peibo0332@tyut.edu.cn J. Zhao zhaojuanjuan@tyut.edu.cn98 Y. Li ly0158@tyut.edu.cn W. Wu L. Wu Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA weiliwu@utdallas.edu L. Wu Lidong.wu@utdallas.edu X. Zhang College of Information Sciences and Technology, Pennsylvania State University, State College, PA, USA

2 J Comb Optim (2014) 28: can achieve the shortest path among different nodes in network under the parallel environment. We analyze the social network model by this algorithm first, and then optimize the output from cloud platform by using the intermediary degrees and degree central algorithm. Finally, with a simulated social network, we validate the efficiency of the proposed algorithm. The experiment results indicate that our algorithm can improve the efficiency of parallel breath-first search (BSF). Keywords Social networks HBase Parallel BFS The K-shortest paths Intermediary degrees 1 Introduction Social network services (SNS) make it possible to connect people who share interests and activities across political, economic, and geographic borders. We can interact with each other through virtual connections. However, it is not easy to analyze the SNS data. In particular, the increase of users will accelerate the data scale of popular SNS, such as Facebook, and this posts challenges for effective data analysis. Traditional data analysis methods often target small-scale datasets, and are inefficient in handling large-scale datasets, such as SNS data. For instance, in a social network, to know how any two nodes are connected and how close they are, we have to compute the shortest path of these two nodes. Due to complex internal dependency of network nodes, tradition methods based on the relational database model will not be suitable for large social networks. To address this problem, researchers have proposed some new methods. We promote the NoSQL (Lakshman and Malik 2010) and Graph database (GDB) to solve this problem. Different from traditional relational database, these methods can store the topological structure of a network, which is used to compute the shortest path. Recently, many researchers begin to pay attention to cloud database techniques. Google s Bigtable (Chang et al. 2006) and Apache HBase (Taylor 2010) are the representative of cloud databases. HBase is an open source, nonrelational, column-oriented, distributed database. It is developed as part of Apache Software Foundation s Apache Hadoop (Škrabálek et al. 2013) project and runs on top of hadoop distributed filesystem (HDFS). Google s Bigtable is also a structured distributed data storage system built on Google File System. They both can handle large datasets. Based on HBase, we propose an algorithm that combines ranking methods and data clustering. First, we construct a set of paths between any two nodes in a social network, and then rank the paths based on the importance of nodes and user requirements. The rest of this paper is structured as follows. Section 2 provides pervious work on shortest path. Section 3 describes path algorithms based on HBase, includes HBase, parallel BFS to strike a single-source shortest path tree, K-shortest paths and optimization sorting in more details respectively. We describe the experiments and discuss experimental results of our work in Sects. 4 and 5. Section 6 presents our conclusions.

3 590 J Comb Optim (2014) 28: Previous works The small world network, proposed by Watts and Strogatz (1998), has been extensively studied. They argued that small world networks mathematically follow the power law distribution. This theory is also called six degree space theory (Missen 2008). According to this theory, although most of nodes in a network are not connected directly, almost all nodes can be reached with a small number of hops. In a small world network, a person is treated as a node, and the connection between them represents that people know each other. Thus, the small world network can reflect the small world phenomenon that strangers can be connected by those whom they mutually know. Barabási and Albert (1999), as well as Barabási et al. (1999), showed that the power law can be used to simulate small world networks. Gu et al. (2013) proposed a generalized small world network, which contains several small world network models, and proved mathematically that a small-world network generalized with a given expectation of edge number possesses large clustering coefficient and small diameter. Lu et al. (2014) analyzed the efficient Influence Spread Estimation for Influence Maximization under the Linear Threshold Model. GDB are now a viable alternative to Relational Database Systems (RDBMS). Social networking and recommendation engines are all examples of applications that can be represented in a much more natural form. The social graph leverages information across a range of networks in order to quantify the relationships between individuals. It is necessary to find an effective way to store network oriented data, so as to computing the shortest path in social networks. Unlike in relational databases, in which data is organized as tables, its storage style allows more agile data query. Neo4j (Hoque and Gupta 2012) is an open-source, high-performance, enterprise-grade NoSQL graph database. With the Neo4j network storage model, algorithms such as Dijkstra (1959) can be used to solve the shortest path problem. However, this kind of methods cannot be easily scaled up for large networks and depends on hardware strongly. With the rapid rise of the Internet web 2.0 sites, the traditional relational database in coping web 2.0 site, especially the ultra-large-scale and high concurrent SNS type of the web 2.0 pure dynamic website, appears to be inadequate, exposing a lot of insurmountable problems. Non-relational database such as NoSQL has been rapidly developed to address this challenge. Identified by its non-adherence to relational database model, NoSQL belongs to a class of database management system. Major internet companies, such as Google, Amazon, and Facebook, have been using this new database system to deal with huge quantities of data. To manage large volumes of data, NoSQL do not necessarily follow a fixed schema. Apach HBase and Google Bigtable are examples of NoSQL. While Bigtable is a more proprietary technology, HBase is an open source, non-relational distributed database. It runs on top of HDFS. Compared to general relational databases, HBase is better suitable for unstructured data. With the rapid development of cloud computing, many corresponding formulations based on cloud platform have appeared. Lin and Dyer (2010) focused on MapReduce algorithm design. They designed the breath-first search (BSF) algorithm based on MapReduce distributed platform; it is called the parallel BSF. This parallel algorithm can make full use of distributed computing resources of the platform for a particular point in the network to get the shortest path distance to other all points. McCubbin

4 J Comb Optim (2014) 28: and Perozzi (2011) proposed a K-shortest path algorithm to compute the shortest K path by using the sort phase of MapReduce framework. 3 Path algorithm based on HBase To run a traditional shortest path algorithm in the Hadoop cloud platform, we need to transform it into a MapReduce format first. The parallel BFS algorithm by Lin and Dyer (2010) cannot make full use of the efficient distributed storage architecture of the Hadoop. Our proposed algorithm uses the HBase database and coprocessor in the HBase to improve the operation efficiency of parallel BFS. 3.1 HBase HBase can be run on top of HDFS, and is a column-oriented database management system. This allows HBase to read the contents of the large data easily. Unlike traditional database, such a non-relational database does not support a structured query language like SQL. Facebook uses HBase as its storage infrastructure because of its special characteristic: high write throughput, low latency random reads, elasticity, cheap and fault tolerant and strong consistency within a data center. An HBase system is constituted by a set of tables, each of which contains rows and columns. It uses a primary key in the table to access data stored in it. In HBase, multiple attributes can be combined into column families, and data elements of which are then stored together. Thus, when column families are involved in HBase, the table schema and the column families must be predefined and specified. Similar to graph data, social network data can be stored in HBase. Every edge is associated with a weight. The weight of a path represents the sum of the weights of the selected edges. This database uses a 3-tuple of a row, column family, and column qualifier as a key to index values to stores entries. Using the following schema, a graph can be represented as adjacency lists: row := node name column family := name of the edge type column qualifier := name of the incident node value := edge weight (floating point) With this schema it is straightforward to store hypergraphs by simply using different column family identifiers. To indicate which particular graph elements should be considered, we set a list of edge types as one of their parameters to our algorithm. For any particular source, we add a column to the HBase table to store the shortest path tree: row := node name column family := the string pointer column qualifier := the shortest path source value := a pointer string There are two comma-separated values in the pointer string: a total cost of this shortest path from the node to the source, which we refer to as cost (src; n); and the next hop in the shortest path to the source.

5 592 J Comb Optim (2014) 28: Parallel BFS to strike a single-source shortest path tree We implement our parallel BFS algorithm based on HBase. Map a social network into a network graph full of nodes. Any node has two attributes: node ID and the distance from this node to the source node. The source node s distance to itself is initialized to 0 and other nodes are initialized to infinite. System will travel through all nodes based on BFS cyclically. In the map phase, system will emit the source node and send a message to its adjacent nodes. The objects represented by these nodes usually like our family members in real life. When these nodes receive the message, they will update their distance to 1 by 0 plus 1. Then the system will propagate the node s distance. At reduce phase, system will check the type of the received information to determine whether it is a node or a distance. Then the system will reconstruct network graph and find the minimum distance. When each cycle is finished, the system will check if all nodes are iterative. If the answer is no, the cycle will continue like the first time. If the answer is yes, the cycle will be broken, and then we will get a dendrogram with the node s distance. We also apply the HBase coprocessor to the cycle. Thus, at the reduce phase, including reconstruct network graph and finding the minimum distance could be omitted to speed up the time required by each HBase coprocessor. Finally we will get a simplified algorithm. Its pseudo-code is as follows: Algorithm 1 Pseudo code of improved parallel BFS Class Mapper Method Map(nid n, node N) seen {} For all cell i in N do Cost = cell i value cost Pointer= cell i value pointer Neighbor= cell i neighbornode Seen[pointer]=(neighbor,cost) If seen.has(src) and seen.has(dst) then totalcost= seen[src].cost+seen[dst].cost emit(totalcost, N.key) Class Reducer visited {} pathlist [] Method Reduce(cost,cells) For all cell i in cells do If emittedpaths<=k then If not visited.has(cell i ) then Path=ReassemblePath(cell i ) visited.addall(path) pathlist.append(cell i ) For all path i in pathlist do Emit(path i ) As with Dijkstra s algorithm, assume that a connected and directed graph is stored as adjacency lists. The distance to each node is directly stored alongside the adjacency list of that node, and all the nodes will be initialized to 1 except for the source node. As

6 J Comb Optim (2014) 28: shown in the pseudo-code, the node id (an integer) and the node s corresponding data structure (adjacency list and current distance) are represented by n and N respectively. The algorithm works by mapping all nodes and propagating the key-value pair for each neighbor on the node s adjacency list. The key contains the node id of the neighbor, and the value saves the current distance to the node plus one. If we can reach node n with distance d, then we must be able to reach all the nodes that are connected to n with distance d + 1. After shuffling and sorting, the reducer stage will receive keys corresponding to the destination node ids and distances from all paths to that node. The reducer will select the shortest value of these distances and then update the distance in the node data structure. It is apparent that parallel BFS is an iterative algorithm. Each iteration corresponds to a MapReduce job. When the algorithm starts, nodes connected to the source can be identified. As iteration keeps going on, other nodes that are connected to the identified nodes and their shortest distances can be discovered. Similar to the mapreduce processes to pass graph structure, there are both map and reduce processes in the BFS method. These processes become a burden to computational efficiency, so we provide improvements to BFS by leveraging HBase coprocessors. Coprocessors can be loaded globally on all tables and regions hosted by the region server, and these are known as system coprocessors. The administrator can also specify which coprocessors should be loaded on all regions for a table on a per-table basis, and these are known as table coprocessors. The BFS method is improved by using coprocessor. The map phase is responsible for the value of the transmission path and the network structure in the traditional BFS method, but Coprocessor component architecture allows us to complete the update of the value of the node shortest path d in the map phase. There are two advantages of this method. The first one is that it only needs the map phase, so a lot of time consumed in the process of sort-shuffle can be saved. The second one is that the data storage mode of HBase can help to reduce the consumption of the data transfer process. By doing so, updates to the pointer tree are made to some nodes before the mappers read their data. We refer to this property as pointer cascading. This leads to one iteration of the algorithm that can extend the effective frontier of the shortest paths by more than one hop. HBase will be forced to update the data after mapping to reduce the number of iterations. 3.3 K-shortest paths To find new nodes, in our application we need to find paths that are cycle free and also meet the additional requirement: each path in the set of K-shortest paths contains a node not presented on any other path in the set. In this way, new nodes can be found. This property is called node uniqueness. After the single-source shortest path tree that is stored in the HBase database got from our algorithm in Sect. 3.2, the K-shortest paths in the graph can be determined with one additional pass. In the Map function, we check each node to see whether there is a shortest path pointer to both the source and destination. If the answer is yes, the path from the source through the node to

7 594 J Comb Optim (2014) 28: the destination formed by passing the two shortest paths together is a candidate for a K-shortest path from the source to destination. To inform the reducer of this fact, we emit a key/value pair in which the key is the sum of both pointers costs and the value is the node s identifier. We have assumed that the graph is undirected; as a result, one of the two shortest paths can be reversed. Courtesy of the sort-shuffle phase, our reducer receives the candidate shortest paths in cost order. Ties are broken by the node numbering; the lowest numbered paths are discovered first. The reducer must, however, do more than simply choose the first resultant k candidates. We are interested in unique shortest paths that have at least one interesting node. Therefore our reducer keeps a set of nodes that have seen on previous paths. It accepts, and rejects the shortest path suggestions that are just composed of the nodes it has seen before. The reducer must also guard against cycles that may be produced by the shortest path pointer algorithm. In the omitted function ReassemblePath, we reconstruct each proposed path from HBase and check it for cycles before accepting it. The pseudo-code of this process is as follows: Algorithm 2 Pseudo-code of K Shortest Paths Class Mapper Method Map(nid n, node N) seen {} For all cell i in N do Cost = cell i value cost Pointer= cell i value pointer Neighbor= cell i neighbornode Seen[pointer]=(neighbor,cost) If seen.has(src) and seen.has(dst) then totalcost=seen[src].cost+seen[dst].cost emit(totalcost, N.key) Class Reducer visited {} pathlist [] Method Reduce(cost,cells) For all cell i in cells do If emittedpaths<=k then If not visited.has(cell i ) then Path=ReassemblePath(cell i ) visited.addall(path) pathlist.append(cell i ) For all pathi in pathlist do Emit(path i ) 3.4 Optimization sorting From the previous step, we get the K source nodes to the destination node of the shortest path set. The question of this phase is whether the user can easily read path centralized information. The answer is negative, because the path ordering set is just according to the path value size, and no other information is available. When calculating the path, many

8 J Comb Optim (2014) 28: important paths may be missed if the value of K is too small; on the contrary, there might be too many paths. Thus, we should choose an optimal value for K to identify an appropriate number of paths. There are many methods to optimize network diagram path. According to the needs of users, we can use different algorithm combinations. Here we choose intermediary degree and degree of central algorithm to optimize cloud platform output results. In the real case, we not only expect knowing which nodes are in the path, but also need to learn if someone in the relationship chain plays a key role, because the key role of people may be the one who is known by both parties. Thus, we choose intermediary degrees (betweenness) to optimize the path set. Betweenness Centrality, betweenness for short (Brandes 2001), is an important concept in social network analysis. It represents the number of the shortest path of all the nodes. Betweenness is a good description of the contribution of those nodes in every path to the connection between two nodes. The larger betweenness of a node is, the more paths go through the node and the bigger probability the node connects other nodes. Consider a weighted directed (multi)-graph G = (V, E) with n = V, m = E. Let SP st denote the set of shortest paths between source s and target t and SP st (v) the subset of SP st consisting of paths that have v in their interior. Then, the betweenness centrality for node v is C B (v) = s =v =t V σ st (v)/σ st (1) where σ st := Spst and σst (v) := Spst (v) In addition we choose degree centricity as another optimization factor. The degree centrality of a node is defined as the number of edges that connect to this node Qin and Li (2011). Generally, the higher centrality a node has the more popular or wellconnected the node is. In our algorithm, the degree centrality for node v is written as: C D (v) = deg (v)/(n 1) (2) where deg (v) means the degree of v, and n equals the number of v. 4 The analysis of the experiment We conducted an experiment to evaluate the performance of our algorithm. We built a cloud system used simulated networks for the experiment. 4.1 Hardware configuration of the cloud system We used Hadoop for cloud platform configuration. The cloud cluster contains 10 nodes, and each node has a 2.6-GHz CPU, 8 GB of RAM, and 1 TB of hard disk.

9 596 J Comb Optim (2014) 28: Fig. 1 A network pattern we used to simulate social network There are 128 mapper and 64 reducer in default. After large number of tests, we set the most proper rate between the quantities of mapper, reduce the running speed of image algorithm to 2: Network model It is hard to get the real large network data for our test. Thus, we used mathematic modeling to simulate network data. Social network is a random network, and has its own regulation. It belongs to small world. Clustering phenomenon will appear in the whole network model, and obey the Power-law distribution. As a result, we used the math model developed by Holme and Kim (2002) to modeling the network with acceptable network properties. We used NetworkX, a Python language software package for the generation, manipulation, and study of the structure, dynamics, and function of complex network. The structure of social networks is similar to graph, in which people can be represented by node and the relationships of people can be represented by edges. As a result, it is very suitable to analyze social network data by NetworkX. It can draw multiple network graphics, and we use one kind of these graphics to generate the network data. Figure 1 shows the network pattern we used to simulate social network. NetworkX can generate networks that follow power-law distribution. The simulated network we used is fairly large. It contains 100 million nodes and 600 million edges. 5 Results and discussion 5.1 Time complexity Since each line will receive a new message in the process of iterative formulation, the complexity of node degree for each network diagram is O(1), and its time complexity is O( log E ). The amount of all the message is O(E). The complexity of storing data to HBase table structure equals O(log E ). As shown in Fig. 2a, we compare the efficiency of our algorithm with the parallel BFS. Our improved parallel BFS algorithm outperforms the traditional one. As shown in Fig. 2a, when the number of nodes is small, these two algorithm make no difference. With the increase of the number of nodes in the social network, our improved algorithm is more efficient than the traditional algorithm. The time

10 J Comb Optim (2014) 28: complexity of K-shortest path algorithm is O(V log V ), so is that of BFS algorithm. The time complexity in shuffle-sort phase equals O(E log E ). Figure 2b shows the running time of K-shortest path algorithm of different nodes. The curve of running time from 1 to 4 belongs to linear distribution basically. From 5 to 10, the growth speed of running time is faster and finally consume more time than the situation in Fig. 2a. 5.2 Discussion As shown in Table 1, the influence of cascade phenomenon on the number of iteration is significant, especially when data is forced to be written into HBase immediately after mapping. To deal with the large network with 100 million nodes and 600 million edges, our method consumes 6 h and 30 min to compute the shortest paths, which is acceptable. Running time of the parallel BFS algorithm accounted for half, and the rest time was for running K-shortest path algorithm. When K is 20, the running time of the optimization algorithm is trivial, compared with those of other steps. Fig. 2 a Efficiency Comparison of two algorithms. b Running time of K-shortest paths algorithm of different nodes

11 598 J Comb Optim (2014) 28: Table 1 Comparison of the number of BFS iteration with different numbers of nodes Node number Traditional iteration times Improved iteration times Cascade iteration times Path length Conclusion This paper proposed a parallel BFS algorithm to compute the shortest path of a social network based on the HBase. This algorithm analyzes the social network model. It achieves the shortest path between different nodes in network under the parallel environment, and can optimize the output from cloud platform by using the intermediary degrees and degree central algorithm. We validated the efficiency of the proposed algorithm with a simulated large social network. The experiment results indicate that the proposed algorithm can improve the efficiency of parallel BSF. Acknowledgments This study was supported by the National Natural Science Foundation of China (Grant No , , ); Natural Science Foundation of Shanxi Province (Grant No ) and Programs for Science and Technology Development of Shanxi Province (Grant No ). This work was also supported in part by the US National Science Foundation (NSF) under Grant no. CNS and CCF References Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509 Barabási A-L, Albert R, Jeong H (1999) Mean-field theory for scale-free random networks. Physica A 272:173 Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2): Chang F, et al. (2006) Bigtable: a distributed storage system for structured data. OSDI Dean J, Ghemawat S (2008) Mapreduce simplified data processing on large clusters. Commun ACM 51(1): Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1(1): Gu L, Huang HL, Zhang XD (2013) The clustering coefficient and the diameter of small-world networks. Acta Mathematica Sinica 29: English Series Holme P, Kim BJ (2002) Growing scale free networks with tunable clustering. Phys Rev E 65: Hoque I, Gupta IC (2012) Disk layout techniques for online social network data. IEEE INTERNET COM- PUTING Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. SIGOPS 44(2):35 40 Lin J, Dyer C (2010) Data-intensive text processing with MapReduce, ser. Synthesis lectures on human language technologies. Morgan and Claypool Publishers, Florida Lu Z, Fan L, Wu W, Thuraisingham B, Yang K (2014) Efficient influence spread estimation for influence maximization under the linear threshold model. To appear in computational, social networks

12 J Comb Optim (2014) 28: McCubbin C, Perozzi B (2011) Finding the Needle : Locating interesting nodes using the K-shortest paths algorithm in MapReduce th IEEE International Conference on Data Mining Workshops Missen MMSC (2008) The small world of web network graphs. International Multi Topic Conference on Wireless Networks, Information Processing and Systems, IMTIC Qin L, Li H (2011) Centrality analysis of BBS reply networks International Conference on Information Technology, Computer Engineering and Management Sciences, ICM September Škrabálek J, Kunc P, Nguyen F (2013) Towards effective social network system implementation/new trends in databases and information systems. Springer Berlin, Heidelberg, pp Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):S1 Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393:440

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008 Lesson 4 Random graphs Sergio Barbarossa Graph models 1. Uncorrelated random graph (Erdős, Rényi) N nodes are connected through n edges which are chosen randomly from the possible configurations 2. Binomial

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Analyzing and Improving Load Balancing Algorithm of MooseFS

Analyzing and Improving Load Balancing Algorithm of MooseFS , pp. 169-176 http://dx.doi.org/10.14257/ijgdc.2014.7.4.16 Analyzing and Improving Load Balancing Algorithm of MooseFS Zhang Baojun 1, Pan Ruifang 1 and Ye Fujun 2 1. New Media Institute, Zhejiang University

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Response Network Emerging from Simple Perturbation

Response Network Emerging from Simple Perturbation Journal of the Korean Physical Society, Vol 44, No 3, March 2004, pp 628 632 Response Network Emerging from Simple Perturbation S-W Son, D-H Kim, Y-Y Ahn and H Jeong Department of Physics, Korea Advanced

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

CIB Session 12th NoSQL Databases Structures

CIB Session 12th NoSQL Databases Structures CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is

More information

Graph Data Processing with MapReduce

Graph Data Processing with MapReduce Distributed data processing on the Cloud Lecture 5 Graph Data Processing with MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes Commun. Theor. Phys. (Beijing, China) 45 (2006) pp. 950 954 c International Academic Publishers Vol. 45, No. 5, May 15, 2006 Small World Properties Generated by a New Algorithm Under Same Degree of All

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

L22: NoSQL. CS3200 Database design (sp18 s2)   4/5/2018 Several slides courtesy of Benny Kimelfeld L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models

More information

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 98 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 554 559 The 3rd International Symposium on Emerging Information, Communication and Networks Integration of Big

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Constructing weakly connected dominating set for secure clustering in distributed sensor network

Constructing weakly connected dominating set for secure clustering in distributed sensor network J Comb Optim (01) 3:301 307 DOI 10.1007/s10878-010-9358-y Constructing weakly connected dominating set for secure clustering in distributed sensor network Hongjie Du Weili Wu Shan Shan Donghyun Kim Wonjun

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Research on Community Structure in Bus Transport Networks

Research on Community Structure in Bus Transport Networks Commun. Theor. Phys. (Beijing, China) 52 (2009) pp. 1025 1030 c Chinese Physical Society and IOP Publishing Ltd Vol. 52, No. 6, December 15, 2009 Research on Community Structure in Bus Transport Networks

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE Chin-Chao Huang, Wenching Liou National Chengchi University, Taiwan 99356015@nccu.edu.tw, w_liou@nccu.edu.tw

More information

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

A Security Audit Module for HBase

A Security Audit Module for HBase 2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

Presented by Nanditha Thinderu

Presented by Nanditha Thinderu Presented by Nanditha Thinderu Enterprise systems are highly distributed and heterogeneous which makes administration a complex task Application Performance Management tools developed to retrieve information

More information

An Evolving Network Model With Local-World Structure

An Evolving Network Model With Local-World Structure The Eighth International Symposium on Operations Research and Its Applications (ISORA 09) Zhangjiajie, China, September 20 22, 2009 Copyright 2009 ORSC & APORC, pp. 47 423 An Evolving Network odel With

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Higher order clustering coecients in Barabasi Albert networks

Higher order clustering coecients in Barabasi Albert networks Physica A 316 (2002) 688 694 www.elsevier.com/locate/physa Higher order clustering coecients in Barabasi Albert networks Agata Fronczak, Janusz A. Ho lyst, Maciej Jedynak, Julian Sienkiewicz Faculty of

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos Instituto Politécnico de Tomar Introduction to Big Data NoSQL Databases Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016 Part of the slides used in

More information

Presented by Sunnie S Chung CIS 612

Presented by Sunnie S Chung CIS 612 By Yasin N. Silva, Arizona State University Presented by Sunnie S Chung CIS 612 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. See http://creativecommons.org/licenses/by-nc-sa/4.0/

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Sreekanth Rallapalli 1,*, Gondkar R R 2 1 Research Scholar, R&D Centre, Bharathiyar University, Coimbatore, Tamilnadu, India.

More information

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Image Filtering with MapReduce in Pseudo-Distribution Mode

Image Filtering with MapReduce in Pseudo-Distribution Mode Image Filtering with MapReduce in Pseudo-Distribution Mode Tharindu D. Gamage, Jayathu G. Samarawickrama, Ranga Rodrigo and Ajith A. Pasqual Department of Electronic & Telecommunication Engineering, University

More information

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Khalid Mahmood Shaheed Zulfiqar Ali Bhutto Institute of Science and Technology, Karachi Pakistan khalidmdar@yahoo.com

More information

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores Nikhil Dasharath Karande 1 Department of CSE, Sanjay Ghodawat Institutes, Atigre nikhilkarande18@gmail.com Abstract- This paper

More information

Research on the value of search engine optimization based on Electronic Commerce WANG Yaping1, a

Research on the value of search engine optimization based on Electronic Commerce WANG Yaping1, a 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Research on the value of search engine optimization based on Electronic Commerce WANG Yaping1,

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY , pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,

More information

BESIII Physical Analysis on Hadoop Platform

BESIII Physical Analysis on Hadoop Platform BESIII Physical Analysis on Hadoop Platform Jing HUO 12, Dongsong ZANG 12, Xiaofeng LEI 12, Qiang LI 12, Gongxing SUN 1 1 Institute of High Energy Physics, Beijing, China 2 University of Chinese Academy

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1

More information

A Review to the Approach for Transformation of Data from MySQL to NoSQL

A Review to the Approach for Transformation of Data from MySQL to NoSQL A Review to the Approach for Transformation of Data from MySQL to NoSQL Monika 1 and Ashok 2 1 M. Tech. Scholar, Department of Computer Science and Engineering, BITS College of Engineering, Bhiwani, Haryana

More information

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Ch.Srilakshmi Asst Professor,Department of Information Technology R.M.D Engineering College, Kavaraipettai,

More information