Big Data Using Hadoop

Similar documents
Embedded Technosolutions

Falling Out of the Clouds: When Your Big Data Needs a New Home

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

When, Where & Why to Use NoSQL?

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Accelerate Big Data Insights

Big Data with Hadoop Ecosystem

BigDataBench-MT: Multi-tenancy version of BigDataBench

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Scalable Tools - Part I Introduction to Scalable Tools

Mining Distributed Frequent Itemset with Hadoop

WHITEPAPER. MemSQL Enterprise Feature List

Nowadays data-intensive applications play a

Introduction to Big-Data

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Stages of Data Processing

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Multi-tenancy version of BigDataBench

Challenges for Data Driven Systems

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Optimizing Apache Spark with Memory1. July Page 1 of 14

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Big Data Architect.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Part 1: Indexes for Big Data

Evolving To The Big Data Warehouse

A Fast and High Throughput SQL Query System for Big Data

Correlation based File Prefetching Approach for Hadoop

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

V Conclusions. V.1 Related work

Data Analytics at Logitech Snowflake + Tableau = #Winning

BIG DATA TESTING: A UNIFIED VIEW

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

New Approach to Unstructured Data

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Next-Generation Cloud Platform

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

[This is not an article, chapter, of conference paper!]

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

ScaleArc for SQL Server

How to integrate data into Tableau

Solution Brief. A Key Value of the Future: Trillion Operations Technology. 89 Fifth Avenue, 7th Floor. New York, NY

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

Big Data: Tremendous challenges, great solutions

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

BIG DATA COURSE CONTENT

An Introduction to Big Data Formats

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

MOHA: Many-Task Computing Framework on Hadoop

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Progress DataDirect For Business Intelligence And Analytics Vendors

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

SEVEN Networks Open Channel Traffic Optimization

Performance Evaluation of NoSQL Databases

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Big Data Specialized Studies

Hadoop, Yarn and Beyond

Five Common Myths About Scaling MySQL

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Efficient Algorithm for Frequent Itemset Generation in Big Data

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

ABSTRACT I. INTRODUCTION

Improving the ROI of Your Data Warehouse

White Paper Impact of DoD Cloud Strategy and FedRAMP on CSP, Government Agencies and Integrators.

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Storage Infrastructure at Facebook

Oracle Big Data Connectors

Twitter data Analytics using Distributed Computing

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Survey on MapReduce Scheduling Algorithms

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Introduction to Data Mining and Data Analytics

Specialist ICT Learning

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

Mitigating Data Skew Using Map Reduce Application

A data-driven framework for archiving and exploring social media data

Statistics Driven Workload Modeling for the Cloud

A comparison of UKCloud s platform against other public cloud providers

Transcription:

IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for data intensive applications. Hadoop, an open source implementationof MapReduce, has been adopted by an increasingly growing user community. Cloud computing service providers such as AmazonEC2 Cloud offer the opportunities for Hadoop users to lease a certain amount of resources and pay for their use. However, a keychallenge is that cloud service providers do not have a resource provisioning mechanism to satisfy user jobs with deadlinerequirements. Currently, it is solely the user s responsibility to estimate the required amount of resources for running a job in the cloud.this paper presents a Hadoop job performance model that accurately estimates job completion time and further provisions the requiredamount of resources for a job to be completed within a deadline. The proposed model builds on historical job execution records andemploys Locally Weighted Linear Regression (LWLR) technique to estimate the execution time of a job. Furthermore, it employslagrange Multipliers technique for resource provisioning to satisfy jobs with deadline requirements. The proposed model is initiallyevaluated on an in-house Hadoop cluster and subsequently evaluated in the Amazon EC2 Cloud. Experimental results show that theaccuracy of the proposed model in job execution estimation is in the range of 94.97 and 95.51 percent, and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications With the advent of big data phenomenon in theworld of data and its related technologies, the developments onthe NoSQL databases are highly regarded. It has been claimedthat these databases outperform their SQL counterparts. Theaim of this study is to investigate the claim by evaluating thedocument-oriented MongoDB database with SQL in terms of theperformance of common aggregated and nonaggregate queries.we designed a set of experiments with a huge

17ANSP-BD-003 17ANSP-BD-004 number ofoperations such as read, write, delete, and select from variousaspects in the two databases and on the same data for a typical ecommerceschema. The results show that MongoDB performsbetter for most operations excluding some aggregate functions.the results can be a good source for commercial and noncommercialcompanies eager to change the structure of thedatabase used to provide their line-of-business services. Dynamic Job Ordering and Slot Configurations for MapReduce Workloads MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This paper proposes two classes of algorithms to minimize the makespan and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 _ 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice. Distributed In-Memory Processing of All k Nearest Neighbor Queries A wide spectrum of Internet-scale mobile applications, ranging from social networking, gaming and entertainment to emergency response and crisis management, all require efficient and scalable All k Nearest Neighbor (AkNN) computations over millions of moving objects every few seconds to be operational. Most traditional techniques for computing AkNN queries are centralized, lacking both scalability and efficiency. Only recently, distributed techniques for shared-nothing cloud infrastructures have been proposed to achieve

17ANSP-BD-005 17ANSP-BD-006 scalability for large datasets. These batch-oriented algorithms are sub-optimal due to inefficient data space partitioning and data replication among processing units. In this paper, we present Spitfire, a distributed algorithm that provides a scalable and high-performance AkNN processing framework. Our proposed algorithm deploys a fast load-balanced partitioning scheme along with an efficient replicationset selection algorithm, to provide fast main-memory computations of the exact AkNN results in a batch-oriented manner. We evaluate, both analytically and experimentally, how the pruning efficiency of the Spitfire algorithm plays a pivotal role in reducing communication and response time up to an order of magnitude, compared to three other state-of-the-art distributed AkNN algorithms executed in distributed main-memory. Adaptive Replication Management in HDFS Based on Supervised Learning The number of applications based on Apache Hadoop is dramatically increasing due to the robustness and dynamic features of this system. At the heart of Apache Hadoop, the Hadoop Distributed File System (HDFS) provides the reliability and high availability for computation by applying a static replication by default. However, because of the characteristics of parallel operations on the application layer, the access rate for each data file in HDFS is completely different. Consequently, maintaining the same replication mechanism for every data file leads to detrimental effects on the performance. By rigorously considering the drawbacks of the HDFS replication, this paper proposes an approach to dynamically replicate the data file based on the predictive analysis. With the help of probability theory, the utilization of each data file can be predicted to create a corresponding replication strategy. Eventually, the popular files can be subsequently replicated according to their own access potentials. For the remaining low potential files, an erasure code is applied to maintain the reliability. Hence, our approach simultaneously improves the availability while keeping the reliability in comparison to the default scheme. Furthermore, the complexity reduction is applied to enhance the effectiveness of the prediction when dealing with Big Data. Wide Area Analytics for Geographically Distributed Datacenters Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a

17ANSP-BD-007 17ANSP-BD-008 single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions. A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn As market competition intensifies, customer churn management is increasingly becoming an important means of competitive advantage for companies. However, when dealing with big data in the industry, existing churn prediction models cannot work very well. In addition, decision makers are always faced with imprecise operations management. In response to these difficulties, a new clustering algorithm called semantic-driven subtractive clustering method (SDSCM) is proposed. Experimental results indicate that SDSCM has stronger clustering semantic strength than subtractive clustering method (SCM) and fuzzy c-means (FCM). Then, a parallel SDSCM algorithm is implemented through a Hadoop MapReduce framework. In the case study, the proposed parallel SDSCM algorithm enjoys a fast running speed when compared with the other methods. Furthermore, we provide some marketing strategies in accordance with the clustering results and a simplified marketing activity is simulated to ensure profit maximization. A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment Effective patient queue management to minimize patient wait delays

17ANSP-BD-009 and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all the patients before him is the time that he must wait. It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the largescale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and lowlatency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals. FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among

transactions. Incorporating the similarity metric and the Locality- Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31% with an average of 18%.