CS 6343: CLOUD COMPUTING Term Project

Size: px
Start display at page:

Download "CS 6343: CLOUD COMPUTING Term Project"

Transcription

1 CS 6343: CLOUD COMPUTING Term Project Project Goal Explore existing Cloud storage systems Implement some components in Cloud storage systems to get a better understanding on the implementation issues in Cloud storage systems Preliminary Form your group May be adjusted by the instructor and TA 2-3 weeks later Determine a meeting time Your group should have your own meeting times each week to coordinate the efforts in the group Your group needs to meet with the TA and sometimes also the instructor every week The attendance for the weekly meetings is mandatory The group membership may be adjusted if the meeting time becomes a problem During the meeting, the group needs to discuss its progress and decides its subsequent schedule and goals Each member in the group should report her/his contributions and the subsequent goals Your group should submit a weekly report, discussing what each member has done for the project (just a simple list of items, cumulative over the weeks), problems encountered and how they are resolved Project assignment The instructor will assign one of the projects to the group In the following, we first discuss the common works to be done for all project, then we discuss the detailed steps for each project Facility Each group will work on a cluster of computers, with 4 machines and 2 monitors on 2 tables Project-A/B groups will be in ECSS and Project-C groups will be in ECSS (may change) Common Steps to All Projects Goal Install a few famous Cloud storage systems (file systems and object storages) and compare their features and performance Cloud storage systems considered HDFS and HBase: Cassandra: Redis: Implement a map-reduce program on HDFS or Cassandra Environment setting Install Linux (better to be CentOS) on each machine in the cluster Install VMM on each machine and start some VMs Set up KVM Explore libvirt to achieve VM creation, cloning, and scaling

2 Set up the networking in the VM environment and make sure the VMs can communicate with each other Cloud storage system exploration Install the Cloud storage systems given above on the cluster Identify a feature vector (what features should be considered if a user needs to select a file system to use) for evaluating the Cloud storage systems Look up time Access latency, access throughput, directory service latency, etc. Load balancing features and performance Consistency model and solutions, availability solutions, etc. Other special features that are unique to a certain file system Use YCSB or similar software to explore the features of the file systems and object storages based on the feature vector YCSB does not offer delete commands We may offer an IO request generator to provide a more complete evaluation Map-Reduce on HDFS or Cassandra Install Hadoop MapReduce on HDFS or Cassandra Input files A sanitized crime database from The source of this crime database was from An example input file: Write a MapReduce program to compute the total crime incidents of each crime type in each region Region definition Crime location is defined on a coordinate system (East, North) East and North are defined by a 5-digit numerical value Region definition 1: use the first digit of the coordinates only to define a region. (5xxxx, 7xxxx), (5xxxx, 3xxxx), (8xxxx, 6xxxx), each is one region. Supposedly there are 100 regions, but not all the numbers appear in the files Region definition 2: use the first three digits of the coordinates to define a region. (535xx, 726xx) is one region Consider other region definitions Crime types include: Anti-social behavior, Burglary, Criminal damage and arson, Drugs, Other theft, Public disorder and weapons, Robbery, Shoplifting, Vehicle crime, Violent crime, Other crime Study Hadoop behaviors from its logs How the file(s) are distributed over the nodes. One large input file or many small input files How the mapper and reducer tasks are distributed over the nodes The performance of map, reduce, and shuffling&sorting phases, including execution time, memory usage, etc. How the system handle errors Project A/B Specifics Goal

3 Develop DHT and the corresponding load balancing solutions in Cloud storage systems Develop various DHT schemes and explore their pros and cons Consider DHT maintenance, updates, as well as load balancing Project steps Study Study various DHT solutions in Cloud storage systems Swift, Cassandra, Redis, Ceph (CRUSH) Study the corresponding load balancing solutions in DHT based storage systems Implement various DHT solutions DHT solutions to be implemented DHT1: Cassandra and Swift DHT2: Ceph DHT3: Elastic DHT: Similar to Redis, having a central directory, but allow the hash table to change its size Maintain a membership list for physical storage nodes Use a centralized solution like Swift and Ceph (A) Use a gossip protocol based solution like Cassandra (B) A potential package that can be used for the gossip protocol: Implement a virtual node to physical node mapping Consider virtual nodes in all cases Maintain a virtual node to physical node mapping table based on DHT solutions Should have a system configuration file to specify Number of physical nodes and weight of each physical node Virtual node unit weight Initial Table (OSD map, Ring) Replication level Implement DHT access requests handling See the list of access requests in the simulated control client Implement the simulated storage servers Each storage server also maintains the membership information and virtual node mapping Storage servers use Epoch to sync the membership information Simulated storage servers handle the access requests from the simulate clients See the list of requests by the simulated client For read/write requests, check if the files/objects accessed are local, if so, simply return this check is only by hashing the file/object names, no need to keep a list of local files If the files/objects to be accessed are not local, or are locked, then an error message should be returned to the client For locking, check the file transfer protocol Implement the simulated clients Each client issues a sequence of requests (prepared in advance in a file) Should be a reasonable access sequence Should have a random waiting time in between subsequent requests (Poison distribution) Client requests Requests: read/write Generate the file/object name for the requests and determine which storage node should be accessed Client sends the read/write request to the storage server after lookup or after finding the serverid in its cache DHT caching at the client side

4 Cache the necessary information locally Maintain an Epoch number to support sync For the centralized membership table (A), only sync with the central server For the distributed membership table (B), sync with any storage server Report: Provide print out to show the correctness of the implementation Control client requests Table initialization Physical node insertion and removal Virtual node insertion and removal Load balance request (see load balance section) Report: Access the data structures and print them to show the correctness of the implementation Implement a simulated file transfer solution to transfer a set of files The unit for transferring is a hash bucket Not to transfer real files, just to simulate the transfer and implement the corresponding DHT update mechanisms (if relevant) Lock the hash bucket being transferred and disable accesses to it Otherwise, there may be updates to files/objects and may cause inconsistency This part can be omitted if time does nt Simulate a transfer time, depending on the bytes to be transferred After transfer, enable the accesses to the bucket at the new node Update the DHT Implement load balancing solutions Load balancing type LB1: Elastic DHT Centralized decision making, simply decide to move a hash bucket from storage node x to storage node y May need to decide whether to split or merge the hash node LB2: Swift, Ceph Centralized decision making Swift: decide whether to create new virtual nodes or remove virtual nodes Ceph: decide whether to change the weights of the storage nodes LB3: Cassandra Assume that the membership table also maintains load information A heavily loaded node changes its own hash value and update the membership table Load balance request for LB1 and LB2 Control client simply sends a load balancing request to the central server Central server decides virtual node changes and informs the involved storage servers Source storage server simulates the file transfer protocol and informs the central server after file transfer Central server updates the membership table and syncs with other nodes Load balance request for LB3 Control client sends a load balancing request to a randomly selected server The selected server moves one of its virtual node in the DHT by 1 position and start the simulated file transfer protocol The selected server updates its membership table after file transfer, and sync with other servers by the gossip protocol Compare the performance of various DHT solutions Start a large number of clients to access the Cloud storage system

5 Multiple client can be simulated by multiple processes as well as multiple threads in each process Obtain the statistical performance data Consider different control parameters Plot the performance data Project C Specifics Goal Develop directory structure maintenance solutions in Cloud file systems Develop various directory structure maintenance solutions and explore their pros and cons Consider directory maintenance, reads and updates Introduction In Unix system, each directory is a file by itself In many cloud file systems, files are treated as individual elements during placement among the distributed servers and directory structure maintenance is handled separately Dir1: Use a centralized server to store the entire directory Dir2: Ceph solution (split the overall directory dynamically) Dir3: Use Unix solution (treat directory files as regular files), but merge a subtree of directories into one file Let the #levels be a configurable parameter When #levels = 1, it is the Unix file system solution Implement simulated clients Each client may issue a sequence of directory access requests Consider directory creation, deletion, rename, ls operations The sequence should be reasonable Fir Dir1, simply send the directory requests to the central server For Dir2, send the directory requests to the correct server Each client knows the root server If the correct server is unknown, start from the root server, traverse till the correct server has been reached Cache the intermediate partitioning information. Can use caching packages such as Java s EhCache, Guava Caching Utilities, cache2k. Cache size should be limited When finding information from the cache, should try prefixes of the path For Dir3, needs to hash and determine which server has the directory file Hash the directory path (or prefix of it, depending on the #levels to be merged) to obtain the server ID Control client Start the directory server(s) For Dir2, consider partition change request Query the directory server data structures to show that the system is implemented correctly Implement the directory server(s) Handle the directory requests Each server creates a thread pool to allow parallel request handling Lock the directory node to ensure proper accesses Since there are parallel threads on each server, locking is needed so that there is no conflict directory accesses E.g., x deletes directory foo, and y performs ls on foo

6 Dir2 Consider 3 directory servers Consider dynamic partition changes Upon handling a partition change request, server x should send the directory information to server y Dir3 Implement the storage servers Implement real directory files and hash them to storage servers Implement read/update of directory files Evaluation Generate a huge directory structure May be provided when ready Start a large number of clients Use several processes and each process include a large number of threads, each thread is a client A simulated client needs to issue directory requests to the simulated directory server(s) Generate a sequence of requests in advance and store them in a file Should be a reasonable access sequence During simulation, the client reads one request at a time from the file and submit the request to the simulated server Should have a random waiting time in between subsequent requests (Poison distribution) Evaluate the performance of different directory management solutions Project Schedule Week 1 Week 2 Week 3 Week of 10/19 Week of 11/30 Team formation, explore Linux and KVM by individual students Computer and lab assignment, environment setup by each team, study the project description, setup meeting schedule Kickoff meeting and progress check Midterm demo and project report due - Finish cloud storage system installations and exploration - Implement a draft of the simplified project Final project preliminary inspection - Finish the final project and obtain comments for improvements - Finish the preliminary final report - Finish the MapReduce project 12/8-15 Final project demo and project report due - Date is determined during the preliminary inspection Project Submissions Mid-term submissions Your working code Though your code is not for the entire project, it should implement some functions required in the project and should be fully working Midterm project report (contents is the same as final report) Final submissions Your working code

7 Final project report Midterm report should be progressively revised into final report Teams that fall behind will have meetings every week to ensure the progress is caught up Submission guideline Each team only needs one submission, team leader should make the submission through e- learning <team-label>: A, B1, B2, C1, C2 During submission, attach the doc file and the file name has to be <team-label>.doc or <teamlabel>.docx (e.g., B1.doc) Do not submit pdf Zip or tar your code, file name has to be <team-label>.zip Attach the zip file during e-learning submission Required information in the report (can include more than listed) In the cover page, provide team-label, title, and team members Introduction Goal of the project What you would offer in your system and why what you offered is important Study of related work Summary of related works (in paper or similar products available) How your project is different from or is similar to some existing works Approach System architecture Activity diagram (workflow) Architecture diagrams (from high level to low level decomposition of the system) Description of the nodes in the architecture Detailed design Including the APIs of each important component, the algorithms used in some components, etc. Implementation details List the code files and VMs that are in your system Provide a structure of the code files and describe each code file and each VM in the list The relations of the code files and what they are doing should have been in the detailed design section, you can relate the code files to those in the design section Problems encountered and how they are resolved This is important, it is the basis for us to decide the efforts you have actually put in the project... Experimental results Clearly state the control parameters and the metrics for measurement Experimentation needs to be thorough and results need to be easy to read (e.g., use graphs and table) Installation manual, including existing open source used (no need to give installation guide for open sources, just their urls) and how to setup your programs User manual, discussing how to use your system Workload distribution Include a high level summary and the detailed list The detailed list include tasks done by each member in the team week by week (just attach the log you prepared every week)

CS 6343: CLOUD COMPUTING Term Project

CS 6343: CLOUD COMPUTING Term Project CS 6343: CLOUD COMPUTING Term Project For all projects Each group will be assigned a cluster of machines Each group should install VMM on each platform and use the VMs to simulate more machines See VMM

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Presented by Nanditha Thinderu

Presented by Nanditha Thinderu Presented by Nanditha Thinderu Enterprise systems are highly distributed and heterogeneous which makes administration a complex task Application Performance Management tools developed to retrieve information

More information

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu Deploying Software Defined Storage for the Enterprise with Ceph PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu Agenda Yet another attempt to define SDS Quick Overview of Ceph from a SDS perspective

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

/ Cloud Computing. Recitation 10 March 22nd, 2016

/ Cloud Computing. Recitation 10 March 22nd, 2016 15-319 / 15-619 Cloud Computing Recitation 10 March 22nd, 2016 Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.3, OLI Unit 4, Module 15, Quiz 8 This week

More information

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook Cassandra - A Decentralized Structured Storage System Avinash Lakshman and Prashant Malik Facebook Agenda Outline Data Model System Architecture Implementation Experiments Outline Extension of Bigtable

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table 1 What is KVS? Why to use? Why not to use? Who s using it? Design issues A storage system A distributed hash table Spread simple structured

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency CS 677 Distributed Operating Systems Spring 2013 Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency Due: Tue Apr 30 2013 You may work in groups of two for this lab

More information

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari Dynomite Ecosystem Dynomite - Proxy layer Dyno - Client Dynomite-manager - Ecosystem orchestrator Dynomite-explorer

More information

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St. -Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St. Petersburg Introduction File System Enterprise Needs Gluster Revisited Ceph

More information

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage CS-580K/480K Advanced Topics in Cloud Computing Object Storage 1 When we use object storage When we check Facebook, twitter Gmail Docs on DropBox Check share point Take pictures with Instagram 2 Object

More information

ROCK INK PAPER COMPUTER

ROCK INK PAPER COMPUTER Introduction to Ceph and Architectural Overview Federico Lucifredi Product Management Director, Ceph Storage Boston, December 16th, 2015 CLOUD SERVICES COMPUTE NETWORK STORAGE the future of storage 2 ROCK

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013 CS15-319 / 15-619 Cloud Computing Recitation 11 November 5 th and Nov 8 th, 2013 Announcements Encounter a general bug: Post on Piazza Encounter a grading bug: Post Privately on Piazza Don t ask if my

More information

Distributed Storage with GlusterFS

Distributed Storage with GlusterFS Distributed Storage with GlusterFS Dr. Udo Seidel Linux-Strategy @ Amadeus OSDC 2013 1 Agenda Introduction High level overview Storage inside Use cases Summary OSDC 2013 2 Introduction OSDC 2013 3 Me ;-)

More information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives Name: _Solution 6 questions, 100 pts, 80 minutes 1. (20 pts) Compare Hadoop (plus HDFS) to the Chord DHT. (a) What

More information

4. Note: This example has NFS version 3, but other settings such as NFS version 4 may also work better in some environments.

4. Note: This example has NFS version 3, but other settings such as NFS version 4 may also work better in some environments. Creating NFS Share 1. Mounting the NFS Share from VMware vsphere Mounting from Windows NFS Clients NFS and Firewall Settings NFS Client Mount from Linux NFS v4 and Authentication Considerations Common

More information

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017 Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012 virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012 outline why you should care what is it, what it does how it works, how you can use it architecture

More information

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 16. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2017 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017 INTRODUCTION TO CEPH Orit Wasserman Red Hat August Penguin 2017 CEPHALOPOD A cephalopod is any member of the molluscan class Cephalopoda. These exclusively marine animals are characterized by bilateral

More information

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Ceph Rados Gateway. Orit Wasserman Fosdem 2016 Ceph Rados Gateway Orit Wasserman owasserm@redhat.com Fosdem 2016 AGENDA Short Ceph overview Rados Gateway architecture What's next questions Ceph architecture Cephalopod Ceph Open source Software defined

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies! DEMYSTIFYING BIG DATA WITH RIAK USE CASES Martin Schneider Basho Technologies! Agenda Defining Big Data in Regards to Riak A Series of Trade-Offs Use Cases Q & A About Basho & Riak Basho Technologies is

More information

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS h_da Prof. Dr. Uta Störl Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe 2017 163 Performance / Benchmarks Traditional database benchmarks

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Developing Enterprise Cloud Solutions with Azure

Developing Enterprise Cloud Solutions with Azure Developing Enterprise Cloud Solutions with Azure Java Focused 5 Day Course AUDIENCE FORMAT Developers and Software Architects Instructor-led with hands-on labs LEVEL 300 COURSE DESCRIPTION This course

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb The following is intended to outline our general product direction. It is intended for information purposes only,

More information

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate Scale-out Solution and Challenges Mahadev Gaonkar igate 2013 Developer Conference. igate. All Rights Reserved. Table of Content Overview of Scale-out Scale-out NAS Solution Architecture IO Workload Distribution

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Armon HASHICORP

Armon HASHICORP Nomad Armon Dadgar @armon Distributed Optimistically Concurrent Scheduler Nomad Distributed Optimistically Concurrent Scheduler Nomad Schedulers map a set of work to a set of resources Work (Input) Resources

More information

Advanced Systems Lab

Advanced Systems Lab Advanced Systems Lab Autumn Semester 2017 Gustavo Alonso 1 Course Overview 1.1 Prerequisites This is a project based course operating on a self-study principle. The course relies on the students own initiative,

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日 Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日 Motivation There are many cloud DB and nosql systems out there PNUTS BigTable HBase, Hypertable, HTable Megastore Azure Cassandra Amazon

More information

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang 9 May 2017 Swifta A performant Hadoop file system driver for Swift Mengmeng Liu Andy Robb Ray Zhang Our Big Data Journey One of two teams that run multi-tenant Hadoop ecosystem at Walmart Large, shared

More information

Performance Evaluation of Cassandra in a Virtualized environment

Performance Evaluation of Cassandra in a Virtualized environment Master of Science in Computer Science February 2017 Performance Evaluation of Cassandra in a Virtualized environment Mohit Vellanki Faculty of Computing Blekinge Institute of Technology SE-371 79 Karlskrona

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

File Management By : Kaushik Vaghani

File Management By : Kaushik Vaghani File Management By : Kaushik Vaghani File Concept Access Methods File Types File Operations Directory Structure File-System Structure File Management Directory Implementation (Linear List, Hash Table)

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons

More information

Indexing Large-Scale Data

Indexing Large-Scale Data Indexing Large-Scale Data Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook November 16, 2010

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India NoSQL BENCHMARKING AND TUNING Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India Today large variety of available NoSQL options has made it difficult for developers to choose

More information

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Armon HASHICORP

Armon HASHICORP Nomad Armon Dadgar @armon Cluster Manager Scheduler Nomad Cluster Manager Scheduler Nomad Schedulers map a set of work to a set of resources Work (Input) Resources Web Server -Thread 1 Web Server -Thread

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

A Solution for Geographic Regions Load Balancing in Cloud Computing Environment

A Solution for Geographic Regions Load Balancing in Cloud Computing Environment Chapter 5 A Solution for Geographic Regions Load Balancing in Cloud Computing Environment 5.1 INTRODUCTION Cloud computing is one of the most interesting way of distributing the data as well as to get

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

Developing Microsoft Azure Solutions (MS 20532)

Developing Microsoft Azure Solutions (MS 20532) Developing Microsoft Azure Solutions (MS 20532) COURSE OVERVIEW: This course is intended for students who have experience building ASP.NET and C# applications. Students will also have experience with the

More information

UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS. Ms. Selva Mary. G UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

More information

Big Data Analytics. Rasoul Karimi

Big Data Analytics. Rasoul Karimi Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Chapter 10: File-System Interface

Chapter 10: File-System Interface Chapter 10: File-System Interface Objectives: To explain the function of file systems To describe the interfaces to file systems To discuss file-system design tradeoffs, including access methods, file

More information

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin 4/26/2010 Joos-Hendrik Boese Slide

More information

Dynamic Metadata Management for Petabyte-scale File Systems

Dynamic Metadata Management for Petabyte-scale File Systems Dynamic Metadata Management for Petabyte-scale File Systems Sage Weil Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller UC Santa Cruz November 1, 2006 Presented by Jae Geuk, Kim System Overview Petabytes

More information

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 13 April 12 th 2016 15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2

More information

SwiftStack Object Storage

SwiftStack Object Storage Integrating NetBackup 8.1.x with SwiftStack Object Storage July 23, 2018 1 Table of Contents Table of Contents 2 Introduction 4 SwiftStack Storage Connected to NetBackup 5 Netbackup 8.1 Support for SwiftStack

More information

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015 Ceph Block Devices: A Deep Dive Josh Durgin RBD Lead June 24, 2015 Ceph Motivating Principles All components must scale horizontally There can be no single point of failure The solution must be hardware

More information

Scaling Out Key-Value Storage

Scaling Out Key-Value Storage Scaling Out Key-Value Storage COS 418: Distributed Systems Logan Stafman [Adapted from K. Jamieson, M. Freedman, B. Karp] Horizontal or vertical scalability? Vertical Scaling Horizontal Scaling 2 Horizontal

More information

IMPORTANT: Circle the last two letters of your class account:

IMPORTANT: Circle the last two letters of your class account: Spring 2011 University of California, Berkeley College of Engineering Computer Science Division EECS MIDTERM I CS 186 Introduction to Database Systems Prof. Michael J. Franklin NAME: STUDENT ID: IMPORTANT:

More information

SciSpark 201. Searching for MCCs

SciSpark 201. Searching for MCCs SciSpark 201 Searching for MCCs Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays

More information

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25 Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/25 Examples AFS NFS SMB/CIFS Coda Intermezzo HDFS WebDAV 9P 2/25 Andrew File System

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Chapter 10: File-System Interface

Chapter 10: File-System Interface Chapter 10: File-System Interface Objectives: To explain the function of file systems To describe the interfaces to file systems To discuss file-system design tradeoffs, including access methods, file

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

GlusterFS Cloud Storage. John Mark Walker Gluster Community Leader, RED HAT

GlusterFS Cloud Storage. John Mark Walker Gluster Community Leader, RED HAT GlusterFS Cloud Storage John Mark Walker Gluster Community Leader, RED HAT Adventures in Cloud Storage John Mark Walker Gluster Community Leader November 8, 2013 Topics Usage and Performance GlusterFS

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems Ceph Intro & Architectural Overview Abbas Bangash Intercloud Systems About Me Abbas Bangash Systems Team Lead, Intercloud Systems abangash@intercloudsys.com intercloudsys.com 2 CLOUD SERVICES COMPUTE NETWORK

More information

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama Basics of Cloud Computing Lecture 2 Cloud Providers Satish Srirama Outline Cloud computing services recap Amazon cloud services Elastic Compute Cloud (EC2) Storage services - Amazon S3 and EBS Cloud managers

More information

Architecture of a Real-Time Operational DBMS

Architecture of a Real-Time Operational DBMS Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc.

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents

More information

Elastic Cloud Storage (ECS)

Elastic Cloud Storage (ECS) Elastic Cloud Storage (ECS) Version 3.1 Administration Guide 302-003-863 02 Copyright 2013-2017 Dell Inc. or its subsidiaries. All rights reserved. Published September 2017 Dell believes the information

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

At Course Completion Prepares you as per certification requirements for AWS Developer Associate. [AWS-DAW]: AWS Cloud Developer Associate Workshop Length Delivery Method : 4 days : Instructor-led (Classroom) At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

More information

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems CompSci 516 Data Intensive Computing Systems Lecture 21 (optional) NoSQL systems Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Key- Value Stores Duke CS,

More information

Shen PingCAP 2017

Shen PingCAP 2017 Shen Li @ PingCAP About me Shen Li ( 申砾 ) Tech Lead of TiDB, VP of Engineering Netease / 360 / PingCAP Infrastructure software engineer WHY DO WE NEED A NEW DATABASE? Brief History Standalone RDBMS NoSQL

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information