CS 6343: CLOUD COMPUTING Term Project

Similar documents
CS 6343: CLOUD COMPUTING Term Project

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Presented by Nanditha Thinderu

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

/ Cloud Computing. Recitation 10 March 22nd, 2016

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Tools for Social Networking Infrastructures

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Big Data Hadoop Course Content

CLOUD-SCALE FILE SYSTEMS

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

ROCK INK PAPER COMPUTER

Time Series Storage with Apache Kudu (incubating)

Distributed Systems 16. Distributed File Systems II

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

Distributed Storage with GlusterFS

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

4. Note: This example has NFS version 3, but other settings such as NFS version 4 may also work better in some environments.

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Introduction to MapReduce

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Google File System. Arun Sundaram Operating Systems

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Distributed File Systems II

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Developing Enterprise Cloud Solutions with Azure

The Fusion Distributed File System

Introduction to BigData, Hadoop:-

CSE-E5430 Scalable Cloud Computing Lecture 9

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

Hadoop & Big Data Analytics Complete Practical & Real-time Training

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

The Google File System

Armon HASHICORP

Advanced Systems Lab

A BigData Tour HDFS, Ceph and MapReduce

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

Performance Evaluation of Cassandra in a Virtualized environment

CompSci 516: Database Systems

File Management By : Kaushik Vaghani

2/26/2017. For instance, consider running Word Count across 20 splits

Hadoop. copyright 2011 Trainologic LTD

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Indexing Large-Scale Data

Importing and Exporting Data Between Hadoop and MySQL

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Armon HASHICORP

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Huge market -- essentially all high performance databases work this way

CSE 124: Networked Services Lecture-17

A Solution for Geographic Regions Load Balancing in Cloud Computing Environment

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Developing Microsoft Azure Solutions (MS 20532)

UNIT-IV HDFS. Ms. Selva Mary. G

Big Data Analytics. Rasoul Karimi

Processing of big data with Apache Spark

Chapter 10: File-System Interface

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Dynamic Metadata Management for Petabyte-scale File Systems

/ Cloud Computing. Recitation 13 April 12 th 2016

SwiftStack Object Storage

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

Scaling Out Key-Value Storage

IMPORTANT: Circle the last two letters of your class account:

SciSpark 201. Searching for MCCs

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Challenges for Data Driven Systems

Chapter 10: File-System Interface

Big Data Analytics using Apache Hadoop and Spark with Scala

GlusterFS Cloud Storage. John Mark Walker Gluster Community Leader, RED HAT

CSE 124: Networked Services Lecture-16

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Architecture of a Real-Time Operational DBMS

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Elastic Cloud Storage (ECS)

Kathleen Durant PhD Northeastern University CS Indexes

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Shen PingCAP 2017

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Transcription:

CS 6343: CLOUD COMPUTING Term Project Project Goal Explore existing Cloud storage systems Implement some components in Cloud storage systems to get a better understanding on the implementation issues in Cloud storage systems Preliminary Form your group May be adjusted by the instructor and TA 2-3 weeks later Determine a meeting time Your group should have your own meeting times each week to coordinate the efforts in the group Your group needs to meet with the TA and sometimes also the instructor every week The attendance for the weekly meetings is mandatory The group membership may be adjusted if the meeting time becomes a problem During the meeting, the group needs to discuss its progress and decides its subsequent schedule and goals Each member in the group should report her/his contributions and the subsequent goals Your group should submit a weekly report, discussing what each member has done for the project (just a simple list of items, cumulative over the weeks), problems encountered and how they are resolved Project assignment The instructor will assign one of the projects to the group In the following, we first discuss the common works to be done for all project, then we discuss the detailed steps for each project Facility Each group will work on a cluster of computers, with 4 machines and 2 monitors on 2 tables Project-A/B groups will be in ECSS 3.217 and Project-C groups will be in ECSS 3.213 (may change) Common Steps to All Projects Goal Install a few famous Cloud storage systems (file systems and object storages) and compare their features and performance Cloud storage systems considered HDFS and HBase: http://hadoop.apache.org/ Cassandra: http://cassandra.apache.org/ Redis: https://redis.io/ Implement a map-reduce program on HDFS or Cassandra Environment setting Install Linux (better to be CentOS) on each machine in the cluster Install VMM on each machine and start some VMs Set up KVM Explore libvirt to achieve VM creation, cloning, and scaling

Set up the networking in the VM environment and make sure the VMs can communicate with each other Cloud storage system exploration Install the Cloud storage systems given above on the cluster Identify a feature vector (what features should be considered if a user needs to select a file system to use) for evaluating the Cloud storage systems Look up time Access latency, access throughput, directory service latency, etc. Load balancing features and performance Consistency model and solutions, availability solutions, etc. Other special features that are unique to a certain file system Use YCSB or similar software to explore the features of the file systems and object storages based on the feature vector YCSB does not offer delete commands We may offer an IO request generator to provide a more complete evaluation Map-Reduce on HDFS or Cassandra Install Hadoop MapReduce on HDFS or Cassandra http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html Input files A sanitized crime database from http://utdallas.edu/~ilyen/course/cloud/forall/mr-data.zip The source of this crime database was from http://www.police.uk/data An example input file: http://utdallas.edu/~ilyen/course/cloud/forall/mr-input.txt Write a MapReduce program to compute the total crime incidents of each crime type in each region Region definition Crime location is defined on a coordinate system (East, North) East and North are defined by a 5-digit numerical value Region definition 1: use the first digit of the coordinates only to define a region. (5xxxx, 7xxxx), (5xxxx, 3xxxx), (8xxxx, 6xxxx), each is one region. Supposedly there are 100 regions, but not all the numbers appear in the files Region definition 2: use the first three digits of the coordinates to define a region. (535xx, 726xx) is one region Consider other region definitions Crime types include: Anti-social behavior, Burglary, Criminal damage and arson, Drugs, Other theft, Public disorder and weapons, Robbery, Shoplifting, Vehicle crime, Violent crime, Other crime Study Hadoop behaviors from its logs How the file(s) are distributed over the nodes. One large input file or many small input files How the mapper and reducer tasks are distributed over the nodes The performance of map, reduce, and shuffling&sorting phases, including execution time, memory usage, etc. How the system handle errors Project A/B Specifics Goal

Develop DHT and the corresponding load balancing solutions in Cloud storage systems Develop various DHT schemes and explore their pros and cons Consider DHT maintenance, updates, as well as load balancing Project steps Study Study various DHT solutions in Cloud storage systems Swift, Cassandra, Redis, Ceph (CRUSH) Study the corresponding load balancing solutions in DHT based storage systems Implement various DHT solutions DHT solutions to be implemented DHT1: Cassandra and Swift DHT2: Ceph DHT3: Elastic DHT: Similar to Redis, having a central directory, but allow the hash table to change its size Maintain a membership list for physical storage nodes Use a centralized solution like Swift and Ceph (A) Use a gossip protocol based solution like Cassandra (B) A potential package that can be used for the gossip protocol: https://github.com/apache/incubator-gossip Implement a virtual node to physical node mapping Consider virtual nodes in all cases Maintain a virtual node to physical node mapping table based on DHT solutions Should have a system configuration file to specify Number of physical nodes and weight of each physical node Virtual node unit weight Initial Table (OSD map, Ring) Replication level Implement DHT access requests handling See the list of access requests in the simulated control client Implement the simulated storage servers Each storage server also maintains the membership information and virtual node mapping Storage servers use Epoch to sync the membership information Simulated storage servers handle the access requests from the simulate clients See the list of requests by the simulated client For read/write requests, check if the files/objects accessed are local, if so, simply return this check is only by hashing the file/object names, no need to keep a list of local files If the files/objects to be accessed are not local, or are locked, then an error message should be returned to the client For locking, check the file transfer protocol Implement the simulated clients Each client issues a sequence of requests (prepared in advance in a file) Should be a reasonable access sequence Should have a random waiting time in between subsequent requests (Poison distribution) Client requests Requests: read/write Generate the file/object name for the requests and determine which storage node should be accessed Client sends the read/write request to the storage server after lookup or after finding the serverid in its cache DHT caching at the client side

Cache the necessary information locally Maintain an Epoch number to support sync For the centralized membership table (A), only sync with the central server For the distributed membership table (B), sync with any storage server Report: Provide print out to show the correctness of the implementation Control client requests Table initialization Physical node insertion and removal Virtual node insertion and removal Load balance request (see load balance section) Report: Access the data structures and print them to show the correctness of the implementation Implement a simulated file transfer solution to transfer a set of files The unit for transferring is a hash bucket Not to transfer real files, just to simulate the transfer and implement the corresponding DHT update mechanisms (if relevant) Lock the hash bucket being transferred and disable accesses to it Otherwise, there may be updates to files/objects and may cause inconsistency This part can be omitted if time does nt Simulate a transfer time, depending on the bytes to be transferred After transfer, enable the accesses to the bucket at the new node Update the DHT Implement load balancing solutions Load balancing type LB1: Elastic DHT Centralized decision making, simply decide to move a hash bucket from storage node x to storage node y May need to decide whether to split or merge the hash node LB2: Swift, Ceph Centralized decision making Swift: decide whether to create new virtual nodes or remove virtual nodes Ceph: decide whether to change the weights of the storage nodes LB3: Cassandra Assume that the membership table also maintains load information A heavily loaded node changes its own hash value and update the membership table Load balance request for LB1 and LB2 Control client simply sends a load balancing request to the central server Central server decides virtual node changes and informs the involved storage servers Source storage server simulates the file transfer protocol and informs the central server after file transfer Central server updates the membership table and syncs with other nodes Load balance request for LB3 Control client sends a load balancing request to a randomly selected server The selected server moves one of its virtual node in the DHT by 1 position and start the simulated file transfer protocol The selected server updates its membership table after file transfer, and sync with other servers by the gossip protocol Compare the performance of various DHT solutions Start a large number of clients to access the Cloud storage system

Multiple client can be simulated by multiple processes as well as multiple threads in each process Obtain the statistical performance data Consider different control parameters Plot the performance data Project C Specifics Goal Develop directory structure maintenance solutions in Cloud file systems Develop various directory structure maintenance solutions and explore their pros and cons Consider directory maintenance, reads and updates Introduction In Unix system, each directory is a file by itself In many cloud file systems, files are treated as individual elements during placement among the distributed servers and directory structure maintenance is handled separately Dir1: Use a centralized server to store the entire directory Dir2: Ceph solution (split the overall directory dynamically) Dir3: Use Unix solution (treat directory files as regular files), but merge a subtree of directories into one file Let the #levels be a configurable parameter When #levels = 1, it is the Unix file system solution Implement simulated clients Each client may issue a sequence of directory access requests Consider directory creation, deletion, rename, ls operations The sequence should be reasonable Fir Dir1, simply send the directory requests to the central server For Dir2, send the directory requests to the correct server Each client knows the root server If the correct server is unknown, start from the root server, traverse till the correct server has been reached Cache the intermediate partitioning information. Can use caching packages such as Java s EhCache, Guava Caching Utilities, cache2k. Cache size should be limited When finding information from the cache, should try prefixes of the path For Dir3, needs to hash and determine which server has the directory file Hash the directory path (or prefix of it, depending on the #levels to be merged) to obtain the server ID Control client Start the directory server(s) For Dir2, consider partition change request Query the directory server data structures to show that the system is implemented correctly Implement the directory server(s) Handle the directory requests Each server creates a thread pool to allow parallel request handling Lock the directory node to ensure proper accesses Since there are parallel threads on each server, locking is needed so that there is no conflict directory accesses E.g., x deletes directory foo, and y performs ls on foo

Dir2 Consider 3 directory servers Consider dynamic partition changes Upon handling a partition change request, server x should send the directory information to server y Dir3 Implement the storage servers Implement real directory files and hash them to storage servers Implement read/update of directory files Evaluation Generate a huge directory structure May be provided when ready Start a large number of clients Use several processes and each process include a large number of threads, each thread is a client A simulated client needs to issue directory requests to the simulated directory server(s) Generate a sequence of requests in advance and store them in a file Should be a reasonable access sequence During simulation, the client reads one request at a time from the file and submit the request to the simulated server Should have a random waiting time in between subsequent requests (Poison distribution) Evaluate the performance of different directory management solutions Project Schedule Week 1 Week 2 Week 3 Week of 10/19 Week of 11/30 Team formation, explore Linux and KVM by individual students Computer and lab assignment, environment setup by each team, study the project description, setup meeting schedule Kickoff meeting and progress check Midterm demo and project report due - Finish cloud storage system installations and exploration - Implement a draft of the simplified project Final project preliminary inspection - Finish the final project and obtain comments for improvements - Finish the preliminary final report - Finish the MapReduce project 12/8-15 Final project demo and project report due - Date is determined during the preliminary inspection Project Submissions Mid-term submissions Your working code Though your code is not for the entire project, it should implement some functions required in the project and should be fully working Midterm project report (contents is the same as final report) Final submissions Your working code

Final project report Midterm report should be progressively revised into final report Teams that fall behind will have meetings every week to ensure the progress is caught up Submission guideline Each team only needs one submission, team leader should make the submission through e- learning <team-label>: A, B1, B2, C1, C2 During submission, attach the doc file and the file name has to be <team-label>.doc or <teamlabel>.docx (e.g., B1.doc) Do not submit pdf Zip or tar your code, file name has to be <team-label>.zip Attach the zip file during e-learning submission Required information in the report (can include more than listed) In the cover page, provide team-label, title, and team members Introduction Goal of the project What you would offer in your system and why what you offered is important Study of related work Summary of related works (in paper or similar products available) How your project is different from or is similar to some existing works Approach System architecture Activity diagram (workflow) Architecture diagrams (from high level to low level decomposition of the system) Description of the nodes in the architecture Detailed design Including the APIs of each important component, the algorithms used in some components, etc. Implementation details List the code files and VMs that are in your system Provide a structure of the code files and describe each code file and each VM in the list The relations of the code files and what they are doing should have been in the detailed design section, you can relate the code files to those in the design section Problems encountered and how they are resolved This is important, it is the basis for us to decide the efforts you have actually put in the project... Experimental results Clearly state the control parameters and the metrics for measurement Experimentation needs to be thorough and results need to be easy to read (e.g., use graphs and table) Installation manual, including existing open source used (no need to give installation guide for open sources, just their urls) and how to setup your programs User manual, discussing how to use your system Workload distribution Include a high level summary and the detailed list The detailed list include tasks done by each member in the team week by week (just attach the log you prepared every week)