CS 6343: CLOUD COMPUTING Term Project

CS 6343: CLOUD COMPUTING Term Project Project Goal Explore existing Cloud storage systems Implement some components in Cloud storage systems to get a better understanding on the implementation issues in Cloud storage systems Preliminary Form your group May be adjusted by the instructor and TA 2-3 weeks later Determine a meeting time Your group should have your own meeting times each week to coordinate the efforts in the group Your group needs to meet with the TA and sometimes also the instructor every week The attendance for the weekly meetings is mandatory The group membership may be adjusted if the meeting time becomes a problem During the meeting, the group needs to discuss its progress and decides its subsequent schedule and goals Each member in the group should report her/his contributions and the subsequent goals Your group should submit a weekly report, discussing what each member has done for the project (just a simple list of items, cumulative over the weeks), problems encountered and how they are resolved Project assignment The instructor will assign one of the projects to the group In the following, we first discuss the common works to be done for all project, then we discuss the detailed steps for each project Facility Each group will work on a cluster of computers, with 4 machines and 2 monitors on 2 tables Project-A/B groups will be in ECSS 3.217 and Project-C groups will be in ECSS 3.213 (may change) Common Steps to All Projects Goal Install a few famous Cloud storage systems (file systems and object storages) and compare their features and performance Cloud storage systems considered HDFS and HBase: http://hadoop.apache.org/ Cassandra: http://cassandra.apache.org/ Redis: https://redis.io/ Implement a map-reduce program on HDFS or Cassandra Environment setting Install Linux (better to be CentOS) on each machine in the cluster Install VMM on each machine and start some VMs Set up KVM Explore libvirt to achieve VM creation, cloning, and scaling

Set up the networking in the VM environment and make sure the VMs can communicate with each other Cloud storage system exploration Install the Cloud storage systems given above on the cluster Identify a feature vector (what features should be considered if a user needs to select a file system to use) for evaluating the Cloud storage systems Look up time Access latency, access throughput, directory service latency, etc. Load balancing features and performance Consistency model and solutions, availability solutions, etc. Other special features that are unique to a certain file system Use YCSB or similar software to explore the features of the file systems and object storages based on the feature vector YCSB does not offer delete commands We may offer an IO request generator to provide a more complete evaluation Map-Reduce on HDFS or Cassandra Install Hadoop MapReduce on HDFS or Cassandra http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html Input files A sanitized crime database from http://utdallas.edu/~ilyen/course/cloud/forall/mr-data.zip The source of this crime database was from http://www.police.uk/data An example input file: http://utdallas.edu/~ilyen/course/cloud/forall/mr-input.txt Write a MapReduce program to compute the total crime incidents of each crime type in each region Region definition Crime location is defined on a coordinate system (East, North) East and North are defined by a 5-digit numerical value Region definition 1: use the first digit of the coordinates only to define a region. (5xxxx, 7xxxx), (5xxxx, 3xxxx), (8xxxx, 6xxxx), each is one region. Supposedly there are 100 regions, but not all the numbers appear in the files Region definition 2: use the first three digits of the coordinates to define a region. (535xx, 726xx) is one region Consider other region definitions Crime types include: Anti-social behavior, Burglary, Criminal damage and arson, Drugs, Other theft, Public disorder and weapons, Robbery, Shoplifting, Vehicle crime, Violent crime, Other crime Study Hadoop behaviors from its logs How the file(s) are distributed over the nodes. One large input file or many small input files How the mapper and reducer tasks are distributed over the nodes The performance of map, reduce, and shuffling&sorting phases, including execution time, memory usage, etc. How the system handle errors Project A/B Specifics Goal

Develop DHT and the corresponding load balancing solutions in Cloud storage systems Develop various DHT schemes and explore their pros and cons Consider DHT maintenance, updates, as well as load balancing Project steps Study Study various DHT solutions in Cloud storage systems Swift, Cassandra, Redis, Ceph (CRUSH) Study the corresponding load balancing solutions in DHT based storage systems Implement various DHT solutions DHT solutions to be implemented DHT1: Cassandra and Swift DHT2: Ceph DHT3: Elastic DHT: Similar to Redis, having a central directory, but allow the hash table to change its size Maintain a membership list for physical storage nodes Use a centralized solution like Swift and Ceph (A) Use a gossip protocol based solution like Cassandra (B) A potential package that can be used for the gossip protocol: https://github.com/apache/incubator-gossip Implement a virtual node to physical node mapping Consider virtual nodes in all cases Maintain a virtual node to physical node mapping table based on DHT solutions Should have a system configuration file to specify Number of physical nodes and weight of each physical node Virtual node unit weight Initial Table (OSD map, Ring) Replication level Implement DHT access requests handling See the list of access requests in the simulated control client Implement the simulated storage servers Each storage server also maintains the membership information and virtual node mapping Storage servers use Epoch to sync the membership information Simulated storage servers handle the access requests from the simulate clients See the list of requests by the simulated client For read/write requests, check if the files/objects accessed are local, if so, simply return this check is only by hashing the file/object names, no need to keep a list of local files If the files/objects to be accessed are not local, or are locked, then an error message should be returned to the client For locking, check the file transfer protocol Implement the simulated clients Each client issues a sequence of requests (prepared in advance in a file) Should be a reasonable access sequence Should have a random waiting time in between subsequent requests (Poison distribution) Client requests Requests: read/write Generate the file/object name for the requests and determine which storage node should be accessed Client sends the read/write request to the storage server after lookup or after finding the serverid in its cache DHT caching at the client side

Cache the necessary information locally Maintain an Epoch number to support sync For the centralized membership table (A), only sync with the central server For the distributed membership table (B), sync with any storage server Report: Provide print out to show the correctness of the implementation Control client requests Table initialization Physical node insertion and removal Virtual node insertion and removal Load balance request (see load balance section) Report: Access the data structures and print them to show the correctness of the implementation Implement a simulated file transfer solution to transfer a set of files The unit for transferring is a hash bucket Not to transfer real files, just to simulate the transfer and implement the corresponding DHT update mechanisms (if relevant) Lock the hash bucket being transferred and disable accesses to it Otherwise, there may be updates to files/objects and may cause inconsistency This part can be omitted if time does nt Simulate a transfer time, depending on the bytes to be transferred After transfer, enable the accesses to the bucket at the new node Update the DHT Implement load balancing solutions Load balancing type LB1: Elastic DHT Centralized decision making, simply decide to move a hash bucket from storage node x to storage node y May need to decide whether to split or merge the hash node LB2: Swift, Ceph Centralized decision making Swift: decide whether to create new virtual nodes or remove virtual nodes Ceph: decide whether to change the weights of the storage nodes LB3: Cassandra Assume that the membership table also maintains load information A heavily loaded node changes its own hash value and update the membership table Load balance request for LB1 and LB2 Control client simply sends a load balancing request to the central server Central server decides virtual node changes and informs the involved storage servers Source storage server simulates the file transfer protocol and informs the central server after file transfer Central server updates the membership table and syncs with other nodes Load balance request for LB3 Control client sends a load balancing request to a randomly selected server The selected server moves one of its virtual node in the DHT by 1 position and start the simulated file transfer protocol The selected server updates its membership table after file transfer, and sync with other servers by the gossip protocol Compare the performance of various DHT solutions Start a large number of clients to access the Cloud storage system

Multiple client can be simulated by multiple processes as well as multiple threads in each process Obtain the statistical performance data Consider different control parameters Plot the performance data Project C Specifics Goal Develop directory structure maintenance solutions in Cloud file systems Develop various directory structure maintenance solutions and explore their pros and cons Consider directory maintenance, reads and updates Introduction In Unix system, each directory is a file by itself In many cloud file systems, files are treated as individual elements during placement among the distributed servers and directory structure maintenance is handled separately Dir1: Use a centralized server to store the entire directory Dir2: Ceph solution (split the overall directory dynamically) Dir3: Use Unix solution (treat directory files as regular files), but merge a subtree of directories into one file Let the #levels be a configurable parameter When #levels = 1, it is the Unix file system solution Implement simulated clients Each client may issue a sequence of directory access requests Consider directory creation, deletion, rename, ls operations The sequence should be reasonable Fir Dir1, simply send the directory requests to the central server For Dir2, send the directory requests to the correct server Each client knows the root server If the correct server is unknown, start from the root server, traverse till the correct server has been reached Cache the intermediate partitioning information. Can use caching packages such as Java s EhCache, Guava Caching Utilities, cache2k. Cache size should be limited When finding information from the cache, should try prefixes of the path For Dir3, needs to hash and determine which server has the directory file Hash the directory path (or prefix of it, depending on the #levels to be merged) to obtain the server ID Control client Start the directory server(s) For Dir2, consider partition change request Query the directory server data structures to show that the system is implemented correctly Implement the directory server(s) Handle the directory requests Each server creates a thread pool to allow parallel request handling Lock the directory node to ensure proper accesses Since there are parallel threads on each server, locking is needed so that there is no conflict directory accesses E.g., x deletes directory foo, and y performs ls on foo

Dir2 Consider 3 directory servers Consider dynamic partition changes Upon handling a partition change request, server x should send the directory information to server y Dir3 Implement the storage servers Implement real directory files and hash them to storage servers Implement read/update of directory files Evaluation Generate a huge directory structure May be provided when ready Start a large number of clients Use several processes and each process include a large number of threads, each thread is a client A simulated client needs to issue directory requests to the simulated directory server(s) Generate a sequence of requests in advance and store them in a file Should be a reasonable access sequence During simulation, the client reads one request at a time from the file and submit the request to the simulated server Should have a random waiting time in between subsequent requests (Poison distribution) Evaluate the performance of different directory management solutions Project Schedule Week 1 Week 2 Week 3 Week of 10/19 Week of 11/30 Team formation, explore Linux and KVM by individual students Computer and lab assignment, environment setup by each team, study the project description, setup meeting schedule Kickoff meeting and progress check Midterm demo and project report due - Finish cloud storage system installations and exploration - Implement a draft of the simplified project Final project preliminary inspection - Finish the final project and obtain comments for improvements - Finish the preliminary final report - Finish the MapReduce project 12/8-15 Final project demo and project report due - Date is determined during the preliminary inspection Project Submissions Mid-term submissions Your working code Though your code is not for the entire project, it should implement some functions required in the project and should be fully working Midterm project report (contents is the same as final report) Final submissions Your working code

Final project report Midterm report should be progressively revised into final report Teams that fall behind will have meetings every week to ensure the progress is caught up Submission guideline Each team only needs one submission, team leader should make the submission through e- learning <team-label>: A, B1, B2, C1, C2 During submission, attach the doc file and the file name has to be <team-label>.doc or <teamlabel>.docx (e.g., B1.doc) Do not submit pdf Zip or tar your code, file name has to be <team-label>.zip Attach the zip file during e-learning submission Required information in the report (can include more than listed) In the cover page, provide team-label, title, and team members Introduction Goal of the project What you would offer in your system and why what you offered is important Study of related work Summary of related works (in paper or similar products available) How your project is different from or is similar to some existing works Approach System architecture Activity diagram (workflow) Architecture diagrams (from high level to low level decomposition of the system) Description of the nodes in the architecture Detailed design Including the APIs of each important component, the algorithms used in some components, etc. Implementation details List the code files and VMs that are in your system Provide a structure of the code files and describe each code file and each VM in the list The relations of the code files and what they are doing should have been in the detailed design section, you can relate the code files to those in the design section Problems encountered and how they are resolved This is important, it is the basis for us to decide the efforts you have actually put in the project... Experimental results Clearly state the control parameters and the metrics for measurement Experimentation needs to be thorough and results need to be easy to read (e.g., use graphs and table) Installation manual, including existing open source used (no need to give installation guide for open sources, just their urls) and how to setup your programs User manual, discussing how to use your system Workload distribution Include a high level summary and the detailed list The detailed list include tasks done by each member in the team week by week (just attach the log you prepared every week)