Finding a needle in Haystack: Facebook's photo storage

Size: px
Start display at page:

Download "Finding a needle in Haystack: Facebook's photo storage"

Transcription

1 Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total, 60 terabytes per week), an efficient and high performing system is needed. The main point of the system is to create an alternative to conventional file system that performs better under the huge workload facebook has. The results are presented in form of an evaluation of the system. The effectiveness is just presented and not compared to alternatives, which gives less insight if you are not very familiar with the context. The evaluation shows the efficiency of the directory and cache of haystack, as well as the performance of the storage using synthetic and production workloads. The motivation of the paper is to present the system which facebook uses for photo storage. The tool is not open sourced, so it is not a paper for offering the tool per say, but give insight to how they handle huge amount of photos in an efficient way. There are open tools available that was based on the work done in this paper. The paper focuses on Facebook s photo serving stack and the effectiveness of the caching. The research was done by tracking over 77 M requests from more than 1 million unique photos. Elements that were studies includes traffic patterns, cache access patterns, geolocation of clients and servers, and correlation between properties and the content. The research was done over a month long period. One of the most important results of the paper was that it was able to point to elements relevant for future investigation. It points to caching options like geographic collaborative caching, and a possibility to adopt S4LRU eviction algorithms at Edge and Origin layers. Also the option to increasing browser cache size for active clients to improve client performance, and enabling local photo resizing for less active clients. There is also points to future work like the placement of resizing functions in the stack, and the design of better caching algorithms. The motivation is to present how Facebook handles caching with the huge amount of workload they have in their service. Since Facebook is such a big service, an analysis from this company can be of great use for other, based on the big amount of user how much data is in use at a time. Review of: Photo caching on Facebook happens on several layers. the browser cache, edge cache and origin cache. Images can be served from two stacks. Akamai and Facebook. The paper focuses on the latter. The Facebook photo serving infrastructure were instrumented and data were gathered over one month. By gathering data from the web browser, edge- and origin cache, the researches were able to follow requests through the entire stack. The analysis shows that 65.5% of all traffic were served from the browser cache, as the browser cache were the closest one to the client, the hit rate were also 65.5%. Edge caches had 20% of the traffic and a 58% hit rate. Origin caches had 4.6% traffic and 31.8% hit rate. The backend servers had 9.9% of the traffic. Through simulation, the paper identified S4LRU as a better alternative to the FIFO eviction algorithm used on edge and origin caches. For instance, an 8.5% improvement on the edge cache and 13.9% on the origin cache.

2 Further improvements of the photo cache involves research on improving cache eviction policy algorithms, and improvements on where to place the photo resize functionality. Review of: Finding a needle in Haystack: Facebook's photo storage With over 60 terrabytes of photos uploaded a week, and a million photos served each second at peak times, Facebook saw the need for a better approach. The solution were Heystack, a storage system designed to do metadata lookup in memory, reducing disk operations used to only read actual data. The paper defines some key requirements for Haystack, including high throughput and low latency, fault-tolerant, cost-effective and simple. The Haystack system consists of three sub-components. The Directory, Cache and Store. The directory is responsible for mapping of volumes, the cache caches photos recently written and that is not requested from a CDN, and store is responsible for reading, writing and deleting files from Haystack. Files written to volumes in Haystack is appended to logical volumes with an offset to further reduce file metadata in memory in to speed up the read process. The paper claims that Haystack, compared to the earlier NFS based system, reduced the cost of each usable terabyte with roughly 28%, and were able to process roughly x4 time more reads pr second. This paper presents the design and implementation of Haystack (an object storage system) for storing Facebook s photo. Facebook currently stores over 260 billion images that means Facebook is the biggest photo sharing website in the world. They designed Haystack to serve the long tail of requests seen by sharing photos in a large social network with less expensive and higher performing solution than the previous approach. They believes, Haystack can provides a fault-tolerant and simple solution to photo storage, which is incrementally scalable, a necessary quality as now a days users upload hundreds of millions of photos each week. This paper explores the dynamics of the full Facebook photo-serving stack, from the client browser to Facebook s Haystack storage server, looking effectiveness of the many layers of caching it employs. They present an overview of the Facebook photo servingstack, high- lighting their instrumentation points. By gathering a month-long trace, they instrumented Facebook s photo-serving infrastructure and then using batch processing they analyzed that trace. This analysis examines more than 70 TB of data, all corresponding to client-initiated requests. They also explore traffic between clients and Edge Caches, how traffic is routed between the Edge Caches and Origin Cache, and how Backend requests are routed. Using simulation they evaluate the effect of different cache sizes, algorithms, and strategies. Two properties (1.The age of photos, 2.the number of Facebook followers associated with the owner) should be strongly associated with photo traffic. This is the first paper, which is systematically instrument and analyzes a realworld workload at the scale of Facebook, and to successfully trace such a high volume of events throughout a massively distributed stack. This paper describes Haystack, an object storage system designed for Facebook s Photos application. They designed Haystack to serve the long tail of requests seen by sharing photos in a large social network. The key insight is to avoid disk operations when accessing metadata. Haystack provides a fault-tolerant and simple solution to photo storage at dramatically less cost and higher throughput than

3 a traditional approach using NAS appliances. Furthermore, Haystack is incrementally scalable, a necessary quality as users upload hundreds of millions of photos each week. In this paper they instrumented the entire Facebook photo-serving stack obtaining traces representative of Facebook s full workload. There are some valuable findings,including the workload pattern, traffic distribution and geographic system dynamics, yielding insights that should be helpful to future system designers. They also identified an opportunity to improve client performance by increasing browser cache sizes for very active clients and by enabling local photo resizing for less active clients. Finding a needle in Haystack: Facebook's photo storage This paper describes Facebook s photo storage system called Haystack. First it describes the goals and motivations behind the creation of such a system. Among the main goals was to reduce latency by reducing disk operations and one of the challenges was to deal with the long-tail (older photos). The Haystack has three main components: Store, Directory and Cache. The Store keeps all the actual data in the form of huge files. The Directory keeps metadata about the images and the Cache functions as an internal cache, which shelters the Store from most frequent requests. Next the authors describe single details, such as insertions, deletions, mapping of image locations and recovery from failures. Finally, they evaluate the whole system and show some trends about users requesting photos (the most frequent image size is small and images are the most active meaning they are modified or deleted a short time after their insertion). They close with a list of related work and compare it to Haystack. Analysis of Facebook Photo Caching The second paper examines the workload of mentioned Haystack. It focuses on the different layers of caching Facebook uses (browser cache, Edge cache, Origin cache, Haystack) and how are they utilized. The main goal of caching is traffic sheltering. They also shortly mention the process of gathering data. Next, it is revealed how the popularity of a photo affects the hit ratio of single layers less popular ones have higher local cache hit rate and more popular ones have higher shared cache hit rate. It is also described how the traffic is distributed geographically. At the end, the authors also present a few improvements for the system (better geographical caching, local photo resizing). The function of sharing photos is necessary for a social website. Many famous social websites like Facebook, LinkedIn and Instagram all provide this service. According to the data from Facebook, users have uploaded more than 65 million photos. Due to the huge volume, the traditional photo storage systems that are based on the filesystem cannot support the service very well. This paper tries to give a new vision depending on the trace from Facebook. It first introduces the photo storage system Haystack used in Facebook and then gives the performance analysis. There are some interesting features of Haystack. For reduction of disk operations, Haystack stores multiple photos in a single file to reduce the memory used for filesystem metadata. The Haystack directory and cache is also specially designed for full cache use. A structure called needle is utilized in the system to map logical space in memory to physical volume. It supports photo read, write and delete. Even it helps to the recovery of failure or reboot. Before the performance analysis of Haystack, the paper introduces the characters of the photo requests. New Photos can generate more requests than aged photos and requests for small images account for

4 84.4%. The experiment results of evaluating Haystack show the system has a good performance in storing photos. For example, read only operation can achieve delivering 85% of the raw throughput of the device with the 17% higher latency. It also shows Haystack is suitable to be used in Facebook as write operations are always multi-writes. This paper provides a detail description of Haystack and classifies the data features that can guide other researcher to do further research. This paper has a good structure and clearly states the challenges and motivation, which is quite good for readers. An analysis of Facebook photo caching The most common way to improve the request speed is to add the cache. However, there is no method of caching which can be used in all applications. Facebook, the biggest photos sharing website, has to deal with million photos every day and needs its own caching system. This paper tries to reduce the mystery of the photo caching system in Facebook. It introduces the entire Internet image-serving infrastructure in Facebook and considers many aspects including the relationship between browser caches, edge caches and the origin cache and backend servers, popularity distribution and geographic traffic distribution and possible improvements. Depending on the analysis, the paper also shows us many interesting discoveries. First, browser caches, edge caches and the origin cache handle an aggregated 90% of requests. Second, popularity distribution follows a Zipfian distribution although Haystack has a comparatively smaller Zipfcoefficient. Third, content is often served across a large distance rather than locally. For example, the traffic in Miami was distributed among several edge caches, with 50% handled in San Jose, Palo alto and LA and 24% in Miami. Fourth, geographic-scale collaborative caching at edge server and advanced eviction algorithms are two possible ways to improve cache hit. The former one is able to promote 17% and the late is 21.9%. Fifth, content popularity rapidly drops with age following Pareto distribution and is conditionally dependent on the owner s social connectivity. Two problems are left as future research areas in this paper. One is the placement of resizing functionality along the stack. Another one is designing better caching algorithms. It is good to see the paper gives a figure describing the structure of whole caching system and methods used to collect data and sample data. However, some data could be explained further. For example, there is a sentence in introduction, the most popular 0.03% of content, cache hit rates neared 100%. This sentence should be explained or reclaimed in the corresponding section with proper figure. This paper studies the workload and effectiveness of Facebook s multi-layer photo catching stack. The hierarchical storage systems consist of browser cache, edge cache, origin cache and backend storage. Facebook also uses Akamai for additional caching. The whole storage system spreads across several data centers. The authors captured over 77M photo requests from 13.2M user browsers for more than 1.3M unique photos. Of the 77M requests, 65.5% are satisfied by browser caches, 20% by edge caches, 4.6% by the origin cache, and 9.9% by the backend storage. Photo popularity distributions are approximately Zipfian. The plots show that more than 89% of requests for the hundred thousand most popular images can be served by browser and edge caches. Although every edge cache receives a majority of its requests from nearby cities, the largest share does not necessarily go to the nearest neighbor. This is because Facebook s routing policy is based on a combination of latency and peering cost. If a photo is not found in edge cache, consistent hashing is used to locate an origin cache. Most of the time, an

5 origin cache will retrieve the photo from the backend storage within the same data center, but sometimes it also retrieves photos from other data centers because of misdirected resizing traffic and failed local fetch. The authors also examined the potential performance improvement brought by other caching algorithms and increased cache size. The results show that the Clairvoyant and S4LRU algorithms are the most effective algorithms. They can improve the hit ratio to a large extent or reduce the cache requirements while remaining the same hit ratios. The analysis also shows that new content will draw attention and account for the majority of traffic. This paper presents the design, implementation and evaluation of Facebook s Haystack -- a new backend storage system. Facebook is the biggest photo sharing website in the world. It hosts 65 billion photos which translates to over 20 PB of data. Haystack is tailored for written once, read often, never modified, and rarely deleted Facebook photos. Previous NFS based storage system involves 3 to 10 times of disk operation which caused significant delay. Haystack is to reduce disk operation to once only. Facebook uses a CDN to serve popular images and uses Haystack for unpopular images. Haystack consists of three components: Haystack directory, Haystack cache, and Haystack store. The Haystack directory provides mapping from logical to physical volumes, load balances writes among logical volumes, determines whether uses CDN or Cache to handle requests, and identifies read only volumes. Haystack cache is organized into a DHT and caches those images that cannot be found in CDNs. Haystack store maintains physical volumes. Each volume contains a superblock and several needles (images). An index file is created in the memory to store similar contents as a physical volume except the image content to speed up image retrieval. The file system used is XFS. Evaluation shows that the directory s hashing policy can distribute read and writes very well. Cache achieves around 80% hit rate. Reads are more frequent than writes and deletes. The latency of multiwrite is fairly low and stable. The read latency on read only machines is stable and lower than that of write enabled machines.

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner Finding a Needle in a Haystack Facebook s Photo Storage Jack Hartner Paper Outline Introduction Background & Previous Design Design & Implementation Evaluation Related Work Conclusion Facebook Photo Storage

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Software Infrastructure in Data Centers: Distributed File Systems 1 Permanently stores data Filesystems

More information

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3 EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California,

More information

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook Efficiency at Scale Sanjeev Kumar Director of Engineering, Facebook International Workshop on Rack-scale Computing, April 2014 Agenda 1 Overview 2 Datacenter Architecture 3 Case Study: Optimizing BLOB

More information

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Overview 5-44 5-44 Computer Networking 5-64 Lecture 6: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Web Consistent hashing Peer-to-peer Motivation Architectures Discussion CDN Video Fall

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service. Goals Memcache as a Service Tom Anderson Rapid application development - Speed of adding new features is paramount Scale Billions of users Every user on FB all the time Performance Low latency for every

More information

15-440/15-640: Homework 3 Due: November 8, :59pm

15-440/15-640: Homework 3 Due: November 8, :59pm Name: 15-440/15-640: Homework 3 Due: November 8, 2018 11:59pm Andrew ID: 1 GFS FTW (25 points) Part A (10 points) The Google File System (GFS) is an extremely popular filesystem used by Google for a lot

More information

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Volley: Automated Data Placement for Geo-Distributed Cloud Services Volley: Automated Data Placement for Geo-Distributed Cloud Services Authors: Sharad Agarwal, John Dunagen, Navendu Jain, Stefan Saroiu, Alec Wolman, Harbinder Bogan 7th USENIX Symposium on Networked Systems

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Load Dynamix Enterprise 5.2

Load Dynamix Enterprise 5.2 DATASHEET Load Dynamix Enterprise 5.2 Storage performance analytics for comprehensive workload insight Load DynamiX Enterprise software is the industry s only automated workload acquisition, workload analysis,

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

From Internet Data Centers to Data Centers in the Cloud

From Internet Data Centers to Data Centers in the Cloud From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs

More information

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE An innovative storage solution from Pure Storage can help you get the most business value from all of your data THE SINGLE MOST IMPORTANT

More information

CLOUD-SCALE INFORMATION RETRIEVAL

CLOUD-SCALE INFORMATION RETRIEVAL 1 CLOUD-SCALE INFORMATION RETRIEVAL Ken Birman, CS5412 Cloud Computing Styles of cloud computing 2 Think about Facebook We normally see it in terms of pages that are imageheavy But the tags and comments

More information

CONFIGURATION GUIDE WHITE PAPER JULY ActiveScale. Family Configuration Guide

CONFIGURATION GUIDE WHITE PAPER JULY ActiveScale. Family Configuration Guide WHITE PAPER JULY 2018 ActiveScale Family Configuration Guide Introduction The world is awash in a sea of data. Unstructured data from our mobile devices, emails, social media, clickstreams, log files,

More information

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017 Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store

More information

Squirrel case-study. Decentralized peer-to-peer web cache. Traditional centralized web cache. Based on the Pastry peer-to-peer middleware system

Squirrel case-study. Decentralized peer-to-peer web cache. Traditional centralized web cache. Based on the Pastry peer-to-peer middleware system Decentralized peer-to-peer web cache Based on the Pastry peer-to-peer middleware system Traditional centralized web cache 1 2 Decentralized caching of web pages use the resources of peers (web browsers/clients)

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

416 Distributed Systems. March 23, 2018 CDNs

416 Distributed Systems. March 23, 2018 CDNs 416 Distributed Systems March 23, 2018 CDNs Outline DNS Design (317) Content Distribution Networks 2 Typical Workload (Web Pages) Multiple (typically small) objects per page File sizes are heavy-tailed

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

SoftNAS Cloud Performance Evaluation on Microsoft Azure

SoftNAS Cloud Performance Evaluation on Microsoft Azure SoftNAS Cloud Performance Evaluation on Microsoft Azure November 30, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for Azure:... 5 Test Methodology...

More information

Topics in P2P Networked Systems

Topics in P2P Networked Systems 600.413 Topics in P2P Networked Systems Week 4 Measurements Andreas Terzis Slides from Stefan Saroiu Content Delivery is Changing Thirst for data continues to increase (more data & users) New types of

More information

Website Designs Australia

Website Designs Australia Proudly Brought To You By: Website Designs Australia Contents Disclaimer... 4 Why Your Local Business Needs Google Plus... 5 1 How Google Plus Can Improve Your Search Engine Rankings... 6 1. Google Search

More information

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge HPC Growing Pains IT Lessons Learned from the Biomedical Data Deluge John L. Wofford Center for Computational Biology & Bioinformatics Columbia University What is? Internationally recognized biomedical

More information

SaaS Providers. ThousandEyes for. Summary

SaaS Providers. ThousandEyes for. Summary USE CASE ThousandEyes for SaaS Providers Summary With Software-as-a-Service (SaaS) applications rapidly replacing onpremise solutions, the onus of ensuring a great user experience for these applications

More information

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Naming Jia Rao http://ranger.uta.edu/~jrao/ 1 Naming Names play a critical role in all computer systems To access resources, uniquely identify entities, or refer to locations

More information

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop K. Senthilkumar PG Scholar Department of Computer Science and Engineering SRM University, Chennai, Tamilnadu, India

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

Data Storage Infrastructure at Facebook

Data Storage Infrastructure at Facebook Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow

More information

Active Archive and the State of the Industry

Active Archive and the State of the Industry Active Archive and the State of the Industry Taking Data Archiving to the Next Level Abstract This report describes the state of the active archive market. New Applications Fuel Digital Archive Market

More information

How Akamai delivers your packets - the insight. Christian Kaufmann SwiNOG #21 11th Nov 2010

How Akamai delivers your packets - the insight. Christian Kaufmann SwiNOG #21 11th Nov 2010 How Akamai delivers your packets - the insight Christian Kaufmann SwiNOG #21 11th Nov 2010 What is a Content Distribution Network? The RFCs and Internet Drafts define a Content Distribution Network, CDN,

More information

Outline. Spanner Mo/va/on. Tom Anderson

Outline. Spanner Mo/va/on. Tom Anderson Spanner Mo/va/on Tom Anderson Outline Last week: Chubby: coordina/on service BigTable: scalable storage of structured data GFS: large- scale storage for bulk data Today/Friday: Lessons from GFS/BigTable

More information

Application-Oriented Storage Resource Management

Application-Oriented Storage Resource Management Application-Oriented Storage Resource Management V Sawao Iwatani (Manuscript received November 28, 2003) Storage Area Networks (SANs) have spread rapidly, and they help customers make use of large-capacity

More information

How Facebook knows exactly what turns you on

How Facebook knows exactly what turns you on How Facebook knows exactly what turns you on We have developed our anti tracking system to combat a culture of user data collection which, we believe, has gone too far. These systems operate hidden from

More information

Chapter The LRU* WWW proxy cache document replacement algorithm

Chapter The LRU* WWW proxy cache document replacement algorithm Chapter The LRU* WWW proxy cache document replacement algorithm Chung-yi Chang, The Waikato Polytechnic, Hamilton, New Zealand, itjlc@twp.ac.nz Tony McGregor, University of Waikato, Hamilton, New Zealand,

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment This document is provided as-is. Information and views expressed in this document, including

More information

Operating System Concepts Ch. 11: File System Implementation

Operating System Concepts Ch. 11: File System Implementation Operating System Concepts Ch. 11: File System Implementation Silberschatz, Galvin & Gagne Introduction When thinking about file system implementation in Operating Systems, it is important to realize the

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Global Headquarters: 5 Speen Street Framingham, MA USA P F B U Y E R C A S E S T U D Y V M w a r e I m p r o v e s N e t w o r k U t i l i z a t i o n a n d B a c k u p P e r f o r m a n c e U s i n g A v a m a r ' s C l i e n t - S i d e D e d u p l i c a t i

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Multimedia Streaming. Mike Zink

Multimedia Streaming. Mike Zink Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Survey: Users Share Their Storage Performance Needs. Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates

Survey: Users Share Their Storage Performance Needs. Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates Survey: Users Share Their Storage Performance Needs Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates Table of Contents The Problem... 1 Application Classes... 1 IOPS Needs... 2 Capacity

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

SoftNAS Cloud Performance Evaluation on AWS

SoftNAS Cloud Performance Evaluation on AWS SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary

More information

vsan 6.6 Performance Improvements First Published On: Last Updated On:

vsan 6.6 Performance Improvements First Published On: Last Updated On: vsan 6.6 Performance Improvements First Published On: 07-24-2017 Last Updated On: 07-28-2017 1 Table of Contents 1. Overview 1.1.Executive Summary 1.2.Introduction 2. vsan Testing Configuration and Conditions

More information

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,

More information

Scaling Data Center Application Infrastructure. Gary Orenstein, Gear6

Scaling Data Center Application Infrastructure. Gary Orenstein, Gear6 Scaling Data Center Application Infrastructure Gary Orenstein, Gear6 SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this

More information

Simplifying Collaboration in the Cloud

Simplifying Collaboration in the Cloud Simplifying Collaboration in the Cloud WOS and IRODS Data Grid Dave Fellinger dfellinger@ddn.com Innovating in Storage DDN Firsts: Streaming ingest from satellite with guaranteed bandwidth Continuous service

More information

Oracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation

Oracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation Oracle 10G Lindsey M. Pickle, Jr. Senior Solution Specialist Technologies Oracle Corporation Oracle 10g Goals Highest Availability, Reliability, Security Highest Performance, Scalability Problem: Islands

More information

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management TCO REPORT NAS File Tiering Economic advantages of enterprise file management Executive Summary Every organization is under pressure to meet the exponential growth in demand for file storage capacity.

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

ISSUES IN STORAGE OF PHOTOS IN FACEBOOK: REVIEW OF VARIOUS STORAGE TECHNIQUES

ISSUES IN STORAGE OF PHOTOS IN FACEBOOK: REVIEW OF VARIOUS STORAGE TECHNIQUES International Journal of Latest Trends in Engineering and Technology Special Issue SACAIM 2016, pp. 549-557 e-issn:2278-621x ISSUES IN STORAGE OF PHOTOS IN FACEBOOK: REVIEW OF VARIOUS STORAGE TECHNIQUES

More information

Server monitoring for Tor exit nodes

Server monitoring for Tor exit nodes CASE STUDY Server monitoring for Tor exit nodes We had a chance to catch up with Kenan Sulayman, who runs some of the biggest Tor servers in the world. Read on to learn about server monitoring for highthroughput

More information

Popularity Prediction of Facebook Videos for Higher Quality Streaming

Popularity Prediction of Facebook Videos for Higher Quality Streaming Popularity Prediction of Facebook Videos for Higher Quality Streaming Linpeng Tang Qi Huang, Amit Puntambekar Ymir Vigfusson, Wyatt Lloyd, Kai Li 1 Videos are Central to Facebook 8 billion views per day

More information

Sharding & CDNs. CS 475, Spring 2018 Concurrent & Distributed Systems

Sharding & CDNs. CS 475, Spring 2018 Concurrent & Distributed Systems Sharding & CDNs CS 475, Spring 2018 Concurrent & Distributed Systems Review: Distributed File Systems Challenges: Heterogeneity (different kinds of computers with different kinds of network links) Scale

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Big Data - Some Words BIG DATA 8/31/2017. Introduction BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means

More information

Akamai's V6 Rollout Plan and Experience from a CDN Point of View. Christian Kaufmann Director Network Architecture Akamai Technologies, Inc.

Akamai's V6 Rollout Plan and Experience from a CDN Point of View. Christian Kaufmann Director Network Architecture Akamai Technologies, Inc. Akamai's V6 Rollout Plan and Experience from a CDN Point of View Christian Kaufmann Director Network Architecture Akamai Technologies, Inc. Agenda About Akamai General IPv6 transition technologies Challenges

More information

Strategic Briefing Paper Big Data

Strategic Briefing Paper Big Data Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which

More information

The Ultimate YouTube SEO Guide: Tips & Tricks on How to Increase Views and Rankings for your Online Videos

The Ultimate YouTube SEO Guide: Tips & Tricks on How to Increase Views and Rankings for your Online Videos The Ultimate YouTube SEO Guide: Tips & Tricks on How to Increase Views and Rankings for your Online Videos The Ultimate App Store Optimization Guide Summary 1. Introduction 2. Choose the right video topic

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice 2014 年 3 月 13 日星期四 From Big Data to Big Value Infrastructure Needs and Huawei Best Practice Data-driven insight Making better, more informed decisions, faster Raw Data Capture Store Process Insight 1 Data

More information

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE E-Guide BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE SearchServer Virtualization P art 1 of this series explores how trends in buying server hardware have been influenced by the scale-up

More information

Akamai's V6 Rollout Plan and Experience from a CDN Point of View. Christian Kaufmann Director Network Architecture Akamai Technologies, Inc.

Akamai's V6 Rollout Plan and Experience from a CDN Point of View. Christian Kaufmann Director Network Architecture Akamai Technologies, Inc. Akamai's V6 Rollout Plan and Experience from a CDN Point of View Christian Kaufmann Director Network Architecture Akamai Technologies, Inc. Agenda About Akamai General IPv6 transition technologies Challenges

More information

Top Trends in DBMS & DW

Top Trends in DBMS & DW Oracle Top Trends in DBMS & DW Noel Yuhanna Principal Analyst Forrester Research Trend #1: Proliferation of data Data doubles every 18-24 months for critical Apps, for some its every 6 months Terabyte

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

CS 425 / ECE 428 Distributed Systems Fall 2015

CS 425 / ECE 428 Distributed Systems Fall 2015 CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Measurement Studies Lecture 23 Nov 10, 2015 Reading: See links on website All Slides IG 1 Motivation We design algorithms, implement

More information

Overlay and P2P Networks. Introduction and unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Introduction and unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Introduction and unstructured networks Prof. Sasu Tarkoma 14.1.2013 Contents Overlay networks and intro to networking Unstructured networks Overlay Networks An overlay network

More information

Cache Management for TelcoCDNs. Daphné Tuncer Department of Electronic & Electrical Engineering University College London (UK)

Cache Management for TelcoCDNs. Daphné Tuncer Department of Electronic & Electrical Engineering University College London (UK) Cache Management for TelcoCDNs Daphné Tuncer Department of Electronic & Electrical Engineering University College London (UK) d.tuncer@ee.ucl.ac.uk 06/01/2017 Agenda 1. Internet traffic: trends and evolution

More information

APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE

APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE FINDING WHAT YOU NEED IN YOUR IN-HOUSE VIDEO STORAGE SECTION 1 You need ways to generate metadata for stored videos without time-consuming manual

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

CSE 124: CONTENT-DISTRIBUTION NETWORKS. George Porter December 4, 2017

CSE 124: CONTENT-DISTRIBUTION NETWORKS. George Porter December 4, 2017 CSE 124: CONTENT-DISTRIBUTION NETWORKS George Porter December 4, 2017 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information