Use Distributed File system as a Storage Tier! Fabrizio Manfred Furuholmen!

Size: px

Start display at page:

Download "Use Distributed File system as a Storage Tier! Fabrizio Manfred Furuholmen!"

Sharlene Fleming
5 years ago
Views:

1 Use Distributed File system as a Storage Tier! Fabrizio Manfred Furuholmen!

2 Agenda Introduction Next Generation Data Center Distributed File system Distributed File system OpenAFS GlusterFS HDFS Ceph Case Studies Conclusion 2! 6/23/10!

3 Class Exam What do you know about DFS? How can you create a Petabyte storage? How can you make a centralized system log? How can you allocate space for your user or system, when you have a thousands of users/systems? How can you retrieve data from everywhere? 3! 6/23/10!

migration across heterogeneous environments Server and storage

4 Introduction Next Generation Data Center: the FABRIC Key categories: Continuous data protection and disaster recovery File and block data migration across heterogeneous environments Server and storage virtualization Encryption for data in-flight and at-rest In other words: Cloud data center 4! 6/23/10!

5 Introduction Storage Tier in the FABRIC High Performance Scalability Simplified Management Security High Availability Solutions Storage Area Network Network Attached Storage Distributed file system 5! 6/23/10!

6 Introduction What is a Distributed File system? A distributed file system takes advantage of the interconnected nature of the network by storing files on more than one computer in the network and making them accessible to all of them.. 6! 6/23/10!

7 Introduction What do you expected from a distributed file system? Uniform Access: file names global support Security: to provide a global authentication/authorization Reliability: the elimination of each single point of failure Availability: administrators perform routine maintenance while the file server is in operation, without disrupting the user s routines Scalability: Handle terabytes of data Standard conformance: some IEEE POSIX file system semantics standard Performance: high performance 7!

8 Part II Implementations How many DFS do you know? 8!

OpenAFS: introduction is the open source implementation of Andrew File system of IBM Key ideas: Make clients do work whenever possible. Cache whenever possible. Exploit file usage properties.

9 OpenAFS: introduction is the open source implementation of Andrew File system of IBM Key ideas: Make clients do work whenever possible. Cache whenever possible. Exploit file usage properties. Understand them. One-third of Unix files are temporary. Minimize system-wide knowledge and change. Do not hardwire locations. Trust the fewest possible entities. Do not trust workstations. Batch if possible to group operations. 9! 6/23/10!

10 OpenAFS: design 10! 6/23/10!

11 OpenAFS: components Cell Cell is collection of file servers and workstation The directories under /afs are cells, unique tree Fileserver contains volumes Volumes Volumes are "containers" or sets of related files and directories Have size limit 3 type rw, ro, backup Mount Point Directory Access to a volume is provided through a mount point A mount point is just like a static directory Server A Server A+B Server C 11!

12 OpenAFS: performances OpenAFS OpenAFS OSD 2 Servers

13 OpenAFS: features Uniform name space: same path on all workstations Security: base to krb4/krb5, extended ACL, traffic encryption Reliability: read-only replication, HA database, read/write replica in OSD version Availability: maintenance tasks without stopping the service Scalability: server aggregation Administration: administration delegation Performance: client side disk base persistent cache, big rate client per Server 13! 6/23/10!

14 openafs: who uses it? Morgan Stanley IT Internal usage Storage: 450 TB (ro)+ 15 TB (rw) Client: Pictage, Inc Online picture album Storage: 265TB ( planned growth to 425TB in twelve months) Volumes: 800,000. Files: Embian Internet Shared folder Storage: 500TB Server: 200 Storage server 300 App server RZH Internal usage 210TB 14!

15 OpenAFS: good for... Good Wide Area Network Heterogeneous System Read operation > write operation Large number of clients/systems Usage directly by end-users Federation Bad Locking Database Unicode Large File Some limitations on.. 15!

16 GlusterFS Gluster can manage data in a single global namespace on commodity hardware.. Keys: Lower Storage Cost Open source software runs on commodity hardware Scalability Linearly scales to hundreds of Petabytes Performance No metadata server means no bottlenecks High Availability Data mirroring and real time self-healing Virtual Storage for Virtual Servers Simplifies storage and keeps VMs always-on Simplicity Complete web based management suite 16! 6/23/10!

17 GlusterFS: design 17! 6/23/10!

18 GlusterFS: components Volume Volume is the basic element for data export The volumes can be stacked for extension volume posix1! type storage/posix! option directory /home/export1! end-volume! Capabilities Specific options (features) can be enabled for each volume (cache, pre fetch, etc.) Simple creation for custom extensions with api interface Services Access to a volume is provided through services like tcp, unix socket, infiniband volume brick1! type features/posix-locks! option mandatory! subvolumes posix1! end-volume! volume server! type protocol/server! option transport-type tcp! option transport.socket.listen-port 6996! subvolumes brick1! option auth.addr.brick1.allow *! end-volume! 18! 6/23/10!

19 Gluster: components 19! 6/23/10!

20 Gluster: performance 20! 6/23/10!

21 Gluster: carateristics Uniform name space: same path on all workstation Reliability: read-1 replication, asynchronous replication for disaster recovery Availability: No system downtime for maintenance (better in the next release) Scalability: Truly linear scalability Administration: Self Healing, Centralized logging and reporting, Appliance version Performance: Stripe files across dozens of storage blocks, Automatic load balancing, per volume i/o tuning 21! 6/23/10!

22 Gluster: who uses it? Avail TVN (USA) 400TB for Video on demand, video storage Fido Film (Sweden) visual FX and Animation studio University of Minnesota (USA) 142TB Supercomputing Partners Healthcare (USA) 336TB Integrated health system Origo (Switzerland) open source software development and collaboration platform 22!

23 Gluster: good for... Good Large amount of data Access with different protocols Directly access from applications (api layer) Disaster recover (better in the next release) SAN replacement, vm storage Bad User-space Low granularity in security setting High volumes of operations on same file 23!

flow Files are striped across a set of nodes in order to facilitate parallel access

24 Implementations Implementations Old way Metadata and data in the same place Single stream per file New way Multiple streams are parallel channels through which data can flow Files are striped across a set of nodes in order to facilitate parallel access OSD Separation of file metadata management (MDS) from the storage of file data 24! 6/23/10!

25 HDFS: Hadoop HDFS is part of the Apache Hadoop project which develops open-source software for reliable, scalable, distributed computing. Hadoop was inspired by Google s MapReduce and Google File system 25! 6/23/10!

26 HDFS: Google File System Design of a file systems for a different environment where assumptions of a general purpose file system do not hold interesting to see how new assumptions lead to a different type of system Key ideas: Component failures are the norm. Huge files (not just the occasional file) Append rather than overwrite is typical Co-design of application and file system API specialization. For example can have relaxed consistency. 26! 6/23/10!

27 HDFS: MapReduce Moving Computation is Cheaper than Moving Data Map! Split and mapped in keyvalue pairs! Combine! For efficiency reasons, the combiner works directly to map operation outputs.! Reduce! The files are then merged, sorted and reduced! 27!

28 HDFS: goals Scalable: can reliably store and process petabytes.! Goals! Economical: It distributes the data and processing across clusters of commonly available computers.! Efficient: can process data in parallel on the nodes where the data is located.! Reliable: automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.! 28!

29 HDFS: design 29!

Datanodes Datanode manage storage attached to the system it run on Applay the map rule of

30 HDFS: components Namenode An HDFS cluster consists of a single NameNode It is a master server that manages the file system namespace and regulates access to files by clients. Datanodes Datanode manage storage attached to the system it run on Applay the map rule of MapReduce Blocks File is split into one or more blocks and these blocks are stored in a set of DataNodes 30!

31 HDFS: features Uniform name space: same path on all workstations Reliability: rw replication, re-balancing, copy in different locations Availability: hot deploy Scalability: server aggregation Administration: HOD Performance: grid computation, parallel transfer 31! 6/23/10!

32 HDFS: who uses it? Major players 32! Yahoo! A9.com AOL Booz Allen Hamilton EHarmony Facebook Freebase Fox Interactive Media IBM ImageShack ISI Joost Last.fm LinkedIn Metaweb Meebo Ning Powerset (now part of Microsoft) Proteus Technologies The New York Times Rackspace Veoh Twitter

33 HDFS: good for... Good Task distribution (Basic GRID infrastructure) Distribution of content (High throughput of data access ) Archiving Etherogenous envirorment Bad Not General purpose File system Not Posix Compliant Low granularity in security setting Java 33!

34 Ceph Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file or write to the same directory usage scenarios that bring typical enterprise storage systems to their knees. Keys: Seamless scaling The file system can be seamlessly expanded by simply adding storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively migrates data onto new devices in order to maintain a balanced distribution of data. Strong reliability and fast recovery All data is replicated across multiple OSDs. If any OSD fails, data is automatically re-replicated to other devices. Adaptive MDS The Ceph metadata server (MDS) is designed to dynamically adapt its behavior to the current workload. 34!

35 Ceph: design OSD Client Metadata Cluster Object Storage Cluster 35!

36 Ceph: features Dynamic Distributed Metadata Metadata Storage Dynamic Subtree Partitioning Traffic Control Reliable Autonomic Distributed Object Storage Data Distribution Replication Data Safety Failure Detection Recovery and Cluster Updates 36!

37 Ceph: features Pseudo-random data distribution function (CRUSH)! Reliable object storage service (RADOS)! Extent B-tree object File System (today btrfs)! 37!

38 Ceph: features Splay Replication Only after it has been safely committed to disk is a final commit notification sent to the client. 38!

39 Ceph: good for Good Scientific application, High throughput of data access Heavy Read / Write operations It is the most advance distributed file system Bad Young (Linux ) Linux only Complex 39!

40 Others Lustre PVFS! MooseFS! Cloudstore (kosmos)! PNFS!! XtreemFS! Tahoe-LAFS! Search Wikipedia..! 40!

41 Part III Case Studies 41!

42 Class Exam What can DFS do for you? How can you create a Petabyte storage? How can you make a centralized system log? How can you allocate space for your user or system, when you have a thousands of users/systems? How can you retrieve data from everywhere? 42! 6/23/10!

43 File sharing Problem Share Documents across a wide network area Share home folder across different Terminal servers Solution OpenAFS Samba Results Single ID, Kerberos/ldap Single file system Usage 800 users 15 branch offices File sharing /home dir 43!

44 Web Service Problem Big Storage on a little budget Solution Gluster Results High Availability data storage Low price Usage 100 TB image archive Multimedia content for web site 44!

45 Internet Disk: mys3 Problems Data from everywhere Disaster Recover Solution mys3 Hadoop / OpenAFS Results High Availability Access through HTTP protocol (REST Interface) Disaster Recovery Usage Users backup Application backend 200 Users 6 TB 45!

46 Log concentrator Problem Log concentrator Solution Hadoop cluster Syslog-NG Results High availability Fast search Storage without limits Usage Security audit and access control 46!

47 Private cloud Problems Low cost VM storage VM self provisioning Solution GlusterFS openafs Custom provisioning Rresults Auto provisioning Low cost Flexible solution Usage Development env Production env

Conclusion: problems Do you have enough bandwidth?! Failure For 10 PB of storage, you will have an average of 22 consumer-grade SATA drives failing per day.

48 Conclusion: problems Do you have enough bandwidth?! Failure For 10 PB of storage, you will have an average of 22 consumer-grade SATA drives failing per day. Read/write time Each of the 2TB drives takes approximately best case 24,390 seconds to be read and written over the network. Data Replication Data replication is the number of the disk drives, plus difference. 48! 6/23/10!

49 Conclusion Environment Analysis! No true Generic DFS! Not simple move 800TB btw different solutions! Dimension! Start with the right size! Servers number is related to speed needed and number of clients! Network for Replication! Divide system in Class of Service! Different disk Type! Different Computer Type! System Management! Monitoring Tools! System/Software Deploy Tools! 49!

50 Conclusion: next step 50! 6/23/10!

Links OpenAFS! Gluster! Hadoop! Ceph! www.

org! Hadoop.apache.org! Isabel Drost! ceph.

51 Links OpenAFS! Gluster! Hadoop! Ceph! Hadoop.apache.org! Isabel Drost! ceph.newdream.n et! Publication! Mailing list! 51!

I look forward to meeting you XVII European AFS meeting 2010 PILSEN - CZECH REPUBLIC September 13-15 Who should attend: Everyone interested in

realm and federated single sign-on environments Everyone who wants to share their knowledge and experience with other members of the AFS and

52 I look forward to meeting you XVII European AFS meeting 2010 PILSEN - CZECH REPUBLIC September Who should attend: Everyone interested in deploying a globally accessible file system Everyone interested in learning more about real world usage of Kerberos authentication in single realm and federated single sign-on environments Everyone who wants to share their knowledge and experience with other members of the AFS and Kerberos communities Everyone who wants to find out the latest developments affecting AFS and Kerberos More Info: 52! 6/23/10!

53 Thank you!

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008 Design and build an inexpensive DFS Fabrizio Manfredi Furuholmen FrOSCon August 2008 Agenda Overview Introduction Old way openafs New way Hadoop CEPH Conclusion Overview Why Distributed File system? Handle