Outline TFS: Tianwang File System -Performance Gain with Variable Chunk Size in GFS-like File Systems Authors: Zhifeng Yang, Qichen Tu, Kai Fan, Lei Zhu, Rishan Chen, Bo Peng Introduction (what s it all about) Tianwang File System Experiments Conclusion Speaker: Hongfei Yan School of EECS, Peking University 4/13/2008 Distributed File Systems Support access to files on remote servers Must support concurrency Make varying guarantees about locking, who wins with concurrent writes, etc... Must gracefully handle dropped connections Can offer support for replication and local caching Different implementations sit in different places on complexity/feature scale Motivations (1/3) 1996 1999 2000 2002 2004 2005 Key ideas Web pages preserve easier preserve Web pages FTP files grow exponentially vanishing pages web resources knowledge discovery Mile Tianwang 1.0 Bingle 1.0 Tianwang 2.0 Web InfoMall 1.0 CDAL, 1.0 Web Digest stones HisTrace Web InfoMall 2.0 etc 2007 Motivations (2/3) Data Web pages 3 billions, 30TB compressed URL list, IP list, link graph, anchor text, etc. Search engine log about 40 GB Test Collection CWT100G, CWT200G, CCT2006, CWT70th, CDAL16th Motivations (3/3) Software Large-scale web crawler Web page deduplicate Web page classifier Index and search TB-level data management Retrieval performance evaluation LinkAnalysis, ShallowNLP, Information Extraction Hardware 80 machines (PC, Dell2850, etc.) 1
Issues Data Accessibility Distributed among machines Data is not open and shared easily Difficult to construct, deploy and run the web data analysis program Communication failure, error detection Machine usability Disk failure is a disaster, but common Inefficiency Some real-world datapoints Sources: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, Bianca Schroeder and Garth A. Gibson (FAST 07) [pdf] Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso (FAST 07) [pdf] Google File System Solution: Divide files in large 64 MB chunks, and distribute/replicate chunks across many servers. A couple of important details: The master maintains only a (file name, chunk server) table in main memory ) minimal I/O Files are replicated using a primary-backup scheme; the master is kept out of the loop Google File System atomic record append concurrent write/append secondary master chunk replications re-replication re-balancing Hadoop Project A module in Lucene/Nutch project DFS+MapReduce Create-once/read-many Does not support concurrent atomic record appends Google & IBM cloud computing initiative for university (Oct,8, 2007) Kosmos File System does not support atomic record appends support concurrent write re-replication re-balancing integrated with Hadoop POSIX file interface FUSE(Filesystem in userspace) binding master is single-point of failure 2
Outline Introduction (what s it all about) Tianwang File System Experiments Conclusion Web Infrastructure/Cloud Computing Storage Fault-tolerant It can recover from component failures Scalable It can operate correctly even as some aspect of the system is scaled to a larger size. Transparent the ability of a distributed system to act like a non-distributed system. Computing Easy/efficent parallel computing Data processing model Mostly sequential access MapReduce Assumptions Component failures are the norm Inexpensive commodity components Files are huge Multi-GB Appending rather than overwriting Once written, only read, often sequentially Multi-append concurrently Co-design applications and the system High sustained bandwidth is more important than low latency TFS Design Decisions Files are consist of chunks Chunks are regular file on local file system Chunk replica 3 replicas One master to manage metadata Heartbeat Operation log Note that: There are big differences between TFS and GFS due to the different chunk size. Chunk Size Chunk Size 4 Application 1 Read Chunk Size Overwrite 2 Fixed Size in GFS Padding Duplicates Variable Size in TFS Flexibility A property of Chunk Offset -> Chunk ID Small chunk Append 3 3
Read Operation Mutation Operation in GFS GFS Cache chunk info Communicate with master when cache fails TFS Get chunk information once New data after open is invisible Append Operation in TFS Record Append Operation GFS At least once Padding & fragments App checksums Duplicates App Record ID TFS At most once Small chunk Delay write, flush Implications for Applications Outline GFS Appending rather than overwriting Read sequentially Checksums Record ID TFS Appending rather than overwriting Read sequentially Sequence of records Introduction (what s it all about) Tianwang File System Experiments Conclusion 4
Experimental Deployment in Tianwang Master Operation in TFS 10 nodes in a cluster One master, nine chunkservers Each with two 2.8GHz processors, 2GB RAM, 100GB+ scsi disk space Read Buffer Size in TFS Aggregate Read Rate in TFS Aggregate Append Rate in TFS Performance: GFS vs. TFS GFS 200 to 500 operations per second aggregate read rate 75% of network limit aggregate append rate 50% of network limit limited by the network bandwidth of the chunkserver that store the last chunk of the file TFS 3400 operations per second aggregate read rate about 72% of network limit aggregate append rate 75% of network limit aggregate append rate can easily exceed 380MB/s with multiple clients machines limited by the aggregate bandwidth between clients and chunkservers 5
TFS Shell Sample Application Source Lines of Codes Conclusion TFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and read The key design choice that the chunk size is variable and record append operation is based on chunk level, which is different from GFS Significantly improves the record append performance by 25%. References TFS Project http://tianwang.grids.cn/projects/tplatform, 2008 [Ghemawat, et al.,2003] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29-43, 2003. [Dean and Ghemawat,2004] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," presented at OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. [Chang, et al.,2006] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," presented at OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006. Hadoop Project http://lucene.apache.org/hadoop/, 2007 6
CS402 Mass Data Processing/Cloud Computing (Summer 2008, preparing) http://net.pku.edu.cn/~course/cs402/ Course description 网页全文索引, 镜像网页消重, 垃圾邮件过滤, 天气模拟, 星系模拟, 上亿字符串的排序., 你想不想了解如何在大型分布式网络上写少量的具体问题代码来做这些事情吗? 这些应用, 可以使用 MapReduce 分布式计算完成, 它已经在 Google 得到了广泛使用 在这为期 5 周的课程中, 你会学习到 : 1) 分布式系统的相关知识 ;2)MapReduce 理论和实践, 包括 : 认识和理解 MapReduce 如何适用于分布式计算, 明白它适合哪些应用, 不适合哪些应用, 实践中的提示和技巧 ;3) 通过几个编程练习和一个课程项目, 获得实际分布式程序设计技术经验 课程练习和项目将使用 Hadoop( 开放源代码实现的 MapReduce) 使用集群由网络实验室提供, 需要学生自备能够无线上网的笔记本 ( 用于连接集群操作 ), 我们会尽量安排在能够无线上网的教室, 并尽量为大家争取到上机实习的机会 Dynamo: Amazon's Highly Available Key-Value Store Dynamo originate in the operating systems and distributed systems research of the past years; DHTs, consistent hashing, versioning, vector clocks, quorum, antientropy based recovery, etc. As far as I know Dynamo is the first production system to use the synthesis of all these techniques, and there are quite a few lessons learned from doing so. Invocation semantics Fault tolerance measures Invocation semantics Retransmit request message No Yes Duplicate filtering Not applicable No Re-execute procedure or retransmit reply Not applicable Maybe Re-execute procedure At-least-once Yes Yes Retransmit reply At-most-once Sun RPC provides at-least-once call semantics 7