Outline. Motivations (1/3) Distributed File Systems. Motivations (3/3) Motivations (2/3)

Similar documents
Understanding IO patterns of SSDs

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容

实验三十三 DEIGRP 的配置 一 实验目的 二 应用环境 三 实验设备 四 实验拓扑 五 实验要求 六 实验步骤 1. 掌握 DEIGRP 的配置方法 2. 理解 DEIGRP 协议的工作过程

Performance Gain with Variable Chunk Size in GFS-like File Systems

如何查看 Cache Engine 缓存中有哪些网站 /URL

Logitech G302 Daedalus Prime Setup Guide 设置指南

Chapter 1 (Part 2) Introduction to Operating System

Chapter 11 SHANDONG UNIVERSITY 1

Chapter 7: Deadlocks. Operating System Concepts 9 th Edition

PCU50 的整盘备份. 本文只针对操作系统为 Windows XP 版本的 PCU50 PCU50 启动硬件自检完后, 出现下面文字时, 按向下光标键 光标条停在 SINUMERIK 下方的空白处, 如下图, 按回车键 PCU50 会进入到服务画面, 如下图

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

The Design of Everyday Things

<properties> <jdk.version>1.8</jdk.version> <project.build.sourceencoding>utf-8</project.build.sourceencoding> </properties>

Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司

Triangle - Delaunay Triangulator

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

The Google File System

Command Dictionary CUSTOM

Software Engineering. Zheng Li( 李征 ) Jing Wan( 万静 )

CA485 Ray Walshe Google File System

Spark Standalone 模式应用程序开发 Spark 大数据博客 -

OTAD Application Note

Technology: Anti-social Networking 科技 : 反社交网络

Distributed File Systems II

Air Speaker. Getting started with Logitech UE Air Speaker. 快速入门罗技 UE Air Speaker. Wireless speaker with AirPlay. 无线音箱 (AirPlay 技术 )

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

测试 SFTP 的 问题在归档配置页的 MediaSense

Ganglia 是 UC Berkeley 发起的一个开源集群监视项目, 主要是用来监控系统性能, 如 :cpu mem 硬盘利用率, I/O 负载 网络流量情况等, 通过曲线很容易见到每个节点的工作状态, 对合理调整 分配系统资源, 提高系统整体性能起到重要作用

Learn OpenStack from trystack.cn Grizzly in practice

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

CHINA VISA APPLICATION CONCIERGE SERVICE*

CLOUD-SCALE FILE SYSTEMS

The Google File System

NyearBluetoothPrint SDK. Development Document--Android

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

智能终端与物联网应用 课程建设与实践. 邝坚 嵌入式系统与网络通信研究中心北京邮电大学计算机学院

A Benchmark For Stroke Extraction of Chinese Characters

三 依赖注入 (dependency injection) 的学习

PTZ PRO 2. Setup Guide 设置指南

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

The Google File System. Alexandru Costan

Apache Kafka 源码编译 Spark 大数据博客 -

Wireless Presentation Pod

Safe Memory-Leak Fixing for C Programs

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

H3C CAS 虚拟机支持的操作系统列表. Copyright 2016 杭州华三通信技术有限公司版权所有, 保留一切权利 非经本公司书面许可, 任何单位和个人不得擅自摘抄 复制本文档内容的部分或全部, 并不得以任何形式传播 本文档中的信息可能变动, 恕不另行通知

IP unnumbered 实验讲义 一. 实验目的 : 二. 实验设备 : 三. 实验拓扑 : 四. 实验内容 :

VAS 5054A FAQ ( 所有 5054A 整合, 中英对照 )

3dvia Composer Solidworks

Distributed System. Gang Wu. Spring,2018

The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

计算机科学与技术专业本科培养计划. Undergraduate Program for Specialty in Computer Science & Technology

操作系统原理与设计. 第 13 章 IO Systems(IO 管理 ) 陈香兰 2009 年 09 月 01 日 中国科学技术大学计算机学院

Congestion Control Mechanisms for Ad-hoc Social Networks 自组织社会网络中的拥塞控制机制

libde265 HEVC 性能测试报告

Multiprotocol Label Switching The future of IP Backbone Technology

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Lessons Learned While Building Infrastructure Software at Google

Safety Life Cycle Model IEC61508 安全生命周期模型 -IEC61508

云计算入门 Introduction to Cloud Computing GESC1001

Logitech ConferenceCam CC3000e Camera 罗技 ConferenceCam CC3000e Camera Setup Guide 设置指南

未有现场回答的问题及其解答 如果我不需要所有的平台, 我可否只安装我想使用的平台?

The Google File System

Introduction to Computer Science

Build a Key Value Flash Disk Based Storage System. Flash Memory Summit 2017 Santa Clara, CA 1

Color LaserJet Pro MFP M477 入门指南

display portal server display portal user display portal user count display portal web-server

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

计算机组成原理第二讲 第二章 : 运算方法和运算器 数据与文字的表示方法 (1) 整数的表示方法. 授课老师 : 王浩宇

NPTEL Course Jan K. Gopinath Indian Institute of Science

我们应该做什么? 告知性分析 未来会发生什么? 预测性分析 为什么会发生 诊断性分析 过去发生了什么? 描述性分析 高级分析 传统 BI. Source: Gartner

Declaration of Conformity STANDARD 100 by OEKO TEX

[ 电子书 ]Spark for Data Science PDF 下载 Spark 大数据博客 -

Division of Science and Technology

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

MeeGo : An Open Source OS Solution For Client Devices

第二小题 : 逻辑隔离 (10 分 ) OpenFlow Switch1 (PC-A/Netfpga) OpenFlow Switch2 (PC-B/Netfpga) ServerB PC-2. Switching Hub

The Google File System (GFS)

BlueCore BlueTunes Configuration Tool User Guide

S 1.6V 3.3V. S Windows 2000 Windows XP Windows Vista S USB S RGB LED (PORT1 PORT2 PORT3) S I 2 C. + 表示无铅 (Pb) 并符合 RoHS 标准 JU10 JU14, JU24, JU25

LAB 5: S-parameter Simulation and Optimization

The Google File System

Skill-building Courses Business Analysis Lesson 3 Problem Solving

LAB 3: DC Simulations and Circuit Modeling

新一代 ODA X5-2 低调 奢华 有内涵

: Operating System 计算机原理与设计

1. DWR 1.1 DWR 基础 概念 使用使用 DWR 的步骤. 1 什么是 DWR? Direct Web Remote, 直接 Web 远程 是一个 Ajax 的框架

IBM 开源技术微讲堂容器技术与微服务系列

CloudStack 4.3 API 开发指南!

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!--- global properties --> <property>

#MDCC Swift 链式语法应 用 陈乘

Machine Vision Market Analysis of 2015 Isabel Yang

U-CONTROL UMX610/UMX490/UMX250. The Ultimate Studio in a Box: 61/49/25-Key USB/MIDI Controller Keyboard with Separate USB/Audio Interface

HBase 在 hulu 的使用和实践. hulu

赛灵思技术日 XILINX TECHNOLOGY DAY 用赛灵思 FPGA 加速机器学习推断 张帆资深全球 AI 方案技术专家

Advanced Design System Fundamentals

Transcription:

Outline TFS: Tianwang File System -Performance Gain with Variable Chunk Size in GFS-like File Systems Authors: Zhifeng Yang, Qichen Tu, Kai Fan, Lei Zhu, Rishan Chen, Bo Peng Introduction (what s it all about) Tianwang File System Experiments Conclusion Speaker: Hongfei Yan School of EECS, Peking University 4/13/2008 Distributed File Systems Support access to files on remote servers Must support concurrency Make varying guarantees about locking, who wins with concurrent writes, etc... Must gracefully handle dropped connections Can offer support for replication and local caching Different implementations sit in different places on complexity/feature scale Motivations (1/3) 1996 1999 2000 2002 2004 2005 Key ideas Web pages preserve easier preserve Web pages FTP files grow exponentially vanishing pages web resources knowledge discovery Mile Tianwang 1.0 Bingle 1.0 Tianwang 2.0 Web InfoMall 1.0 CDAL, 1.0 Web Digest stones HisTrace Web InfoMall 2.0 etc 2007 Motivations (2/3) Data Web pages 3 billions, 30TB compressed URL list, IP list, link graph, anchor text, etc. Search engine log about 40 GB Test Collection CWT100G, CWT200G, CCT2006, CWT70th, CDAL16th Motivations (3/3) Software Large-scale web crawler Web page deduplicate Web page classifier Index and search TB-level data management Retrieval performance evaluation LinkAnalysis, ShallowNLP, Information Extraction Hardware 80 machines (PC, Dell2850, etc.) 1

Issues Data Accessibility Distributed among machines Data is not open and shared easily Difficult to construct, deploy and run the web data analysis program Communication failure, error detection Machine usability Disk failure is a disaster, but common Inefficiency Some real-world datapoints Sources: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, Bianca Schroeder and Garth A. Gibson (FAST 07) [pdf] Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso (FAST 07) [pdf] Google File System Solution: Divide files in large 64 MB chunks, and distribute/replicate chunks across many servers. A couple of important details: The master maintains only a (file name, chunk server) table in main memory ) minimal I/O Files are replicated using a primary-backup scheme; the master is kept out of the loop Google File System atomic record append concurrent write/append secondary master chunk replications re-replication re-balancing Hadoop Project A module in Lucene/Nutch project DFS+MapReduce Create-once/read-many Does not support concurrent atomic record appends Google & IBM cloud computing initiative for university (Oct,8, 2007) Kosmos File System does not support atomic record appends support concurrent write re-replication re-balancing integrated with Hadoop POSIX file interface FUSE(Filesystem in userspace) binding master is single-point of failure 2

Outline Introduction (what s it all about) Tianwang File System Experiments Conclusion Web Infrastructure/Cloud Computing Storage Fault-tolerant It can recover from component failures Scalable It can operate correctly even as some aspect of the system is scaled to a larger size. Transparent the ability of a distributed system to act like a non-distributed system. Computing Easy/efficent parallel computing Data processing model Mostly sequential access MapReduce Assumptions Component failures are the norm Inexpensive commodity components Files are huge Multi-GB Appending rather than overwriting Once written, only read, often sequentially Multi-append concurrently Co-design applications and the system High sustained bandwidth is more important than low latency TFS Design Decisions Files are consist of chunks Chunks are regular file on local file system Chunk replica 3 replicas One master to manage metadata Heartbeat Operation log Note that: There are big differences between TFS and GFS due to the different chunk size. Chunk Size Chunk Size 4 Application 1 Read Chunk Size Overwrite 2 Fixed Size in GFS Padding Duplicates Variable Size in TFS Flexibility A property of Chunk Offset -> Chunk ID Small chunk Append 3 3

Read Operation Mutation Operation in GFS GFS Cache chunk info Communicate with master when cache fails TFS Get chunk information once New data after open is invisible Append Operation in TFS Record Append Operation GFS At least once Padding & fragments App checksums Duplicates App Record ID TFS At most once Small chunk Delay write, flush Implications for Applications Outline GFS Appending rather than overwriting Read sequentially Checksums Record ID TFS Appending rather than overwriting Read sequentially Sequence of records Introduction (what s it all about) Tianwang File System Experiments Conclusion 4

Experimental Deployment in Tianwang Master Operation in TFS 10 nodes in a cluster One master, nine chunkservers Each with two 2.8GHz processors, 2GB RAM, 100GB+ scsi disk space Read Buffer Size in TFS Aggregate Read Rate in TFS Aggregate Append Rate in TFS Performance: GFS vs. TFS GFS 200 to 500 operations per second aggregate read rate 75% of network limit aggregate append rate 50% of network limit limited by the network bandwidth of the chunkserver that store the last chunk of the file TFS 3400 operations per second aggregate read rate about 72% of network limit aggregate append rate 75% of network limit aggregate append rate can easily exceed 380MB/s with multiple clients machines limited by the aggregate bandwidth between clients and chunkservers 5

TFS Shell Sample Application Source Lines of Codes Conclusion TFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and read The key design choice that the chunk size is variable and record append operation is based on chunk level, which is different from GFS Significantly improves the record append performance by 25%. References TFS Project http://tianwang.grids.cn/projects/tplatform, 2008 [Ghemawat, et al.,2003] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29-43, 2003. [Dean and Ghemawat,2004] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," presented at OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. [Chang, et al.,2006] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," presented at OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006. Hadoop Project http://lucene.apache.org/hadoop/, 2007 6

CS402 Mass Data Processing/Cloud Computing (Summer 2008, preparing) http://net.pku.edu.cn/~course/cs402/ Course description 网页全文索引, 镜像网页消重, 垃圾邮件过滤, 天气模拟, 星系模拟, 上亿字符串的排序., 你想不想了解如何在大型分布式网络上写少量的具体问题代码来做这些事情吗? 这些应用, 可以使用 MapReduce 分布式计算完成, 它已经在 Google 得到了广泛使用 在这为期 5 周的课程中, 你会学习到 : 1) 分布式系统的相关知识 ;2)MapReduce 理论和实践, 包括 : 认识和理解 MapReduce 如何适用于分布式计算, 明白它适合哪些应用, 不适合哪些应用, 实践中的提示和技巧 ;3) 通过几个编程练习和一个课程项目, 获得实际分布式程序设计技术经验 课程练习和项目将使用 Hadoop( 开放源代码实现的 MapReduce) 使用集群由网络实验室提供, 需要学生自备能够无线上网的笔记本 ( 用于连接集群操作 ), 我们会尽量安排在能够无线上网的教室, 并尽量为大家争取到上机实习的机会 Dynamo: Amazon's Highly Available Key-Value Store Dynamo originate in the operating systems and distributed systems research of the past years; DHTs, consistent hashing, versioning, vector clocks, quorum, antientropy based recovery, etc. As far as I know Dynamo is the first production system to use the synthesis of all these techniques, and there are quite a few lessons learned from doing so. Invocation semantics Fault tolerance measures Invocation semantics Retransmit request message No Yes Duplicate filtering Not applicable No Re-execute procedure or retransmit reply Not applicable Maybe Re-execute procedure At-least-once Yes Yes Retransmit reply At-most-once Sun RPC provides at-least-once call semantics 7