1 Distributed Storage Rob Quick Indiana University Slides courtesy of Derek Weitzel University of Nebraska Lincoln
2 Outline Storage Patterns in Grid Applications Storage Architecture for Grid Sites Strategies for Managing Data Reliability and the Cost of Complexity Prestaging Data on the OSG Advanced Data Management Architectures 2
3 Storage Patterns There are many variations in how storage is handled. Not only variations from different workflows, but even within a workflow. Important: Pick a proper storage strategy for your application 3
4 Common Storage Patterns I m going to highlight a few of the storage patterns seen at Grid sites. It s important to note that moving files around is relatively easy. The hard part of storage management are large datasets and error cases. 4
5 Usage Patterns Ask yourself, for each job: - Events in a job s life: INPUT: What should be available when the job is starting? RUNTIME: What data are needed while the job runs? OUTPUT: What output is produced? - Important quantities: FILES: How many files? DATA: How big is your dataset? How big is the sum of all the files? RATES: How much data is consumed on average? At peak?
6 Why we care about quantities? FILES: How many files? - Many file systems cannot handle lots of small files DATA: - How much space do you need on the file server to store the input? Output? 6
7 Why we care about quantities? RATES: How much data is consumed on average? At peak? - Even professionals have a hard time figuring this out - Rates determine how you should stage your data (talk about staging shortly) 7
8 Examples of Storage Usage Simulation - Small input, big output Searches - Big input, small output Data processing - About the same input and output Analysis - Big input, small output 8
9 Simulation Based on different input configurations, generate the physical response of a system. - Input: The Grid application must manage many input files; one per job. - Runtime: An executable reads the input and later produces output. Sometimes, temporary scratch space is necessary for intermediate results. - Output: These outputs can be small [KB] (total energy and configuration of a single molecule) or large [GB] (electronic readout of a detector for hundreds of simulated particle collisions). Huge outputs [TB] are currently not common in Grid applications.
10 Searches Given a database, calculate a solution to a given problem. - Input: A database (possibly several GB) shared for all jobs, and an input file per job. - Runtime: Job reads the configuration file at startup, and accesses the database throughout the job s runtime. - Output: Typically small (order of a few MB); the results of a query/search.
11 Data Processing Transform dataset(s) into new dataset(s). The input dataset might be re-partitioned into new logical groupings, or changed into a different data format. - Input: Configuration file; input dataset - Runtime: Configuration is read at startup; input dataset is read through, one file at a time. Scratch space used for staging output. - Output: Output dataset; similar in size to the input dataset.
12 Analysis Given some dataset, analyze and summarize its contents. - Input: Configuration file and data set. - Runtime: Configuration file is read, then the process reads through the files in the dataset, one at a time (approximately constant rate). Intermediate output written to scratch area. - Output: Summary of dataset; smaller than input dataset.
13 OSG Storage Patterns Just as we want to identify common successful patterns, we want to identify common patterns that are unsuccessful on OSG. - Files larger than >5GB. - Requiring a (POSIX) shared file system. - Lots of small files (more than 1 file per minute of computation time). - Jobs consuming more than 10GB per hour, or needing scratch space more than 5GB. - Locking, appending, or updating files. When using OSG resources opportunistically, the effective limitations may be even more restrictive than above.
14 Exercise to help identify problems Remember, we want to identify: - Input data How much data is needed for the entire workflow? - Working set How much data is needed to run 1 unit of work? Including possible temporary files. - Output data How much data is output to the workflow. 14
15 Questions? Questions? Comments? - Feel free to ask us questions 15
16 Lecture 2: Using Remote Storage Systems Rob Slides courtesy of Derek Weitzel
17 Storage Architectures on the OSG Let s look at how storage is structured on the OSG. This will give you an idea of where the bottlenecks can be. Can also help when forming data solutions on your own.
18 Original Storage at OSG CEs LAN OSG CE Worker Node Worker Node Worker Node Worker Node NFS Server All OSG sites have some kind of shared, POSIX-mounted storage (typically NFS).* This is almost never a distributed or high-performance file system This is mounted and writable on the CE.* This is readable (sometimes read-only) from the OSG worker nodes. *Exceptions apply; Sites ultimately decide 18
19 Why Not? This setup is called the classic SE setup, because this is how the grid worked circa Why didn t this work? Scalability issues. High-performance filesystems not reliable or cheap enough. Difficult to manage space. 19
20 Storage Elements In order to make storage and transfers scalable, sites set up a separate system for storage (the Storage Element). Most sites have an attached SE, but on wide range of scales. These are separated from the compute cluster; normally, you interact it via a get or put of the file. - Not necessarily POSIX. 20
21 Storage Elements on the OSG User point of View SRM WAN Storage Element GridFTP Local Compute Cluster dcap dcap dcap Worker Node Worker Node Worker Node Worker Node 21
22 SE Internals Storage Element Namespace Service Auxiliary Services Disk Server Disk Server Disk Server Disk Server Disk Server GridFTP Xrootd SRM Only care about 22
23 GridFTP in One Slide A set of extensions to the classic FTP protocol. Two most important extensions: - Security: Authentication, encryption, and integrity provided by GSI/X509. Use proxies instead of username/password. - Improved Scalability: Instead of transferring a file over one TCP connection, multiple TCP connections may be used.
24 SRM SRM = Storage Resource Management SRM is a web-services-based protocol for doing: - Metadata Operations - Load-balancing - Space management This allows us to access storage remotely, and to treat multiple storage implementations in a homogeneous manner.
25 Data Access Methods A few common approaches to implementing solutions: - Job Sandbox - Prestaging - Caching - Remote I/O A realistic solution might combine multiple methods. Descriptions to follow discuss common cases; there are always exceptions or alternates.
26 Job Sandbox Data sent with jobs. The user generates the files on the submit machines. These files are copied from submit node to worker node. - With Condor, a job won t even start if the sandbox can t be copied first. You can always assume it is found locally.
27 Prestaging Data placed into local SE where the jobs are run. By increasing the number of locations of the data and doing intelligent placement, we increase scalability. - In some cases, prestaging = making a copy at each site where the job is run. - Not always true; a single storage element may be able to support the load of many sites.
28 Caching A cache transparently stores data close to the user jobs. Future requests can be served from the cache. - User must place file/object at a source site responsible for keeping the file permanently. Cache must be able to access the file at its source. - Requesting a file will bring it into the cache. - Once the cache becomes full, an eviction policy is invoked to decide what files to remove. - The cache, not the user, decides what files will be kept in the cache.
29 Remote I/O Application data is streamed from a remote site upon demand. I/O calls are transformed into network transfers. - Can be done transparently so the user application doesn t know the I/O isn t local. - Typically, only the requested bytes are moved. Can be effectively combined with caching to scale better.
30 Scalability For each of the four previous methods (sandboxes, prestaging, caching, remote I/O), how well do they scale? Depends on: - How big are the data sets? - What resources are consumed per job?
31 The Cost of Reliability For each of the four previous methods, what must be done to create a reliable system? Depends on: - What recovery must be done on failures? What must be done manually? Ideally, want to automate failure recovery as much as possible
32 Prestaging Data on the OSG Prestaging is currently the most popular data management method on the OSG, but requires the most work. - Requires a transfer system to move a list of files between two sites. Example systems: Globus Online, Stork, FTS. - Requires a bookkeeping system to record the location of datasets. Examples: LFC, DBS. - Requires a mechanism to verify the current validity of file locations in the bookkeeping systems.
33 Previous example In the first exercise, you formulated a data plan for BLAST. Now, you are going to implement the data movement of blast using prestaging. 33
34 Lecture 3: Caching and Remote I/O Horst Severini Slides courtesy of Derek Weitzel
35 HTTP Caching is easy on the OSG Widely deployed caching servers (required by other experiments) Tools are easy to use - curl, wget Servers are easy to setup - Just need http server, somewhere 35
36 HTTP Caching In practice, you can cache almost any URL Sites are typically configured to cache any URL, but only accept requests from WN s Sites control the size and policy for caches, keep this in mind 36
37 HTTP Caching - Pitfalls curl Have to add special argument to use caching: - curl -H "Pragma: <url>! Services like Dropbox and Google Docs explicitly disable caching - Dropbox: cache-control: max-age=0! - Google Docs: Cache-Control: private, max-age=0! 37
38 HTTP Caching The HTTP protocol is perhaps the most popular application-layer protocol on the planet. Built-in to HTTP 1.1 are elaborate mechanisms to use HTTP through a proxy and for caching. There are mature, widely-used commandline clients for the HTTP protocol on Linux. - Curl and wget are the most popular; we will use wget for the exercises. Oddly enough, curl disables caching by default.
39 HTTP Caching: What do you need? All you need is a public webserver somewhere. Will pull down files to the worker node through squid cache. 39
40 HTTP Cache Dataflow Source Site For this class, we will use: - Source site: osg-ss-submit.chtc.wisc.edu - HTTP cache: osg-ss-se.chtc.wisc.edu HTTP Get First Request (Proxied) HTTP Cache (Stores response) Execution Site HTTP Get First Request HTTP Get Second Request Worker Node User Process
41 Proxy Request Here s the HTTP headers for the request that goes to the proxy osg-ssse.cs.wisc.edu: GET HTTP/ 1.0! User-Agent: Wget/ Red Hat modified! Accept: */*! Host: osg-ss-submit.chtc.wisc.edu!
42 Proxy Request Response: HTTP/ OK! Date: Tue, 19 Jun :53:26 GMT! Server: Apache/2.2.3 (Scientific Linux)! Last-Modified: Tue, 19 Jun :26:56 GMT! ETag: "136d0f-59-4c2d9f120e800"! Accept-Ranges: bytes! Content-Length: 89! Content-Type: text/html; charset=utf-8! X-Cache: HIT from osg-ss-se.chtc.wisc.edu! X-Cache-Lookup: HIT from osg-ss-se.chtc.wisc.edu:3128! Via: 1.0 osg-ss-se.chtc.wisc.edu:3128 (squid/2.6.stable21)! Proxy-Connection: close!! 42
43 HTTP Proxy HTTP Proxy is very popular for small VO s Easy to manage, easy to implement Highly recommend HTTP Proxy for files less than 100MB 43
44 Now on to Remote I/O Do you only read parts of the data (skip around in the file?) Do you have data that is only available on the submit host? Do you have an application that requires a specific directory structure? 44
47 Remote I/O We will perform Remote I/O with the help of Parrot from ND. It redirects I/O requests back to the submit host. Can pretend that a directory is locally accessible. 47
48 Examples of Remote I/O What does Remote IO look like in practice? submit host: $ echo hello from submit host > newfile! script on the worker node (in Nebraska?) $ cat /chirp/condor/home/dweitzel/newfile! hello from submit host! $ 48
49 Celebrate! 49
50 Downsides of Remote I/O What are some downsides to Remote I/O? All I/O goes through the submit host. Distributed remote I/O is hard (though can be done, see XrootD) 50
The HTCondor CacheD Derek Weitzel, Brian Bockelman University of Nebraska Lincoln Today s Talk Today s talk summarizes work for my a part of my PhD Dissertation Also, this work has been accepted to PDPTA
Map-Reduce Marco Mura (firstname.lastname@example.org) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
Worldwide Production Distributed Data Management at the LHC Brian Bockelman MSST 2010, 4 May 2010 At the LHC http://op-webtools.web.cern.ch/opwebtools/vistar/vistars.php?usr=lhc1 Gratuitous detector pictures:
High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San
COSC 6397 Big Data Analytics Distributed File Systems Edgar Gabriel Spring 2017 What is a file system A clearly defined method that the OS uses to store, catalog and retrieve files Manage the bits that
Data Management 1 Grid data management Different sources of data Sensors Analytic equipment Measurement tools and devices Need to discover patterns in data to create information Need mechanisms to deal
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
AstroGrid D Meeting at MPE 14 15. November 2006 Garching dcache A scalable storage element and its usage in HEP Martin Radicke Patrick Fuhrmann Introduction to dcache 2 Project overview joint venture between
I Tier-3 di CMS-Italia: stato e prospettive Claudio Grandi Workshop CCR GRID 2011 Outline INFN Perugia Tier-3 R&D Computing centre: activities, storage and batch system CMS services: bottlenecks and workarounds
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
Introduction to SRM Grid Storage Resource Manager Riccardo Zappi 1 1 INFN-CNAF, National Center of INFN (National Institute for Nuclear Physic) for Research and Development into the field of Information
Proxying p.1/24 Proxying Why and How Alon Altman email@example.com Haifa Linux Club Proxying p.2/24 Definition proxy \Prox"y\, n.; pl. Proxies. The agency for another who acts through the agent; authority
CMSC 332 Computer Networking Web and FTP Professor Szajda CMSC 332: Computer Networks Project The first project has been posted on the website. Check the web page for the link! Due 2/2! Enter strings into
Bootstrapping a (New?) LHC Data Transfer Ecosystem Brian Paul Bockelman, Andy Hanushevsky, Oliver Keeble, Mario Lassnig, Paul Millar, Derek Weitzel, Wei Yang Why am I here? The announcement in mid-2017
MP 1: HTTP Client + Server Due: Friday, Feb 9th, 11:59pm Please read all sections of this document before you begin coding. In this assignment, you will implement a simple HTTP client and server. The client
and the GridKa mass storage system / GridKa [Tape TSM] staging server 2 Introduction Grid storage and storage middleware dcache h and TSS TSS internals Conclusion and further work 3 FZK/GridKa The GridKa
Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why the Grid? Science is becoming increasingly digital and needs to deal with increasing amounts of
Distributed File Systems Today l Basic distributed file systems l Two classical examples Next time l Naming things xkdc Distributed File Systems " A DFS supports network-wide sharing of files and devices
Triton file systems - an introduction slide 1 of 28 File systems Motivation & basic concepts Storage locations Basic flow of IO Do's and Don'ts Exercises slide 2 of 28 File systems: Motivation Case #1:
The EPIKH Project (Exchange Programme to advance e-infrastructure Know-How) glite Grid Services Overview Antonio Calanducci INFN Catania Joint GISELA/EPIKH School for Grid Site Administrators Valparaiso,
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
Grid services Dusan Vudragovic firstname.lastname@example.org Scientific Computing Laboratory Institute of Physics Belgrade, Serbia Sep. 19, 2008 www.eu-egee.org Set of basic Grid services Job submission/management
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 24 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time How
Lecture 20 -- 11/20/2017 BigTable big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures what does paper say Google
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
Distributed System Gang Wu Spring,2018 Lecture7:DFS What is DFS? A method of storing and accessing files base in a client/server architecture. A distributed file system is a client/server-based application
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma email@example.com Fall 2015 004395771 Overview Google file system is a scalable distributed file system
Application Protocols and HTTP 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer Networking: A Top Down Approach, 6 th edition. J.F. Kurose and K.W. Ross Administrivia Lab #0 due
CS 555: DISTRIBUTED SYSTEMS [DYNAMO & GOOGLE FILE SYSTEM] Frequently asked questions from the previous class survey What s the typical size of an inconsistency window in most production settings? Dynamo?
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
Distributed production managers meeting Armando Fella on behalf of Italian distributed computing group Distributed Computing human network CNAF Caltech SLAC McGill Queen Mary RAL LAL and Lyon Bari Legnaro
Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File System Our Specific Case for File Systems Parallel File Systems A Survey of Current Parallel File Systems Implementation
Chapter 2 CommVault Data Management Concepts 10 - CommVault Data Management Concepts The Simpana product suite offers a wide range of features and options to provide great flexibility in configuring and
HTTP Protocol and Server-Side Basics Web Programming Uta Priss ZELL, Ostfalia University 2013 Web Programming HTTP Protocol and Server-Side Basics Slide 1/26 Outline The HTTP protocol Environment Variables
Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented
Data Storage Paul Millar dcache Overview Introducing storage How storage is used Challenges and future directions 2 (Magnetic) Hard Disks 3 Tape systems 4 Disk enclosures 5 RAID systems 6 Types of RAID
A Federated Grid Environment with Replication Services Vivek Khurana, Max Berger & Michael Sobolewski SORCER Research Group, Texas Tech University Grids can be classified as computational grids, access
Storage Virtualization Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan Storage Virtualization In computer science, storage virtualization uses virtualization to enable better functionality
Computer Networks Wenzhong Li Nanjing University 1 Chapter 8. Internet Applications Internet Applications Overview Domain Name Service (DNS) Electronic Mail File Transfer Protocol (FTP) WWW and HTTP Content
CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but
Grid Data Management Week #4 Hardi Teder firstname.lastname@example.org University of Tartu March 6th 2013 Overview Grid Data Management Where the Data comes from? Grid Data Management tools 2/33 Grid foundations 3/33
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
INF5750 RESTful Web Services Recording Audio from the lecture will be recorded! Will be put online if quality turns out OK Outline REST HTTP RESTful web services HTTP Hypertext Transfer Protocol Application
Int'l Conf Par and Dist Proc Tech and Appl PDPTA'15 341 Distributed Caching Using the Derek Weitzel, Brian Bockelman, and David Swanson Computer Science and Engineering University of Nebraska Lincoln Lincoln,
Mobile Application Development Higher Diploma in Science in Computer Science Produced by Eamonn de Leastar (email@example.com) Department of Computing, Maths & Physics Waterford Institute of Technology
Data Movement and Storage 04/07/09 www.cac.cornell.edu 1 Data Location, Storage, Sharing and Movement Four of the seven main challenges of Data Intensive Computing, according to SC06. (Other three: viewing,
Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1 Recap Directory A special file contains (inode, filename) mappings Caching Directory cache Accelerate to find inode
Journal of Physics: Conference Series PAPER OPEN ACCESS File Access Optimization with the Lustre Filesystem at Florida CMS T2 To cite this article: P. Avery et al 215 J. Phys.: Conf. Ser. 664 4228 View
Simplifying Wide-Area Application Development with WheelFS Jeremy Stribling In collaboration with Jinyang Li, Frans Kaashoek, Robert Morris MIT CSAIL & New York University Resources Spread over Wide-Area
INTERNET ENGINEERING HTTP Protocol Sadegh Aliakbary Agenda HTTP Protocol HTTP Methods HTTP Request and Response State in HTTP Internet Engineering 2 HTTP HTTP Hyper-Text Transfer Protocol (HTTP) The fundamental
Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing
AMGA metadata catalogue system Hurng-Chun Lee ACGrid School, Hanoi, Vietnam www.eu-egee.org EGEE and glite are registered trademarks Outline AMGA overview AMGA Background and Motivation for AMGA Interface,
BOSCO Architecture Derek Weitzel University of Nebraska Lincoln Goals We want an easy to use method for users to do computational research It should be easy to install, use, and maintain It should be simple
ARC integration for CMS ARC integration for CMS Erik Edelmann 2, Laurence Field 3, Jaime Frey 4, Michael Grønager 2, Kalle Happonen 1, Daniel Johansson 2, Josva Kleist 2, Jukka Klem 1, Jesper Koivumäki
HTTP Reading: Section 9.1.2 and 9.4.3 COS 461: Computer Networks Spring 2013 1 Recap: Client-Server Communication Client sometimes on Initiates a request to the server when interested E.g., Web browser
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
Map Reduce Erasmus+ @ Yerevan firstname.lastname@example.org Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
Distributed Monte Carlo Production for Joel Snow Langston University DOE Review March 2011 Outline Introduction FNAL SAM SAMGrid Interoperability with OSG and LCG Production System Production Results LUHEP
Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance
Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing Wolf Behrenhoff, Christoph Wissing DESY Computing Seminar May 17th, 2010 Page 1 Installation of
Outline Introduce Socket Programming Domain Name Service (DNS) Standard Application-level Protocols email (SMTP) HTTP HyperText Transfer Protocol Defintitions A web page consists of a base HTML-file which
COSC 2206 Internet Tools The HTTP Protocol http://www.w3.org/protocols/ What is TCP/IP? TCP: Transmission Control Protocol IP: Internet Protocol These network protocols provide a standard method for sending
XRAY Grid TO BE OR NOT TO BE? 1 I was not always a Grid sceptic! I started off as a grid enthusiast e.g. by insisting that Grid be part of the ESRF Upgrade Program outlined in the Purple Book : In this
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
Contents About this course 1 Recommended chapters.............................................. 1 A note about solutions............................................... 2 Exercises 2 Your first script (recommended).........................................
Persistence & State SWE 432, Fall 2016 Design and Implementation of Software for the Web Today What s state for our web apps? How do we store it, where do we store it, and why there? For further reading:
Textual Description of webbioc Colin A. Smith October 13, 2014 Introduction webbioc is a web interface for some of the Bioconductor microarray analysis packages. It is designed to be installed at local
Google File System goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) focus on multi-gb files handle appends efficiently (no random writes & sequential reads) co-design GFS
Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library Authors Devresse Adrien (CERN) Fabrizio Furano (CERN) Typical HPC architecture Computing Cluster
Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able
Seminar on By Sai Rahul Reddy P 2/2/2005 Web Caching 1 Topics covered 1. Why Caching 2. Advantages of Caching 3. Disadvantages of Caching 4. Cache-Control HTTP Headers 5. Proxy Caching 6. Caching architectures