Clustering and Reclustering HEP Data in Object Databases

Size: px
Start display at page:

Download "Clustering and Reclustering HEP Data in Object Databases"

Transcription

1 Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications and to farming HEP applications with a high degree of concurrency. We make the case for the reclustering of HEP data on the basis of performance measurements and briefly discuss a prototype automatic reclustering system. Introduction As part of the CMS contribution to the RD4 [] collaboration, database clustering and reclustering have been under investigation for about. years. The clustering of objects in an object database is the mapping of objects to locations on physical storage media like disk farms and tapes. The performance of the database, and the physics application on top of it, depends crucially on having a good match between the object clustering and the database access operations performed by the physics application. The principles for clustering discussed in this paper are based on, and illustrated with, a set of performance measurements. The performance measurements shown in this paper were all performed on a Sun Ultra Enterprise server, running SunOS Release.., on hard disks in a SPARCstorage array (no striping, no other RAID type processing,.-gb 700-rpm fast-wide SCSI-, Seagate ST-30W). These disks which can be considered typical for the high end of the 994 commodity disk market. All performance results were cross-validated on at least one other hardware/os configuration, most results on at least two other configurations. The object database system used was always Objectivity/DB version 4 []. HEP data clustering basics Most I/O intensive physics analysis systems, no matter what the implementation method, and no matter whether tape or disk based, use the following simple principles to optimise performance. Divide the set of all events into fairly large chunks (in most current systems a chunk is a run or a part of a physics stream [3]). Implement farming (both for disks and CPUs) at the chunk level 3. Make sure that (sub)jobs always iterate through the events in a chunk in the same order 4. Cluster the event data in a chunk in the iteration order Though object databases make it perhaps easier than ever to build physics analysis systems which do not follow the principles above, we believe that these principles are currently still the most viable basis for designing a performant production system. Principle above, dividing the event set into chunks, involves coarse grained clustering decisions: strategies like dividing events into physics streams [3] are often used here, and newer strategies are a topic of active research [4]. At the chunk level, reducing tape mounts is a very important goal, and an important constraint is that it is not feasible to recluster data often.

2 Most of this paper deals with the refinement of principle 4 above, that is with clustering and reclustering decisions at the sub-chunk level. At this level, an important goal is to achieve nearsequential reading on the disk or disk farm, and frequent reclustering is feasible as a strategy for optimising performance and reducing disk space occupancy. The clustering techniques discussed in this paper can be used with both the Objectivity/DB [] and Versant [] object databases, though Objectivity offers far more direct and convenient support for them. Note that clustering and reclustering are not problems which are specific to commercial object databases. At the core, the clustering problem is one of reducing disk seeks, tape seeks and tape mounts, and this problem exists equally well in physics analysis systems not based on object databases, even though other systems may use different terminology to describe the problem. 3 Type-based clustering The most obvious way to refine principle Detector X 4 above is to cluster data as shown in Fig.. Detector Y For each event in the chunk, the event data Detector Z is split into several objects of different types. Reconstructed P s For example, one type can hold all data for Event summary tags a single subdetector. Then, these objects are grouped by type into collections. Inside each collection, the objects for the different events are clustered in the iteration order. This way, Figure : Clustering of a chunk into collections a job which only needs one type of data per event automatically performs sequential reading over a single collection, which exactly contains the needed data, yielding the maximum achievable I/O performance. and preferred physical pattern (right) Figure : Reading two collections: logical pattern (left) There are some performance pitfalls however for jobs which need to read two or more 4. collections from the same disk or disk array. A 4 job reading two collections has a logical object reading pattern as shown on the left in 3. Fig.. To achieve near-sequential throughput 3 for such a job, the logical pattern needs to. 800 KB read ahead be transformed into the physical pattern at the 60 KB read ahead right in Fig.. We found that this transformation. no read ahead was not performed by Objectivity/DB, the database on which we implemented our test 0. system, nor by the operating system (we tested 0 both SunOS and HP-UX), nor by the disk Number of collections 0 hardware, for various commodity brands. The result was a significant performance degradation, Figure 3: One client reading multiple collections of 8 KB especially when reading more than two objects from a single disk collections, see the solid line in Fig. 3. We eliminated the performance degradation by extending the collection indexing/iteration Event Event Event 3 Read performance (MB/s) Event N

3 class to read ahead objects into the database cache. This extension could be made without affecting the end-user physics analysis code. Measurements (see Fig. 3) showed that when 800 KB worth of objects were read ahead for each collection, the I/O throughput approached that of sequential reading (3.9 MB/s). Keeping all collections on different disks would of course be an alternative to the read- Disk Disk Disk 3 ahead optimisation. That approach would however create a load balancing problem: for optimal performance one has to make sure that all disks are kept busy, even for jobs which only read one or a few collections. The problem can be solved to some extent by mapping Chunk type X Chunk type Z Chunk 3 type Y Chunk type Y Chunk type X Chunk 3 type Z Chunk type Z Chunk type Y Chunk 3 type X Figure 4: Load-balancing arrangement as an alternative to using a read-ahead optimisation collections in different chunks to disks as shown in Fig. 4. This will produce load balancing for any number of collections, assuming that the subjobs running in parallel on each chunk are about equally heavy. A problem with this solution is that it requires a higher degree of connectedness between all disks and all CPUs. We therefore prefer to use the read-ahead optimisation: by devoting a modest amount or memory (given current RAM prices) to read-ahead buffering, we can keep the objects for one event together on the same disk, which gives us greater fault-tolerance and decreases the dependence on subjob scheduling. 4 Random database access For small objects, a good clustering is more important than for large objects. This is illustrated by Fig., which plots the ratio between the speed of sequential reading and that of random reading for different object sizes. Fig. shows a curve for 994 disks and one for disks in the year 00, based on an analysis of hard disk technology trends [6]. The performance of sequential reading is the performance of the best possible clustering arrangement, that of random reading the performance of the worst possible clustering arrangement. Fig. therefore also plots the worst-case performance loss in the case of bad clustering. We see that currently, for objects Performance ratio disks (est.) 994 disks K K 4K 8K 6K 3K 64K Average object size (bytes) Figure : Performance ratio between sequential and random reading. larger than 64 KB, clustering is not that important: the performance loss for bad clustering is never more than a factor. The 00 curve shows however that the importance of good clustering will increase in future. Selective reading Physics analysis jobs usually don t read all objects in a collection: they only iterate through a subset of of the collection corresponding to those events satisfying a certain cut predicate. We call this iteration through a subset selective reading. Theselectivity is the percentage of objects in the collection which is needed by the job. In tests we found that, as the selectivity increases, 3

4 the throughput of selective reading drops rapidly, only to level out at the throughput of random reading. This is shown, for a collection of 8 KB objects, with an 8 KB database page size, in Fig. 6. In other tests we found that the curve in Fig. 6 does not change much as a function of the page size. The curve in Fig. 6 has two distinct parts. In the part covering selectivity values 6 from 00% to roughly %, the decrease in throughput exactly mirrors the increase in selectivity. If we would have sequentially read sequential:.4 MB/s random: 0.7 MB/s all objects, and then thrown away the unneeded ones, the job would have taken 4 the Bandwidth (MB/s) Selectivity (%) Selectivity (%) Figure 6: Selective reading of 8 KB objects Here, selective reading is as fast as sequential reading K K 4K 8K 6K 3K 64K Average object size (bytes) Here, selective reading is faster than sequential reading same time. Thus, in this part of the curve, selective reading is useless as an optimisation device if the job is disk-bound. However, selective reading will decrease the load on the CPU, the cache, and (if applicable) the network connection to the remote disk. This reduction in load depends largely on the selectivity on database pages, not on objects. See [7] for a discussion of page selectivity. In the part of the curve between % and 0%, selective reading is faster than sequential reading and then throwing data away. On the other hand, it is not faster than random reading. We found that the boundary between the two parts of the curve, which is located at % in Fig. 6, depends on the average object size and the selectivity. This boundary is visualised in Fig. 7. From Fig. 7 we can conclude that for collections of small objects, selective reading may not be worth the extra indexing complexity over sequential reading and throwing the unneeded data away. A corollary of this is that one could pack several small logical objects into larger physical objects with- Figure 7: Selective reading performance boundary out getting a loss of performance even for high selectivities. For collections of large objects, a selective reading mechanism can be useful as a means of ensuring that the performance never drops below that of random reading. 6 Reclustering To avoid the performance degradation associated with an increase in selectivity which we discussed above, reclustering could be used. Reclustering, re-arranging the objects in the database, is an expensive operation, but just letting the performance drop with increasing selectivity can easily be more costly, especially for collections in which objects are small. The simplest form of 4 0 0

5 reclustering is to copy only those objects which are actually wanted in a particular analysis effort to a new collection at the start of the effort. The creation of a data summary tape or an ntuple file are examples of this simple form of reclustering. Much more advanced forms of reclustering are feasible in a system based on an object database. Automatic reclustering, in which the system reacts to changing access patterns without any user hints beforehand, is feasible whenever there are sequences of jobs which access the same event set. We have prototyped an automatic reclustering system ([8], [9]) which æ performs reclustering transparently to the user code, æ can optimise clustering for four different analysis efforts at the same time, and æ keeps the use of storage space within bounds by avoiding the duplication of data We refer the reader to [8] for a discussion of the architecture of our automatic reclustering system. Fig. 8 illustrates its performance under a simple physics analysis scenario in which 40 subsequent jobs are run, with a new cut predicate being added after every 0 jobs. Job size and run time > Job size and run time > Batch reclustering operations Subsequent jobs > Subsequent jobs > Figure 8: Performance of 40 subsequent jobs without (left) and with (right) automatic reclustering. Each pair of bars represents a job: the black bar represents the number of objects accessed, the grey bar is the (wall clock) job run time. Reclustering is an important research topic in the object database community (see [8] for some references). However, this research is directed at typical object database workloads like CAD workloads. Physics analysis workloads are highly atypical: they are mostly read-only, transactions routinely access millions of objects, and most importantly the workloads lend themselves to streaming type optimisations. It is conceivable that vendors will bundle generalpurpose automatic reclustering systems with future versions of object database products, but we do not expect that such products will be able to provide efficient reclustering for physics workloads. As far as reclustering is concerned, physics analysis is too atypical to be provided for by the market. Therefore, we conclude that the HEP community will have to develop its own reclustering systems. 7 Conclusions We have discussed principles for the clustering and reclustering of HEP data. The performance graphs in this paper can be used to decide, for a given physics analysis scenario, whether

6 certain clustering techniques can be ignored without too much loss of performance, whether they need to be considered, or whether they are indispensable. We have shown performance measurements mainly for the single-client single-disk case. In additional performance tests ([6], [0]) we have verified that the techniques described above are also applicable to a system with disk and processor farming. Specifically, if a client is optimised to access the database with a good clustering efficiency, then it is possible to run many such clients concurrently, all accessing the same disk farm, without any significant performance degradation. Furthermore, the operating system will ensure that each client will get an equal share of the available disk resources. For a detailed discussion of the scalability of farming configurations with hundreds of clients, we refer the reader to [0]. References [] RD4, A Persistent Storage Manager for HEP. [] Objectivity/DB. [3] D. Baden et al., Joint DØ/CDF/CD Run II Data Management Needs Assessment, CDF/DOC/COMP UPG/PUBLIC/400, DØ Note 397, March 0, 997. [4] Grand Challenge Application on HENP Data. [] The Versant object database. [6] K. Holtman, Prototyping of CMS Storage Management, CMS NOTE/ [7] Using an object database and mass storage system for physics analysis. CERN/LHCC 97-9, The RD4 collaboration, April 997. [8] K. Holtman, P. van der Stok, I. Willers. Automatic Reclustering of Objects in Very Large Databases for High Energy Physics, Proc. of IDEAS 98, Cardiff, UK, p. 3-40, IEEE 998. [9] Reclustering Object Store Library for LHC++, V.. Available from [0] K. Holtman, J. Bunn. Scalability to Hundreds of Clients in HEP Object Databases. Proc. of CHEP 98, Chicago, USA. 6

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks RAID and AutoRAID RAID background Problem: technology trends - computers getting larger, need more disk bandwidth - disk bandwidth not riding moore s law - faster CPU enables more computation to support

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide

More information

2 Databases for calibration and bookkeeping purposes

2 Databases for calibration and bookkeeping purposes Databases for High Energy Physics D. Baden University of Maryland. B. Linder ORACLE corporation R. Mount Califomia Institute of Technology J. Shiers CERN, Geneva, Switzerland. This paper will examine the

More information

Storage Devices for Database Systems

Storage Devices for Database Systems Storage Devices for Database Systems 5DV120 Database System Principles Umeå University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner Storage Devices for

More information

High-Energy Physics Data-Storage Challenges

High-Energy Physics Data-Storage Challenges High-Energy Physics Data-Storage Challenges Richard P. Mount SLAC SC2003 Experimental HENP Understanding the quantum world requires: Repeated measurement billions of collisions Large (500 2000 physicist)

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents File Server Comparison: Microsoft Windows NT Server 4.0 and Novell NetWare 5 Contents Executive Summary Updated: October 7, 1998 (PDF version 240 KB) Executive Summary Performance Analysis Price/Performance

More information

LHCb Computing Resources: 2018 requests and preview of 2019 requests

LHCb Computing Resources: 2018 requests and preview of 2019 requests LHCb Computing Resources: 2018 requests and preview of 2019 requests LHCb-PUB-2017-009 23/02/2017 LHCb Public Note Issue: 0 Revision: 0 Reference: LHCb-PUB-2017-009 Created: 23 rd February 2017 Last modified:

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016 Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016 Role of the Scheduler Optimise Access to Storage CPU operations have a few processor cycles (each cycle is < 1ns) Seek operations

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services On BigFix Performance: Disk is King How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services Authors: Shaun T. Kelley, Mark Leitch Abstract: Rolling out large

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Striped Data Server for Scalable Parallel Data Analysis

Striped Data Server for Scalable Parallel Data Analysis Journal of Physics: Conference Series PAPER OPEN ACCESS Striped Data Server for Scalable Parallel Data Analysis To cite this article: Jin Chang et al 2018 J. Phys.: Conf. Ser. 1085 042035 View the article

More information

Storage. CS 3410 Computer System Organization & Programming

Storage. CS 3410 Computer System Organization & Programming Storage CS 3410 Computer System Organization & Programming These slides are the product of many rounds of teaching CS 3410 by Deniz Altinbuke, Kevin Walsh, and Professors Weatherspoon, Bala, Bracy, and

More information

The BABAR Database: Challenges, Trends and Projections

The BABAR Database: Challenges, Trends and Projections SLAC-PUB-9179 September 2001 The BABAR Database: Challenges, Trends and Projections I. Gaponenko 1, A. Mokhtarani 1, S. Patton 1, D. Quarrie 1, A. Adesanya 2, J. Becla 2, A. Hanushevsky 2, A. Hasan 2,

More information

Extreme Storage Performance with exflash DIMM and AMPS

Extreme Storage Performance with exflash DIMM and AMPS Extreme Storage Performance with exflash DIMM and AMPS 214 by 6East Technologies, Inc. and Lenovo Corporation All trademarks or registered trademarks mentioned here are the property of their respective

More information

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

Evaluation of the computing resources required for a Nordic research exploitation of the LHC PROCEEDINGS Evaluation of the computing resources required for a Nordic research exploitation of the LHC and Sverker Almehed, Chafik Driouichi, Paula Eerola, Ulf Mjörnmark, Oxana Smirnova,TorstenÅkesson

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

1Highwinds. Software. Highwinds Software LLC. Document Version 1.1 April 2003

1Highwinds. Software. Highwinds Software LLC. Document Version 1.1 April 2003 1Highwinds H A R D W A R E S i z i n g G u i d e Document Version 1.1 April 2003 2Highwinds Intr troduc duction Managing Usenet is brutal on hardware. As of this writing, Usenet ranges from 350 to 450

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Atlas event data model optimization studies based on the use of segmented VArray in Objectivity/DB.

Atlas event data model optimization studies based on the use of segmented VArray in Objectivity/DB. Atlas event data model optimization studies based on the use of segmented VArray in Objectivity/DB. S. Rolli, A. Salvadori 2,A.C. Schaffer 2 3,M. Schaller 2 4 Tufts University, Medford, MA, USA 2 CERN,

More information

Where Have We Been? Ch. 6 Memory Technology

Where Have We Been? Ch. 6 Memory Technology Where Have We Been? Combinational and Sequential Logic Finite State Machines Computer Architecture Instruction Set Architecture Tracing Instructions at the Register Level Building a CPU Pipelining Where

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

Prototyping of CMS storage management

Prototyping of CMS storage management Prototyping of CMS storage management Holtman, K.J.G. DOI: 10.6100/IR533317 Published: 01/01/2000 Document Version Publisher s PDF, also known as Version of Record (includes final page, issue and volume

More information

Lecture 29. Friday, March 23 CS 470 Operating Systems - Lecture 29 1

Lecture 29. Friday, March 23 CS 470 Operating Systems - Lecture 29 1 Lecture 29 Reminder: Homework 7 is due on Monday at class time for Exam 2 review; no late work accepted. Reminder: Exam 2 is on Wednesday. Exam 2 review sheet is posted. Questions? Friday, March 23 CS

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund August 15 2016 Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC Supported by the National Natural Science Fund The Current Computing System What is Hadoop? Why Hadoop? The New Computing System with Hadoop

More information

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on Chapter 1: Introduction Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Objectives To describe the basic organization of computer systems To provide a grand tour of the major

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Distributed Video Systems Chapter 3 Storage Technologies

Distributed Video Systems Chapter 3 Storage Technologies Distributed Video Systems Chapter 3 Storage Technologies Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 3.1 Introduction 3.2 Magnetic Disks 3.3 Video

More information

CSE380 - Operating Systems

CSE380 - Operating Systems CSE380 - Operating Systems Notes for Lecture 17-11/10/05 Matt Blaze, Micah Sherr (some examples by Insup Lee) Implementing File Systems We ve looked at the user view of file systems names, directory structure,

More information

Future File System: An Evaluation

Future File System: An Evaluation Future System: An Evaluation Brian Gaffey and Daniel J. Messer, Cray Research, Inc., Eagan, Minnesota, USA ABSTRACT: Cray Research s file system, NC1, is based on an early System V technology. Cray has

More information

White paper ETERNUS CS800 Data Deduplication Background

White paper ETERNUS CS800 Data Deduplication Background White paper ETERNUS CS800 - Data Deduplication Background This paper describes the process of Data Deduplication inside of ETERNUS CS800 in detail. The target group consists of presales, administrators,

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006 November 21, 2006 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds MBs to GBs expandable Disk milliseconds

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

L9: Storage Manager Physical Data Organization

L9: Storage Manager Physical Data Organization L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components

More information

Microsoft Office SharePoint Server 2007

Microsoft Office SharePoint Server 2007 Microsoft Office SharePoint Server 2007 Enabled by EMC Celerra Unified Storage and Microsoft Hyper-V Reference Architecture Copyright 2010 EMC Corporation. All rights reserved. Published May, 2010 EMC

More information

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Achieving Horizontal Scalability. Alain Houf Sales Engineer Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

An SQL-based approach to physics analysis

An SQL-based approach to physics analysis Journal of Physics: Conference Series OPEN ACCESS An SQL-based approach to physics analysis To cite this article: Dr Maaike Limper 2014 J. Phys.: Conf. Ser. 513 022022 View the article online for updates

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

Specifying Storage Servers for IP security applications

Specifying Storage Servers for IP security applications Specifying Storage Servers for IP security applications The migration of security systems from analogue to digital IP based solutions has created a large demand for storage servers high performance PCs

More information

Automated Storage Tiering on Infortrend s ESVA Storage Systems

Automated Storage Tiering on Infortrend s ESVA Storage Systems Automated Storage Tiering on Infortrend s ESVA Storage Systems White paper Abstract This white paper introduces automated storage tiering on Infortrend s ESVA storage arrays. Storage tiering can generate

More information

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill Lecture Handout Database Management System Lecture No. 34 Reading Material Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill Modern Database Management, Fred McFadden,

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018 CSE 4/521 Introduction to Operating Systems Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018 Overview Objective: To discuss how the disk is managed for a

More information

Google is Really Different.

Google is Really Different. COMP 790-088 -- Distributed File Systems Google File System 7 Google is Really Different. Huge Datacenters in 5+ Worldwide Locations Datacenters house multiple server clusters Coming soon to Lenior, NC

More information

CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland Available on CMS information server CMS NOTE 2001/037 The Compact Muon Solenoid Experiment CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland July 16, 2001 CMS Data Grid System Overview

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Oracle EXAM - 1Z Oracle Exadata Database Machine Administration, Software Release 11.x Exam. Buy Full Product

Oracle EXAM - 1Z Oracle Exadata Database Machine Administration, Software Release 11.x Exam. Buy Full Product Oracle EXAM - 1Z0-027 Oracle Exadata Database Machine Administration, Software Release 11.x Exam Buy Full Product http://www.examskey.com/1z0-027.html Examskey Oracle 1Z0-027 exam demo product is here

More information

Technical papers Web caches

Technical papers Web caches Technical papers Web caches Web caches What is a web cache? In their simplest form, web caches store temporary copies of web objects. They are designed primarily to improve the accessibility and availability

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Storing Data: Disks and Files

Storing Data: Disks and Files Storing Data: Disks and Files Chapter 7 (2 nd edition) Chapter 9 (3 rd edition) Yea, from the table of my memory I ll wipe away all trivial fond records. -- Shakespeare, Hamlet Database Management Systems,

More information

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE Andreas Pfeiffer CERN, Geneva, Switzerland Abstract The Anaphe/LHC++ project is an ongoing effort to provide an Object-Oriented software environment for

More information

Data preservation for the HERA experiments at DESY using dcache technology

Data preservation for the HERA experiments at DESY using dcache technology Journal of Physics: Conference Series PAPER OPEN ACCESS Data preservation for the HERA experiments at DESY using dcache technology To cite this article: Dirk Krücker et al 2015 J. Phys.: Conf. Ser. 66

More information

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Objectives To describe the physical structure of secondary storage devices and its effects on the uses of the devices To explain the

More information

IBM and HP 6-Gbps SAS RAID Controller Performance

IBM and HP 6-Gbps SAS RAID Controller Performance IBM and HP 6-Gbps SAS RAID Controller Performance Evaluation report prepared under contract with IBM Corporation Introduction With increasing demands on storage in popular application servers, the server

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3

More information

Mladen Stefanov F48235 R.A.I.D

Mladen Stefanov F48235 R.A.I.D R.A.I.D Data is the most valuable asset of any business today. Lost data, in most cases, means lost business. Even if you backup regularly, you need a fail-safe way to ensure that your data is protected

More information

Session: Hardware Topic: Disks. Daniel Chang. COP 3502 Introduction to Computer Science. Lecture. Copyright August 2004, Daniel Chang

Session: Hardware Topic: Disks. Daniel Chang. COP 3502 Introduction to Computer Science. Lecture. Copyright August 2004, Daniel Chang Lecture Session: Hardware Topic: Disks Daniel Chang Basic Components CPU I/O Devices RAM Operating System Disks Considered I/O devices Used to hold data and programs before they are loaded to memory and

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

PROOF-Condor integration for ATLAS

PROOF-Condor integration for ATLAS PROOF-Condor integration for ATLAS G. Ganis,, J. Iwaszkiewicz, F. Rademakers CERN / PH-SFT M. Livny, B. Mellado, Neng Xu,, Sau Lan Wu University Of Wisconsin Condor Week, Madison, 29 Apr 2 May 2008 Outline

More information

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage CSCI-GA.2433-001 Database Systems Lecture 8: Physical Schema: Storage Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com View 1 View 2 View 3 Conceptual Schema Physical Schema 1. Create a

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Science 46 Computer Architecture Spring 24 Harvard University Instructor: Prof dbrooks@eecsharvardedu Lecture 22: More I/O Computer Science 46 Lecture Outline HW5 and Project Questions? Storage

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

Advanced Database Systems

Advanced Database Systems Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web

More information

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware openbench Labs Executive Briefing: March 13, 2013 Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware Optimizing I/O for Increased Throughput and Reduced

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III NEC Express5800 A2040b 22TB Data Warehouse Fast Track Reference Architecture with SW mirrored HGST FlashMAX III Based on Microsoft SQL Server 2014 Data Warehouse Fast Track (DWFT) Reference Architecture

More information

Qemacrs 9+> Presented at the HPCN 95 Conference, Milan, May Experience of running PIAF on the CS-2 at CERN CERN-CN cam ub1za1z1es, GENEVA

Qemacrs 9+> Presented at the HPCN 95 Conference, Milan, May Experience of running PIAF on the CS-2 at CERN CERN-CN cam ub1za1z1es, GENEVA GQ Qemacrs 9+> cum-comput1nc AND NETWORKS D1v1s1oN CN! 95/6 March 1995 OCR Output cam ub1za1z1es, GENEVA CERN-CN-95-06 Experience of running PIAF on the CS-2 at CERN Timo Hakulinen 8: Pons Rademakers Presented

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information