Clustering and Reclustering HEP Data in Object Databases

Similar documents
The Google File System

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

The Google File System

CA485 Ray Walshe Google File System

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

2 Databases for calibration and bookkeeping purposes

Storage Devices for Database Systems

High-Energy Physics Data-Storage Challenges

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Distributed Filesystem

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents

LHCb Computing Resources: 2018 requests and preview of 2019 requests

CLOUD-SCALE FILE SYSTEMS

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

Distributed Systems 16. Distributed File Systems II

Striped Data Server for Scalable Parallel Data Analysis

Storage. CS 3410 Computer System Organization & Programming

The BABAR Database: Challenges, Trends and Projections

Extreme Storage Performance with exflash DIMM and AMPS

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

CSE 124: Networked Services Fall 2009 Lecture-19

6. Results. This section describes the performance that was achieved using the RAMA file system.

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

1Highwinds. Software. Highwinds Software LLC. Document Version 1.1 April 2003

CSE 124: Networked Services Lecture-16

Atlas event data model optimization studies based on the use of segmented VArray in Objectivity/DB.

Where Have We Been? Ch. 6 Memory Technology

Cache introduction. April 16, Howard Huang 1

Prototyping of CMS storage management

Lecture 29. Friday, March 23 CS 470 Operating Systems - Lecture 29 1

CSE 124: Networked Services Lecture-17

Storage. Hwansoo Han

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Chapter 11: File System Implementation. Objectives

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Dept. Of Computer Science, Colorado State University

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Chapter Seven Morgan Kaufmann Publishers

Distributed Video Systems Chapter 3 Storage Technologies

CSE380 - Operating Systems

Future File System: An Evaluation

White paper ETERNUS CS800 Data Deduplication Background

The Google File System

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

L9: Storage Manager Physical Data Organization

Microsoft Office SharePoint Server 2007

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Definition of RAID Levels

An SQL-based approach to physics analysis

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

Specifying Storage Servers for IP security applications

Automated Storage Tiering on Infortrend s ESVA Storage Systems

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

Google is Really Different.

CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Oracle EXAM - 1Z Oracle Exadata Database Machine Administration, Software Release 11.x Exam. Buy Full Product

Technical papers Web caches

The Google File System

Storing Data: Disks and Files

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

Data preservation for the HERA experiments at DESY using dcache technology

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

IBM and HP 6-Gbps SAS RAID Controller Performance

Google File System. Arun Sundaram Operating Systems

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Mladen Stefanov F48235 R.A.I.D

Session: Hardware Topic: Disks. Daniel Chang. COP 3502 Introduction to Computer Science. Lecture. Copyright August 2004, Daniel Chang

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Client Server & Distributed System. A Basic Introduction

PROOF-Condor integration for ATLAS

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Computer Science 146. Computer Architecture

Lecture 2: Memory Systems

Advanced Database Systems

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

Qemacrs 9+> Presented at the HPCN 95 Conference, Milan, May Experience of running PIAF on the CS-2 at CERN CERN-CN cam ub1za1z1es, GENEVA

PowerVault MD3 SSD Cache Overview

Transcription:

Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications and to farming HEP applications with a high degree of concurrency. We make the case for the reclustering of HEP data on the basis of performance measurements and briefly discuss a prototype automatic reclustering system. Introduction As part of the CMS contribution to the RD4 [] collaboration, database clustering and reclustering have been under investigation for about. years. The clustering of objects in an object database is the mapping of objects to locations on physical storage media like disk farms and tapes. The performance of the database, and the physics application on top of it, depends crucially on having a good match between the object clustering and the database access operations performed by the physics application. The principles for clustering discussed in this paper are based on, and illustrated with, a set of performance measurements. The performance measurements shown in this paper were all performed on a Sun Ultra Enterprise server, running SunOS Release.., on hard disks in a SPARCstorage array (no striping, no other RAID type processing,.-gb 700-rpm fast-wide SCSI-, Seagate ST-30W). These disks which can be considered typical for the high end of the 994 commodity disk market. All performance results were cross-validated on at least one other hardware/os configuration, most results on at least two other configurations. The object database system used was always Objectivity/DB version 4 []. HEP data clustering basics Most I/O intensive physics analysis systems, no matter what the implementation method, and no matter whether tape or disk based, use the following simple principles to optimise performance. Divide the set of all events into fairly large chunks (in most current systems a chunk is a run or a part of a physics stream [3]). Implement farming (both for disks and CPUs) at the chunk level 3. Make sure that (sub)jobs always iterate through the events in a chunk in the same order 4. Cluster the event data in a chunk in the iteration order Though object databases make it perhaps easier than ever to build physics analysis systems which do not follow the principles above, we believe that these principles are currently still the most viable basis for designing a performant production system. Principle above, dividing the event set into chunks, involves coarse grained clustering decisions: strategies like dividing events into physics streams [3] are often used here, and newer strategies are a topic of active research [4]. At the chunk level, reducing tape mounts is a very important goal, and an important constraint is that it is not feasible to recluster data often.

Most of this paper deals with the refinement of principle 4 above, that is with clustering and reclustering decisions at the sub-chunk level. At this level, an important goal is to achieve nearsequential reading on the disk or disk farm, and frequent reclustering is feasible as a strategy for optimising performance and reducing disk space occupancy. The clustering techniques discussed in this paper can be used with both the Objectivity/DB [] and Versant [] object databases, though Objectivity offers far more direct and convenient support for them. Note that clustering and reclustering are not problems which are specific to commercial object databases. At the core, the clustering problem is one of reducing disk seeks, tape seeks and tape mounts, and this problem exists equally well in physics analysis systems not based on object databases, even though other systems may use different terminology to describe the problem. 3 Type-based clustering The most obvious way to refine principle Detector X 4 above is to cluster data as shown in Fig.. Detector Y For each event in the chunk, the event data Detector Z is split into several objects of different types. Reconstructed P s For example, one type can hold all data for Event summary tags a single subdetector. Then, these objects are grouped by type into collections. Inside each collection, the objects for the different events are clustered in the iteration order. This way, Figure : Clustering of a chunk into collections a job which only needs one type of data per event automatically performs sequential reading over a single collection, which exactly contains the needed data, yielding the maximum achievable I/O performance. and preferred physical pattern (right) Figure : Reading two collections: logical pattern (left) There are some performance pitfalls however for jobs which need to read two or more 4. collections from the same disk or disk array. A 4 job reading two collections has a logical object reading pattern as shown on the left in 3. Fig.. To achieve near-sequential throughput 3 for such a job, the logical pattern needs to. 800 KB read ahead be transformed into the physical pattern at the 60 KB read ahead right in Fig.. We found that this transformation. no read ahead was not performed by Objectivity/DB, the database on which we implemented our test 0. system, nor by the operating system (we tested 0 both SunOS and HP-UX), nor by the disk Number of collections 0 hardware, for various commodity brands. The result was a significant performance degradation, Figure 3: One client reading multiple collections of 8 KB especially when reading more than two objects from a single disk collections, see the solid line in Fig. 3. We eliminated the performance degradation by extending the collection indexing/iteration Event Event Event 3 Read performance (MB/s) Event N

class to read ahead objects into the database cache. This extension could be made without affecting the end-user physics analysis code. Measurements (see Fig. 3) showed that when 800 KB worth of objects were read ahead for each collection, the I/O throughput approached that of sequential reading (3.9 MB/s). Keeping all collections on different disks would of course be an alternative to the read- Disk Disk Disk 3 ahead optimisation. That approach would however create a load balancing problem: for optimal performance one has to make sure that all disks are kept busy, even for jobs which only read one or a few collections. The problem can be solved to some extent by mapping Chunk type X Chunk type Z Chunk 3 type Y Chunk type Y Chunk type X Chunk 3 type Z Chunk type Z Chunk type Y Chunk 3 type X Figure 4: Load-balancing arrangement as an alternative to using a read-ahead optimisation collections in different chunks to disks as shown in Fig. 4. This will produce load balancing for any number of collections, assuming that the subjobs running in parallel on each chunk are about equally heavy. A problem with this solution is that it requires a higher degree of connectedness between all disks and all CPUs. We therefore prefer to use the read-ahead optimisation: by devoting a modest amount or memory (given current RAM prices) to read-ahead buffering, we can keep the objects for one event together on the same disk, which gives us greater fault-tolerance and decreases the dependence on subjob scheduling. 4 Random database access For small objects, a good clustering is more important than for large objects. This is illustrated by Fig., which plots the ratio between the speed of sequential reading and that of random reading for different object sizes. Fig. shows a curve for 994 disks and one for disks in the year 00, based on an analysis of hard disk technology trends [6]. The performance of sequential reading is the performance of the best possible clustering arrangement, that of random reading the performance of the worst possible clustering arrangement. Fig. therefore also plots the worst-case performance loss in the case of bad clustering. We see that currently, for objects Performance ratio 000 000 000 00 00 00 0 0 0 00 disks (est.) 994 disks 3 64 8 6 K K 4K 8K 6K 3K 64K Average object size (bytes) Figure : Performance ratio between sequential and random reading. larger than 64 KB, clustering is not that important: the performance loss for bad clustering is never more than a factor. The 00 curve shows however that the importance of good clustering will increase in future. Selective reading Physics analysis jobs usually don t read all objects in a collection: they only iterate through a subset of of the collection corresponding to those events satisfying a certain cut predicate. We call this iteration through a subset selective reading. Theselectivity is the percentage of objects in the collection which is needed by the job. In tests we found that, as the selectivity increases, 3

the throughput of selective reading drops rapidly, only to level out at the throughput of random reading. This is shown, for a collection of 8 KB objects, with an 8 KB database page size, in Fig. 6. In other tests we found that the curve in Fig. 6 does not change much as a function of the page size. The curve in Fig. 6 has two distinct parts. In the part covering selectivity values 6 from 00% to roughly %, the decrease in throughput exactly mirrors the increase in selectivity. If we would have sequentially read sequential:.4 MB/s random: 0.7 MB/s all objects, and then thrown away the unneeded ones, the job would have taken 4 the Bandwidth (MB/s) Selectivity (%) 3 0 00 00 0 0 0 0. 0. 0. 0.0 90 80 70 60 0 Selectivity (%) Figure 6: Selective reading of 8 KB objects Here, selective reading is as fast as sequential reading 3 64 8 6 K K 4K 8K 6K 3K 64K Average object size (bytes) 40 30 0 Here, selective reading is faster than sequential reading same time. Thus, in this part of the curve, selective reading is useless as an optimisation device if the job is disk-bound. However, selective reading will decrease the load on the CPU, the cache, and (if applicable) the network connection to the remote disk. This reduction in load depends largely on the selectivity on database pages, not on objects. See [7] for a discussion of page selectivity. In the part of the curve between % and 0%, selective reading is faster than sequential reading and then throwing data away. On the other hand, it is not faster than random reading. We found that the boundary between the two parts of the curve, which is located at % in Fig. 6, depends on the average object size and the selectivity. This boundary is visualised in Fig. 7. From Fig. 7 we can conclude that for collections of small objects, selective reading may not be worth the extra indexing complexity over sequential reading and throwing the unneeded data away. A corollary of this is that one could pack several small logical objects into larger physical objects with- Figure 7: Selective reading performance boundary out getting a loss of performance even for high selectivities. For collections of large objects, a selective reading mechanism can be useful as a means of ensuring that the performance never drops below that of random reading. 6 Reclustering To avoid the performance degradation associated with an increase in selectivity which we discussed above, reclustering could be used. Reclustering, re-arranging the objects in the database, is an expensive operation, but just letting the performance drop with increasing selectivity can easily be more costly, especially for collections in which objects are small. The simplest form of 4 0 0

reclustering is to copy only those objects which are actually wanted in a particular analysis effort to a new collection at the start of the effort. The creation of a data summary tape or an ntuple file are examples of this simple form of reclustering. Much more advanced forms of reclustering are feasible in a system based on an object database. Automatic reclustering, in which the system reacts to changing access patterns without any user hints beforehand, is feasible whenever there are sequences of jobs which access the same event set. We have prototyped an automatic reclustering system ([8], [9]) which æ performs reclustering transparently to the user code, æ can optimise clustering for four different analysis efforts at the same time, and æ keeps the use of storage space within bounds by avoiding the duplication of data We refer the reader to [8] for a discussion of the architecture of our automatic reclustering system. Fig. 8 illustrates its performance under a simple physics analysis scenario in which 40 subsequent jobs are run, with a new cut predicate being added after every 0 jobs. Job size and run time > Job size and run time > Batch reclustering operations Subsequent jobs > Subsequent jobs > Figure 8: Performance of 40 subsequent jobs without (left) and with (right) automatic reclustering. Each pair of bars represents a job: the black bar represents the number of objects accessed, the grey bar is the (wall clock) job run time. Reclustering is an important research topic in the object database community (see [8] for some references). However, this research is directed at typical object database workloads like CAD workloads. Physics analysis workloads are highly atypical: they are mostly read-only, transactions routinely access millions of objects, and most importantly the workloads lend themselves to streaming type optimisations. It is conceivable that vendors will bundle generalpurpose automatic reclustering systems with future versions of object database products, but we do not expect that such products will be able to provide efficient reclustering for physics workloads. As far as reclustering is concerned, physics analysis is too atypical to be provided for by the market. Therefore, we conclude that the HEP community will have to develop its own reclustering systems. 7 Conclusions We have discussed principles for the clustering and reclustering of HEP data. The performance graphs in this paper can be used to decide, for a given physics analysis scenario, whether

certain clustering techniques can be ignored without too much loss of performance, whether they need to be considered, or whether they are indispensable. We have shown performance measurements mainly for the single-client single-disk case. In additional performance tests ([6], [0]) we have verified that the techniques described above are also applicable to a system with disk and processor farming. Specifically, if a client is optimised to access the database with a good clustering efficiency, then it is possible to run many such clients concurrently, all accessing the same disk farm, without any significant performance degradation. Furthermore, the operating system will ensure that each client will get an equal share of the available disk resources. For a detailed discussion of the scalability of farming configurations with hundreds of clients, we refer the reader to [0]. References [] RD4, A Persistent Storage Manager for HEP. http://wwwcn.cern.ch/asd/cernlib/rd4/ [] Objectivity/DB. http://www.objy.com/ [3] D. Baden et al., Joint DØ/CDF/CD Run II Data Management Needs Assessment, CDF/DOC/COMP UPG/PUBLIC/400, DØ Note 397, March 0, 997. [4] Grand Challenge Application on HENP Data. http://www-rnc.lbl.gov/gc/ [] The Versant object database. http://www.versant.com/ [6] K. Holtman, Prototyping of CMS Storage Management, CMS NOTE/997-074. [7] Using an object database and mass storage system for physics analysis. CERN/LHCC 97-9, The RD4 collaboration, April 997. [8] K. Holtman, P. van der Stok, I. Willers. Automatic Reclustering of Objects in Very Large Databases for High Energy Physics, Proc. of IDEAS 98, Cardiff, UK, p. 3-40, IEEE 998. [9] Reclustering Object Store Library for LHC++, V.. Available from http://wwwcn.cern.ch/~kholtman/ [0] K. Holtman, J. Bunn. Scalability to Hundreds of Clients in HEP Object Databases. Proc. of CHEP 98, Chicago, USA. 6