Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications and to farming HEP applications with a high degree of concurrency. We make the case for the reclustering of HEP data on the basis of performance measurements and briefly discuss a prototype automatic reclustering system. Introduction As part of the CMS contribution to the RD4 [] collaboration, database clustering and reclustering have been under investigation for about. years. The clustering of objects in an object database is the mapping of objects to locations on physical storage media like disk farms and tapes. The performance of the database, and the physics application on top of it, depends crucially on having a good match between the object clustering and the database access operations performed by the physics application. The principles for clustering discussed in this paper are based on, and illustrated with, a set of performance measurements. The performance measurements shown in this paper were all performed on a Sun Ultra Enterprise server, running SunOS Release.., on hard disks in a SPARCstorage array (no striping, no other RAID type processing,.-gb 700-rpm fast-wide SCSI-, Seagate ST-30W). These disks which can be considered typical for the high end of the 994 commodity disk market. All performance results were cross-validated on at least one other hardware/os configuration, most results on at least two other configurations. The object database system used was always Objectivity/DB version 4 []. HEP data clustering basics Most I/O intensive physics analysis systems, no matter what the implementation method, and no matter whether tape or disk based, use the following simple principles to optimise performance. Divide the set of all events into fairly large chunks (in most current systems a chunk is a run or a part of a physics stream [3]). Implement farming (both for disks and CPUs) at the chunk level 3. Make sure that (sub)jobs always iterate through the events in a chunk in the same order 4. Cluster the event data in a chunk in the iteration order Though object databases make it perhaps easier than ever to build physics analysis systems which do not follow the principles above, we believe that these principles are currently still the most viable basis for designing a performant production system. Principle above, dividing the event set into chunks, involves coarse grained clustering decisions: strategies like dividing events into physics streams [3] are often used here, and newer strategies are a topic of active research [4]. At the chunk level, reducing tape mounts is a very important goal, and an important constraint is that it is not feasible to recluster data often.

Most of this paper deals with the refinement of principle 4 above, that is with clustering and reclustering decisions at the sub-chunk level. At this level, an important goal is to achieve nearsequential reading on the disk or disk farm, and frequent reclustering is feasible as a strategy for optimising performance and reducing disk space occupancy. The clustering techniques discussed in this paper can be used with both the Objectivity/DB [] and Versant [] object databases, though Objectivity offers far more direct and convenient support for them. Note that clustering and reclustering are not problems which are specific to commercial object databases. At the core, the clustering problem is one of reducing disk seeks, tape seeks and tape mounts, and this problem exists equally well in physics analysis systems not based on object databases, even though other systems may use different terminology to describe the problem. 3 Type-based clustering The most obvious way to refine principle Detector X 4 above is to cluster data as shown in Fig.. Detector Y For each event in the chunk, the event data Detector Z is split into several objects of different types. Reconstructed P s For example, one type can hold all data for Event summary tags a single subdetector. Then, these objects are grouped by type into collections. Inside each collection, the objects for the different events are clustered in the iteration order. This way, Figure : Clustering of a chunk into collections a job which only needs one type of data per event automatically performs sequential reading over a single collection, which exactly contains the needed data, yielding the maximum achievable I/O performance. and preferred physical pattern (right) Figure : Reading two collections: logical pattern (left) There are some performance pitfalls however for jobs which need to read two or more 4. collections from the same disk or disk array. A 4 job reading two collections has a logical object reading pattern as shown on the left in 3. Fig.. To achieve near-sequential throughput 3 for such a job, the logical pattern needs to. 800 KB read ahead be transformed into the physical pattern at the 60 KB read ahead right in Fig.. We found that this transformation. no read ahead was not performed by Objectivity/DB, the database on which we implemented our test 0. system, nor by the operating system (we tested 0 both SunOS and HP-UX), nor by the disk Number of collections 0 hardware, for various commodity brands. The result was a significant performance degradation, Figure 3: One client reading multiple collections of 8 KB especially when reading more than two objects from a single disk collections, see the solid line in Fig. 3. We eliminated the performance degradation by extending the collection indexing/iteration Event Event Event 3 Read performance (MB/s) Event N

class to read ahead objects into the database cache. This extension could be made without affecting the end-user physics analysis code. Measurements (see Fig. 3) showed that when 800 KB worth of objects were read ahead for each collection, the I/O throughput approached that of sequential reading (3.9 MB/s). Keeping all collections on different disks would of course be an alternative to the read- Disk Disk Disk 3 ahead optimisation. That approach would however create a load balancing problem: for optimal performance one has to make sure that all disks are kept busy, even for jobs which only read one or a few collections. The problem can be solved to some extent by mapping Chunk type X Chunk type Z Chunk 3 type Y Chunk type Y Chunk type X Chunk 3 type Z Chunk type Z Chunk type Y Chunk 3 type X Figure 4: Load-balancing arrangement as an alternative to using a read-ahead optimisation collections in different chunks to disks as shown in Fig. 4. This will produce load balancing for any number of collections, assuming that the subjobs running in parallel on each chunk are about equally heavy. A problem with this solution is that it requires a higher degree of connectedness between all disks and all CPUs. We therefore prefer to use the read-ahead optimisation: by devoting a modest amount or memory (given current RAM prices) to read-ahead buffering, we can keep the objects for one event together on the same disk, which gives us greater fault-tolerance and decreases the dependence on subjob scheduling. 4 Random database access For small objects, a good clustering is more important than for large objects. This is illustrated by Fig., which plots the ratio between the speed of sequential reading and that of random reading for different object sizes. Fig. shows a curve for 994 disks and one for disks in the year 00, based on an analysis of hard disk technology trends [6]. The performance of sequential reading is the performance of the best possible clustering arrangement, that of random reading the performance of the worst possible clustering arrangement. Fig. therefore also plots the worst-case performance loss in the case of bad clustering. We see that currently, for objects Performance ratio 000 000 000 00 00 00 0 0 0 00 disks (est.) 994 disks 3 64 8 6 K K 4K 8K 6K 3K 64K Average object size (bytes) Figure : Performance ratio between sequential and random reading. larger than 64 KB, clustering is not that important: the performance loss for bad clustering is never more than a factor. The 00 curve shows however that the importance of good clustering will increase in future. Selective reading Physics analysis jobs usually don t read all objects in a collection: they only iterate through a subset of of the collection corresponding to those events satisfying a certain cut predicate. We call this iteration through a subset selective reading. Theselectivity is the percentage of objects in the collection which is needed by the job. In tests we found that, as the selectivity increases, 3

the throughput of selective reading drops rapidly, only to level out at the throughput of random reading. This is shown, for a collection of 8 KB objects, with an 8 KB database page size, in Fig. 6. In other tests we found that the curve in Fig. 6 does not change much as a function of the page size. The curve in Fig. 6 has two distinct parts. In the part covering selectivity values 6 from 00% to roughly %, the decrease in throughput exactly mirrors the increase in selectivity. If we would have sequentially read sequential:.4 MB/s random: 0.7 MB/s all objects, and then thrown away the unneeded ones, the job would have taken 4 the Bandwidth (MB/s) Selectivity (%) 3 0 00 00 0 0 0 0. 0. 0. 0.0 90 80 70 60 0 Selectivity (%) Figure 6: Selective reading of 8 KB objects Here, selective reading is as fast as sequential reading 3 64 8 6 K K 4K 8K 6K 3K 64K Average object size (bytes) 40 30 0 Here, selective reading is faster than sequential reading same time. Thus, in this part of the curve, selective reading is useless as an optimisation device if the job is disk-bound. However, selective reading will decrease the load on the CPU, the cache, and (if applicable) the network connection to the remote disk. This reduction in load depends largely on the selectivity on database pages, not on objects. See [7] for a discussion of page selectivity. In the part of the curve between % and 0%, selective reading is faster than sequential reading and then throwing data away. On the other hand, it is not faster than random reading. We found that the boundary between the two parts of the curve, which is located at % in Fig. 6, depends on the average object size and the selectivity. This boundary is visualised in Fig. 7. From Fig. 7 we can conclude that for collections of small objects, selective reading may not be worth the extra indexing complexity over sequential reading and throwing the unneeded data away. A corollary of this is that one could pack several small logical objects into larger physical objects with- Figure 7: Selective reading performance boundary out getting a loss of performance even for high selectivities. For collections of large objects, a selective reading mechanism can be useful as a means of ensuring that the performance never drops below that of random reading. 6 Reclustering To avoid the performance degradation associated with an increase in selectivity which we discussed above, reclustering could be used. Reclustering, re-arranging the objects in the database, is an expensive operation, but just letting the performance drop with increasing selectivity can easily be more costly, especially for collections in which objects are small. The simplest form of 4 0 0

reclustering is to copy only those objects which are actually wanted in a particular analysis effort to a new collection at the start of the effort. The creation of a data summary tape or an ntuple file are examples of this simple form of reclustering. Much more advanced forms of reclustering are feasible in a system based on an object database. Automatic reclustering, in which the system reacts to changing access patterns without any user hints beforehand, is feasible whenever there are sequences of jobs which access the same event set. We have prototyped an automatic reclustering system ([8], [9]) which æ performs reclustering transparently to the user code, æ can optimise clustering for four different analysis efforts at the same time, and æ keeps the use of storage space within bounds by avoiding the duplication of data We refer the reader to [8] for a discussion of the architecture of our automatic reclustering system. Fig. 8 illustrates its performance under a simple physics analysis scenario in which 40 subsequent jobs are run, with a new cut predicate being added after every 0 jobs. Job size and run time > Job size and run time > Batch reclustering operations Subsequent jobs > Subsequent jobs > Figure 8: Performance of 40 subsequent jobs without (left) and with (right) automatic reclustering. Each pair of bars represents a job: the black bar represents the number of objects accessed, the grey bar is the (wall clock) job run time. Reclustering is an important research topic in the object database community (see [8] for some references). However, this research is directed at typical object database workloads like CAD workloads. Physics analysis workloads are highly atypical: they are mostly read-only, transactions routinely access millions of objects, and most importantly the workloads lend themselves to streaming type optimisations. It is conceivable that vendors will bundle generalpurpose automatic reclustering systems with future versions of object database products, but we do not expect that such products will be able to provide efficient reclustering for physics workloads. As far as reclustering is concerned, physics analysis is too atypical to be provided for by the market. Therefore, we conclude that the HEP community will have to develop its own reclustering systems. 7 Conclusions We have discussed principles for the clustering and reclustering of HEP data. The performance graphs in this paper can be used to decide, for a given physics analysis scenario, whether

certain clustering techniques can be ignored without too much loss of performance, whether they need to be considered, or whether they are indispensable. We have shown performance measurements mainly for the single-client single-disk case. In additional performance tests ([6], [0]) we have verified that the techniques described above are also applicable to a system with disk and processor farming. Specifically, if a client is optimised to access the database with a good clustering efficiency, then it is possible to run many such clients concurrently, all accessing the same disk farm, without any significant performance degradation. Furthermore, the operating system will ensure that each client will get an equal share of the available disk resources. For a detailed discussion of the scalability of farming configurations with hundreds of clients, we refer the reader to [0]. References [] RD4, A Persistent Storage Manager for HEP. http://wwwcn.cern.ch/asd/cernlib/rd4/ [] Objectivity/DB. http://www.objy.com/ [3] D. Baden et al., Joint DØ/CDF/CD Run II Data Management Needs Assessment, CDF/DOC/COMP UPG/PUBLIC/400, DØ Note 397, March 0, 997. [4] Grand Challenge Application on HENP Data. http://www-rnc.lbl.gov/gc/ [] The Versant object database. http://www.versant.com/ [6] K. Holtman, Prototyping of CMS Storage Management, CMS NOTE/997-074. [7] Using an object database and mass storage system for physics analysis. CERN/LHCC 97-9, The RD4 collaboration, April 997. [8] K. Holtman, P. van der Stok, I. Willers. Automatic Reclustering of Objects in Very Large Databases for High Energy Physics, Proc. of IDEAS 98, Cardiff, UK, p. 3-40, IEEE 998. [9] Reclustering Object Store Library for LHC++, V.. Available from http://wwwcn.cern.ch/~kholtman/ [0] K. Holtman, J. Bunn. Scalability to Hundreds of Clients in HEP Object Databases. Proc. of CHEP 98, Chicago, USA. 6