CDF/DOC/COMP UPG/PUBLIC/4522 AN INTERIM STATUS REPORT FROM THE OBJECTIVITY/DB WORKING GROUP

Size: px

Start display at page:

Download "CDF/DOC/COMP UPG/PUBLIC/4522 AN INTERIM STATUS REPORT FROM THE OBJECTIVITY/DB WORKING GROUP"

Bryce Grant
5 years ago
Views:

1 CDF/DOC/COMP UPG/PUBLIC/4522 March 9, 1998 Version 1.0 AN INTERIM STATUS REPORT FROM THE OBJECTIVITY/DB WORKING GROUP W. David Dagenhart, Kristo Karr, Simona Rolli and Krzysztof Sliwa Tufts University Department of Physics and Astronomy Medford, Massachusetts USA Ruth Pordes and Mingshen Gao Fermi National Accelerator Laboratory Batavia, Illinois ABSTRACT We present some preliminary results of benchmarks of the Objectivity database system obtained by our group. We show results using prototype "loader" and "reader" modules that run under AC++, which provide necessary interface between the Objectivity/DB database and the TRYBOS/YBOS records. We also summarize the highlights from the 1998 CERN RD45 Status Report. The selected items are relevant to the architecture, scalability and performance issues of the large volume data management system based on Objectivity/DB, as we propose for CDF RUN-II data. 1

2 1 Introduction We present some preliminary results of benchmarks of the Objectivity database system. First, we show results derived using prototype modules that run under AC++. Second, we show tests of simpler standalone processes that write/read arrays. Finally, there is a brief summary from the 1998 CERN RD45 Status Report. The selected items are relevant to the architecture, scalability and performance issues of the large volume data management system based on Objectivity/DB, as we propose for CDF RUN-II data. 2 AC++ Prototype Module Results In this section, we present results derived using prototype modules that run under AC++ (Framework). One module reads banks from a TRYBOS record and copies them to Objectivity persistent objects. We call it the \loader". The second module is an input module. It reads the data les created by the \loader", then uses functions to recreate the banks in a TRYBOS record, and passes that record to Framework. Other modules could process that TRYBOS record in a completely transparent fashion. At present, 36 run 1 banks have been redened as persistent classes and can be loaded and read. The persistent objects store the data eld and bank number from the bank. The rest of the bank header and the type header are not stored, but can be recreated by a function. The following results are measured using 150 MBytes of data from a run 1 monte carlo le. These results are preliminary. At present, the granularity of the objects is one bank to one object. We are already working on composite objects, which contain multiple bank sized objects or arrays of them. When these are implemented, we expect performance to signicantly improve. These composite objects are not at arrays of bytes, but contain other objects. In addition, we have done studies of large at arrays and are considering using them only for the storage of raw data, not reconstructed or higher level objects. For both composite objects and at arrays, we expect the performance to improve, because the persistent objects will be larger. The database includes event objects and tag objects. All objects in an event are connected to these through associations. The input module iterates over the tag objects. Associations are used to access all the other objects in the event. The reading benchmarks reect reading 100% of the data in the database. They do not show benet one would see when reading nonsequentially. It is important to keep in mind that the following results will change as our data model evolves. Results of benchmarks done with TRYBOS are included for comparison. 2

3 Objectivity space overhead = 31% TRYBOS space overhead = 33% Objectivity Load and Write speed = 0.43 MBytes/sec TRYBOS Bank Copy and Write Speed = 1.8 MBytes/sec TRYBOS Write Speed = 4.1 MBytes/sec Objectivity Read speed (Objects into memory) = 3.7 MBytes/sec Objectivity Read speed (with copy to TRYBOS record) = 2.8 MBytes/sec TRYBOS Read speed = 8.3 MBytes/sec The rest of this section gives details of how the tests were done. Skip to the next section if you are not interested in the details. The space overhead is calculated based on the size of the Objectivity output les after ootidy (147 MBytes) divided by the size of the data elds in the banks (112 MBytes), then subtracting 1 and converting to a percentage. For TRYBOS, rst a bank copy program is run that copies out the banks which the loader can translate. The size of the original TRYBOS le is 238 MBytes. The size of the le with only the selected banks is 150 MBytes. This 150 Mbyte le was used for all the TRYBOS tests. The input le is /simona/simona/cdfloader/wbb q.trk on b0it04. The write speed is calculated by dividing the size of the output le by the time to create it. For Objectivity, the time includes the time to run the process (373 sec) plus ootidy time (63 sec) minus the time it takes to read in the data using TRYBOS standard input module (91 sec for the 238 Mbyte input le). For TRYBOS, there are two ways to measure write speed, depending on what you want to compare. In the rst method, there is one TRYBOS record that is lled by the input module, then the output module takes a second record and copies in the data one bank at a time, then the second record is written to disk. This is analogous to loading the objects and then writing. In the second method, the record delivered by the input module is directly written to a new disk le. There is no copy of banks to a new record. For TRYBOS, the time is the process time minus the time to read in the data. The read speed is the size of the database les divided by the time to read through it. All the data is read into memory, including the contents of the oovarrays in the case of Objectivity. Only the input module is run, there is no processing done on the event. For Objectivity, it takes 52 seconds to read 147 MBytes when copied into a TRYBOS record. It takes 39.4 seconds when the data is only read into memory, not copied to a TRYBOS record. For TRYBOS, it takes 18.1 seconds to read in the 150 Mbyte le. This is using a modied version of the FrameMods input module that does not reallocate a new event record every event. Using the standard module the read rate is 2.6 MBytes per second. In this case, most of the time is spent allocating new records and this is not a very interesting number. 3

4 The results depend heavily on the hardware used. In all cases, b0it04 was used. b0it04 is a Silicon Graphics 195 MHz R10000 machine. The disk was local to the machine. It was a Seagate ST19171N with rated "Internal Formatted Transfer Rate" of 7.86 to 12.2 MBytes/sec. The disk was tested to write using low level I/O at least 4 MBytes/sec. The disk read speed was tested with the UNIX command \time cat wbb11.try" piped to /dev/null, where wbb11.try is the 150 Mbyte disk le. It took seconds, which is 8.4 MBytes/second. All the benchmarks were run when no other signicant processes were running. This was determined using the UNIX \top" utility. Note, about 30% of the time to write data was taken by the ootidy process. But ootidy only reduces the size of the output le by 1% or 2%. It is not clear the size savings is worth the extra write time. We dene 1 MByte/sec to be 1,000,000 bytes per second. 3 Simpler Standalone Results 3.1 Reading/Writing Many Identical Fixed Size Arrays First, we show tests of a process that writes a large number of identical persistent objects to the database. Each object contains one array of a xed size. The test is repeated with dierent array sizes. The arrays are lled with random garbage. To keep it simple, there are no oovarrays or associations. A second process is used to read through the data created by the rst process. See gure 1 and table 1. The details of the test are very similar to the test described in the rst section. The same hardware was used. The size of the data is roughly 160 MBytes in every case. The write speed is the size of the output le divided by the time to create it (including time for ootidy). The read speed is the size of the input le divided by the time to read it all into memory. Objects are located using the Objectivity scan function to initialize the iterator. The hardware read speed was calculated using the UNIX command "time cat lename" piped to /dev/null. The hardware read speed was recalculated for every le to account for the variation caused by what part of the disk the le is written to. For the larger objects, the reading speed is roughly 90% as fast as the hardware can go. The hardware write speed was tested using low level I/O to write 1 MByte blocks to a le. The hardware write speed is very roughly 4 MBytes per second. The writing speed is less than 1/4 as fast as the hardware will go. Benchmarks at dierent page sizes are in progress. The plot in gure 1 shows space overhead as a function of object size. The space overhead is the sum of two components. One component varies like 1/(object size). This is important for smaller objects. This component will increase proportionally as associations, oovarrays and other things are added to objects. In the simple case tested here, the overhead is 14 bytes per object. The other component depends on the page size. The saw tooth part of the curve in gure 1 relates to space wasted at 4

5 Table 1: Read and Write Speeds as a Function of Object Size (with pagesize 8192 bytes). Errors on the speeds are roughly 10%. Size of Space Write Speed Read Speed Hardware Read Objects (bytes) Overhead (MBytes/s) (MBytes/s) Speed (MBytes/s) 80 19% % % % , % the beginning/end of the pages. Changing the page size would be like rescaling the x axis in gure 1. The height and shape of the sawtooth peaks would not change. The peaks are located at (pagesize)/n and (n+1/2) * (pagesize), for n = 1, 2, 3 : : :. It is a little dicult to compare the results in section 1 to these results. The average bank size for the results in section 1 was 160 bytes. But the overhead per object was larger, because there are associations, oovarrays, tag objects, event objects and other things that make the overhead per object larger. These things also take signicant amount of time to create and a little bit more time to read. 5

6 Figure 1: Space overhead as a function of the size of the persistent objects. Overhead is calculated by dividing the size of the database output le by the size of the input data (number of objects times the size of the data in each object), then subtracting 1 and converting to a percentage. The curve is a prediction based on our understanding of how Objectivity works. The points show our test results. The top plot covers objects of size 32 to 1,000 bytes, the lower left plot 1,000 to 9,000 bytes, and the lower right plot shows 9,000 to 90,000 bytes. This is a sum of overhead from 2 sources. First, there is 14 bytes of overhead per object. This varies like 1/x and is only signicant in the top plot. Second, the sawtooth function reects the space wasted at the end/beginning of each page. 6

7 3.2 Bank Loader Program Using Flat Arrays In this test, an alternate schema was tried. It shows that we can expect performance to improve with larger objects. It is also a schema we are considering using for the raw data. This schema uses large at variable sized arrays. \Flat" means that the type information is not stored within the schema, it is just an array of bytes. This approach requires another package (TRYBOS) to do type conversions and access the data. The schema just gives persistent storage for arrays. In this test, all the banks in an event with the same name were stored in the same array (one might also store entire TRYBOS records this way). We used the same hardware, same measurement assumptions, and the same 150 MBytes of monte carlo data used in the test described in section 1. The overhead was measured to be 15%. The writing speed was measured to be 0.89 MBytes/sec. Reading is not implemented yet for this alternate schema. 3.3 Nonsequential Reading of Data One can expect to save a lot of time by reading data nonsequentially. We demonstrate this in a simple test. The process reads data from events which were written into the database by our loader program. There is a cut based on data in the tag object. The rest of the data in the event is only accessed if the cut is passed. The numbers below show a nonsequential read is faster. In each case there are 10,000 total events. Events passing cut = 9997 Events passing cut = 7510 Events passing cut = 5241 Events passing cut = 4055 Events passing cut = 2276 Events passing cut = 707 Events passing cut = 553 Time = 152 seconds Time = 138 seconds Time = 104 seconds Time = 87 seconds Time = 59 seconds Time = 19 seconds Time = 17 seconds The results above will change as we change the event structure in our database. The only conclusion one should draw is that nonsequential reading is faster. One should not presume that the relationship between time and the fraction of events selected will be the same in the nal data model. 4 Highlights from the 1998 RD45 Status Report For the Year 1998 the RD45 Collaboration has been asked to reach the following milestones: demonstrate that an ODBMS can satisfy the requirements of a typical simulation reconstruction and analysis scenario with data volume up to 1 TB 7

8 investigate the impact on the every day work of the end user physicist when using an ODBMS for event data storage. The work should address issues related to individual developers' schema and collections for simulation, reconstruction and analysis demonstrate the feasibility of using an ODBMS and MSS at data rates sucient for ATLAS and CMS 1997 test beams requirements Milestone I The main requirements for reconstruction is that the ODBMS should be able to keep up with the rate at which data is acquired: 100MB/sec for ATLAS/CMS 1.5GB/sec for ALICE. (For CDF this number is of order 60MB/sec, and it's the reconstruction rate to which the Input/Output module should keep up). It is assumed that reconstruction is done using a farm. The NA45 experiment demonstrated already that up to 32 streams can write into a single Objectivity/DB federation using a lock free strategy. We think that an extrapolation to 60 streams is not unreasonable. This should give a rate of 1 MB/second per stream. The conclusion is that I/O rates for reconstruction purposes are not considered to be a problem. For analysis it is assumed that some 150 users will be performing analysis concurrently at any one time. The following setup has been developed at Caltech: A 256 processors system has been built on 16 nodes, each with 16 processors. Each node was connected to 4-way striped disk array, capable of delivering 22 MB/s. The nodes were connected by fast switching fabric. Data have been put on two nodes (for a total of 2 10 GB), and the I/O-intensive clients on the nodes which hold the data. The CPU-intensive clients were located on the other nodes. The following assumption has been made: 1 physics job executes as N clients ( or client threads), each client doing mostly sequential reading. The load of M physics jobs has been simulated by putting M*N Objectivity clients on the machine. Each client does mostly sequential reading (traversing containers): 1/3 of all clients read 10KB objects and computed for sec per analysis on the object; 1/3 of all clients read 100KB objects and computed for 0.1 sec per object; 1/3 of all clients read 500KB objects and computed for 10 sec per object The system got up to 158 clients running in parallel. The conclusions are that HP Exemplar I/O scales to 100+ readers for data on 2 nodes. There were no no I/O performance degradation with 100+ readers. There were no crashes of the lockserver (570 entries in lockserver table). The combined throughput of all clients was about 18 MB/second, essentially constant for up to 100 concurrent clients. These result suggest that scaling to 150 concurrent analyses is achievable today, without resorting to replication. 8

9 4.0.2 Scalability In version 4 of Objectivity it was not possible to create individual databases (DB) larger than 2 GB which, combined with the limit of 2 16 databases allowed in a federation, was a limitation the maximum volume of data one could store in a federated database. In the version 5, RD45 has veried that the 2 GB limit no longer exists, les up to 25 GB is size have been created. The current le size in only limited by the underlying lesystem: it is practically unlimited on 64-bit systems. The largest federation created to date is of order 0.5 TB ( limited only by disk space). Many federations containing over 1000 DB have been built. Using 25 GB databases just 40 are needed to build a federation of 1 TB. In practice, building federations larger of the order of 1 TB requires an interface to a mass storage system. In version 6 of Objectivity, due end of 1998, the mapping between containers and les will be possible. With this modication, one could build very large federation using small les, of the order of 1GB, which may be advantageous over using large, say GB, les. Attempts to build very large federation using HPSS-managed storage are currently under way, as CERN plans to use the Objectivity/HPSS for production with about 300 TB of data for two experiments, COMPASS and NA45, in Data reclustering The data reclustering issue has been investigated both in ATLAS and CMS. To study the potential performance gains of reclustering a prototype has been developed in CMS, based on a mechanism for clustering data into collections and accessing the collections with read ahead optimization. The read ahead optimization allows the clustering of dierent types of objects to be managed in an independent way, and also makes it possible for the batch reclustering operation to conserve the database size while preserving optimal throughput. The schedule only needs to be computed once for every job, and this allows the optimizer to use fairly complex computations Data Import/Export In Objectivity/DB it is possible to copy a database and attach it to another federation It is required that the target federation should be compatible i.e. have a consistent schema for at least the subset of objects in the database that are copied and share database parameters such as database pages. A copied database may be (re-)attached with a new database ID in which case the object identiers of all contained objects are automatically updated. A more complete solution however is to provide a deep copy utility, which copies objects and objects that are referenced. Such a tool has been developed by BaBar to assist in their data import/export. An associated problem is that of maintening consistency between federations. BaBar intend to use a simple database-id allocation scheme which ensures that the database Ids used by dierent federation are compatible. 9

10 4.0.5 Production database services. The federation catalogue is crucial for the database, and it should be protected from possible corruption and failures. A number of production database services will be establish at CERN during 1998, most importantly, a dedicated server on which the lockserver for the primary partition would run. This machine would also contain a copy of the federated database catalog and schema. These servers would have mirrored le-system for the operating system and database data, in order to oer maximum reliability of the Dataserver on which the Objectivity/DB servers (AMS servers) would run. These servers would have several hundred GB of disk space which would be typically managed by HPSS. At least one data server per experiment would be non-hpss managed for data that must reside permanently online, such as calibration data, production control and data collections of event tags. Such services will be established for ALICE, NA45, ATLAS, CMS and COMPASS. At SLAC the BaBar experiment will take some 200 TB of data per year starting in They intend to use a combination of Objectivity/DB and HPSS on which to base their event store. For the details of the data model see Milestone II. Access to a given Objectivity/DB federation is determined by invoking an environmental variable. Helper classes, distributed as part of the HepODBMS class libraries, reduce the amount of knowledge about the details of frequently performed database operations, for example initialization et cetera. At the interactive analysis stage browser's are developed to navigate through the database, nd an appropriate collection of events and analyse it. The system allows a user to access the data as a logically single entity, without the need for knowing the physical location of data, or details of the staging system or a book-keeping system Collections Collections of persistent objects, even collections or others, are of obvious importance for the users and the database administrator. Database administrators will try to optimize the overall system performance by redening physical clustering of event collections shared by one of more physics analysis groups. A workshop was held in February 1998 at CERN, and the key requirements have been dened: a single class for user interface STL-like interface, including a forward iterator support for collections of up to events support for a "description" of the collection set-style operations based on a unique event identier 10

11 The goal is to develop a rst prototype of event collection classes in time for the end of April 1998 RD-45 workshop. The goal is to incorporate a version of these classes in the 98A release of the LHC++, scheduled for June/July LHC++ analysis model A new approach to analysis is possible if the data is stored in an ODBMS. Comparing to the NTUPLES, with most of HEP users are quite familiar, it is no longer necessary to repeat the entire NTUPLE-generation stage if an additional variable is required in the NTUPLE. It is possible because of associations provided by the ODBMS. A scheme for performing interactive data analysis has been developed in the context of LHC Schema Handling Issues Each persistent-capable class is given a type number, allocated sequentially. Maintaining a type-numbering scheme of an application ( or library) in agreement with the target federated database schema is therefore an essential requirements. To remove type number coupling between dierent packages introduced by the sequential type number allocation, Objectivity oers the, so called, \named schema" feature. This feature allows to divide the type number space into named subsets when running the ddl processor. Each of these individual named schemata is reserving a range of 64K type numbers, allowing the developers of dierent packages not to enter in mutual conict. Already some 16 schema names have been allocated for the various LHC++ packages. User schema are then possible, allocating a named schema for each user, on demand. 4.2 Milestone III ODBMS-MSS interface. The production version is scheduled to be delivered by the end of A prototype of the interface has been produced for IBM AIX systems, the only system on which HPSS is currently ocially supported. The interface is provided in such a way that end-user sites are able to optimize the I/O layer, even substituting a dierent mass storage system provided that a compatible interface is written. Objectivity/DB applications will be unaware that the associated data resides in HPSS managed storage. When an object is accessed it will be returned immediately if the corresponding database is already on disk, if not the client will block on the implicit database open whilst the server, through HPSS, causes the necessary le to be reloaded from tape. The current HPSS-NFS and the "simple API" interfaces, in which a block is read at a time, are ecient for very large blocksizes - between 1-10 MB. Unfortunately, 11

12 databases typically transfer much smaller amounts of data (the pagesize we are considering to use for the RUN-II CDF Objectivity/DB database is 8 kb). A better strategy should be to read multiple blocks at a time and hence minimize the interaction with the HPSS server. However the performance implications of the current prototype are not yet well understood and it is expected that stress testing over the coming months will suggest areas where improvements are required. An alternative solution would be to use HPSS as a conventional staging system and let the Objectivity/DB to read and write directly to standard UNIX lesystems. This would avoid the performance overheads associated with reading (or writing) from (to) HPSS-managed disk storage, but would require additional space management of the disk pool. At CERN, this direction is pursued now, the existing tape staging software already provides such a capacity and is being interfaced to HPSS. This solution is currently considered the most viable short term solution OBJY-HPSS installation at CERN. In the current HPSS test conguration at CERN, the various HPSS components are distributed across multiple systems. For example, the tape mover(s), disk mover(s) and HPSS nameserver all run on dierent systems. In addition an IBM system is currently being used to evaluate the Objy-HPSS prototype interface. As such this system runs both the Objy/server (AMS) and the HPSS disk mover, together with the rest of the environments required by HPSS, such as DCE OBJY-HPSS conguration at SLAC. Unlike at CERN, SLAC currently plans to run the various HPSS components and the Objectivity/DB server on a single powerful system. Although such a scenario has the advantage of reducing the network overhead involved in the inter-module communication, it is inherently a less scalable scenario, but nevertheless well-suited to the environment at SLAC, where the system will be used to support a single experiment Functionality tests. The basic functionality required by the proof of concept prototype have been demonstrated. Test have been also made to access tape-resident databases. This area needs a signicant amount of further study and it is going to be the subject of primary attention in The target for a production-quality interface remains the end of 1998 and it is scheduled for inclusion in Objectivity/DB Version 6.0. The schedule is not unrealistic given the fact that two large volume experiments, BaBar and Compass, will start taking data in

13 4.2.5 CMS test beams experiences. There has been no integration with HPSS (given the small amount of data - less then 100GB - ). The data rates involved were well below 1MB/sec. The two test beam activities can be considered as a production demonstration of the overall LHC++ environment, from data taking to analysis The H2 test-beam Online Event data recording DAQware ( OBBMS unaware) Objectivity/DB formatter ( Objy-dependent) Control system ( could use ODBMS) CDF ( Dependent on objy fault tolerant option) Asynchronous data recording ( Objy dependent) Oine data processing reconstruction framework (ODBMS-based) interface to simulation user persistent class ( Objy dependent) Interactive analysis environment: data browser HisOOgrams ( ODBMS-based) HisOOgrams visualize ( ODBMS-aware) After a few days of running the system run essentially unattended without major problems. The only manual operation was to change the output disk every 9 GB. Further development expected in The X5B test-beam Online ( data recording) DAQ/conversion of ZEBRA les Objectivity/DB formatter Central Data Recording Online monitoring/data quality Oine (data processing) simulation framework ( interface to GEANT-4) analysis and reconstruction framework 13

14 Interactive analysis tools : HisOOgrams HisOOgrams visualizer ( HEPRxplorer/HEPInventor) The Objy reformatter performs the following operations: gets the data from the ZEBRA server using the proxy pattern; creates the databases and containers; creates the event structure; populates the database. It is clear that the reformatter is very similar in nature to our CdfLoader module. 4.3 Database Administration tools A tool for monitoring and administering an Objectivity/DB federated database has been developed. A rst version, has been built using the Objectivity/DB Java binding. Using this tool the database administrator is able to observe, control, and manage the basic federated database functionality as well as the autonomous partition and data replication options. Functionality of this tool is divided in three major groups: conguration, handling the functionality of the autonomous partition and data replication options - in other words it allows the administrator to create or delete partitions, replicate database images, vary partitions on/oine, resynchronise images and so on -; control, allowing and administrator to monitor and control the database servers; and statistics group oering the possibility to run a number of tests to check data transfer throughput of a given autonomous partition. 4.4 Tests with large number of images. The data replication option has been testes up to 100 images - the limit coming from the number of nodes that could be conveniently be used for this purpose and not from any limitation in Objectivity/DB. The time taken to both create persistent objects and commit the corresponding transaction increases with the number of images involved. This is expected, not only does the transaction not complete until the data involved has been safely written on disk on all servers, but more network trac is involved. There have been wide area tests: two images at CERN and one at Caltech. The data rate is strongly correlated to the hour of the day, the measurements varied from 2Kbit/second to 20 KBit/second. The conclusion is that the basic functionality oered by the Data Replication Option behaves as documented. It is important to stress that the required network bandwidth must be made available. Oine replication remains the most appropriate option for large data volumes, with the networks used typically in HEP today. Replication is still a viable solution for small data volumes, as calibration data. 14

15 5 Use of OBJECTIVITY/DB in HEP AMS (Alpha Magnetic Spectrometer) is an experiment that will take data on the NASA Space Shuttle (to be launched May 1998) and later on the International Space Station. The AMS collaboration has been using Objectivity/DB in test and plans to use it to store their production data, slow control parameters and NASA auxiliary data. ALEPH - started an exercise to convert their ADAMO-based mini-dst to Objectivity/DB. ALICE - the ALICE oine team is currently focusing on GEANT-4 (which uses Objectivity/DB as their standard output). They plan to study the Objy/DB based solutions, but only in the context of GEANT-4. ATLAS - is developing a number of prototype applications using Objectivity/DB, in both o-line and on-line. BaBar -plans to use Objectivity/DB to store their data starting in They expect to collect about 200 TB of data per year, all of which will be stored in the federated Objectivity/DB database. The HPSS storage manager will also be used. BELLE - start taking data in the fall of 1998, plan to use Objectivity/DB to store detector constants, and later the mini/micro-dst for rapid data analysis. CHORUS - using Objectivity/DB for an on-line emulsion scanning database. They also evaluate Objectivity/DB as a potential solution for the proposed TOSCA experiment. CMS - using Objectivity/DB for a number of prototype applications, including the test beam activities. The current baseline assumption is that (as ATLAS) they will use Objectivity/DB coupled with HPSS to store their data as persistent objects in an object database. NA45 - they have been using Objectivity/DB in production since A number of production runs have been performed, with a total data volume of 30 GB. For 1998 their plan is to make tests of Objectivity/DB together with the central data recording and HPSS, in preparation for the 1999 data run, where TB of data is anticipated. NA-48 - maintain the detector conguration database in Objectivity/DB. Recently, they initiated a project to store their micro-dst (and perhaps more) in Objectivity/DB (following ZEUS). RHIC - The RHIC experiments in Brookhaven plan to adopt a common strategy 15

16 for their data storage. The current plan is to use Objectivity/DB and HPSS. Experiments involved include BRAHMS, PHENIX, PHOBOS and STAR. Data volumes for both PHENIX and STAR are expected to be around TB/year. ZEUS - they built a tag database on Objectivity/DB, and were using it since end of This database has been built from the physics data in the ADAMO database. The new system is considered signicantly more exible than the past one, and if oers much improved performance - by allowing users to access only the data they need. 6 Comments on comparisons quoted by the ROOT team the dierence in size of the les comes from two factors: 1) there was no optimization of the Objy database page size performed, which factor can account for perhaps 1/3 of the dierence 2) ROOT uses data compression, and the LHC++ at present does not. (It is possible to use the data compression with Objectivity/DB. One can compress physical entities (les, containers) or logical entities (VArrays or objects). Compression has been tried by RD45 using compression based on zlib (gunzip), with the default compression method. Tests performed with ATLAS Production Ntuple showed possible gains of between in le sizes. One should remember that the gains depend on data used and access patterns, as compression increases imbalance between sequential and random access speeds. It also increases the CPU load on the server or client.) The dierence in time of going across the database This was measured by doing the comparison in a very skewed way, basically comparing apples and oranges. The database given to the ROOT team by RD45 was built to demonstrate the ease of traversing from the "tag" to rest of the data. It was not intended to be "ecient", just to demonstrate the functionality. The database traversal speed dierence went completely away when Objy database and LHC++ were congured more properly. It is basically impossible to nd out from cdf4497 how the various comparisons were made. This is too bad, as there is room for comparisons. However, they should be conducted in a sensible way to demonstrate that the architecture of the system will scale to our requirements. This is the issue at hand. The dierence in the 1-D histogramming speed After a few, simple, changes to make the comparison more fair and meaningful, LHC++ was demonstrated to deliver similar performance to ROOT. A large 16

17 factor in histogramming speed was due to the old and unoptimized version which was used by ROOT team when using it with Objy database (ROOT was 30x faster than the "old" LHC++ histogram; the new version of LHC++ histogram code is 25x faster than the "old" code. The LHC++ team plan to totally replace the current histograms with a new templated design. LHC++ gave the ROOT team the library which was available at the time the request was made, in order not to slow the exercise meant as an attempt to show that it is possible to access and manipulate NTUPLES stored in Objectivity using ROOT. 7 References \A solution for data handling based on an Object Oriented Database System and its Applications to CDF RUN II" CDF/DOC/COMP UPG/PUBLIC/4346 \ Flows and controls for a data handling solution based on an Object Oriented Database System " CDF/DOC/COMP UPG/PUBLIC/4493 \Gedanken experiments for a data handling solution based on an Object Oriented Database System and its Applications to CDF RUN II" CDF/DOC/COMP UPG/PUBLIC/4492 \Implementation plan for a a data handling solution based on an Object Oriented Database System " CDF/DOC/COMP UPG/PUBLIC/4503 \RD45 - A persistent Object Manager for HEP" CERN/LHCC \Object databases and mass storage systems: the prognosis" CERN/LHCC \ ATLAS Computing Technical Proposal" CERN/LHCC \Object databases and their impact on storage-related aspects of HEP computing" CERN/LHCC 97-7 \Object database feutures and HEP data management" CERN/LHCC 97-8 \Using an object database and mass storage system for physics analysis" CERN/LHCC

18 \ Status Report of the RD45 project" CERN/LHCC 97-6 \ Status Report of the RD45 project" CERN/LHCC 98-x Not yet available to the general public. 18

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications