OF SCIENTIFIC DATABASES

Size: px

Start display at page:

Download "OF SCIENTIFIC DATABASES"

Pierce Walker
6 years ago
Views:

1 CHAR4mCS OF SCIENTIFIC DATABASES Arie Shoshani, Frank Oken, and Harry K.T. Wong Computer Science Research Department University of Caifornia, Lawrence Berkeey Laboratory Berkeey, Caifornia The purpose of this paper is to examine the kinds of data and usage of scientific databases and to identify common characteristics among the diierent discipines. Most scientific databases do not use genera purpose database management systems (DBMS). The main reason is that they have; data structures and usage patterns that cannot be easiy accommodated by existing DBNSs. It is the purpose of this paper to identify the specia database management needs of scientific databases, and to point out directions for further research specificay oriented to these needs. We discuss the different types of scientific databases, and it the properties identified for them. Exampes appications are then anayzed with respect to the types of data and their characteristics, and summarized in two tabes. Concusions are drawn as to the preferabe data management methods needed in support of scientific databases. 1. IWON This document is a resut of numerous interviews with scientists, mosty from Lawrence Berkeey haboratory, spanning severa different scientific discipines. The purpose of these interviews was to examine the kinds of data and usage of scientific databases in order to identify common characteristics among the diierent discipines. In the past, we have studied statistica databases, which are databases that are primariy coected for statistica anaysis purposes. A summary of work in statistica databases can be found in [Shoshani 521. We expected that some of the observations and techniques deveoped for statistica databases wi be usefu for scientific databases. Indeed. we found this to be the case. The simiarities are pointed out throughout this document where appropriate. It shoud not be surprising that common characteristics exist, because many scientific databases are often subject to statistica anaysis. However, as discussed beow, scientiic databases have additiona stages of data coection and anaysis that introduce more compexity and chaenges. In section 2. different types of scientific databases are described. In order to describe the common properties between severa exampe appications, a ist of characteristics for the different types of data are described in section 3. In section 4. a representative exampe of a scientific appication is deineated with respect to the ist of characteristics. Additiona exampes have been simiary anayzed, but because of space iiitations are not described in this paper. The description and anaysis of these exampe appications were described in [Shoshani et a 841. The characteristics of these exampe appications are summarized in two tabes, which appear at the end of this document. In section 5 we discuss the impications of our observations to desirabe database techniques for scientific databases, and propose areas for further investigation. Section 6 is a short summary section. 2. TYPES OF SCIJWIIFIC DATA The scientific databases described to us during the interviews were anayzed in order to identify simiar data structures, data characteristics and data usage among different appications. We found it convenient to distinguish between different types of scientific data. The important features for each type were identified, and different exampes of scientific data were categorized accordingy. In this section we describe the data types and their main features. Procoding ot the tenth Intemrtionr -onvoylmrgod~bun. 147 Singqwn, August, 1994

2 2.1. EXPERUENT DATA Most scientific data resut from experiments and simuations. Data from experiments are usuay measurements of some physica phenomena, such as the coision of partice beams, or the spectra generated by moecues in a strong magnetic fied. Data from simuations typicay resut from compex computations derived by using vaues from the previous time interva. Both experiment and simuation data have simiar characteristics, and therefore are considered jointy. In order to simpify the terminoogy used here, we refer to such data as experiment data, regardess of whether they are experiment or simuation data. Experiment data can be cassified according to three characteristics: reguarity, density, and time variation. Regvmity refers to the pattern of the points or coordinates for which vaues are measured or computed. For exampe, in physics experiments, detectors are paced in a specific configuration. If the configuration describes a reguar grid or some other geometric structure, the experiment is said to have (spatia) reguarity. Simiary, many simuations assume some reguar grid for which vaues are computed, and therefore have spatia reguarity. In addition, if vaues are measured or computed at reguar time intervas, then time can be considered as another reguar coordinate of the data. In genera, reguarity impies that a mapping between the coordinates of measured vaues and the storage ocations of these vaues can be made by means of a computation (such as array inearization, which is simpy a mapping from muti-dimensiona space to inear space, simiar to FORTRAN array mapping). Therefore, in such cases it is not necessary to store the coordinate vaues with each measured* data vaue, resuting in storage savings and fast random access. On the other hand, when spatia irreguarity exists it is necessary to enumerate the data points, and store their identifiers with the data vaues. Density indicates whether a the potentia data points have actua vaues associated with them. For exampe, simuation data of fuid motion computed on a reguar grid woud have data vaues (for veocity, direction, etc.) computed for each point of the grid, and therefore the data is considered dense. On the other hand, in many experiments a arge number of measurements that are beow a certain threshod are discarded and never recorded. In fact, the eve of sparseness can be quite high, i.e. ony a sma fraction of the potentia data points have recorded vaues. For exampe, in physics experiments of coiding partice beams, the measured data is ony for resuting sub-partices, which occur over a sma portion of the detectors that are distributed in space. Proceedings of the Tenth Internationa Conference on Very Large Data Ba&. 148 Spar&y impies a arge number of nu vaues which may be compressed out. The compression schnique chosen shoud depend on the access patterns to the data, such as whether the data are accessed sequentiay or randomy. Access patterns are discussed in the next section. %ne vuri&ion refers to the change of coordinates over time; i.e. the points for which data vaues are measured or computed change their position from one time unit to another. For exampe, consider some materia that is bent in the course of an experiment. Before the experiment starts a set of points is seected for measuring the materia s behavior (such as stress, votage, temperature). During the experiment the seected points may change their position as a resut of the bending action. Time variation is a characteristic found mosty in simuations where a mesh of points are aowed to change their position over time during the simuation process. These simuation methods are generay caed adaptive mesh techniques. Time variation adds an important requirement. In addition to storing the coordinates of points for every time interva, it is necessary to maintain the reationships between the points as they existed in the origina mesh. This is needed in order to be abe to reconstruct the time sequence of points that correspond to the same origina point, and in order to fnd neighboring points to a given point at any given time DATA In addition to the experiment data discussed above, there exist data in support of the experiments, and data that are generated from the experiment data. Support data fa into two types which we ca configuration data and instrumentation data. Simiary, generated data fa into three types: anayzed data, summary data, and property data. These types are discussed beow. To distinguish these additiona data types from the experiment data, we refer to them coectivey as associated data conaguration data Configuration data are data that describe the initia structure of an experiment or simuation. For exampe, in simuating heat transfer through buidings, the buiding ayout has to be described. Simiary, the confguration of an experiment describes t& position of different devices and detectors. The configuration ayout actuay determines the reguarity (or irreguarity) of the experiment data mentioned above. Usuay, it does not change in the course of the experiment or simuation. However, it can change between experiments or simuations. It is important to keep track of these changes and to associate the correct Singapore, August, 1994

3 configuration data with the corresponding experiment data In8trumentationdata Instrumentation data consists of descriptions of the different instruments and substances used in an experiment, and their changes over time. This data is crucia for the correct anaysis of the experiment data. It incudes information such as the pressure and temperature of a gas used in an experiment and their changes over time, drift of votage over time, and the characteristics of detectors and devices as measured before each experiment or a series of experiments. It aso incudes the og of experiment operations, such as the time that a defective anaog-to-digita converter was repaced, and who was in charge of it. Unfortunatey, some of this information is coected into unreated fies and og books, thus making their association with the experiment data a tedious task that is prone to errors Anayzeiddata The previous two data types are essentia in order to support the anaysis of experiment data. The anaysis process produces many databases that aso need to be managed aong with their reationships to the experiment data they were derived from and to each other. The anaysis process may require severa steps. For exampe, in physics experiments of coiding partice beams, a preiminary histogram over the experiment data can be done in order to estimate parameters that are ater used to interpret the caibration data of detectors in the next step of the anaysis. For each coision, caed an event, the tracks of sub-partices produced are reconstructed and kept in a database. From the track data, another database for the event data can be derived, describing the kind of sub-partices produced and their characteristics. Additiona steps use databases from this and earier stages to generate yet more data. It is important to capture the anaysis process, the input and output databases of each step, and the reationships between the steps Summery data Simiar to statistica databases, which dea with statistica summaries (aggregations) of data sets, scientific databases are often aggregated. For exampe, in experiments of heat transfer in buidings, the amount of heat ost or gained can be averaged over severa points of a wa, summed over entire rooms, or aggregated over days into months. Another exampe, is the generation of histograms from many experiments to determine the ikeihood of a certain phenomenon. As in the case of statistica databases, there is a need to organize, search and browse coections of summary ~rocwdingr of the Tenth Internationa Conference on Very Large Data gases. 149 data, and to preserve their reationship to ower eve data from which they were derived F'rojx?ttydaf.a In any scientific fied, the summary of information earned over the years is usefu to the community at arge. There is a substantia amount of work devoted to the organization and cassification of properties of materias, substances, and partices. For exampe, there are severa systems devoted to the storage and retrieva of chemica substance properties. Many property databases cannot now be accessed on-ine. The data is ony avaiabe in periodicay pubished books, and may not be up-to-date. Property data is nonuniform: it contains numeric, text, and bibiographic data, as we as images and graphs. This is one of the reasons that for each scientific area specia purpose systems have been deveoped. Data management systems that can dea with such diversity of data types are not generay avaiabe. In addition, because of the compex terminoogy invoved with such data, sophisticated search and browsing capabiities are needed. 3. -CSIDENTIHEDHIRscIENTIHCDATA Using the cassifications of data types described in the previous section, it was easier to identify common characteristics and usage of the data. For each cassification we have ooked for certain characteristics that seem to exist across scientific appications. These characteristics are described in this section. Since the characteristics of experiment data are not necessariy the same as those of associated data, they are described separatey. In the next section, we describe exampe appications in terms of these characteristics. The terms that are used for each characteristic are shown in tiaics in the text beow. The reader may refer to the eftmost coumns of tabe 1 and tabe 2 for the ist of characteristics of experiment data and associated data, respectivey CHARA~SIICSOF-DATA 1) Identifier The identifier is that part of the data that identifies each data point uniquey (aso caed a key). In the case of experiment data the identifier is usuay a composite key of spatia coordinates and a time coordinate. Since the identifier has mutipe dimensions, the characteristics of reguarity, sparsity. and time variation, discussed in the previous section, appy naturay. The concept of a muti-dimensiona identifier is simiar to that of category attributes in statistica databases [Shoshani 821. This concept is quite dominant in experiment data (as was the case with statistica databases) because the Singapore, August, 1984

4 data is mosty accessed with respect to its identifier. We expand on this point beow in the section on access patterns. The identifier is said to be reguar if each of its dimensions are ordered in reguar intervas. It is sparse if ony a fraction of the points in the cross product of the dimensions have data associated with them; otherwise it is dense. Pime variation impies that the coordinates of the identifier, regardess whether it is reguar or irreguar. vary over time. 2) Access pattern Access pattern refers to the most typica forms of data access. For exampe, an anaysis program may foow a track of a sub-partice, or a simuation program may need its nearest neighbors in order to cacuate the next data point. Note that in these exampes the access of points is reative to the (spatia) identifier coordinates, and not the measured or cacuated data vaues. This is typica of the access pattern of experiment data. The reason for distinguishing between the diierent types of access patterns is that they impy ditrerent requirements for physica data base organizations, as discussed beow in the impications section. We distinguish between two aspects of the access pattern. The access type is the type of access of a singe query (or a step of the computation). The access sequence refers to the reationship between queries, i.e. whether the seection of a query depends on previous queries. 2a) Access type There are three access types that we found usefu to identify. tin ezact match means that the identifier of a point was specified precisey in the query. A ran#z type impies that a range of possibe points were identified. Since the identifier is muti-dimensiona, each dimension is invoved in the specification of the range. A proximity type indicates that the neighboring points around a given point are desired. 2b) Access sequence Given a query of a particuar type, the aocess sequence indicates whether the identifier(s) of the next query reate to the identifier(s) of the previous query. A oca access sequence impies that the identifier(s) of the current query are cose to the identifier(s) of the previous query. For exampe, foowing a partice track invoves a oca access sequence, since each successive point is cose to the previous point. A non-oca access sequence means that there is no reationship between the identifiers of successive queries. In a itneat access sequence, the sequence of the identifiers of successive queries foows successive intervas of the dimensions of Proceedings of ths Tenth Internationa Conference on Very Large Data S~SSS. 150 the identifier. For exampe, foowing the points of a mesh according to the reguar intervas of the dimensions of the mesh is considered a inear access. An o&itrary access sequence indicates that the order of processing the data points is unimportant. Such access is usuay used when the entire data set (or some arge subset) need to be processed for anaysis or sumwy statistics. Access sequence shoud be thought of in conjunction with access type. For exampe, searching for a particuar point in space where this point is not reated to points of the previous query, impies an exact access type and a non-oca access sequence. However, searching for a coection of points in the same neighborhood whie foowing a certain path, impies a proximity access type and a oca access sequence. 3) Database size Experiments are often repeated in order to verify a certain phenomena, to determine the statistica behavior of the experiment, or to discover a rare event that occurs ony in a sma fraction of the experiments. In many cases the resuts of each experiment can be processed independenty. We ca each independent part of an experiment a unti. An exampe of an independent unit is a singe coision (event) in partice physics, or a singe time step cacuation of a simuation. It is important to identify such units and to determine their size because they can be processed independenty of other units and often in parae. In addition, if units are sma enough they can be processed entirey in main memory, rather than brought piecewise from secondary storage. Anaysis and summarization of experiment data is usuay performed over a coection of experimenta units. The size of a coection is significant because it refers to the quantity of data that anaysis queries may need to access. Such queries may seect a portion of the coection, or may process the entire set to derive summaries or statistics. A coection may be very arge, as is the case with experiments that are run over a period of months because the desired event is rare, because a arge number of runs is desirabe for statistica anaysis, or because extensive parametric studies are desirabe. There is no ogica imit to the tota amount of data that can be coected by repeating experiments and simuations. The imitations are usuay cost and resources. Nevertheess, it is interesting to identify the tota amount of data that scientists keep active and avaiabe. This category is simpy referred to as tota size. A size figures shown in tabe 1 are ony intended to show order of magnitude. Sngspore, August, 1994

5 4)ABOCitiaata The diierent categories of associated data shown in Tabe 1: con&ur&rm, instru?nenta&ien., ana~zad, and w, simpy indicate whether such data exists for the different exampe appications. Note that property data is not mentioned since property data is not usuay associated with a singe experiment, but rather summarizes data over many experiments. 22 -mof -TED DATA We chose to emphasize somewhat different characteristics for associated data, because their structure and usage is different from experiment data. The access pattern end size characteristics are simiar to those of experiment data, but the identifier characteristics are more diverse. They are described as part of the data modeing characteristics. We aso added usage characteristics and non-standard data types. I) Acceg pattern The access pattern characteristics of associated data fa into simiar categories as those of experiment data. However, whie acc$ss patterns of experiment data refer to accessing data points with respect to their identifiers, the access patterns of associated data are with respect to any attributes, whether they are thought of as identifiers or measured data. The reason is that in associated data the concept of an identifier(or category attributes) is not so dominant. For exampe, when the experiment data of a partice physics experiment are anayzed, the resuting database represents tracks and events rather than the individua data points. The identifiers of the origina data points no onger exist in the anayzed data. Instead the tracks and events may be given an identifying number or some combination of the measured vaues (such as mass and momentum) may be thought of as the identifier. The categories assigned to access patterns of experiment data above appy to access patterns of associated data as we. However, we found it necessary to add a partia access type, because it is common to access associated data (especiay anayzed and summary data) by specifying predicates (seection criteria) ony on part of the attributes. For exampe, Andiig a partices with a mass in a certain range that generated a certain number of sub-partices. 2) Ihtamodebg The data modeing capabiities chosen here are either common to many exampes of associated data, or are incuded because of their importance. Geometric modeing is the capabiity to describe the geometry of an object (such as an airpane wing), or a coection of objects (such as the position of detectors). The term Procoedingr of the Tenth Internationa Confonnca on Vwy Large Data Sases. ISI entities refers to the need to distinguish between mutipe entities, which is a basic assumption in a database modes (such as reationa, hierarchica, etc.). There are situations where the concepts of entities are not naturay appicabe. such as with summary data (e.g., a co-variance matrix). The terms hierarchica and?wtwotks refer to reationships between entities. A hierarchica characteristic obviousy impies a one-to-many reationship between entities of successive eves of the hierarchy, but aso impies the possibiity that the identifiers (keys) of higher eves propagate down to ower eves. For exampe, a partice identifier usuay propagates down to its sub-partices eve, and is concatenated with the sub-partice identifer to form a unique key. A network characteristic indicates the existence of a manyto-many reationship between entities. We use the term genera&a&ion in the sense described in [Smith & Smith 771. Briefy, it is the capabiity of describing genericay the properties that appy to an entire set of objects. For exampe, the common properties that describe a anaog-to-digita converters of a certain type used in a certain experiment shoud be described ony once. Each individua converter can have its own specific properties, but the generaization capabiity aows the common properties to be inherited by each individua converter. The existence of muti-dimensiona data was expained before in the context of the identifier of experiment data. Athough not as common in associated data, the capabiity to support muti-dimensiona data is nevertheess important, especiay for anayzed and summary data. We refer to this characteristic as N-d~nuL Meta-dda refers to the information necessary to describe the data. However, the intent here is to emphasize the information that is beyond the usua data defmition capabiity provided by most data management systems. An exampe of such additiona information is the source from which an anayzed database was derived, and the person who derived it. S) Usage It is often necessary in the anaysis process to change the deanition of the database schema, such as to add new attributes (coumns) or to cacuate new attributes from previous attributes. The abiity to support such changes dynamicay is referred to here as SChawLa v&h. Supporting historica data impies the maintenance of the history of changes made to the database (not ony the atest updated version.) In the impications section we discuss the different aspects of historica data needed for associated data. An important characteristic of a database is its stabiity, i.e. infrequent updates. In physica database design there is Sngrpore, August, 1994

6 usuay a trade off between the efficiency of retrieva and the efficiency of updating. One can take advantage of stabe databases to empoy more efficient retrieva agorithms in exchange for sower updating. 4)Non-standarddata types The resuts of the anaysis of scientific data are often presented as graphs. By tezt we mean not ony the usua abiity to support character strings of imited size, but aso support of unimited text, such as artice abstracts or manua information. The ti711b series data type is important in scientific databases (as it is for statistica databases) because specia statistica anaysis techniques can be appied to time series. The abiity to represent the moeczat structure of materias is a specia requirement of scientific data. It cannot be thought of as graphs or images, because it is necessary to be abe to refer to the detais of the structure, such as doube bonds between certain atoms. We did not incude this category in Tabe 2 because our exampes did not have such a requirement, but it is a we known requirement for chemica property data as can be found in many chemica property pubications (e.g. The Journa of Chemica Information and Computer Sciences). There is aso a need to represent specia symbos which requires the support of a arge character set. By non- ScukZT data type we mean vectors, matrices, and combinations of these. The abiity to refer to such objects by name, to refer to particuar eements of the objects (such as the i,j eement), and to store newy generated non-scaar objects as part of the database is an essentia capabiity for scientific data. Non-standard data types are discussed in [Hampe & Ries ) Database size The size Agures shown in Tabe 2 are intended to show the amount of associated data that is required to support or is generated from a coection (described in section 3.1, part 3) above) of experiment data. The bytes figures represent an approximate upper bound, and the percentage figures show the approximate size percentage reative to the size of the experiment data coection. 4. EXAMPWOF ScENTmCDAT- In this section we describe a representative scientific appication with respect to its characteristics as defined in the previous section. Atogether, we anayzed ten exampe appications, but because of space imitations they are described esewhere [Shoshani et a 841. We have tried to seect these exampes so that they cover a diverse range of appications. Proceedings of the Tenth Internationa Conference on Very Large Data Bases. 152 They incude simuations, experiments, as we as property data. In order to have an idea of the kind of appications we anayzed, we describe them briefy beow. Then one representative appication is described in detai. The Time Projection Chamber appication is designed to record the tracks of sub-partices that resut from partice beam coisions. The Limited Track Reconstruction appication aso deas with sub-partices, but it is designed to coect high resoution measurements on their properties, rather than record their tracks. Hydrodynamics appications are concerned with modeing the fow of fuids, usuay using grid methods. Nucear Magnetic Resonance (NMR) spectroscopy experiments are used to investigate chemica structures. The Heavy Ion Spectrometer appication studies the break up process in catastrophic coision invoving heavy ions. The passive soar experiment invoves the simuation of heat transfer to study the sun energy performance in residence and industry buidings. Turbuent fow studies is another exampe of hydrodynamics modeing, but rather than using grid methods which require a arge number of data points, partice methods are used to mode the vorticity of the turbuence. The purpose of the Laser Isotope Separation experiment is to deveop a technique for recovering the reusabe isotopes from nucear waste materias. In addition to the these experiments and simuations appications, two exampes of property data were aso examined. The function of the Partice Data Group project is to compie partice properties data in a highy evauated and summarized form. The Nucear Structure Data project is concerned with recording, evauating, and tabuating data about the structure of atomic nucei, and the reactions by which nucei decay from one state to another. Next, we describe the Time Projection Chamber appication with respect to the characteristics described in section Time Projection Chamber 1) Description The Time Projection Chamber (TPC) is a device used in high energy physics experiments to record the behavior of sub-partices resuting from partice beam coisions. In a typica experiment, two partice beams coide after they are acceerated to very high speeds. Each such coision, caed an event, may produce subpartices that scatter in different directions at diierent speeds. Often the partices ony graze each other and do not produce the sub-partices desired. Because some events are very rare, and because of the need to Singapore, August, 1984

7 be statisticay accurate, coision experiments are repeated miions of times. It is not important here to describe the detais of the TPC device, but it is important to understand its operation in order to describe the data generated by it. The TPC is essentiay a arge cyinder Bed with a certain gas. The coisions occur in the center of the cyinder. When partice (or sub-partices) trave through the gas they ionize the gas, eaving tracks where they pass. In order to distinguish between positive, negative, and neutra partices, the TPC is subjected to a magnetic fed which causes the charged partices to trave in circuar patterns which depend on their charge. At the two ends of the cyinder eectrostatic fieds are appied and cause the ionized tracks to drift to the ends. Specia detectors detect the position and time of the drifting tracks, and measure the charge of ions reaching them From the position of the detector, the x and y coordinates are determined. From the recorded drift time, the z coordinate can ater be cacuated. The data is coected through specia hardware n a bii form onto tapes. 2) The experiment data 2a) Identifier Each data point of the experiment data consists of a puse measurement of a certain detector at a certain time. The identifier of each data point consists,,therefore, of the position of the detector and the time. The reasons for considering the identifier as having reguarity and sparsity are expained next. The detectors are paced on the two circes at the ends of the TPC cyinder on concentric circes at reguar intervas. Because the intervas are reguar one can compute the actua x-y position of the detectors by knowing the concentric circe number and the ordina number of the detector on the circe. The identiier points are said to be rs@or, because their position can be computed from ordina numbers, simiar to what can be done for a mesh of points. Readings exist ony for the points representing the tracks of the event. Thus, most of the detectors readings are nu (in reaity, beow a very ow threshod). Ony about one percent of the potentia data points have readings. Thus, the identifier is said to be sparse. There are severa techniques that can be used to store identtier data that is reguar and sparse. they are discussed in the impications section. the most obvious technique is to throw away the nu points and to store the identifier of the non-nu points with the data vaues. This is indeed what is currenty done for the TPC experiment. Procaectn9a ot the Tenth Intomrtonr Coneronco on Vary Lm90 Data Bno8. 2b) Access pattern The first step required before the data can be anayzed is to reconstruct the tracks from the experiment data The method used is to compute each potentia track path, and to verify that data points exist for it. he process of verification invoves a search of points aong the presumed path. For each such point, the neighboring points are aso needed because a puse has a Certain width (for each puse about 44 neighboring data points exist). The above process exhibits the foowing access pattern. The access type is ezacf match and prozimity search, because for each puse one ooks for a particuar point and a coection of points around it. The access sequence is mosty nan-coca, because each successive coection of points (representing a puse) are not necessariy cose to the previous search. Once a few points are found, the rest of the points are searched aong the presumed path. In this case. the access sequence is oca, because each successive coection of points woud be cose to the previous coection. 2c) Size In a typica six month period, data for about 4 miion events are coected. An event is run about once per second, and generates an average of about 2Sk bytes. A (particuary interesting) arge event may generate about 120k bytes. Thus, the tota voume of data for a six month period is about 10 bytes, which is stored on about 1350 magnetic tapes. The main difficuty in deaing with such a arge voume of data is the mounting and management of tapes for processing. A mass storage system woud be most usefu for such an appication. Since the data for every event can be anayzed independenty from the other events, they CM be considered a separate unit. The process of track reconstruction needs ony a singe unit at a time. However, as discussed ater there are other processes that need to be run over a arge number of events. 3)Theaaaoeietaddata 3a) Configuration data Athough the configuration data does not exist expicity as a database, it nevertheess exists in the programs anayzing the experiment data. This data corresponds to the description of the physica configuration of the detectors on the TPC device. It consists of mapping information between the identifiers of detectors as stored with each data vaue and their x-y coordinates. It aso incudes the mapping of the time measurement to the z coordinate. singpon, August,

8 3b) Instrumentation data The instrumentation data is quite extensive and has many components. There is caibration information for each of the 16,000 channes associated with each of the detectors. This information is used to adjust the readings of the detectors. There is other information representing the distortions due to imperfections in the magnetic fied, the changes in the eectric fieds over time, etc. A this information is necessary in order to caibrate the experiment data. The tota amount of instrumentation data is a few megabytes. It is not very arge to manage, but it is compex since it contains many components. It is not obvious how to best organize such information in a database management environment. 3c) Anayzed data The anaysis process has many steps that necessitate a number of passes over the experiment data. Each step generates data fies that are used in ater steps. For exampe, one of the passes generates histograms over the experiment data. These histograms are used to determine constants for further anaysis. A set of (muti-dimensiona) histograms is taken over a coection of about 2000 events, and occupies about 400 kbytes. There are about 2000 such sets over the experiment data. These histograms are exampes of nonstandard data types that require the capabiity of characterizing and managing an entire data set as a singe item. The fna resut of this anaysis process is to produce summaries about tracks that beong to events. These summaries form the databases that need to be searched for interesting phenomena. Typicay, the access type is a range search over some partice measures such as mass and momentum. The access sequence is non-oca since there no a priori correation between successive queries. 3d) Summary data Further anaysis over the track and event data usuay produces graphs and histograms. These data sets need to be managed as non-standard data types. 5. IMPLICATIONS The impications derived in this section can be best foowed by referring to Tabe 1 for experiment data and Tabe 2 for associated data. The organization of these tabes was designed after the information on the different appications was coected in order to carify its presentation. However, we beieve that these tabe structures can be used to cassify additiona appications. Once the appropriate entries are fed for an Proceedings of the Tenth nternatona Conference on Very Large Dats Bases. 154 appication, one coud quicky draw concusions on its requirements and the possibe data management techniques to support it, aong the ines discussed beow EXPERIMENTDATA We discuss the entries of tabe 1 by referring to its rows because the rows represent observations about each characteristic. The sections beow are organized according to the row groups in the tabes. The first row abeed experiment/simuation is merey to identify whether each exampe is an experiment or simuation. 1) Identifier Identifiers in scientific databases are typicay muti-dimensiona, where the dimensions may be spatia coordinates. time steps, or varying experimenta conditions such as temperature or magnetic fed changes. An important issue is the efficient storage and access of identifer data which are affected by the reguarity, density, and time-variation characteristics. Identifiers whose dimensions have a reguar structure are quite common. The main reason is that simper agorithms can be deveoped for them, and that the data can be organized in an ordery fashion. The simpest case exists when the configuration of the experiment or simuation forms a muti-dimensiona mesh. In such a case there is no need to store the identifiers of the data because the position of each data point can be cacuated using the array inearization technique mentioned in section 2. Indeed, the array capabiities of programming anguages have been used extensivey by scientific appication. This suggests that an array inearization access method woud be most desirabe in a scientific data management system. The advantages of such an access method is that it requires no storage for the identifiers and provides a very efficient random access (a simpe computation) to the data points. The situation is more compex when the configuration is not simpe, such as representing an airpane wing or the shape of a combustion chamber. In such cases a mesh that covers the entire configuration can be imposed, and a the points outside the configuration boundaries are considered nu. This approach introduces a certain eve of sparsity in the data points. We wi discuss sparsity beow. Other forms of reguarity may exist. One is the reguar pacement of points aong some geometric shape, such as concentric circes. Another occurs when two kinds of reguar structures co-exist, such as having a finer mesh in certain regions of the conf3guration. In such cases the mapping agorithm of ogica points into Singapore, August, 1984

9 a inear sequence is more compex than array hnearization, but they &ii provide storage savings and more importanty a fast random access to the data points. As can be seen from Tabe 1 there are severa exampes of *eguf~ identifiers. At first gance, it seems that identifiers that consist of irreguar dimensions, such as the numbers identifying rooms n a buidng. have to be expicity stored with each corresponding data vaue. Such an approach wastes space since each dhension vaue has to be repeatedy stored with the data. Rather, the irreguar dimensions can be enumerated and stored ony once. Thereafter, the identifiers can be cacuated using array inearization over the enumerations. Irreguar dimensions are most common in statistica databases (such as state, race, sex, and cause of death for mortaity data), where the enumeration of each dimension and array inearization over them is a most effective method. Data spwsity means that ony a fraction of the points in the fu cross product of the dimensions have actua vaues associated with them. There are basicay two options: either to store the identiiers of the vaid data points, or to compress out the non-vaid (nu) data points. Compression methods, such as run ength encoding (which introduce a count into the data stream in pace of each sequence of nu points) can be quite effective, especiay when the nu points are custered to form ong sequences. However, such compression methods require sequentia scanning of the data in order to seect a particuar point randomy. Indexing methods require too much space for arge databases and may be prohibitive. In [Eggers & Shoshani 601 a compression technique, caed header compression, which provides fast (ogarithmic) access was proposed for statistica databases. It basicay organizes the run ength counts into a separate header, in such a way that the header can be searched in ogarithmic time with respect to the number of counts. This technique can be appicabe for sparse scientific data as we, since it can be used effectivey with muti-dimensiona data. Time varying appications are not as common as other appications, but they represent an important cass of modeing techniques. When the identifers are t+ns vaa-ying there is no choice but to store them, since they change from one time step to the next. In the case that the data is aso reguar, there is an additiona requirement that the origina reationship between the points is maintained. To see this point, one can imagine a mesh of points connected by rubber strings. The entire structure can then be stretched and compressed in successive time steps. The maintenance of these reationships can be achieved with techniques appicabe to reguar data. When data is irreguar and time PromodIng@ o the Tenth IntarIUItOnO COdOWCOOiVO~LS~DStOBSSOO. 155 varying, the reationship between the data points changes from one time step to the next, and has to be deduced from the stored identifiers. 2)&xesapattem From Tabe 1 it can be seen that the access types of ezact match and poxim~ search are important. Exact match impies, in genera, the need to access specific data points randomy. To accommodate such a requirement some kind of indexing or hashing technique is required. Fortunatey, one can take advantage of the muti-dimensionaity of the data. The mapping of mutidimensiona space to inear space discussed above (e.g. array inearization) provides a key-to-address mapping that is equivaent to hashing. In addition, some muti-dimensiona to inear mappings provide advantages for proximity search as discussed beow. To support proximity search it is necessary to preserve ogica ocaity in physica storage. That is. when points are ogicay cose to each other in the muti-dimensiona space, it is desirabe that they are physicay cose in physica space, so that they can be brought into memory from secondary storage with a minimum number of accesses. This suggests the organization of physica storage into ces aong the dimensions of the identifer. The data points within a ce wi satisfy the proximity requirement. For eements on the borders of ces it is necessary to access adjacent ces, and therefore the pacement of ces in physica storage is aso important. The mapping of muti-dimensiona space to inear space mentioned above works we with such a ceuar organization because it does not disturb the ogica proximity of the data points. An arbitrary hash mapping woud pace data points into ces (buckets) which woud not necessariy preserve ogica proximity. The optima partitioning of ces, especiay in the case of sparse data, is an interesting probem that shoud be further investigated The nznge access type does not seem to be as important. Nevertheess, the ce organization shoud benefit range access on the dimensions of the ces. Referring again to Tabe 1 it seems that oca access sequence is aso important. The ce organization is aso hepfu here because oca points are ikey to be in the same ce. The question of how to organize the ces arises here again. If the paths of oca access sequences are known or predictabe. then the ces shoud be organized aong these paths. The beneats of such ideas need to be investigated. Non-oca access sequence is not as prevaent as oca access sequence. However, it can be supported we with ce organization. The reason is that it compements the requirements of exact match, since it impies the need for a random access of the data points. Liner sngapom, August, 1994

10 access sequence conficts with the idea of a ce organization, because the inear sequencing of the data is broken. However, it does not seem to be an important requirement. If data was organized ineary to accommodate this requirement, then proximity search and oca access sequence wi be performed ess efficienty. An arbitrary access sequence is quite common. It usuay impies that the entire data set needs to be processed, and that the order of points is irreevant. This suggests that parae processing can be performed over the data. This ony compements the ce organization approach, since the ces coud be paced on parae devices for parae processing. In summary, it seems that the ce approach is most desirabe since it accommodates the most important requirements. The organization of ces shoud be aong the dimensions of the identifier, since they preserve ogica ocaity. The approach of mapping the muti-dimensiona space into inear space compements this ce organization. There are severa papers that discuss the organization of data into ces [e.g. Nieverget et a 841. However, the access requirement mentioned here, such as proximity search and oca access sequence were not expicity addressed. 3) Size The most important observation that can be made from the size figures in Tabe 1, is that athough scientific databases are arge, they can often be partitioned into sma independent units. The units are sma enough that much of the processing can be done in main memory. In genera, experimenta units can be processed in parae, since they are independent of each other. Simuation units (time steps), on the other hand, usuay foow each other in sequence. Note that simuation units are typicay arger than experimenta units. Unit processing is ony one part of the anaysis process. Other types of processing need to search and access entire coections. As can be seen from the coection figures in Tabe 1. some coections are so arge that they cannot be practicay stored on magnetic disks. In such cases, the data is currenty stored on tapes and the mounting of those tapes becomes a major probem. Current soutions are to process the data sequentiay once, to coect interesting subsets, or to break the data into redundant smaer sections. It is obvious that arger secondary storage devices (such as optica disks) coud be hepfu. 4) Associated data Associated data is discussed in the next section. The different types of associated data were incuded in Proceedings of the Tenth Internationa Conference on Very Large Data Bases. 156 Tabe 1 in order to point out their importance and prevaence. Neary a appications have a types of associated data. The obvious exception is that simuations do not have instrumentation data ASXXED DATA Tabe 2 summarizes our observations on the different types of associated data. We coud discuss these observations by row for each cass of characteristics or by coumn for each type of associated data. A cose observation of tabe 2 reveas that there are many simiarities between the configuration and instrumentation coumns, and between anayzed data and summary data coumns. This is not very surprising since these two groups represent support data and generated data and shoud have simiar characteristics. In fact, eary on we did not make this finer distinction, but ater we found that it heped sorting out the dierent aspects of scientific data. Accordingy, we wi discuss characteristics in Tabe 2 in three parts: the support data (configuration and instrumentation data), the generated data (anayzed and summary data), and property data. 1) Support data The access type for support data is mosty exact match. A typica access invoves finding a particuar configuration point and the particuar instrument associated with it. Proximity search is sometimes needed. For exampe, if a certain instrument faied, the configuration data may be consuted to And the instruments in neighboring ocations. The access sequence is mosty non-oca, which indicates that successive queries are unreated. Thus, the access requirement for support data is mainy random access. The data modeing requirements are fairy conventiona, i.e. modeing of entities that have hierarchica or network reationships. The reationships between the different instruments and detectors are part of the configuration data. Generaization is an important modeing too for instruments, as generic information can be represented once and inherited by each particuar instrument in that cass. An important exception to the conventiona modeiig requirements mentioned above is geometric modeing of configuration data. In many exampes the geometry is quite reguar and coud probaby be modeed with simpe types (points, ines, circes, etc.). However, geometric shapes may be compex enough to require specia modeing techniques simiar to those required in engineering databases [Lorie 821. Another major requirement is for the support of historica data. Instrumentation data change continuousy over time, and the entire history of changes iu: bz Singapore, August, 1994

11 be recorded. In addition, ogs of the operation, such as when an instrument faied who was in charge at the tie, etc. need aso be recorded. The tie eement can be thought of as another dimension orthogona to the structure of the database. It requires specia storage techniques and specia operators such as a&e? and during. Severa recent works have deat with this topic [e.g. Anderson 81, Boour et a 821. The history of confguration data changes aso needs to be recorded, but not as often as instrumentation data because they usuay occur ony between experiments. Support data may have some text that describe procedura instructions or configuration descriptions. Instrument data are usuay poed at reguar time intervas, and coud benefit from a time series data type. The size of the data is reativey sma, and constitutes ony about 1X of the experiment data. In concusion, we beieve that support data can be managed for the most part with conventiona data management techniques. The databases are reativey sma. The requirement for random access can be accommodated with conventiona indexing or hashing methods. The two most important exceptions that require specia attention are historica data support and geometric modeii. 2)Ge~eratetddata The access pattern of generated data is simiar to statistica databases. That is, it is mosty range and partia match queries. As with statistica databases, the generated data is repeatedy anayzed in order to diicover patterns, statistica behavior, or a rare event. Many subsets are generated and need to be kept track of. The access sequence is mosty non-ooa. athough ocaity exists when anaysts reane their queries. From tie to time an entire set of anayzed data is processed to generate summaries. This is indicated as an arbitrary access sequence in Tabe 2. The most prominent data modeing characteristic is that generated data is muti-dimensiona. Unike experiment data where the dimensions are mosty spatia coordinates, the dimensions of generated data are the properties of the data (e.g. charge, temperature. mass). Thus, the number of dimensions can be in the order of ten, which presents a specia chaenge for its efficient support. In some instances it is usefu to view anayzed data as entities and hierarchica reationships (for exampe, events and their corresponding subpartices). Another important modeing requirement is for meta-data. The requirements of meta-data management incude data detition faciities not ony for fed descriptors (such as type, size, and acronym), but aso Pmcmdng8 of tha onth Intomatond Contoronw on Vwy w Data Baaos. 157 the description of the origii of the data, how it was coected, when it was generated or modified, and the identity of the person responsibe for its coection. Faciities to describe compex data types such as times series, matrices, and muti-dimensiona categorica data are aso needed. It is necessary to organize and manage meta-data, just as is the case with data. One shoud be abe to retrieve and search meta-data. index keywords, and browse through the meta-data structures. A system that supports such operations for statistica databases is described in [Chan dr Shoshani a]. Me&data is aso necessary for keeping track of the different subsets produced, dates of their creation, methods used, etc. The management of subsets aso requires that their historica aspects are maintained. It is necessary to record and maintain the ancestors of each subset produced The anaysis process, simiar to statistica anaysis, can be modeed as a tree or a directed graph structure. The anayst can generate subsets, observe their patterns. and choose to go back to a previous set and foow another path of anaysis. The above requirements are simiar to many aspects of the meta-data management for statistica databases [McCarthy82]. It is often usefu in the anaysis process to add new fieds to the database or to compute new fieds from other fieds. This is referred to in Tabe 2 as schema variation. However, except for such additions during the anaysis process. the generated databases are quite stabe. The support of non-standard data types is most important. Generated data can be expressed as graphs, vectors, matrices, and time series. Finay, the size of the data is substantia. and athough it is sma enough to fit on disks, it is sufficienty arge to benefit from data management techniques that minimize disk storage and access time. The tota amount of generated data may be of the same order of magnitude as the experiment data it was derived from, because a arge number of subsets are usuay produced. In concusion, generated data have many characteristics in common with statistica databases. We beieve that specia techniques for the management of muti-dimensiona data deveoped for statistica databases coud be appied to support anayzed data. Since property data is a summary over many experiments and contain genera knowedge of a subject area, it has characteristics more akin to bibiographic sjngaporo, Auguot, 1994

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Data Mining Cassification: Basic Concepts, Decision Trees, and Mode Evauation Lecture Notes for Chapter 4 Part III Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,