SUBJECT CATEGORY-BASED ANALYSIS OF DESCRIPTORS OF SLOVENIAN PLANT SCIENCE DOCUMENTS IN THE AGRIS DATABASE IN THE PERIOD

University of Ljubljana UDC:002.6.01/.08:63:014.3:05 Biotechnical Faculty "1993-1995"(497.12)(045)=20 Slovenian National AGRIS Centre Research paper Head: Tomaž Bartol, M.Sc. Raziskovalno delo SUBJECT CATEGORY-BASED ANALYSIS OF DESCRIPTORS OF SLOVENIAN PLANT SCIENCE DOCUMENTS IN THE AGRIS DATABASE IN THE PERIOD 1993-1995 Tomaž BARTOL a ABSTRACT We identified, with the support of subject category codes, all Slovenian plant science and productionand plant protection-related documents in the international agricultural database AGRIS for the period 1993-1995. For these documents we separately downloaded all indexer- and computer-assigned descriptors. We sorted the descriptors, calculated the frequency of each descriptor and created two separate lists for both descriptor groups. The lists can be used as a controlled lexical source for authors while writing English language titles, abstracts or articles. They serve as an indicator and scientometric tool for estimation of the presence of narrow and broad concepts in the documents over the above period of time. They can provide the indexer with a reference to her/his previous indexing of similar material and can thus enhance indexing consistency and reduce ambiguity. General aim of such presentation of collected controlled terms is to enable a more consistent use of terms for the same concepts by authors and indexers alike and in turn, as a result of higher degree of database consistency, to improve information retrieval by end-users. KEYWORDS: plant science, data collection, data analysis, information processing, databases, newspapers, documentation, information science ANALIZA DESKRIPTORJEV, NA OSNOVI PREDMETNIH KATEGORIJ, SLOVENSKIH DOKUMENTOV O RASTLINAH V ZBIRKI AGRIS V OBDOBJU 1993-1995 IZVLEČEK S pomočjo predmetnih kategorizacijskih kod smo iz mednarodne podatkovne zbirke AGRIS za obdobje 1993-1995 izbrali tiste slovenske dokumente, ki so se nanašali na pridelavo in varstvo rastlin. Za te dokumente smo posebej pretočili (downloading) deskriptorje, ki jih je določil indekser ter računalniško določene širše deskriptorje. Deskriptorje smo sortirali v dveh ločenih spiskih in izračunali frekvence pojavnosti. Spiska se lahko uporabljata kot vir kontroliranih pojmov za pisanje angleških naslovov, izvlečkov ali prispevkov. Rabita lahko kot indikacija in scientometrično orodje za oceno ožjih in širših konceptov v dokumentih v določenem časovnem obdobju. Indekserju tudi omogočata, da hitro pregleda in preveri, kateri pojmi so se najpogosteje uporabljali prejšnja leta, kar lahko pripomore k boljši konsistenci indeksiranja in tako zmanjša nejasnosti. Splošni cilj take predstavitve kontroliranih izrazov je omogočiti stalnejšo rabo istih kontroliranih izrazov za iste pojme tako pri avtorjih prispevkov kot pri indekserjih kar naj uporabnikom informacijskih sistemov in podatkovnih zbirk omogoči uspešnejše iskanje relevantnih informacij. KLJUČNE BESEDE: znanost o rastlinah, zbiranje podatkov, analiza podatkov, procesiranje informacij, podatkovne zbirke, časopisi, dokumentacija, znanstvena informatika ------------------- a M. Sc. Agr., B. Sc. Zoot., 1111 Ljubljana, Jamnikarjeva 101, P. O. Box 95

1 INTRODUCTION Information is being increasingly transmitted over networks, however, paper journals, or other printed documents, for that matter, still remain relevant indicators of research activity (Suraud et al., 1995). Quantitative bibliometric techniques offer useful tools for analysis of documents. Content analysis of documents is one of the most commonly used data analysis techniques for such quantitative research (Westbrook, 1994). As the bulk of data in the form of proceedings or journal articles is huge the wider visibility is possible almost only through electronic services, so majority of significant journals seek ways of becoming indexed by a major information service. The importance of databases as produced by such services is still growing. They allow not only storage of enormous amount of information but offer also standardized access to this information as a result of a systematic vocabulary control. So, all bibliographic databases are in principle potential sources for performing different bibliometric or scientometric analyses (Braun, 1995). These analyses frequently deal with different indexing terms, such as controlled vocabulary terms, classification codes, subject headings, indexer selected terms, natural language phrases, keywords, automatically generated index terms, etc. (Rajashekar, 1995). The objective of the following analysis is to explore the possibilities offered by a compilation of controlled (indexer selected) indexing terms (descriptors) which were assigned in the process of data analysis (indexing) to the Slovenian documents in the international agricultural database AGRIS. One of the assets of systematic collection of terminological data can be also meditation of information between different languages (Ananiadou, 1995) so we hold that such list of collected descriptors may be used as a lexical resource for authors. As a single descriptor is frequently assigned to several synonyms chosen by different authors such a list can, if consulted, increase semantic consistency of words used in articles for the same concept. The list of descriptors and codes furnished with additional frequency data may additionally serve as an indicator and scientometric tool to follow presence of concepts over a certain period of time as related to our previous research (Bartol, 1995). Further, as the work of the individual indexer is highly subjective and open to personal interpretation because the decisions he or she makes involve judgements of the value of what is presented (Quinn, 1994) such a list can provide an indexer with a reference to his previous indexing of similar material and thus increase indexing consistency and indexing quality. Communication between the end-user and information retrieval system is problematic if the indexer and user do not use the same words for the same concept (Collantes, 1995). Also, ambiguity is increasingly greater with vast amount of object information from different sources (Bower, 1993), so indexing consistency can therefore significantly improve end-user's retrieval recall. 2 METHODS We have been collecting since the beginning of 1994 bibliographic data based on agricultural scientific or professional documents produced and published in Slovenia since 1993. The data were compiled into a database established as a branch of the international database AGRIS. Our data in this database reached the number of

thousand documents by mid-1996. Each document figures as a separate entry and contains up to three indexer assigned broader category codes and roughly between five and twelve indexer assigned English descriptors taken from the thesaurus AGROVOC. In the process of including existing English abstracts we sometimes also made use of lexical terms found in these abstracts. However, author-assigned English titles and abstracts frequently served only as an association in the direction of controlled terms from the thesaurus. The indexer-assigned codes and descriptors thus feature as an analytical semantic representation of concepts presented in the documents and may serve purposes beyond mere information retrieval. It is well known that one of the difficulties of a bibliographic analysis lies in the way one constitutes the set of data to be processed. (Suraud, 1995). As a set of data in our analysis we will take from the Slovenian AGRIS segment all documents which are primarily related to plant science, production, protection or plant-related postharvest technology. To do so we will identify broader subject categories (concepts) which define the area. Information systems very frequently offer a possibility to classify documents with a few selected significant category concepts (Pathak, 95) so such broad categories can be very helpful in identifying relevant data (Christensen, 1995). Similar set of content-specific elements, such as keywords or classification codes depicts cognitive resemblance of documents (Peters, 1995). We can then infer that such documents represent work within the same research field or speciality. (Braam, 1991). After we have selected all relevant documents of the plant science and production area we will download all existing descriptors that have been assigned to the documents. Even though some authors considered in their research only those descriptors with high frequency for the total period of the analysis (Rikken, 1995) we will include all descriptors as we want to define major as well as minor aspects in our selected documents. After the descriptors have been downloaded to a computer file we will align and sort them with sorting procedures for paragraphs in Microsoft-Word. We will then use the Microsoft Statistica package to calculate frequencies of those descriptors. We will present descriptors in a form of a list along with the frequency of occurrence of each descriptor. 3 RESULTS We searched and identified all Slovenian documents which were in the AGRIS database assigned the broader plant-related subject category codes, such as F (Plant science and production), H (Plant protection) and plant-related J (Postharvest technology). We downloaded all the descriptors taken from those documents. There are two groups of descriptors assigned to each document. First group (table 1) is compounded of indexer-assigned descriptors what is a result of the indexer's cognitive analysis of documents. The second group (table 2) is compounded of computerassigned (automatically generated) broader descriptors what is the result of an automatised enrichment of indexer's descriptors in order to present hierarchically higher concepts in the documents. In the picture 1 we present typical descriptorclusters from the thesaurus AGROVOC, whereby the first descriptor is assigned by the indexer, and all other broader descriptors (BT-broader term) are automatically assigned by a computer.

Picture 1: Examples of two hierarchical clusters of descriptors in the AGROVOC thesaurus ENDOSPERM BT1 kernels BT2 seeds BT3 plant developmental stages BT4 developmental stages BT3 plant reproductive organs BT4 plant anatomy- PHOTOSYSTEMS BT1 photosynthesis BT2 biosynthesis BT3 biochemical reactions BT4 chemical reactions BT2 energy metabolism BT3 metabolism The basic descriptor and hierarchically higher descriptors never overlap in the same field. We further present both groups with each descriptor being supplied with the frequency of documents in which the descriptor appeared between 1993 and 1995. For condensation purposes we excluded most taxonomy from the second group (table 2) since it is represented narrow enough by the first group (table 1). We present also those broader descriptors which were computer-assigned to ten or more documents (Table 3). Table 1: List of indexer-assigned descriptors of the plant science and production- and plan protection-related Slovenian documents in the international agricultural database AGRIS for the period 1993-1995 Table 2: List of computer-assigned (broader) descriptors of the plant science and production- and plan protection-related Slovenian documents in the international agricultural database AGRIS for the period 1993-1995 Table 3: Computer-assigned descriptors ten times or more 4 DISCUSSION AND CONCLUSIONS In the AGRIS database we searched for all those Slovenian documents which were between the years 1993 and 1995 indexed with plant science-, production- and protection-related broader subject categories. This enabled us to consequently pick out all the descriptors associated with these categories. In the AGRIS database there are two groups of descriptors: indexer-assigned and computer-assigned descriptors. The second group represents broader subject aspects than the first one. Each descriptor in the second group is deducted from at least one narrower descriptor in the first group. The indexer need not worry about broad enough descriptor for the presentation of a concept in a document. No matter how restricted the indexer's decision such a

document can be retrieved also with all other broader terms that are in the thesaurus assigned to the respective narrower descriptor. Each document can be assigned up to three different broad subject categories so for certain interdisciplinary articles there can exist interaction between plant and animal, food or forest categories. That is why there is a number of non "plant" descriptors in the table 1. Plant categories are assigned to crops as well as other plants such as forest trees so the later also feature in our analysis if the pertaining article deals with physiology or a disease of a tree. Plant categories are not assigned to the general articles on forestry or forestry production. We excluded certain descriptors from the table 2 to achieve better condensation of the computer-assigned descriptors. Also, there is an important distinction between Latin and English name for a plant, where the former stands for the plant before harvest and the later for the plant and its derivatives (food, feed, etc.) after it has been harvested. However, there remains certain ambiguity related to the foods or feeds. If this was the primary aspect of an article such an article was probably assigned Q (Food and feed science and technology and related) categories only and was thus not included in our analysis. The fact that certain forest-related concepts do feature in our lists whereas there is absence of certain food aspects which are certainly more related to crop husbandry than forest trees shows how difficult it is to organize logical semantic representation of broader categories if relevant data are not to be lost, but also if relevant data are not to become too "inclusive" to provide any logical subject limits. We wanted to present both groups of descriptors to show as many keyword-level concepts in Slovenian plant-related documents as logically possible according to some AGRIS Categorization Scheme criteria. With restricted place it was impossible to present all hierarchic and associative relations as in the thesaurus so the second group (table 2) can serve only as an information on selected concepts and their frequencies. We can see that only a few descriptors were computer assigned to ten or more documents (table 3) whereas more than half descriptors were assigned only once or twice whereby we can to some extent infer that there is a large variability amongst the documents. However, it has to be taken into consideration that a computer-assigned broader descriptor may in another document feature as an indexer-assigned one, given the broader primary aspect of the document, so real descriptor searching should be usually performed in both groups. Also, indexing is a highly individual and subjective process, so the results can be quantified or qualified only to a limited degree. We hold that such selected descriptors could still be effectively used by authors in their writing of English-language documents or in application of English-language document titles, abstracts and keywords. However, a much better direction would be of course obtained directly from a thesaurus where all relations between thousands of different descriptors are defined in detail. Another important aim of our presentation was to give ourselves a reference tool to maintain as high a level of indexing consistency as possible with our further indexing performance. We will be able to frequently peruse the list for our own reference in order to search after the concepts that have already been presented in the previous years. This will be especially helpful as we might have assigned a good and informative descriptor which might have, however, later evaded from our memory and

would therefore not be consistently assigned anymore. In this way we intend to keep a higher level of indexing consistency what will in turn improve end-user retrieval and thus contribute to the overall quality of the information system. 5 REFERENCES 1. Ananiadou, S.; McNaught, J.: Terms are not alone: term choice and choice terms.- Aslib Proceedings, 47(95)2, p. 47-60 2. Bartol, T.: Subject analysis of Slovenian agricultural published documents in the years 1993 and 1994.- Zbornik Biotehniške fakultete - Kmetijstvo, 1995, no. 65, p. 143-155 3. Bower, J.M.: Vocabulary control and the virtual database.- Knowledge Organization.- 20(93)1, p. 4-7 4. Braam, R.R.; et al: Mapping of science by combined co-citation and co-word analysis. Part I: Structural aspects.- Journal of the American Society for Information Science, 91, no. 42, p. 233-251 5. Braun, T. et al: "Hyphenation" of databases in building scientometric indicators.- Scientometrics, 33(95)2, 131-148 171 6. Christensen, F.H.; Ingwersen, P.: Fundamental methodological issues of data set creation online for the analyses of research publications.- In: Fifth International Conference of the International Society for Scientometrics and Informetrics, River Forest, June 7-10, 1995, Proceedings, Rosary College, 1995, p. 103-112 7. Collantes, L.Y.: Degree of agreement in naming objects and concepts for information retrieval.- Journal of the American Society for Information Science, 46(1995)2, p. 116-132 8. Pathak, L.P.; Binwal, J.C.: Identification of main concepts used in sociology and their categorization.- Knowl Organization, 21(94)2, p. 69-74 9. Peters, H.P.F. Et Al: Cognitive resemblance and citation relations in chemical engineering publications.- Journal of the American Society for Information Science, 46(95)1, p. 9-22 10.Quinn, B.: Recent theoretical approaches in classification and indexing.- Knowledge Organization., 21(94)3, p. 140-147 11.Rajashekar, T.B.; W. Bruce-Croft: Combining automatic and manual index representations in probabilistic retrieval.- Journal of the American Society for Information Science, 46(1995)4, p. 272-283 12.Rikken, F.; Et Al: Mapping the dynamics of adverse drug reactions in subsequent time periods using Indscal.- Scientometrics, 33(95)3, p. 367-380

13.Suraud, M.G., Et Al: On the significance of data bases keywords for a large scale bibliometric investigation in fundamental physics.- Scientometrics, 33(95)1, p. 41-63 14.Westbrook, L. Qualitative research methods: A review of major stages, data analysis techniques, and quality controls.- LISR, 94, no. 16, p. 241-254