How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?

Size: px

Start display at page:

Download "How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?"

Sandra Wheeler
6 years ago
Views:

1 How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments? A. Hossein Farajpahlou Professor, Dept. Lib. and Info. Sci., Shahid Chamran University, Ahvaz, Iran. Faeze Tabatabai MLS graduate, Dept. Lib. and Info. Sci., Shahid Chamran University, Ahvaz, Iran. Abstract Purpose - The Purpose of this research was to examine the indexing quality and ranking of XML content objects containing Dublin Core and MARC 21 metadata elements in dynamic online information environments by general search engines like Google and Yahoo!. Design/methodology/approach- 100 XML content objects were analyzed in two groups: those with DCXML elements and those with MARCXML elements, which were published on the website from late July 2009 till June Data was collected during April 2010 by means of a checklist. The website was introduced to Google & Yahoo! search engines. Google search engine was able to retrieve fully all the content objects during the study period through their Dublin Core and MARC 21 metadata elements; Yahoo! search engine, however, didn't respond at all. The indexing quality of metadata elements embedded in content objects as in a dynamic online information environment, and their indexing and ranking capabilities were compared and examined. Findings- Results of the study showed that all Dublin Core and MARC 21 metadata elements were indexed by Google search engine, and that there was no observed difference between indexing quality and ranking with DCXML and MARCXML metadata elements in dynamic online information environments as performed by Google. All in all, results of the study revealed that neither the XML-based Dublin Core Metadata Initiative, nor MARC 21 demonstrate any preference in regard with accession in dynamic online information environments through Google search engine. Practical Implications results of the present study would provide useful information as well as a basis for search engine designers who are involved with creation of indexing software. Originality/Value the present study was conducted for the first time in dynamic environments using XML-based metadata elements. Therefore, it can provide ground for further studies of the kind. Keywords- Dublin Core, MARC 21, Indexing, ranking, dynamic online environments, Google, XML. Introduction In line with recent developments in information and communication technology, especially in regards with the World Wide Web on the one hand, and the global developments and increase in the production of scientific information on the other, we are

2 witnessing increasing growth and improvement in different dynamic online information databases. They contain content objects and up-to-date scientific sources in different branches of human knowledge. Therefore, the significance of knowledge and information classification, as a basic issue in librarianship, is always considered by experts in this field. As a result, a vast amount of extraordinary activities and researches have been conducted on development of metadata initiatives and standards which are based on nowadays needs of various domains. In other words, the need for application of metadata initiative and standards is now unavoidably associated with ongoing developments in digital libraries and dynamic online information databases. One of the most important metadata schemes which has made itself compatible to dynamic environment to be applied in identification, classification, and retrieval of Web resources and content objects is the MARC metadata format. Dublin Core metadata initiative (DCMI) is another main and international metadata initiative which was originally created for application in identification, retrieval and classification of the Web content objects. An important point to make is that observing the metadata initiative for identification, retrieval and classification of resources and content objects and facilitation of their exchange process on the Web is only one side of the coin in the fast and relevant retrieval of the content objects. The other side is inevitable deeper attention to the most important and applicable internet search tools; since the majority of internet users use general search engines for searching and retrieving their needed sources, especially the content objects available in dynamic online information environments. Not only this, the interoperability of these tools and metadata schemes is also another major issue to be considered. Another major issue in identification and effective retrieval of content objects consisting of metadata elements relates to their semantic environment and platform. Dublin Core metadata initiative and MARC 21 tendency to XML (Extensible Markup Language) advanced technology is due to this technology s high capacity in increasing the interoperability of the most important and frequently used internet search tools with mentioned metadata initiatives. Implementation of MARC 21 and Dublin Core metadata elements on the XML platform has provided added values to both metadata initiatives. One advantage is that the indexing software of search engines is able to index these metadata initiative elements in XML thoroughly. Findings of Taheri (2008) research (in static information environment) emphasize this issue. However, the fast tracking, retrieval and storing of information in dynamic online environments remains an issue that should be addressed. Experts express metadata usage as a solution. On the other hand, as mentioned earlier, the majority of web users use common general search engines for searching and retrieval of content objects in dynamic online information environments. Therefore, suppliers are usually interested and keen about interoperability between these tools and metadata schemes so as to be able to provide tools more efficient in searching and retrieval of available content objects in such environments. Hence, the present study mainly aimed at examining the indexing quality and ranking of content objects consisting of the XML based MARC 21 and Dublin Core metadata elements in dynamic online information environments by general search engines. It also seeks to provide answers and solutions to how s about questions regarding the relation and interaction between indexing software of general search engines such as Yahoo and Google with XML-based content objects in dynamic online. 2

3 Another aim was to find out which one of the mentioned metadata initiatives is more efficient in indexing and ranking XML based content objects in dynamic online information environment by Yahoo and Google as search engines. As most of the information and scientific content objects are in dynamic online information environment, the importance of this research lies in unveiling the possibility of connection and interaction of the two indexing software of Yahoo and Google as public search engines with XML based content objects in dynamic online information environment and MARC 21 and Dublin Core metadata elements. Therefore, having identified the search engine with more efficiency in indexing and ranking of XML based content objects in dynamic online information environment, designers could use the relevant metadata elements in their schemes for indexing and ranking of search result. This would also make possible for designers of content objects in dynamic online information environments appropriate application and usage of metadata initiatives. Research Objectives To achieve the main goal of the present research, following stages were considered to follow: examining the difference in quality of indexing capabilities of Google and Yahoo search engines regarding content objects which contain XML based MARC 21 and Dublin Core Metadata elements in dynamic online information environment; examining the difference in ranking capabilities of Google and Yahoo search engines in regards with content objects consisting of XML-based MARC 21 and Dublin Core metadata elements in dynamic online information environment; observing and examining the reaction of indexing software (robots) of Yahoo and Google search engines to XML- based content objects in dynamic online information environment with both, flat (or tree) and hierarchical (or family) structures; observing and examining the reaction of indexing software (robots) of Yahoo and Google search engines to metadata initiatives, both with language based tags (Dublin Core) and without language based tags (MARC 21); Regarding Yahoo and Google search engines, selecting the metadata initiative which is likely more appropriate for organization of content objects in dynamic online information environment. The Research questions This research was about to find answers to the following seven questions: 1. How is the indexing quality of content objects containing XML-based Dublin Core metadata elements in dynamic online information environments as performed by Yahoo and Google search engines? 2. How is the indexing quality of content objects containing XML-based MARC 21 metadata elements in dynamic online information environments as performed by Yahoo and Google search engines? 3

4 3. What is the difference between the indexing quality of three main elements (title, author and subject) of content objects containing Dublin Core and XML based MARC 21 metadata elements in dynamic online information environments as performed by Yahoo and Google search engines? 4. What is the difference ranking procedure of content objects containing XML based MARC 21 and Dublin Core metadata elements in dynamic online information environments as performed by Yahoo and Google search engines? 5. How is the reaction of Yahoo and Google search engines to content objects of dynamic online information environments containing XML- based metadata elements with flat structure (Dublin Core) and hierarchical structure (MARC 21)? 6. How is the reaction of Yahoo and Google search engines to metadata initiatives with language- based tags (Dublin Core) and without language- based tags (MARC 21)? 7. Which one of MARC 21 and Dublin Core metadata initiatives is more suitable for classification of XML based content objects in dynamic online information environments in regard with accession through Google and Yahoo search engines? Research background Investigation of the current literature reveals that more research has been done elsewhere than in Iran on the present subject, of which some examples will be described as follows. Turner and Brackbill (1998) looked at the ways in which accessing HTML documents could be improved. As a result of their experimental research on Hypertext Markup Language (HTML) meta-tags in regards with web document retrieval by search engines, they found out that assigning the description feature alone was not able to improve accession, however, the keywords feature did improve accession. Sokvitne (2000) conducted a research on the websites of 20 Australian large educational and government organizations aiming at identifying the ability to retrieve key elements such as title, publisher, author, and subject in Dublin Core metadata initiative. Results of the study revealed that because of inconsistency in the content records formats, elements such as author, publisher and co-author which could be useful in searching and retrieval of objects, remained useless. Since the subject was not used properly and the title content was the same as the HTML title s tag content, these elements are not effective in the retrieval process. Henshaw and Valauskas (2001) conducted an experimental research on some selected pages of First Monday s electronic magazine. Two groups of pages were included in this research. One group functioned as the control group with no metadata element; and the other group had had Dublin Core metadata elements as well as HTML keywords and description meta-tags. Results of the study revealed that metadata alone did not have any impact on increasing the probability of the sources indexing and getting top ranking in search engines results. Zhang and Dimitroff (2004) in an article entitled Internet search engine's response to metadata Dublin Core implementation which was published on the basis of an experimental research, examined the function of seven main search engines which were categorized in 4

5 two groups: a target group and a control group. The target group consisted of subject element of the Dublin Core metadata scheme as well as keyword element from the HTML language. The control group lacked any such elements. The results showed that there was significant difference between two groups in terms of visibility for search engines; i.e., six out of the 7 search engines responded positively to metadata elements. Quevedo-Torrero (2004) devoted his doctoral thesis to improving retrieval from the web by hypertext markup language tags, using an experimental research method. The main purpose of this research was improving search quality and retrieval of pages in the web by inserting keyword in hypertext markup language meta-tags as metadata. In this research which was conducted on a selection of search results in search engines like Google and Altavista, some strategies were formulated and suggested for improvement in ranking of search results on the bases of using hypertext markup language meta-tags as metadata, and clustering web pages according to their link structures. Zhang and Dimitroff (2005a) examined the effect of web page content features on their visibility and inclusion in search engines result. This research aimed at finding answers to the question: how could ranking of a page or a site in search engine result be improved in view of authors or developers of pages or websites? The study results revealed that repetition of keywords in the title as well as in the full text body improves the visibility of pages in search engines results. Factors like color and font size proved having no effect on the visibility improvement. Zhang and Dimitroff (2005 b) conducted another experimental study to examine the effect of implementing metadata on the WWW pages appearance in search engines results. For this purpose they introduced 40 test web pages to 19 search engines. The results of the study showed that the metadata is an appropriate and effective mechanism in regard with page appearance and ranking in has effect on appearance and ranking improvement of web pages in the search engines result. Moreover, keywords extracted from web pages, especially from title and full-text body, proved to be very effective in ranking. Mohamed (2006) investigated the effect of metadata usage on the web pages ranking and retrieval. This research was conducted in two parts. In part one, the effect of metadata initiative on the accession of content objects was considered and examined. In part two, by adding metadata elements to web pages, the extent of their indexing was measured as well as the effect of metadata on page ranking. Results of this research showed that description elements and keywords have significant role in page ranking. Also, a couple of relevant studies have been conducted in Iran. Safari (2005) in a research on 16 articles that were published on the web version of the Iranian International Journal of Science examined the effect of Dublin Core metadata elements (4 elements out of 15 elements of this initiative) on the web source rank detecting and improvement of web sources ranks as conducted by three search engines of Google, AltaVista and Lycos. The results of this experimental study showed no significant differences between ranking of pages that contained Dublin Core metadata elements and those of the control group that lacked such elements. Also, no significant impact was seen in the retrieval of pages. Taheri (2008) conducted a comparative study on the indexing quality and ranking of content objects containing XML based Dublin Core and MARC 21 metadata elements by general search engines as his Master s thesis research project. Taheri's research shows 5

6 that there is no significant difference between the indexing quality of content objects containing XML based Dublin core and MARC 21 metadata elements as performed by Google and Yahoo search engines. Also, there was no observed significant difference between content objects ranking containing of the two metadata initiatives in Google search engine; however, there was significant difference in ranking status of content objects containing the two metadata initiatives in Yahoo search engine. Finally, none of the two mentioned XML- based metadata initiatives has preference over the other in regard with accessing through general search engines. The research method The present research experiments the interoperability of content objects containing XML based Dublin Core and MARC 21 metadata elements in dynamic online information environments with general search engines. 100 content objects, i.e. ebooks, were selected out of California digital library source set. These ebooks were selected using the url and focusing on the subject "theory of knowledge". The mentioned content objects were divided into two groups. The first group contained Dublin Core metadata elements (XML- based), and the second group contained XML based MARC 21 metadata elements. Both groups were mounted on and introduced to Yahoo and Google search engines from late July 2009 till June The data were collected in April The mentioned website was introduced to Google search engine by "Webmaster Tools" through "XML Sitemap" option and "Suggest a site". Introduction to Yahoo! was done using "Yahoo! Search URL Status Review Form" and "ROR & Text Sitemap" with the same condition. Google search engine could retrieve all the content objects fully by Dublin Core and MARC 21 Metadata elements, however, Yahoo search engine, despite many follow-ups, did not respond at all until, not only the deadline, but we believe until now! Therefore, the researchers had to rely only on the Google results. The data was collected by means of a checklist which was devised on the basis of, and according to research questions and requirements. Data gathering was conducted by the query: "keyphrase"site:marcdcmi.ir. The data that was collected by means of the checklist, were transferred to worksheets in which + and - signs were assigned as indications of being indexed or not being indexed, respectively. Each of these positions received 1 and 0 values respectively for calculation purposes. The sum of these values were then used in analyses and answering the research questions. The data thus provided, was then keyed in the SPSS Software and was analyzed according to research questions. Research findings As mentioned above, Yahoo search engine never responded to metadata elements retrieval till the deadline. Therefore, following will be an account and description of the results obtained by Google search engine. Table 1 is presented in line with answering the 1 st and 2 nd research questions. The contents of table 1 indicate that Google search engine has been able to index Dublin Core metadata initiative elements (9 elements) as well as MARC 21 elements (10 elements). 6

7 Therefore, XML based content objects which were embedded in the research dynamic online information environment, proved to be retrievable. In fact, the indexing quality of the selected elements by Google search engine is suitable. Table 1: the indexing quality of Dublin Core and MARC 21 metadata initiative elements in XML based content objects in dynamic online information environments by Google search engine The number The number Metadata Indexing of the of content website initiative (in percentage studied objects Google) elements %100 % Dublin Core MARC 21 Table 2 illustrates the indexing quality of Google search engine in regards with title, author and subject elements both in Dublin Core and Marc21 metadata schemes. The content of table 2 answers the 3 rd research question. With reviewing of the second table data for answering to the third question what is the difference between the indexing quality of three main elements (title, author and subject) of content objects containing Dublin Core and XML based MARC 21 metadata elements in dynamic online information environments as performed by Yahoo and Google search engines? Obviously, the data reveals that, Google search engine is able to index title, author, and subject content elements in Dublin Core and MARC 21 metadata initiatives. Therefore, there is no difference between these elements in this regard. Table 2: the indexing quality of title, author and subject elements related to Dublin Core and MARC 21 embedded in content objects XML based in dynamic information environments by Google search engine. The obtained point by content objects The number of content objects The studied main elements title website Metadata initiative (in Google) Dublin Core title MARC 21 author Dublin Core author MARC 21 subject Dublin Core subject MARC 21 Table 3 is used to answer the fourth research question regarding the rank quality. As the contents of table 3 indicate, Google search engine follows the same policy for ranking of XML- based content objects in dynamic online information. That is, this search engine 7

8 out of content objects containing MARC 21 metadata elements, places only 25 objects higher than content objects containing Dublin Core elem. In other words, the ratio of XML-based content objects containing metadata elements is equally 25 out of for both Dublin Core and Marc21. Table 3: The ranking output of XML based content objects containing Dublin Core and MARC21 in dynamic online information environments by Google search engine. The point of content objects placed higher total number of content objects website Metadata initiative (in Google) Dublin Core MARC 21 In answering the fifth and sixth questions, the contents of table 2 would be useful. The data in table 2 show that Google search engine indexing software does not discriminate between content objects with flat structure and with language based tag and those with hierarchical structure and without language based tag. Finally, the seventh research question is trying to determine the more suitable metadata initiative for organization of the XML- based content objects in dynamic online information environments in terms of accessibility by Google search engine. In answering this question, one would conclude that none of the XML- based Dublin Core metadata initiative, or MARC 21 shows any preference over the other in this regard. In other words, both metadata schemes are appropriate for organization of XML-based content objects in dynamic online information environments, as far as accessibility by Google is concerned. Discussion and Conclusion According to the above data and discussions, it can be concluded that XML, as the syntax ground for implementing the metadata elements of Dublin Core and Marc21, in analogy with HTML, can be effective both in Static and Dynamic environments; therefore, it seems to be more appropriate. Because it maximizes the interoperability between search engines and the metadata initiatives which aim at identification, description, locating and retrieving of content objects in static and dynamic online environments. Therefore, as is clear from answers to questions 1 and 2, both metadata initiatives, ie, Dublin Core and Marc21, could be regarded as appropriate for making different content objects accessible in dynamic online environments via Google search engine. On the other hand, none of the two metadata initiatives proved having clear preference and superiority over the other in regards with indexing competence and qualifications. As a result, once embedding these elements into the XML-based content objects in the online information environments, these objects could be accessed and retrieved easily. In regards with ranking of the content objects under study, it was found out that the Google search engine does not discriminate between the two metadata initiatives, as it 8

9 does about indexing metadata elements in both initiatives. That is, it follows a similar pattern and policy in ranking of the content objects containing these two initiatives. Also, from the answer to the 3 rd research question it was clear that the structure, whether flat or hierarchical, does not impact the quality of indexing of the content objects available in the online information environments. So is the reaction of the Google indexing software about metadata initiatives with and without language-based tags (Dublin Core against Marc21 respectively). That is, the software indexes both kinds of content objects in XML-based dynamic online information environments. Therefore, all in all, it can be concluded that both, the Dublin Core Initiative and the Marc21, are suitable for organization of XML-based content objects in dynamic online information environments. References Henshaw, Robin; Valauskas, Edward J. (2001). "Metadata as a Catalyst: Experiments with Metadata and Search Engines in the Internet Journal, First Monday". Libri, 51 (2): [online], available at: [5 Nov, 2009]. sion=1&_urlversion=0&_userid=10&md5=a853d410a866732d3f8ab5dd3217d412. [15 Nov. 2009]. Mohamed, Khaled A.f. (2006). "The impact of metadata in web resources discovering". Online Information Review, 30 (2), Quevedo- Torrero, J.U. (2004). "Improving Web Retrieval by Mining the HTML tags for Keywords and Exploring the Hyperlink Structures of Web Pages", [Abstract] doctoral Dissertation. University of Houston. [online], available at: [23 Oct, 2009]. Safari, Mahdi (2005). "Search Engines and Resource Discovery on the Web". Webology, 2 (2). [online], available at: [13 Nov. 2009]. Sokvitne, Lloyd (2000). "An Evaluation of the Effectiveness of Current Dublin Core Metadata for Retrieval". [online], available at: [13 Nov. 2009]. Taheri, mahdi (2008). "A comparative survey on the indexing quality and ranking of content objects containing Dublin Core and MARC 21 metadata elements XML based by general search engines". Thesis of library and information science, Islamic azad university of Tehran. Turner, Thomas P.; Brackbill, Lise (1998). "Rising to the Top: Evaluating the Use of the HTML Meta tag To Improve Retrieval of World Wide Web Documents through Internet Search Engines". Library Resources and Technical Services, 42 (4): [online], available at: [25 sep. 2009]. Zhang, J; Dimitroff, A (2005b). "The impact of metadata implementation on Webpage visibility in search engine result (Part II)".Information Processing and Management, 41(3), [online], available at: Zhang, Jin; Dimitroff, Alexandra (2004). "Internet search engine's response to metadata Dublin Core implementation". Journal of Information Science, 30 (4), [online], available at: [15 Nov. 2009]. Zhang, Jin; Dimitroff, Alexandra (2005a). "The impact of Webpage content characteristics on webpage visibility in search engine result (Part I)". Information Processing & 9

10 Management, 41 (3), [online], available at: [15 Nov. 2009]. doi: /j.ipm

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.