PLATFORM OF TRANSCRIPTION THE OLD ARABIC MANUSCRIPTS

Similar documents
Automatic Metadata Retrieval from Ancient Manuscripts

Enriching Historical Manuscripts: The Bovary Project

INTRODUCING THE UNIFIED E-BOOK FORMAT AND A HYBRID LIBRARY 2.0 APPLICATION MODEL BASED ON IT. 1. Introduction

BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA

A tool for Entering Structural Metadata in Digital Libraries

Segmentation of Arabic handwritten text to lines

Using Linked Data to Reduce Learning Latency for e-book Readers

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward

The Text Encoding Initiative and manuscript studies

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

User Manual Al Manhal. All rights reserved v 3.0

Interrogation System Architecture of Heterogeneous Data for Decision Making

Syrtis: New Perspectives for Semantic Web Adoption

Image and Text Coupling for Creating Electronic Books from Manuscripts

Text Line Segmentation in Handwritten Document Using a Production System

A Framework for Processing Complex Document-centric XML with Overlapping Structures Ionut E. Iacob and Alex Dekhtyar

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

Guidelines for Developing Digital Cultural Collections

Advances in Databases and Information Systems 1997

Easy Ed: An Integration of Technologies for Multimedia Education 1

The NYPL Digital Gallery. Jenny Singer, Lara Hanneman, Alana Verminski

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 5: Multimedia description schemes

MUSEUM MEETS ACADEMIA: THE GOSLAR TO GRASMERE PROJECT

Metadata Workshop 3 March 2006 Part 1

A Web Service-Based System for Sharing Distributed XML Data Using Customizable Schema

Elba Project. Procedures and general norms used in the edition of the electronic book and in its storage in the digital library

The Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet.

Enhanced retrieval using semantic technologies:

T H E D I G I TA L L I B R A R Y

Table of contents for The organization of information / Arlene G. Taylor and Daniel N. Joudrey.

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

Lou Burnard Consulting

A Digital Image Processing and Database System for Watermarks in Medieval Manuscripts

Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project

Integration Strategy for the Realization of an Adaptive Hypermedia System of Natural Dyes

Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa

Final Report. Phase 2. Virtual Regional Dissertation & Thesis Archive. August 31, Texas Center Research Fellows Grant Program

How to use TRANSKRIBUS a very first manual

Open Digital Forms. Hiep Le, Thomas Rebele, Fabian Suchanek. HAL Id: hal

A DIGITAL APPROACH TO HANDWRITTEN DOCUMENTS. B.I.T. - Bureau Ingénieur Tomasi

Summary of Bird and Simons Best Practices

Page Delivery Service User Guide

ALOE - A Socially Aware Learning Resource and Metadata Hub

Database of historical places, persons, and lemmas

Reformulation of Contexts: A Design Concept for the Database of DSR Archive

IF ONLY WE D KNOWN: COLLECTING RESEARCH DATA

The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows

INFORMATIQUE ET MÉDECINE/COMPUTER AND MEDICINE ELECTRONIC SUBMISSION OF AN ARTICLE

TBX in ODD: Schema-agnostic specification and documentation for TermBase exchange

Taming the TEI Tiger 6. Lou Burnard June 2004

METAINFORMATION INCORPORATION IN LIBRARY DIGITISATION PROJECTS

Collection Policy. Policy Number: PP1 April 2015

Creating an Accessible PDF

Part A: Getting started 1. Open the <oxygen/> editor (with a blue icon, not the author mode with a red icon).

Text Mining for Historical Documents Digitisation and Preservation of Digital Data

Adobe Bridge CS5.1 Voluntary Product Accessibility Template

Sharing the digital pedagogical resources among institutions of higher education in Morocco

3 Publishing Technique

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

Web-based Internet Information and Application Checklist

Comparing Open Source Digital Library Software

Metadata Standards and Applications

Adobe RoboHelp 9 Voluntary Product Accessibility Template

Data Exchange and Conversion Utilities and Tools (DExT)

The Trustworthiness of Digital Records

Automatic Metadata Extraction for Archival Description and Access

Practical Experiences with Ingesting Materials for Long-Term Preservation

The Necessity of a New Culture of Electronic Publishing C A S L I N

Building OWL Ontology of Unique Bulgarian Bells Using Protégé Platform

Unit 3 Corpus markup

Scientific Data Management for Visualization

Dexterity: Data Exchange Tools and Standards for Social Sciences

ISO/IEC INTERNATIONAL STANDARD. Systems and software engineering Requirements for designers and developers of user documentation

Networked Access to Library Resources

Creating Word Outlines from Compendium on a Mac

A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method

From Individual Solutions to Generic Tools Digitization at the Max Planck Society. Digitization Day 2012, Geneva Andrea Kulas

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Semi-Automatic Techniques for Generating BIM Façade Models of Historic Buildings

Unicode Encoding. The TITUS Project

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications

Text Encoding Fundamentals: Element list

Xyleme Studio Data Sheet

SEARCH SEMI-STRUCTURED DATA ON WEB

DESIGN PATTERN MATCHING

For those of you who may not have heard of the BHL let me give you some background. The Biodiversity Heritage Library (BHL) is a consortium of

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Adobe InDesign CC Voluntary Product Accessibility Template

Adobe Business Catalyst Voluntary Product Accessibility Template

Growing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages

NISO STS (Standards Tag Suite) Differences Between ISO STS 1.1 and NISO STS 1.0. Version 1 October 2017

Adobe Illustrator CS5.1 Voluntary Product Accessibility Template

Fueling Time Machine: Information Extraction from Retro-Digitised Address Directories

AVS4YOU Programs Help

Semantic Indexing of Algorithms Courses Based on a New Ontology

Enterprise Multimedia Integration and Search

Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards

Adobe Illustrator CC Voluntary Product Accessibility Template

User Manual. ACM MAC Word Template. (MAC 2016 version)

Adobe Experience Manager (AEM) 6.2 Forms Workbench Voluntary Product Accessibility Template

Transcription:

PLATFORM OF TRANSCRIPTION THE OLD ARABIC MANUSCRIPTS Noureddine EL MAKHFI Laboratoire de Transmission et Traitement de l Information (LTTI), UFR: Signaux Systèmes et Télécommunications, Faculty of Sciences and Technology, University of Sidi Mohamed Ben Abdellah, Fez, Morocco n.elmakhfi@gmail.com Rachid BENSLIMANE Laboratoire de Transmission et Traitement de l Information (LTTI), University of Sidi Mohamed Ben Abdellah, Fez, Morocco r.benslimane1@gmail.com Abstract : The old manuscripts kept in libraries are a part of the richest cultural heritage and legacy of civilizations. Digitalization is a solution for the preservation of this cultural and historical heritage, which is very difficult to handle for users. On the other hand, restriction of access to national heritage manuscript is related to the concern to preserve the manuscripts physically manipulated which contribute to their accelerated degradation, taking into consideration these limitations on access while ensuring preservation of original manuscripts, the solution widely adopted is based partly on the digitization of this heritage manuscript, and partly on the development of management platforms and diffusion of this wealth of knowledge digitized. We propose in this paper a platform of transcription and establishment by annotating images of manuscripts, these annotations are respecting a XML model. The search in the images of a handwritten document, the rich functionality, intuitive user interface, portability, extensibility and the powerful of the XML technology all make the platform of transcription and establishment an ideal explorer for a specialists and readers of ancient Arabic manuscripts. Keywords: Establishment; Transcription; Manuscripts; Digitalization; Arabic search engine; Metadata; Annotation. 1. Introduction The search for information in Arabic manuscripts is not a simple process, owing to several constraints related to the cursive nature of Arabic script, to the massive presence of diacritical symbols and problems of overlapping between words and lines in manuscripts Arabic. Many research works and projects devoted to the processing of ancient and modern manuscripts [1]: The BAMBI (Better Access to Manuscripts and Browsing of Images) project [2] has produced a hypermedia system allowing historians, and more particularly codicologists and philologists, to read manuscripts, transcribe them, write annotations, and navigate between the words of the transcription and the matching piece of image in the numerized picture of the manuscript. The DEBORA (Digital access to Books Of RenAissance) project [3] tried to define the needs related to the use of virtual libraries and electronic books. The aim of this project was to develop tools for the digitization and the access to a selection of books of the 16th century. EAMMS (Electronic Access to Medieval Manuscripts) project [4] has developed guidelines for encoding and storing catalog descriptions of medieval and Renaissance manuscripts in electronic form. MASTER (Manuscript Access Through Standards for Electronic Records) [5]. This is a generic, robust and flexible enough to allow its application in different areas of the description of the manuscript. The chosen technology is based on international standards SGML (Standard General Markup Language) and XML (extensible Markup Language). Access to patrimonial documents, including old Arabic manuscripts, involves creating a database of images of these manuscripts. The realization of such a database requires a scanning operation followed by the creation of metadata specific to these manuscripts. Our work in this field, aims to facilitate information retrieval by identifying Arabic manuscripts by metadata and annotations in the images of pages. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5358

We developed in a recent work a method [6] entitled "Searching in Arab manuscripts using metadata and annotations" to handle old manuscripts, this method is based on the SDX platform [7]. The drawback of this method is related to the manual creation of XML files associated to manuscripts. The generation of XML files needs collaborative effort of several experts to annotate, transcribe, and establish the content of these old documents before the webcast. To overcome this drawback, we propose in this paper a new platform that facilitates transcription, establishment of the old Arab handwritten text in the form of annotations by respecting a XML model, thus with this work the rare manuscripts will be available. 2. Structure of handwritten documents 2.1. Metadata Indexes by librarians in order to describe existing documents have a long history. Records are data used to qualify other data like book contents; that is why they are called metadata. We can encode some essential information about the documents in a clear fashion: title, author, date of publication, keywords, etc. Table 1 summarizes several categories of metadata used in our work as reported in [8]. 2.2. Schema Diagram Table 1. Metadata. N Adopted metadata N Adopted metadata 1 Author Types of 5 Material study Copyist Dating of paper of document Name of possessor Title of manuscripts Koran Other religious 2 Title Title of chapters Title of sub-chapters texts Scientists Islamic Jerusalem Medicines Preislamique Literary IV -X centuries 6 Type of Moslem Studied VI -VII centuries manuscript juridical 3 period Medieval Islamic Philosophical (VII -XV ) Histories All periods Etc 7 Study of the writing Style 4 Category of Manuscripts Arabo-christian Arabo-islamic 8 Decoration of the texts Enluminures Illustrations Frontispice 2.2.1. Root Element The root element of the XML file is called msdescription and is an element that may reasonably appear either within the body or the header of a TEI (Text Encoding Initiative) [9] conformant document. In the former case, where the document being encoded is essentially a collection of manuscript descriptions, the <msdescription> element may be placed anywhere a paragraph might appear. In the latter case, where the description forms part of the metadata to be associated with a digital representation of some manuscript original, whether as a transcription, as a collection of digital images, or as some combination of the two, the <msdescription> element has the following components, of which only the first is mandatory. This has seven elements, each with its own complex type. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5359

Fig. 1. Schema diagram for msdescription Element. 2.2.2. Msidentifier Element This element includes all the elements which allow the identification of the manuscript or fragment of manuscript. Fig. 2. Schema diagram for msidentifier Element 2.2.3. Physdesc Element Under the general heading of `physical description' we subsume a large number of different aspects generally regarded as useful in the description of a given manuscript. These include aspects of the form, support, extent, and quire structure of the manuscript object; aspects of the writing... Fig. 3. Schema diagram for physdesc Element. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5360

2.2.4. History Element Groups elements describing the full history of a manuscript or manuscript part. Fig. 4. Schema diagram for history Element. 2.2.5. Mscontents Element The <mscontents> element is used to describe the intellectual content of a manuscript or manuscript part. It comprises either a series of informal prose paragraphs or a series of more structured <msitem> elements, each of which provides a more detailed description of a single item contained within the manuscript. Fig. 5. Schema diagram for mscontents Element. 2.2.6. Logicstruct Element This element describes the logic structure of the document, contents, parts, chapter, etc. Fig. 6. Schema diagram for logicstruct Element. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5361

2.2.7. Admininfo Element A variety of information relating to the duration and management of a manuscript may be recorded as simple prose narrative tagged using the standard <p> element. Fig. 7. Schema Diagram for Admininfo Element. 2.2.8. Additional Element Groups other related information about a manuscript, in particular, administrative information relating to its current location, additional materials associated with it, etc. Fig. 8. Schema diagram for additional Element. 3. Proposed method for indexing and annotating manuscripts 3.1. Principle of the proposed method The platform offers the interfaces customizable of images manuscript and provides searching and browsing functionality for users of Arabic manuscripts. This platform is based on XML representation for metadata, annotation and indexing the pages of the manuscript. Fig. 9. Digitization, processing, indexing and searching in handwritten documents. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5362

3.2. Adding new documents: Images of manuscript document to be created are stored in folders; the browse button identifies the path of pages. The form offers the option of completing the values of metadata fields, the shutter choice of the topic determines and produces topics if they did not exist according to the schema XML [6]. (Library, book title, copier, support...) Fig. 10. Adding books according to a XML scheme. 3.3. Technique for Image Document Annotations In order to allow information retrieval by content, the readers can contribute to the creation of text annotations by selecting areas of the image of manuscript document [10], and they can also associate links, attach files (pdf, doc, XML, wav, mp3...). Fig. 11. Schema diagram for annotation. Annotated pages with different icons Fig. 12. Annotation of a page from a handwritten document [11]. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5363

3.3.1. Textual annotation pages: An annotation created can be saved in XML format; it can be used as supporting data to search by content in the manuscript. Fig. 13. Results of research by the textual annotation. 3.3.2. Graphical annotation of pages We propose to establish documentary research according to the geometrical annotations' on the same level as of the textual annotations. In addition, we have created relationships between geometric annotations and text information specific to each type of annotation. The information has been saved in XML format to serve as search keywords. Fig. 14. Graphical Interface annotation of a page of a manuscript document [11]. 4. Conclusion We presented in this paper, a new platform of transcription and establishment by annotating images of Arabic manuscripts, these annotations are respecting a XML model. Our platform offers the search in the images of Arab handwritten document by using both metadata and annotations. Improvement of the interesting obtained results can be realized by using word spotting method, which is a semi-automatic annotation method. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5364

References [1] Stéphane Nicolas, Thierry Paquet, Laurent Heutte, 2003. Digitizing Cultural Heritage Manuscripts : the Bovary Project. in ACM Symposium on Document Engineering, ACM Doc Eng 2003, Grenoble, France, pp. 55-57. [2] CALABRETTO, Sylvie; BOZZI, Andrea; PINON, Jean-Marie, décembre 1999. Numérisation des manuscrits médiévaux : le projet européen BAMBI, in: Actes du colloque Vers une nouvelle érudition: numérisation et recherche en histoire du livre, Rencontres Jacques Cartier, Lyon. [3] DEBORA: projet européen n. LB 5608 A. Coordinateur R. Bouché, juin 2000.179 pages. [4] http://www.hmml.org/eamms/index.html. [5] BURNARD, Lou. ; ROBINSON, PETER. Vers un standars européen de description des manuscrits : le projet Master. Document numérique. 1999, vol 3, no 1-2, p.151-169. [6] O. El bannay, R.Benslimane, N. El makhfi and N. Rais Searching in Arab Manuscripts Using Metadata and Annotation European Journal of Scientific Research ISSN 1450-216X Vol.28 No.1 (2009), pp.155-164 EuroJournals. [7] http://sdx.archivesdefrance.culture.gouv.fr/gpl/navimages/ [8] Hala kaileh, 2004. L accès à distance aux manuscrits arabes numérisés en mode image. Thèse présentée devant l université Lumière Lyon II. [9] http://www.tei-c.org/index.xml [10] Bertrand Coüasnon, Ivan Leplumey. A Generic Recognition System for Making Archives Documents accessible to Public. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE. [11] Kitab nakhla, G582, Edition av 1862 National library Rabat Morocco. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5365