Unicode Encoding. The TITUS Project
|
|
- Domenic Singleton
- 6 years ago
- Views:
Transcription
1 Unicode Encoding and Online Data Access Ralf Gehrke / Jost Gippert The TITUS Project ( Thesaurus indogermanischer Text- und Sprachmaterialien ) (since 1987/1993) 1
2 Scope of the TITUS project: Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages Present retrieval task: Documentation of the usage of all word forms occurring in the texts, in their resp. contexts Survey of the parts of the text database: Data formats (since 1995): Text formats: WordCruncher Text format (8-Bit) HTML (UTF-8 Unicode 4.0) (Plain 7-bit ASCII format) Database format: MS Access (relational, Unicode-based) Retrieval via SQL 2
3 Original Scripts Covered: Latin (with all kinds of diacritics), incl. variants* Greek Slavic (Cyrillic and Glagolitic*) Armenian Georgian Devangar Other Brhm scripts (Tocharian, Khotanese)* Avestan* Middle Persian (Pahlav)* Manichean* Arabic (incl. Persian) Runic Ogham and many more * not yet encodable (as such) in Unicode Example 1a: Donelaitis (Lithuanian: formatted text incl. diacritics: 8-bit version) 3
4 Example 1b: Donelaitis (Lithuanian: formatted text incl. diacritics: Unicode version) Example 2a: Catechism (Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font) 4
5 Example 2b: Catechism (Old Prussian: formatted text incl. diacritics: Unicode version, no special font) Example 3a: Codex Suprasliensis (Old Church Slavonic Cyrillic text in tentative Unicode encoding: TITUS font) 5
6 Example 3b: Kiev folia (Old Church Slavonic Glagolitic text in substitutional Unicode encoding: TITUS font) Example 4a: Rigveda (Sanskrit text in Unicode encoding, Roman transcription) 6
7 Example 4a: Rigveda (Same, other MS Windows 2000 font) Example 4c: Rigveda (Sanskrit text in Unicode encoding, Devangar script: MS Windows 2000 font) 7
8 Example 4d: Rigveda (Same: other MS Windows 2000 font) Example 5a: Vs u Rmn (Early New Persian text in Unicode encoding, Roman transcription) 8
9 Example 5b: Vs u Rmn (Early New Persian text in Unicode encoding, original script: MS Windows 2000 font) Example 5c: Vs u Rmn (Same, other MS Windows 2000 font) 9
10 Basis of online retrieval: Multilevel referencing system defining Text structure levels Texts Chapters Paragraphs Representation structure levels Pages Lines Formatting types (headers, catchwords etc.) Language / script specific encoding Query preliminaries: Manual query entry via form: Features: Preselection of languages / varieties Text independent search Preselection of query type Combined search of up to 4 word forms 7-bit based manual entry of word forms 10
11 Query form: Language preselection Query form: Type preselection 11
12 Query form: Combined search Query form: Result 12
13 Query preliminaries: User input feature: alternate 7-bit (ASCII) based manual entry of word forms purpose: cross-platform compatibility problem: unavailability and / or inapplicability of national keyboards precondition: English keyboard available and accessible everywhere Query form: Character input 13
14 TITUS bibliography: An example Query form: Character input Example 1: Latin special characters 14
15 Query form: Character input Example 2: Ancient Greek characters Query form: Character input Example 3: Slavonic (Cyrillic) characters 15
16 ALCTS "Library Catalogs and Non- Query preliminaries: Data transfer feature: 7-bit (ASCII) based transmission of data in query strings purpose: secure cross-platform compatibility problem: unavailability and / or inapplicability of Unicode data in data transmission via HTTP precondition: representation of non-ascii characters by hex strings Query form: Character input Example 1: Latin special characters Sanskrit represented by 2B01371Ee = U + 01B2 = U + 1E37 = ASCII e N.B. Wherever a precomposed character is encodable as such, this is used in the text data base titusinx.asp?lxlang=22035&lxword=2b01371ee &LCPL=0&TCPL=0&C=A 16
17 Query form: Character input Query form: Character input Example 2: Romanized Devangar Sanskrit also represented by 2B01371Ee = U + 01B2 = U + 1E37 e = ASCII e titusinx.asp?lxlang=23059&lxword=2b01371ee &LCPL=0&TCPL=0&C=D 17
18 ALCTS "Library Catalogs and Non- Query form: Character input Query form: Character input Example 3: Greek characters Greek represented by a) ndra or 041FBD03B403C103B103 = U + 1F04 = U + 03BD = U + 03B4 = U + 03C1 = U + 03B1 titusinx.asp?lxlang=8&lxword=041fbd03b403 C103B103&LCPL=0&TCPL=0&C=H 18
19 ALCTS "Library Catalogs and Non- Query form: Character input Query form: Character input Example 4: Optional disregard of diacritics Greek represented by andra or B103BD03B403C103B103 = U + 03B1 = U + 03BD = U + 03B4 = U + 03C1 = U + 03B1 titusinx.asp?lxlang=8&lxword=b103bd03b403 C103B103&LCPL=0&TCPL=0&C=H 19
20 Data base properties Unicode specific treatment of diacritics vs. Software specific treatment of diacritics vs. TITUS specific treatment of diacritics Data base properties Unicode specific treatment of diacritics: Precomposed characters vs. Sequences of characters and diacritics Correct treatment must be warranted by software 20
21 Data base properties Software specific treatment of diacritics (MS Access 2000 / XP): SQL query for <a> yields <a, á, à, â> etc. while SQL query for <á> yields only <á> Special functions depending on modern languages Data base properties TITUS language specific treatment of diacritics: SQL query for Lithuanian <s$> yields <š, sch, sz> etc. while SQL query for <sch> yields only <sch> Special functions depending on cross-historical orthographic properties of languages 21
22 Data base properties: Example SARDS SARDS = South Asia Research Documentation Services Part 1 covers the years and contains more than citations of research papers (no monographs) on Indology and South Asia Studies 22
23 Tustep Encoding: Some diacritics SARDS in Tustep encoding 23
24 SARDS in Unicode encoding A question to librarians How is Unicode changing the cataloguing of books? Are authors and titles entered in original script or in transcriptions? Or will both methods be used in parallel? 24
25 Bibliography in original script and transcription: Example UniTeNS UniTeNS = Unified Text Numbering System A new proposal for an identification system for texts Each text is awarded a 48-digit number, where the number reflects author, language, era, sort of text etc. This number is independent of publication in print or electronic form or manuscripts 25
26 UniTeNS All texts should be catalogued according to a complete classification scheme A central institution should keep track of publications of each text in printed, electronic or other form All publishers of texts should notify this institution about each publication Text numbering system: Example 26
RomanCyrillic Std v. 7
https://doi.org/10.20378/irbo-52591 RomanCyrillic Std v. 7 Online Documentation incl. support for Unicode v. 9, 10, and 11 (2016 2018) UNi code A З PDF! Ѿ Sebastian Kempgen 2018 RomanCyrillic Std: new
More informationCoordination! As complex as Format Integration!
True Scripts in Library Catalogs The Way Forward Joan M. Aliprand Senior Analyst, RLG 2004 RLG Why the current limitation? Coordination! As complex as Format Integration! www.ala.org/alcts 1 Script Capability
More informationUTF and Turkish. İstinye University. Representing Text
Representing Text Representation of text predates the use of computers for text Text representation was needed for communication equipment One particular commonly used communication equipment was teleprinter
More informationRepresenting Characters and Text
Representing Characters and Text cs4: Computer Science Bootcamp Çetin Kaya Koç cetinkoc@ucsb.edu Çetin Kaya Koç http://koclab.org Winter 2018 1 / 28 Representing Text Representation of text predates the
More informationThe Use of Unicode in MARC 21 Records. What is MARC?
# The Use of Unicode in MARC 21 Records Joan M. Aliprand Senior Analyst, RLG What is MARC? MAchine-Readable Cataloging MARC is an exchange format Focus on MARC 21 exchange format An implementation may
More informationAchtung! Attention! Alle Rechte vorbehalten / All rights reserved:
Achtung! Dies ist eine Internet-Sonderausgabe des Aufsatzes Electronic Resources Development and SEER. The Preparation and dissemination of electronic resources pertaining to Eastern studies von Jost Gippert
More informationThe Unicode Standard Version 11.0 Core Specification
The Unicode Standard Version 11.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers
More informationYou 2 Software
PrismaCards Enter text for languages with exotic fonts You 2 Software http://www.you2.de info@you2.de Introduction To work in PrismaCards and other programs with complex fonts for different languages you
More informationSurvey of Language Computing in Asia 2005
Survey of Language Computing in Asia 2005 Sarmad Hussain Nadir Durrani Sana Gul Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca
More informationControl Characters ISO 6630:1986 Documentation -- Bibliographic control characters
Title: Status Report on TC 46 Coded Character Set Standards Source: Joan M. Aliprand (Senior Analyst, RLG) Status: Expert Contribution Action: For consideration by ISO/TC46 Date: 2004-09-28 1. Background
More informationL2/ Universal Multiple-Octet Coded Character Set
ISO/IEC JTC1/SC2/WG2 N2446 2002-05-10 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation ;,N*J>"D@*>"b @D(">42"P4b
More informationTitle: Graphic representation of the Roadmap to the BMP of the UCS
ISO/IEC JTC1/SC2/WG2 N2045 Title: Graphic representation of the Roadmap to the BMP of the UCS Source: Ad hoc group on Roadmap Status: Expert contribution Date: 1999-08-15 Action: For confirmation by ISO/IEC
More informationA könyvtárüggyel kapcsolatos nemzetközi szabványok
A könyvtárüggyel kapcsolatos nemzetközi szabványok 1. Állomány-nyilvántartás ISO 20775:2009 Information and documentation. Schema for holdings information 2. Bibliográfiai feldolgozás és adatcsere, transzliteráció
More informationTitle: Graphic representation of the Roadmap to the BMP, Plane 0 of the UCS
ISO/IEC JTC1/SC2/WG2 N2316 Title: Graphic representation of the Roadmap to the BMP, Plane 0 of the UCS Source: Ad hoc group on Roadmap Status: Expert contribution Date: 2001-01-09 Action: For confirmation
More information2011 Martin v. Löwis. Data-centric XML. Character Sets
Data-centric XML Character Sets Character Sets: Rationale Computer stores data in sequences of bytes each byte represents a value in range 0..255 Text data are intended to denote characters, not numbers
More information2007 Martin v. Löwis. Data-centric XML. Character Sets
Data-centric XML Character Sets Character Sets: Rationale Computer stores data in sequences of bytes each byte represents a value in range 0..255 Text data are intended to denote characters, not numbers
More informationRepresenting Characters, Strings and Text
Çetin Kaya Koç http://koclab.cs.ucsb.edu/teaching/cs192 koc@cs.ucsb.edu Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 1 / 19 Representing and Processing Text Representation of text predates the use
More informationConversion of Cyrillic script to Score with SipXML2Score Author: Jan de Kloe Version: 2.00 Date: June 28 th, 2003, last updated January 24, 2007
Title: Conversion of Cyrillic script to Score with SipXML2Score Author: Jan de Kloe Version: 2.00 Date: June 28 th, 2003, last updated January 24, 2007 Scope There is no limitation in MusicXML to the encoding
More informationLBSC 690: Information Technology Lecture 05 Structured data and databases
LBSC 690: Information Technology Lecture 05 Structured data and databases William Webber CIS, University of Maryland Spring semester, 2012 Interpreting bits "my" 13.5801 268 010011010110 3rd Feb, 2014
More informationInternational Cataloging: Use Non-Latin Scripts
OCLC Connexion Client Guides International Cataloging: Use Non-Latin Scripts Revised: September 2011 6565 Kilgour Place, Dublin, OH 43017-3395 www.oclc.org Revision History Date Section title Description
More informationA könyvtárüggyel kapcsolatos nemzetközi szabványok
A könyvtárüggyel kapcsolatos nemzetközi szabványok 1. Állomány-nyilvántartás ISO 20775:2009 Information and documentation. Schema for holdings information 2. Bibliográfiai feldolgozás és adatcsere, transzliteráció
More informationTex with Unicode Characters
Tex with Unicode Characters 7/10/18 Presented by: Yuefei Xiang Agenda ASCII Code Unicode Unicode in Tex Old Style Encoding -Inputenc, -ucs Morden Encoding -XeTeX -LuaTeX Unicode bi-direction in Tex -Emacs-AucTeX
More informationDOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI
DOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI Page 1 Page 2 urdu hindi dictionary in devnagri script urdu hindi dictionary in pdf urdu hindi dictionary in devnagri script
More informationCharacter Encodings. Fabian M. Suchanek
Character Encodings Fabian M. Suchanek 22 Semantic IE Reasoning Fact Extraction You are here Instance Extraction singer Entity Disambiguation singer Elvis Entity Recognition Source Selection and Preparation
More informationGoogle Search Appliance
Google Search Appliance Search Appliance Internationalization Google Search Appliance software version 7.2 and later Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com GSA-INTL_200.01
More informationCYRILLIC LETTER OMEGA WITH TITLO
ISO/IEC JTC1/SC2/WG2 N3184 L2/06-357 2006-10-30 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Международная организация
More informationISO/IEC JTC 1/SC 2 N 3426
ISO/IEC JTC 1/SC 2 N 3426 Date: 2000-04-04 Supersedes SC 2 N 2830 ISO/IEC JTC 1/SC 2 CODED CHARACTER SETS SECRETARIAT: JAPAN (JISC) DOC TYPE: TITLE: Other document Graphic representation of the Roadmap
More informationThu Jun :48:11 Canada/Eastern
Roadmaps to Unicode Thu Jun 24 2004 17:48:11 Canada/Eastern Home Site Map Search Tables Roadmap Introduction Roadmap to the BMP (Plane 0) Roadmap to the SMP (Plane 1) Roadmap to the SIP (Plane 2) Roadmap
More informationSebastian Kempgen Features of the "Kliment Std" Font v. 5.0, 2018
Sebastian Kempgen Features of the "Kliment Std" Font v. 5.0, 2018 Kliment Std The companion to our free «RomanCyrillic Std» font especially for Slavic medievalists Ѿ UC 7.0 Download for font and documentation:
More informationProposal to Encode Combining Half Marks Used for Cyrillic Supralineation in Unicode
Proposal to Encode Combining Half Marks Used for Cyrillic Supralineation in Unicode Aleksandr Andreev * Yuri Shardt Nikita Simmons PONOMAR PROJECT Abstract A Proposal to add two additional characters to
More informationSAPGUI for Windows - I18N User s Guide
Page 1 of 30 SAPGUI for Windows - I18N User s Guide Introduction This guide is intended for the users of SAPGUI who logon to Unicode systems and those who logon to non-unicode systems whose code-page is
More informationComputer Science Applications to Cultural Heritage. Introduction to computer systems
Computer Science Applications to Cultural Heritage Introduction to computer systems Filippo Bergamasco (filippo.bergamasco@unive.it) http://www.dais.unive.it/~bergamasco DAIS, Ca Foscari University of
More informationUnicode and Non Unicode Printing with the Swiss 721 Font
Unicode and Non Unicode Printing with the Swiss 721 Font There are many methods of printing international characters with Unicode fonts on a Zebra printer. We offer a free Swiss 721 font with 983 characters
More informationICANN IDN TLD Variant Issues Project. Presentation to the Unicode Technical Committee Andrew Sullivan (consultant)
ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com I m a consultant Blame me for mistakes here, not staff or ICANN
More informationProposal to Encode Some Outstanding Early Cyrillic Characters in Unicode
POMAR PROJECT Proposal to Encode Some Outstanding Early Cyrillic Characters in Unicode Yuri Shardt, Nikita Simmons, Aleksandr Andreev 1 In old, Slavic documents that come from Eastern Europe in the centuries
More informationCRAI Library Catalog of University of Barcelona
U CRAI Library Catalog of University of Barcelona 1 Contents 1. Introduction 2. Accessing the catalog 3. Quick search 3.1. Quick search by author 3.2. Quick search by title 3.3. Quick search by subject
More informationUsing non-latin alphabets in Blaise
Using non-latin alphabets in Blaise Rob Groeneveld, Statistics Netherlands 1. Basic techniques with fonts In the Data Entry Program in Blaise, it is possible to use different fonts. Here, we show an example
More informationDESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE
83 DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE Rajesh Das Biswajit Das Subhendu Kar Swarnali Chatterjee Abstract Unicode is a 32-bit code for character representation in a
More informationCOSC 243 (Computer Architecture)
COSC 243 Computer Architecture And Operating Systems 1 Dr. Andrew Trotman Instructors Office: 123A, Owheo Phone: 479-7842 Email: andrew@cs.otago.ac.nz Dr. Zhiyi Huang (course coordinator) Office: 126,
More informationBLLDB User Manual semantics GmbH
User Manual 04.07.2007 2007 semantics GmbH Table of Contents I Table of Contents 1 How to use... BLLDB 1 1.1 Basic Search... 2 1.2 Advanced Search... 3 1.3 Advice and examples... 4 1.4 Using the Classification...
More informationInformation Standards Quarterly
CORE (Cost of Reso ISO 25964-1 Z39.7 Data Dictionary Standing C ISO/ TR 11219 ISO/TR 14873 ISO 5127 RFID in Libraries article ecerpted from: SERU (Shared E-Resource Understanding) ISO 8 Information Standards
More informationChapter 4: Computer Codes. In this chapter you will learn about:
Ref. Page Slide 1/30 Learning Objectives In this chapter you will learn about: Computer data Computer codes: representation of data in binary Most commonly used computer codes Collating sequence Ref. Page
More informationCan R Speak Your Language?
Languages Can R Speak Your Language? Brian D. Ripley Professor of Applied Statistics University of Oxford ripley@stats.ox.ac.uk http://www.stats.ox.ac.uk/ ripley The lingua franca of computing is (American)
More informationBook Size Minimum Page Count Maximum Page Count 5x8 B&W x9 B&W x11 B&W x8.5 Color x11.
manuscript submission guide You ve been working on your manuscript for a while now. You re done writing; you ve made your last edits and put the finishing touches on your work. Now you re ready to submit
More information(URW) ++ UNICODE APERÇU 1. Nimbus Sans Block Name. Regular. Bold. Light Vers Regular. Regular. Bold. Medium. Vers Vers Vers. 4.
UNICODE APERÇU 1 Unicode Code points (Plane, Plane 2) 93+9 HKSCS Alternates 8498 8498 31 425 1 Latin Extended-A 5 U+2FF U+52F U+4FF U+F U+5 U+5FF U+7 U+74F U+6FF U+77F U+7 U+7BF U+ U+97F U+7FF U+9FF U+A7F
More information4 Languages and Character Sets
Since the first publication of this chapter, many of its recommendations have been rendered obsolete or obsolescent by the development of ISO/IEC 10646 and the adoption of Unicode as the underlying character
More informationFriendly Fonts for your Design
Friendly Fonts for your Design Choosing the right typeface for your website copy is important, since it will affect the way your readers perceive your page (serious and formal, or friendly and casual).
More informationWordman s Production Corner
Wordman s Production Corner By Dick Eassom, AF.APMP Three Word Tricks...Fractions, Diacritics, and Gibberish The Problems The first trick was inspired by the Office Challenge in TechRepublic (http://www.techrepublic.com/):
More informationUNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS
UNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS 5JSC/LC/5 TO: Joint Steering Committee for Revision of AACR DATE: FROM: SUBJECT: Barbara B. Tillett, LC Representative RDA Part I Internationalization
More informationSCHOLARONE MANUSCRIPTS TM REVIEWER GUIDE
SCHOLARONE MANUSCRIPTS TM REVIEWER GUIDE TABLE OF CONTENTS Select an item in the table of contents to go to that topic in the document. INTRODUCTION... 2 THE REVIEW PROCESS... 2 RECEIVING AN INVITATION...
More informationVersion 5.5. Multi-language Projects. Citect Pty Ltd 3 Fitzsimmons Lane Gordon NSW 2072 Australia
Version 5.5 Multi-language Projects Citect Pty Ltd 3 Fitzsimmons Lane Gordon NSW 2072 Australia www.citect.com DISCLAIMER Citect Pty. Limited makes no representations or warranties with respect to this
More informationCS144: Content Encoding
CS144: Content Encoding MIME (Multi-purpose Internet Mail Extensions) Q: Only bits are transmitted over the Internet. How does a browser/application interpret the bits and display them correctly? MIME
More informationEnhanced retrieval using semantic technologies:
Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008
More informationUNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS. Some of the proposals below (F., P., Q., and R.) were not in the original proposal.
UNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS TO: Joint Steering Committee for Revision of AACR DATE: FROM: SUBJECT: Barbara B. Tillett, LC Representative RDA Part I Internationalization At the
More informationHow to Build a Digital Library
How to Build a Digital Library Ian H. Witten & David Bainbridge Contents Preface Acknowledgements i iv 1. Orientation: The world of digital libraries 1 One: Supporting human development 1 Two: Pushing
More informationNavigating the pitfalls of cross platform copies
Navigating the pitfalls of cross platform copies Kai Stroh, UBS Hainer GmbH Overview Motivation Some people are looking for a way to copy data from Db2 for z/ OS to other platforms Reasons include: Number
More informationThe Unicode Standard Version 6.1 Core Specification
The Unicode Standard Version 6.1 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers
More informationProposed Update. Unicode Standard Annex #11
1 of 12 5/8/2010 9:14 AM Technical Reports Proposed Update Unicode Standard Annex #11 Version Unicode 6.0.0 draft 2 Authors Asmus Freytag (asmus@unicode.org) Date 2010-03-04 This Version Previous http://www.unicode.org/reports/tr11/tr11-19.html
More informationRoutine Routine/ Minor/ Moderate/ Serious / Major/ Critical
Product S84xxse Other Products affected Version affected 17.00.01.00 Bulletin Category Firmware. Host software/ Printer /Firmware/ Communication/ Part issue/ Other Severity rating Bulletin Name Issued
More informationCataloging: Create Bibliographic Records
OCLC Connexion Client Guides Cataloging: Create Bibliographic Records Last updated: June 2007 6565 Kilgour Place, Dublin, OH 43017-3395 www.oclc.org Revision History Date Section title Description of changes
More informationComments Received and Their Disposition
ALCTS Task Force on Non- English Access Comments Received and Their Disposition This compilation is based on comments received on the Task Force s Report via the form posted on the ALCTS website https://cs.ala.org/alcts/non-english_comment_form/
More informationby Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018
THE PROPERTIES AND PROMISES OF UTF-8 by Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018 Or... UTF-8: What Is All This à Ã?! OVERVIEW
More informationOTTO: A Tool for Diplomatic Transcription of Historical Texts
OTTO: A Tool for Diplomatic Transcription of Historical Texts Stefanie Dipper and Martin Schnurrenberger Linguistics Department Ruhr University Bochum, Germany dipper@linguistics.rub.de martin.schnurrenberger@rub.de
More informationUnicode: What is it and how do I use it?
Abstract: The rationale for Unicode and its design goals and detailed design principles are presented. The correspondence between Unicode and ISO/IEC 10646 is discussed, the scripts included or planned
More informationInformation Retrieval of Text with Diacritics
118 Information Retrieval of Text with Diacritics Khalid Saleh Rabeh Aloufi Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madina, KSA, Summary Information
More informationL2/ ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC
ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC 10646 1 Please fill all the sections A, B and C below. Please read Principles and Procedures
More informationChapter 7. Representing Information Digitally
Chapter 7 Representing Information Digitally Learning Objectives Explain the link between patterns, symbols, and information Determine possible PandA encodings using a physical phenomenon Encode and decode
More informationYES (or) More information will be provided later:
ISO/IEC JTC 1/SC 2/WG 2 N3033 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC 10646 Please fill all the sections A, B and C below. Please read Principles and Procedures
More informationUnicode and Standardized Notation. Anthony Aristar
Data Management and Archiving University of California at Santa Barbara, June 24-27, 2008 Unicode and Standardized Notation Anthony Aristar Once upon a time There were people who decided to invent computers.
More informationCEN TC304 N985 Subject/Title: Open Issues for EOR-2 Source: Marc Küster Date: 16 July 2001 Note/Status: This document was presented 26 June 2001 at
CEN TC304 N985 Subject/Title: Open Issues for EOR-2 Source: Marc Küster Date: 16 July 2001 Note/Status: This document was presented 26 June 2001 at the TC304 plenary. A resolution was adopted on accepting
More informationAutomating Authority Work
Mike Monaco Coordinator, Cataloging Services May 14, 2018 Automating Authority Work Automating authority work, or, Be your own authority control vendor Ohio Valley Group of Technical Services Librarians
More informationNumara FootPrints Changelog January 26, 2009
Numara FootPrints 9.0.3 Changelog January 26, 2009 Address Book The logo in the Address Book always pointed to the Numara Software URL. Address book fields were missing from a number of features in FootPrints
More informationIDN and applications. Michel Suignard Senior Program Manager Microsoft
IDN and applications Michel Suignard Senior Program Manager Microsoft IDN is the first step IDN solves a DNS limitation by carrying extended domain entities within the existing framework But most users
More informationThomas Wolff
Mined: An Editor with Extensive Unicode and CJK Support for the Text-based Terminal Environment Thomas Wolff http://towo.net/mined/ towo@computer.org Introduction Many Unicode editors are GUI applications
More informationA. Administrative. B. Technical -- General
ISO/IEC JTC1/SC2/WG2 N2306R 2000-11-29 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation еждународная организация
More informationPrinceton University. Computer Science 217: Introduction to Programming Systems. Data Types in C
Princeton University Computer Science 217: Introduction to Programming Systems Data Types in C 1 Goals of C Designers wanted C to: Support system programming Be low-level Be easy for people to handle But
More information1 Lithuanian Lettering
Proposal to identify the Lithuanian Alphabet as a Collection in the ISO/IEC 10646, including the named sequences for the accented letters that have no pre-composed form of encoding (also in TUS) Expert
More informationLIST OF PRECOMPOSED GREEK CHARACTERS & CODEPOINTS PROPOSED FOR INCLUSION IN THE PUA
LIST OF PRECOMPOSED GREEK CHARACTERS & CODEPOINTS PROPOSED FOR INCLUSION IN THE PUA PROPOSAL FOR COORDINATED USAGE OF THESE GLYPHS IN THE PUA AMONG DIFFERENT UNICODE FONTS WITH THE FULL SET OF THE GREEK
More informationLocal Metadatamanagement in a global environment
Frankfurt 16 June 2010 Local Metadatamanagement in a global environment Daniel van Spanje Global Productmanager Metadata Services OCLC metadata has become the structure on which we re building information
More informationAutomatic Reader. Multi Lingual OCR System.
Automatic Reader Multi Lingual OCR System What is the Automatic Reader? Sakhr s Automatic Reader transforms scanned images into a grid of millions of dots, optically recognizes the characters found in
More informationInformation technology Keyboard layouts for text and office systems. Part 9: Multi-lingual, multiscript keyboard layouts
INTERNATIONAL STANDARD ISO/IEC 9995-9 First edition 2016-10-01 Information technology Keyboard layouts for text and office systems Part 9: Multi-lingual, multiscript keyboard layouts Technologies de l
More informationUniversal Acceptance Technical Perspective. Universal Acceptance
Universal Acceptance Technical Perspective Universal Acceptance Warm-up Exercise According to w3techs, which of the following pie charts most closely represents the fraction of websites on the Internet
More informationProposed Update Unicode Standard Annex #11 EAST ASIAN WIDTH
Page 1 of 10 Technical Reports Proposed Update Unicode Standard Annex #11 EAST ASIAN WIDTH Version Authors Summary This annex presents the specifications of an informative property for Unicode characters
More informationAchtung! Attention! Alle Rechte vorbehalten / All rights reserved:
Achtung! Dies ist eine Internet-Sonderausgabe des Aufsatzes Language-specific encoding in multilingual corpora: Requirements and solutions von Jost Gippert (1999). Sie sollte nicht zitiert werden. Zitate
More informationNamed Entity Identification / Disambiguation
Intelligent information access to linked data weaving the cultural heritage web Research Archive for Ancient Sculpture Universität of Cologne, Germany 18. September 2007 Outline Digital Scholarship A model
More informationCindex 3.0 for Windows. Release Notes
Cindex 3.0 for Windows Release Notes The information contained in this document is subject to change without notice, and does not represent a commitment on the part of Indexing Research. The program described
More informationINTERNATIONALIZATION IN GVIM
INTERNATIONALIZATION IN GVIM A PROJECT REPORT Submitted by Ms. Nisha Keshav Chaudhari Ms. Monali Eknath Chim In partial fulfillment for the award of the degree Of B. Tech Computer Engineering UNDER THE
More informationRetrieval in Texts with Traditional Mongolian Script Realizing Unicoded Traditional Mongolian Digital Library
Retrieval in Texts with Traditional Mongolian Script Realizing Unicoded Traditional Mongolian Digital Library Garmaabazar Khaltarkhuu and Akira Maeda Graduate School of Science and Engineering, Ritsumeikan
More informationScholarOne Manuscripts. Reviewer User Guide
ScholarOne Manuscripts Reviewer User Guide 1-May-2018 Clarivate Analytics ScholarOne Manuscripts Reviewer User Guide Page i TABLE OF CONTENTS INTRODUCTION... 1 THE REVIEW PROCESS... 1 RECEIVING AN INVITATION...
More informationIndo-Iranian Journal brill.com/iij. Scope. Ethical and Legal Conditions. Submission. Instructions for Authors
Scope Indo-Iranian Journal (IIJ), founded in 1957, focuses on the ancient and medieval languages and cultures of South Asia and of pre-islamic Iran. It publishes articles on Indo-Iranian languages (linguistics
More informationDomain Names in Pakistani Languages. IDNs for Pakistani Languages
ا ہ 6 5 a ز @ ں ب Domain Names in Pakistani Languages س a ی س a ب او اور را < ہ ر @ س a آف ا ر ا 6 ب 1 Domain name Domain name is the address of the web page pg on which the content is located 2 Internationalized
More informationOgonek Documentation. Release R. Martinho Fernandes
Ogonek Documentation Release 0.6.0 R. Martinho Fernandes February 17, 2017 Contents: 1 About 1 1.1 Design goals............................................... 1 1.2 Dependencies...............................................
More informationChapter 3. Information Representation
Chapter 3 Information Representation Instruction Set Architecture APPLICATION LEVEL HIGH-ORDER LANGUAGE LEVEL ASSEMBLY LEVEL OPERATING SYSTEM LEVEL INSTRUCTION SET ARCHITECTURE LEVEL 3 MICROCODE LEVEL
More informationSurvey of Language Computing in Asia 2005
Survey of Language Computing in Asia 2005 Sarmad Hussain Nadir Durrani Sana Gul Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca
More informationPicsel epage. PowerPoint file format support
Picsel epage PowerPoint file format support Picsel PowerPoint File Format Support Page 2 Copyright Copyright Picsel 2002 Neither the whole nor any part of the information contained in, or the product described
More informationUsing the FirstVoices Kwa wala Keyboard
Using the FirstVoices Kwa wala Keyboard The keyboard described here has been designed for the Kwa wala language, so that all of the special characters required by the language can be easily typed on your
More informationꞐ A790 LATIN CAPITAL LETTER A WITH SPIRITUS LENIS ꞑ A791 LATIN SMALL LETTER A WITH SPIRITUS LENIS
ISO/IEC JTC1/SC2/WG2 N3487 L2/08-272 2008-08-04 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация
More informationFileMaker 15 Specific Features
FileMaker 15 Specific Features FileMaker Pro and FileMaker Pro Advanced Specific Features for the Middle East and India FileMaker Pro 15 and FileMaker Pro 15 Advanced is an enhanced version of the #1-selling
More informationThe process of preparing an application to support more than one language and data format is called internationalization. Localization is the process
1 The process of preparing an application to support more than one language and data format is called internationalization. Localization is the process of adapting an internationalized application to support
More informationAPA Formatting in Word 2016
APA Formatting in Word 2016 The American Psychological Association (APA) style for formatting a paper is not a setting in Word 2016. However, by following these steps, you can set up your document according
More information