Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Similar documents
Ubiquitous Computing and Communication Journal (ISSN )

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

MURDOCH RESEARCH REPOSITORY

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

A semi-incremental recognition method for on-line handwritten Japanese text

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

A Layout-Free Method for Extracting Elements from Document Images

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

Semantic Annotation using Horizontal and Vertical Contexts

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Data Mining Concepts

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

A Technique for Classification of Printed & Handwritten text

CLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING

Automatic Extraction of Event Information from Newspaper Articles and Web Pages

Automatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

Text-mining based journal splitting

Mathematical formulae recognition and logical structure analysis of mathematical papers

Available online at ScienceDirect. Procedia Computer Science 60 (2015 )

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

INFS 214: Introduction to Computing

Text mining tools for semantically enriching the scientific literature

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

R 2 D 2 at NTCIR-4 Web Retrieval Task

A hybrid method to categorize HTML documents

Efficient Indexing and Searching Framework for Unstructured Data

Bus Detection and recognition for visually impaired people

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Filtering of Unstructured Text

Text Mining. Representation of Text Documents

Video annotation based on adaptive annular spatial partition scheme

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.

Extracting Algorithms by Indexing and Mining Large Data Sets

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Introduction to Text Mining. Hongning Wang

Building Web Annotation Stickies based on Bidirectional Links

Keyword Spotting in Document Images through Word Shape Coding

Segmentation of Arabic handwritten text to lines

Construction of Knowledge Base for Automatic Indexing and Classification Based. on Chinese Library Classification

Semantic Video Indexing

TISA Methodology Threat Intelligence Scoring and Analysis

Reading group on Ontologies and NLP:

Fault Identification from Web Log Files by Pattern Discovery

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Mathematical formula recognition using virtual link network

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Managing Learning Objects in Large Scale Courseware Authoring Studio 1

Extraction of Web Image Information: Semantic or Visual Cues?

Generic object recognition using graph embedding into a vector space

Patent Image Retrieval

Table Identification and Information extraction in Spreadsheets

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Support Vector Machines for Mathematical Symbol Recognition

Keywords Data alignment, Data annotation, Web database, Search Result Record

Overview of Web Mining Techniques and its Application towards Web

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Optical Character Recognition Based Speech Synthesis System Using LabVIEW

A Robot Recognizing Everyday Objects

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Text-Tracking Wearable Camera System for the Blind

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Enterprise Data Catalog for Microsoft Azure Tutorial

XETA: extensible metadata System

Automatic Reader. Multi Lingual OCR System.

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Narrowing It Down: Information Retrieval, Supporting Effective Visual Browsing, Semantic Networks

An Improving for Ranking Ontologies Based on the Structure and Semantics

Semantic Clickstream Mining

6. Applications - Text recognition in videos - Semantic video analysis

Leave-One-Out Support Vector Machines

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Federated Search: Results Clustering. JR Jenkins, MLIS Group Product Manager Resource Discovery

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

Context Based Web Indexing For Semantic Web

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Automatic Metadata Extraction for Archival Description and Access

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *

OCR For Handwritten Marathi Script

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE

Optimizing Search Engines using Click-through Data

Early-Modern Printed Character Recognition using Ensemble Learning

Ahmed Samir Bibliotheca Alexandrina El Shatby Alexandria, Egypt

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

3 Background technologies 3.1 OntoGen The two main characteristics of the OntoGen system [1,2,6] are the following.

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Transcription:

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. In order to discover new knowledge from massive amounts of information, support of information technologies is required. For supporting knowledge discovery from vast amount of books, we developed an OCR-based automatic book-digitizing framework and the system visualizing documents with relationships among them calculated by using NLP techniques. We applied the framework to Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami Shoten. We show an example of knowledge structure extracted from Shisō by using our visualizing system. 1. Introduction The purpose of this study is to provide access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. Knowledge 1 has been increasing at an exponential rate with advances in science and technology in recent years, resulting in massive amounts of information that have been extremely difficult to process manually. Thus, it is important to utilize information technologies (IT) to support new discoveries of knowledge from vast resources, such as literature, that are now available digitally. To implement the study, we have developed: 1) An automatic digitization framework for historical documents, 2) A computational model for extracting ontology from the digitized corpus, and 3) An interactive user interface (UI) to support discoveries of new knowledge. We chose the Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami * Center for Research and Development of Higher Education, University of Tokyo ** Center for Knowledge Structuring, University of Tokyo *** Graduate School of Engineering, University of Tokyo 1 Although the definition of knowledge is domain-specific, we define knowledge here as the particles represented by ontology, which is the (hierarchical) collection and classification of (technical) terms used to recognize their semantic relevance. Journal of the Japanese Association for Digital Humanities, vol. 1, 37

Shoten as our target corpus. This is one of the most representative journals of philosophy in Japan, having an over ninety-year history, from 1921 to the present. It is comprised of about 10,000 papers and around 200,000 pages of textual data. The first step in this study is to develop a technology to digitize such large amounts of textual data from physical books (semi-) automatically. Because the corpus was too large to digitize manually (i.e., by typing), a rapid, accurate, and low-cost approach was required. Thus, we developed an OCR-based (semi-) automatic book-digitizing framework, in which we integrated three processes: (1) book scanning, (2) OCR, and (3) automatic document style recognition. The inputs for the framework are physical books and the output is digitized text with metadata (titles, authors, page numbers, and dates). Because we employed machine learning techniques for document style recognition, our digitizing framework can be applied to other styles of documents. The next step is to use the digitized text to support discoveries of new knowledge. We developed an ontology extraction system using an NLP technology, and a system visualizing the documents with relationships among them based on extracted ontologies. 2. Automatic Digitizing System We have developed an automatic digitizing and document analysis system. The flow of the entire process is as follows: Scanning: Scan books to generate image files of each page. Character Recognition: Recognize the text characters and text blocks by applying an OCR process. Logical Layout Analysis: Estimate logical layout, which means the logical types of text blocks, such as body, title, and author. Reading Order Estimation: Estimate orders of text blocks for reading. Text Extraction: Extract texts by collecting text blocks in the reading order. Figure 1 shows an overview of our digitizing flow. In the Scanning step, we apply a non-destructive book scanner to historical books and create images for each page. In the Character Recognition step, we apply a customized, commercially available OCR system to scanned image files. We customized the OCR system to output XML files which contain not only the text itself and text blocks, but also additional information used in the later steps, such as character size, character position, block position, and so on. In the next steps, Logical Layout Analysis and Reading Order Estimation, we estimate logical types and reading order of text blocks automatically by using machine learning techniques. In the last step, Text Extraction, we extract the text from the OCR results by using the logical types and reading order estimated in previous steps. In the Journal of the Japanese Association for Digital Humanities, vol. 1, 38

rest of this section, we describe the details of the Logical Layout Analysis and Reading Order Detection steps. Figure 1. Overview of Digitizing Flow 2.1. Logical Layout Analysis For the Logical Layout Analysis step, we propose a method to identify the logical types of the blocks using a machine learning technique. In order to estimate the logical types of the blocks, we employed Support Vector Machine (SVM) (Cortes and Vapnik 1995), a machine learning model for classification, and employed various features to classify blocks based on the text and additional information in the OCR results. The features we selected are as follows: Position: x and y coordinates of the block Blank Space Length: the length of a blank space for four directions (upper, lower, left, and right) Block Size: the width and height of the block Character Size: average width and height of characters in the block Noun: percentage of noun words in the text of the block Personal Name: percentage of personal name words in the text of the block Values for the last two features, Noun and Personal Name, are calculated from the results by the Japanese language morphological analyzer MeCab. 2 We evaluated the proposed approach and the existing rule-based approach, which classify the blocks by the classifying rules created by hand based on position and character size, using the OCR data of the journal Shisō. In the experiments, both methods classified blocks into five logical types, Title, Author, Header, Page Number, and Text Body. The proposed approach was trained with a small set of OCR results with correct logical types annotated by hand. Table 1 shows the accuracy of logical type estimation by the two approaches. The results show that the proposed 2 MeCab: Japanese Morphological Analyzer. http://code.google.com/p/mecab/. Journal of the Japanese Association for Digital Humanities, vol. 1, 39

approach classified blocks correctly better than the rule-based approach, especially in the identification of Title and Author blocks, which are important for recognizing the unit of an article. Table 1. Accuracy Percentage of Logical Layout Analysis Proposed approach Rule-based approach Precision Recall Precision Recall Title 97.6 94.5 95.7 87.3 Author 99.6 97.4 98.7 94.5 Header 97.4 98.4 98.2 96.7 Page Number 99.4 99.6 99.7 99.2 Text Body 99.3 99.2 97.9 99.2 2.2. Reading Order Estimation A page contains a number of text blocks of OCR results. We need to know their order so as to extract the whole text body correctly. For the reading order estimation, we introduced a Page Splitting method. This method splits a page vertically or horizontally into two areas and splits those areas recursively until it reaches the level of a single block, as shown in figure 2. A reading order is organized by using predefined orientations, for example, right to left and top to bottom for Shisō documents. Different split rules leads to different reading orders. In our method, a split rule is determined by a score function (weight vector * feature vector of a split candidate). To train the software in the split rule, we used a machine learning optimization method, DE (Differential Evolution). DE searches for an optimal value of weight vector that determines the split rule. The features used here are the following: Type of split (vertical or horizontal) The number of blocks which exist under a split line Position of a split line Width of a split Types of blocks (e.g., whether the split area contains Title or Author blocks) Journal of the Japanese Association for Digital Humanities, vol. 1, 40

Figure 2. Page Splitting Method The proposed approach was evaluated on the OCR result of Shisō documents. The proposed approach was learned from a small set of OCR results with human-generated correct reading order. For the Shisō documents, the approach estimated the reading order of blocks with a Spearman Distance of about 0.04. The experimental result for the above two steps shows enough accuracy for extracting knowledge in the later step. Moreover, since we employed machine learning techniques for estimating the logical types and reading order of blocks, our approach can be applied to documents with other types of layout given the correct data for training our estimation systems. 3. Visualization System We constructed a visualization system for the set of digitized documents. The main objective of the system is to facilitate knowledge acquisition from documents and generate ideas through terminology-based real-time calculations of document similarities and their visualization with an interactive UI. Figure 3 outlines the visualization of knowledge structures for papers relevant to the keyword shisō ( 思想 thought ) in the 1930s. The system extracts terminological information from text in advance by using NLP techniques (Mima and Ananiadou 2000). The system constructs a graph to structure knowledge in which the nodes (dots) reflect relevant papers with the keyword, and the links between the nodes reflect semantic similarities that are Journal of the Japanese Association for Digital Humanities, vol. 1, 41

calculated based on the extracted terminological information. Additionally, the locations of all the nodes are calculated and optimized when the graph is drawn. The distances between nodes depend on how close they are in meaning. Cluster recognition is also carried out based on the detection of groups of papers in which every combination of papers that are included is strongly linked (i.e., their similarity exceeds a threshold). As seen in figure 3, several clusters are automatically recognized and category names such as Marxism, Socialism, and right-wing thought are also automatically assigned to clusters to facilitate an overview of thoughts discussed in these papers. Figure 3. Visualization of knowledge structure Journal of the Japanese Association for Digital Humanities, vol. 1, 42

4. Conclusion We have described a framework for revealing aspects of the modern history of Japanese philosophy, which consists of digitization of documents to extract text, extraction of terminological information from the text with NLP techniques, and visualization of target documents based on the terminological information. Our current target is the journal Shisō (Thought), one of the major journals of philosophy in Japan. We processed the Shisō documents with our framework and constructed a search and visualization system which helps researches to view the whole document set and to discover new knowledge. We expect to discover new knowledge on the historical flow of Japanese thinking during one of its most important eras, from before World War II to the present day, by digitizing and analyzing huge amounts of historical textual data with the system. Although we constructed the framework described in this paper specifically for use with the target Shisō, the framework itself can be applied to any type of documents that can be recognized by using an OCR system. As the next step, we will apply the framework on documents in other areas in order to enable discovery of knowledge in those areas. References Cortes, Corinna, and Vladimir Vapnik. 1995. Support-vector Networks. Machine Learning 20(3): 273 97. doi:10.1007/bf00994018. Mima, Hideki, and Sophia Ananiadou. 2000. An Application and Evaluation of the C/NC-value Approach for the Automatic Term Recognition of Multi-word Units in Japanese, International Journal on Terminology 6(2): 175 94. doi:10.1075/term.6.2.04mim. Journal of the Japanese Association for Digital Humanities, vol. 1, 43