Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. In order to discover new knowledge from massive amounts of information, support of information technologies is required. For supporting knowledge discovery from vast amount of books, we developed an OCR-based automatic book-digitizing framework and the system visualizing documents with relationships among them calculated by using NLP techniques. We applied the framework to Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami Shoten. We show an example of knowledge structure extracted from Shisō by using our visualizing system. 1. Introduction The purpose of this study is to provide access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. Knowledge 1 has been increasing at an exponential rate with advances in science and technology in recent years, resulting in massive amounts of information that have been extremely difficult to process manually. Thus, it is important to utilize information technologies (IT) to support new discoveries of knowledge from vast resources, such as literature, that are now available digitally. To implement the study, we have developed: 1) An automatic digitization framework for historical documents, 2) A computational model for extracting ontology from the digitized corpus, and 3) An interactive user interface (UI) to support discoveries of new knowledge. We chose the Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami * Center for Research and Development of Higher Education, University of Tokyo ** Center for Knowledge Structuring, University of Tokyo *** Graduate School of Engineering, University of Tokyo 1 Although the definition of knowledge is domain-specific, we define knowledge here as the particles represented by ontology, which is the (hierarchical) collection and classification of (technical) terms used to recognize their semantic relevance. Journal of the Japanese Association for Digital Humanities, vol. 1, 37
Shoten as our target corpus. This is one of the most representative journals of philosophy in Japan, having an over ninety-year history, from 1921 to the present. It is comprised of about 10,000 papers and around 200,000 pages of textual data. The first step in this study is to develop a technology to digitize such large amounts of textual data from physical books (semi-) automatically. Because the corpus was too large to digitize manually (i.e., by typing), a rapid, accurate, and low-cost approach was required. Thus, we developed an OCR-based (semi-) automatic book-digitizing framework, in which we integrated three processes: (1) book scanning, (2) OCR, and (3) automatic document style recognition. The inputs for the framework are physical books and the output is digitized text with metadata (titles, authors, page numbers, and dates). Because we employed machine learning techniques for document style recognition, our digitizing framework can be applied to other styles of documents. The next step is to use the digitized text to support discoveries of new knowledge. We developed an ontology extraction system using an NLP technology, and a system visualizing the documents with relationships among them based on extracted ontologies. 2. Automatic Digitizing System We have developed an automatic digitizing and document analysis system. The flow of the entire process is as follows: Scanning: Scan books to generate image files of each page. Character Recognition: Recognize the text characters and text blocks by applying an OCR process. Logical Layout Analysis: Estimate logical layout, which means the logical types of text blocks, such as body, title, and author. Reading Order Estimation: Estimate orders of text blocks for reading. Text Extraction: Extract texts by collecting text blocks in the reading order. Figure 1 shows an overview of our digitizing flow. In the Scanning step, we apply a non-destructive book scanner to historical books and create images for each page. In the Character Recognition step, we apply a customized, commercially available OCR system to scanned image files. We customized the OCR system to output XML files which contain not only the text itself and text blocks, but also additional information used in the later steps, such as character size, character position, block position, and so on. In the next steps, Logical Layout Analysis and Reading Order Estimation, we estimate logical types and reading order of text blocks automatically by using machine learning techniques. In the last step, Text Extraction, we extract the text from the OCR results by using the logical types and reading order estimated in previous steps. In the Journal of the Japanese Association for Digital Humanities, vol. 1, 38
rest of this section, we describe the details of the Logical Layout Analysis and Reading Order Detection steps. Figure 1. Overview of Digitizing Flow 2.1. Logical Layout Analysis For the Logical Layout Analysis step, we propose a method to identify the logical types of the blocks using a machine learning technique. In order to estimate the logical types of the blocks, we employed Support Vector Machine (SVM) (Cortes and Vapnik 1995), a machine learning model for classification, and employed various features to classify blocks based on the text and additional information in the OCR results. The features we selected are as follows: Position: x and y coordinates of the block Blank Space Length: the length of a blank space for four directions (upper, lower, left, and right) Block Size: the width and height of the block Character Size: average width and height of characters in the block Noun: percentage of noun words in the text of the block Personal Name: percentage of personal name words in the text of the block Values for the last two features, Noun and Personal Name, are calculated from the results by the Japanese language morphological analyzer MeCab. 2 We evaluated the proposed approach and the existing rule-based approach, which classify the blocks by the classifying rules created by hand based on position and character size, using the OCR data of the journal Shisō. In the experiments, both methods classified blocks into five logical types, Title, Author, Header, Page Number, and Text Body. The proposed approach was trained with a small set of OCR results with correct logical types annotated by hand. Table 1 shows the accuracy of logical type estimation by the two approaches. The results show that the proposed 2 MeCab: Japanese Morphological Analyzer. http://code.google.com/p/mecab/. Journal of the Japanese Association for Digital Humanities, vol. 1, 39
approach classified blocks correctly better than the rule-based approach, especially in the identification of Title and Author blocks, which are important for recognizing the unit of an article. Table 1. Accuracy Percentage of Logical Layout Analysis Proposed approach Rule-based approach Precision Recall Precision Recall Title 97.6 94.5 95.7 87.3 Author 99.6 97.4 98.7 94.5 Header 97.4 98.4 98.2 96.7 Page Number 99.4 99.6 99.7 99.2 Text Body 99.3 99.2 97.9 99.2 2.2. Reading Order Estimation A page contains a number of text blocks of OCR results. We need to know their order so as to extract the whole text body correctly. For the reading order estimation, we introduced a Page Splitting method. This method splits a page vertically or horizontally into two areas and splits those areas recursively until it reaches the level of a single block, as shown in figure 2. A reading order is organized by using predefined orientations, for example, right to left and top to bottom for Shisō documents. Different split rules leads to different reading orders. In our method, a split rule is determined by a score function (weight vector * feature vector of a split candidate). To train the software in the split rule, we used a machine learning optimization method, DE (Differential Evolution). DE searches for an optimal value of weight vector that determines the split rule. The features used here are the following: Type of split (vertical or horizontal) The number of blocks which exist under a split line Position of a split line Width of a split Types of blocks (e.g., whether the split area contains Title or Author blocks) Journal of the Japanese Association for Digital Humanities, vol. 1, 40
Figure 2. Page Splitting Method The proposed approach was evaluated on the OCR result of Shisō documents. The proposed approach was learned from a small set of OCR results with human-generated correct reading order. For the Shisō documents, the approach estimated the reading order of blocks with a Spearman Distance of about 0.04. The experimental result for the above two steps shows enough accuracy for extracting knowledge in the later step. Moreover, since we employed machine learning techniques for estimating the logical types and reading order of blocks, our approach can be applied to documents with other types of layout given the correct data for training our estimation systems. 3. Visualization System We constructed a visualization system for the set of digitized documents. The main objective of the system is to facilitate knowledge acquisition from documents and generate ideas through terminology-based real-time calculations of document similarities and their visualization with an interactive UI. Figure 3 outlines the visualization of knowledge structures for papers relevant to the keyword shisō ( 思想 thought ) in the 1930s. The system extracts terminological information from text in advance by using NLP techniques (Mima and Ananiadou 2000). The system constructs a graph to structure knowledge in which the nodes (dots) reflect relevant papers with the keyword, and the links between the nodes reflect semantic similarities that are Journal of the Japanese Association for Digital Humanities, vol. 1, 41
calculated based on the extracted terminological information. Additionally, the locations of all the nodes are calculated and optimized when the graph is drawn. The distances between nodes depend on how close they are in meaning. Cluster recognition is also carried out based on the detection of groups of papers in which every combination of papers that are included is strongly linked (i.e., their similarity exceeds a threshold). As seen in figure 3, several clusters are automatically recognized and category names such as Marxism, Socialism, and right-wing thought are also automatically assigned to clusters to facilitate an overview of thoughts discussed in these papers. Figure 3. Visualization of knowledge structure Journal of the Japanese Association for Digital Humanities, vol. 1, 42
4. Conclusion We have described a framework for revealing aspects of the modern history of Japanese philosophy, which consists of digitization of documents to extract text, extraction of terminological information from the text with NLP techniques, and visualization of target documents based on the terminological information. Our current target is the journal Shisō (Thought), one of the major journals of philosophy in Japan. We processed the Shisō documents with our framework and constructed a search and visualization system which helps researches to view the whole document set and to discover new knowledge. We expect to discover new knowledge on the historical flow of Japanese thinking during one of its most important eras, from before World War II to the present day, by digitizing and analyzing huge amounts of historical textual data with the system. Although we constructed the framework described in this paper specifically for use with the target Shisō, the framework itself can be applied to any type of documents that can be recognized by using an OCR system. As the next step, we will apply the framework on documents in other areas in order to enable discovery of knowledge in those areas. References Cortes, Corinna, and Vladimir Vapnik. 1995. Support-vector Networks. Machine Learning 20(3): 273 97. doi:10.1007/bf00994018. Mima, Hideki, and Sophia Ananiadou. 2000. An Application and Evaluation of the C/NC-value Approach for the Automatic Term Recognition of Multi-word Units in Japanese, International Journal on Terminology 6(2): 175 94. doi:10.1075/term.6.2.04mim. Journal of the Japanese Association for Digital Humanities, vol. 1, 43