Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Size: px
Start display at page:

Download "Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization"

Transcription

1 Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. In order to discover new knowledge from massive amounts of information, support of information technologies is required. For supporting knowledge discovery from vast amount of books, we developed an OCR-based automatic book-digitizing framework and the system visualizing documents with relationships among them calculated by using NLP techniques. We applied the framework to Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami Shoten. We show an example of knowledge structure extracted from Shisō by using our visualizing system. 1. Introduction The purpose of this study is to provide access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. Knowledge 1 has been increasing at an exponential rate with advances in science and technology in recent years, resulting in massive amounts of information that have been extremely difficult to process manually. Thus, it is important to utilize information technologies (IT) to support new discoveries of knowledge from vast resources, such as literature, that are now available digitally. To implement the study, we have developed: 1) An automatic digitization framework for historical documents, 2) A computational model for extracting ontology from the digitized corpus, and 3) An interactive user interface (UI) to support discoveries of new knowledge. We chose the Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami * Center for Research and Development of Higher Education, University of Tokyo ** Center for Knowledge Structuring, University of Tokyo *** Graduate School of Engineering, University of Tokyo 1 Although the definition of knowledge is domain-specific, we define knowledge here as the particles represented by ontology, which is the (hierarchical) collection and classification of (technical) terms used to recognize their semantic relevance. Journal of the Japanese Association for Digital Humanities, vol. 1, 37

2 Shoten as our target corpus. This is one of the most representative journals of philosophy in Japan, having an over ninety-year history, from 1921 to the present. It is comprised of about 10,000 papers and around 200,000 pages of textual data. The first step in this study is to develop a technology to digitize such large amounts of textual data from physical books (semi-) automatically. Because the corpus was too large to digitize manually (i.e., by typing), a rapid, accurate, and low-cost approach was required. Thus, we developed an OCR-based (semi-) automatic book-digitizing framework, in which we integrated three processes: (1) book scanning, (2) OCR, and (3) automatic document style recognition. The inputs for the framework are physical books and the output is digitized text with metadata (titles, authors, page numbers, and dates). Because we employed machine learning techniques for document style recognition, our digitizing framework can be applied to other styles of documents. The next step is to use the digitized text to support discoveries of new knowledge. We developed an ontology extraction system using an NLP technology, and a system visualizing the documents with relationships among them based on extracted ontologies. 2. Automatic Digitizing System We have developed an automatic digitizing and document analysis system. The flow of the entire process is as follows: Scanning: Scan books to generate image files of each page. Character Recognition: Recognize the text characters and text blocks by applying an OCR process. Logical Layout Analysis: Estimate logical layout, which means the logical types of text blocks, such as body, title, and author. Reading Order Estimation: Estimate orders of text blocks for reading. Text Extraction: Extract texts by collecting text blocks in the reading order. Figure 1 shows an overview of our digitizing flow. In the Scanning step, we apply a non-destructive book scanner to historical books and create images for each page. In the Character Recognition step, we apply a customized, commercially available OCR system to scanned image files. We customized the OCR system to output XML files which contain not only the text itself and text blocks, but also additional information used in the later steps, such as character size, character position, block position, and so on. In the next steps, Logical Layout Analysis and Reading Order Estimation, we estimate logical types and reading order of text blocks automatically by using machine learning techniques. In the last step, Text Extraction, we extract the text from the OCR results by using the logical types and reading order estimated in previous steps. In the Journal of the Japanese Association for Digital Humanities, vol. 1, 38

3 rest of this section, we describe the details of the Logical Layout Analysis and Reading Order Detection steps. Figure 1. Overview of Digitizing Flow 2.1. Logical Layout Analysis For the Logical Layout Analysis step, we propose a method to identify the logical types of the blocks using a machine learning technique. In order to estimate the logical types of the blocks, we employed Support Vector Machine (SVM) (Cortes and Vapnik 1995), a machine learning model for classification, and employed various features to classify blocks based on the text and additional information in the OCR results. The features we selected are as follows: Position: x and y coordinates of the block Blank Space Length: the length of a blank space for four directions (upper, lower, left, and right) Block Size: the width and height of the block Character Size: average width and height of characters in the block Noun: percentage of noun words in the text of the block Personal Name: percentage of personal name words in the text of the block Values for the last two features, Noun and Personal Name, are calculated from the results by the Japanese language morphological analyzer MeCab. 2 We evaluated the proposed approach and the existing rule-based approach, which classify the blocks by the classifying rules created by hand based on position and character size, using the OCR data of the journal Shisō. In the experiments, both methods classified blocks into five logical types, Title, Author, Header, Page Number, and Text Body. The proposed approach was trained with a small set of OCR results with correct logical types annotated by hand. Table 1 shows the accuracy of logical type estimation by the two approaches. The results show that the proposed 2 MeCab: Japanese Morphological Analyzer. Journal of the Japanese Association for Digital Humanities, vol. 1, 39

4 approach classified blocks correctly better than the rule-based approach, especially in the identification of Title and Author blocks, which are important for recognizing the unit of an article. Table 1. Accuracy Percentage of Logical Layout Analysis Proposed approach Rule-based approach Precision Recall Precision Recall Title Author Header Page Number Text Body Reading Order Estimation A page contains a number of text blocks of OCR results. We need to know their order so as to extract the whole text body correctly. For the reading order estimation, we introduced a Page Splitting method. This method splits a page vertically or horizontally into two areas and splits those areas recursively until it reaches the level of a single block, as shown in figure 2. A reading order is organized by using predefined orientations, for example, right to left and top to bottom for Shisō documents. Different split rules leads to different reading orders. In our method, a split rule is determined by a score function (weight vector * feature vector of a split candidate). To train the software in the split rule, we used a machine learning optimization method, DE (Differential Evolution). DE searches for an optimal value of weight vector that determines the split rule. The features used here are the following: Type of split (vertical or horizontal) The number of blocks which exist under a split line Position of a split line Width of a split Types of blocks (e.g., whether the split area contains Title or Author blocks) Journal of the Japanese Association for Digital Humanities, vol. 1, 40

5 Figure 2. Page Splitting Method The proposed approach was evaluated on the OCR result of Shisō documents. The proposed approach was learned from a small set of OCR results with human-generated correct reading order. For the Shisō documents, the approach estimated the reading order of blocks with a Spearman Distance of about The experimental result for the above two steps shows enough accuracy for extracting knowledge in the later step. Moreover, since we employed machine learning techniques for estimating the logical types and reading order of blocks, our approach can be applied to documents with other types of layout given the correct data for training our estimation systems. 3. Visualization System We constructed a visualization system for the set of digitized documents. The main objective of the system is to facilitate knowledge acquisition from documents and generate ideas through terminology-based real-time calculations of document similarities and their visualization with an interactive UI. Figure 3 outlines the visualization of knowledge structures for papers relevant to the keyword shisō ( 思想 thought ) in the 1930s. The system extracts terminological information from text in advance by using NLP techniques (Mima and Ananiadou 2000). The system constructs a graph to structure knowledge in which the nodes (dots) reflect relevant papers with the keyword, and the links between the nodes reflect semantic similarities that are Journal of the Japanese Association for Digital Humanities, vol. 1, 41

6 calculated based on the extracted terminological information. Additionally, the locations of all the nodes are calculated and optimized when the graph is drawn. The distances between nodes depend on how close they are in meaning. Cluster recognition is also carried out based on the detection of groups of papers in which every combination of papers that are included is strongly linked (i.e., their similarity exceeds a threshold). As seen in figure 3, several clusters are automatically recognized and category names such as Marxism, Socialism, and right-wing thought are also automatically assigned to clusters to facilitate an overview of thoughts discussed in these papers. Figure 3. Visualization of knowledge structure Journal of the Japanese Association for Digital Humanities, vol. 1, 42

7 4. Conclusion We have described a framework for revealing aspects of the modern history of Japanese philosophy, which consists of digitization of documents to extract text, extraction of terminological information from the text with NLP techniques, and visualization of target documents based on the terminological information. Our current target is the journal Shisō (Thought), one of the major journals of philosophy in Japan. We processed the Shisō documents with our framework and constructed a search and visualization system which helps researches to view the whole document set and to discover new knowledge. We expect to discover new knowledge on the historical flow of Japanese thinking during one of its most important eras, from before World War II to the present day, by digitizing and analyzing huge amounts of historical textual data with the system. Although we constructed the framework described in this paper specifically for use with the target Shisō, the framework itself can be applied to any type of documents that can be recognized by using an OCR system. As the next step, we will apply the framework on documents in other areas in order to enable discovery of knowledge in those areas. References Cortes, Corinna, and Vladimir Vapnik Support-vector Networks. Machine Learning 20(3): doi: /bf Mima, Hideki, and Sophia Ananiadou An Application and Evaluation of the C/NC-value Approach for the Automatic Term Recognition of Multi-word Units in Japanese, International Journal on Terminology 6(2): doi: /term mim. Journal of the Japanese Association for Digital Humanities, vol. 1, 43

Ubiquitous Computing and Communication Journal (ISSN )

Ubiquitous Computing and Communication Journal (ISSN ) A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which

More information

A semi-incremental recognition method for on-line handwritten Japanese text

A semi-incremental recognition method for on-line handwritten Japanese text 2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

A Layout-Free Method for Extracting Elements from Document Images

A Layout-Free Method for Extracting Elements from Document Images A Layout-Free Method for Extracting Elements from Document Images Tsukasa Kochi and Takashi Saitoh Information and Communication Research and Development Center 32 Research Group, RICOH COMPANY,LTD. 3-2-3

More information

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1839-1845 International Research Publications House http://www. irphouse.com Recognition of

More information

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.1, January 2013 1 A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations Hiroyuki

More information

Semantic Annotation using Horizontal and Vertical Contexts

Semantic Annotation using Horizontal and Vertical Contexts Semantic Annotation using Horizontal and Vertical Contexts Mingcai Hong, Jie Tang, and Juanzi Li Department of Computer Science & Technology, Tsinghua University, 100084. China. {hmc, tj, ljz}@keg.cs.tsinghua.edu.cn

More information

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples Takuya Makino, Seiji Okura, Seiji Okajima, Shuangyong Song, Hiroko Suzuki, Fujitsu Laboratories Ltd. Fujitsu R&D Center

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009 Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images

More information

A Technique for Classification of Printed & Handwritten text

A Technique for Classification of Printed & Handwritten text 123 A Technique for Classification of Printed & Handwritten text M.Tech Research Scholar, Computer Engineering Department, Yadavindra College of Engineering, Punjabi University, Guru Kashi Campus, Talwandi

More information

CLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING

CLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING CLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING Kenta Fukano 1, and Hiroshi Masuda 2 1) Graduate student, Department of Intelligence Mechanical Engineering, The University of Electro-Communications,

More information

Automatic Extraction of Event Information from Newspaper Articles and Web Pages

Automatic Extraction of Event Information from Newspaper Articles and Web Pages Automatic Extraction of Event Information from Newspaper Articles and Web Pages Hidetsugu Nanba, Ryuta Saito, Aya Ishino, and Toshiyuki Takezawa Hiroshima City University, Graduate School of Information

More information

Automatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources

Automatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources ALI RIDHO BARAKBAH 106 Automatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources Ali Ridho Barakbah Abstract In the information retrieval, the generation

More information

A Supervised Method for Multi-keyword Web Crawling on Web Forums

A Supervised Method for Multi-keyword Web Crawling on Web Forums Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

Extending the Facets concept by applying NLP tools to catalog records of scientific literature Extending the Facets concept by applying NLP tools to catalog records of scientific literature *E. Picchi, *M. Sassi, **S. Biagioni, **S. Giannini *Institute of Computational Linguistics **Institute of

More information

Text-mining based journal splitting

Text-mining based journal splitting Text-mining based journal splitting Xiaofan Lin Intelligent Enterprise Technology Laboratory HP Laboratories Palo Alto HPL-2001-137 (R.1) November 18 th, 2002* E-mail: xiaofan.lin@hp.com table of contents,

More information

Mathematical formulae recognition and logical structure analysis of mathematical papers

Mathematical formulae recognition and logical structure analysis of mathematical papers Mathematical formulae recognition and logical structure analysis of mathematical papers DML 2010 July 7, 2010, Paris Masakazu Suzuki Kyushu University InftyProject ((http://www/inftyproject.org) Science

More information

Available online at ScienceDirect. Procedia Computer Science 60 (2015 )

Available online at   ScienceDirect. Procedia Computer Science 60 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 60 (2015 ) 1014 1020 19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of

More information

INFS 214: Introduction to Computing

INFS 214: Introduction to Computing INFS 214: Introduction to Computing Session 4 Input Technology Lecturer: Dr. Ebenezer Ankrah, Dept. of Information Studies Contact Information: eankrah@ug.edu.gh College of Education School of Continuing

More information

Text mining tools for semantically enriching the scientific literature

Text mining tools for semantically enriching the scientific literature Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

R 2 D 2 at NTCIR-4 Web Retrieval Task

R 2 D 2 at NTCIR-4 Web Retrieval Task R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Efficient Indexing and Searching Framework for Unstructured Data

Efficient Indexing and Searching Framework for Unstructured Data Efficient Indexing and Searching Framework for Unstructured Data Kyar Nyo Aye, Ni Lar Thein University of Computer Studies, Yangon kyarnyoaye@gmail.com, nilarthein@gmail.com ABSTRACT The proliferation

More information

Bus Detection and recognition for visually impaired people

Bus Detection and recognition for visually impaired people Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation

More information

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process A Text-Mining-based Patent Analysis in Product Innovative Process Liang Yanhong, Tan Runhua Abstract Hebei University of Technology Patent documents contain important technical knowledge and research results.

More information

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA 1 A little background of me Teach at Drexel University

More information

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information

More information

Filtering of Unstructured Text

Filtering of Unstructured Text International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 11, Issue 12 (December 2015), PP.45-49 Filtering of Unstructured Text Sudersan Behera¹,

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS Sushilkumar N. Holambe Dr. Ulhas B. Shinde Shrikant D. Mali Persuing PhD at Principal

More information

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico. Journal of Applied Research and Technology ISSN: 1665-6423 jart@aleph.cinstrum.unam.mx Centro de Ciencias Aplicadas y Desarrollo Tecnológico México Singla, S. K.; Yadav, R. K. Optical Character Recognition

More information

Extracting Algorithms by Indexing and Mining Large Data Sets

Extracting Algorithms by Indexing and Mining Large Data Sets Extracting Algorithms by Indexing and Mining Large Data Sets Vinod Jadhav 1, Dr.Rekha Rathore 2 P.G. Student, Department of Computer Engineering, RKDF SOE Indore, University of RGPV, Bhopal, India Associate

More information

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Building Web Annotation Stickies based on Bidirectional Links

Building Web Annotation Stickies based on Bidirectional Links Building Web Annotation Stickies based on Bidirectional Links Hiroyuki Sano, Taiki Ito, Tadachika Ozono and Toramatsu Shintani Dept. of Computer Science and Engineering Graduate School of Engineering,

More information

Keyword Spotting in Document Images through Word Shape Coding

Keyword Spotting in Document Images through Word Shape Coding 2009 10th International Conference on Document Analysis and Recognition Keyword Spotting in Document Images through Word Shape Coding Shuyong Bai, Linlin Li and Chew Lim Tan School of Computing, National

More information

Segmentation of Arabic handwritten text to lines

Segmentation of Arabic handwritten text to lines Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 73 (2015 ) 115 121 The International Conference on Advanced Wireless, Information, and Communication Technologies (AWICT

More information

Construction of Knowledge Base for Automatic Indexing and Classification Based. on Chinese Library Classification

Construction of Knowledge Base for Automatic Indexing and Classification Based. on Chinese Library Classification Construction of Knowledge Base for Automatic Indexing and Classification Based on Chinese Library Classification Han-qing Hou, Chun-xiang Xue School of Information Science & Technology, Nanjing Agricultural

More information

Semantic Video Indexing

Semantic Video Indexing Semantic Video Indexing T-61.6030 Multimedia Retrieval Stevan Keraudy stevan.keraudy@tkk.fi Helsinki University of Technology March 14, 2008 What is it? Query by keyword or tag is common Semantic Video

More information

TISA Methodology Threat Intelligence Scoring and Analysis

TISA Methodology Threat Intelligence Scoring and Analysis TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Fault Identification from Web Log Files by Pattern Discovery

Fault Identification from Web Log Files by Pattern Discovery ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

Mathematical formula recognition using virtual link network

Mathematical formula recognition using virtual link network Mathematical formula recognition using virtual link network Yuko Eto Toshiba Corporation e-solution Company Suehiro-cho 2-9, Ome, Tokyo, 198-8710, Japan yuko.eto@toshiba.co.jp Masakazu Suzuki Faculty of

More information

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful

More information

Managing Learning Objects in Large Scale Courseware Authoring Studio 1

Managing Learning Objects in Large Scale Courseware Authoring Studio 1 Managing Learning Objects in Large Scale Courseware Authoring Studio 1 Ivo Marinchev, Ivo Hristov Institute of Information Technologies Bulgarian Academy of Sciences, Acad. G. Bonchev Str. Block 29A, Sofia

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

Generic object recognition using graph embedding into a vector space

Generic object recognition using graph embedding into a vector space American Journal of Software Engineering and Applications 2013 ; 2(1) : 13-18 Published online February 20, 2013 (http://www.sciencepublishinggroup.com/j/ajsea) doi: 10.11648/j. ajsea.20130201.13 Generic

More information

Patent Image Retrieval

Patent Image Retrieval Patent Image Retrieval Stefanos Vrochidis IRF Symposium 2008 Vienna, November 6, 2008 Aristotle University of Thessaloniki Overview 1. Introduction 2. Related Work in Patent Image Retrieval 3. Patent Image

More information

Table Identification and Information extraction in Spreadsheets

Table Identification and Information extraction in Spreadsheets Table Identification and Information extraction in Spreadsheets Elvis Koci 1,2, Maik Thiele 1, Oscar Romero 2, and Wolfgang Lehner 1 1 Technische Universität Dresden, Germany 2 Universitat Politècnica

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Support Vector Machines for Mathematical Symbol Recognition

Support Vector Machines for Mathematical Symbol Recognition Support Vector Machines for Mathematical Symbol Recognition Christopher Malon 1, Seiichi Uchida 2, and Masakazu Suzuki 1 1 Engineering Division, Faculty of Mathematics, Kyushu University 6 10 1 Hakozaki,

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA Journal of Computer Science, 9 (5): 534-542, 2013 ISSN 1549-3636 2013 doi:10.3844/jcssp.2013.534.542 Published Online 9 (5) 2013 (http://www.thescipub.com/jcs.toc) MATRIX BASED INDEXING TECHNIQUE FOR VIDEO

More information

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey

More information

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study Interactive Machine Learning (IML) Markup of OCR Generated by Exploiting Domain Knowledge: A Biodiversity Case Study Several digitization projects such as Google books are involved in scanning millions

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

Optical Character Recognition Based Speech Synthesis System Using LabVIEW

Optical Character Recognition Based Speech Synthesis System Using LabVIEW Optical Character Recognition Based Speech Synthesis System Using LabVIEW S. K. Singla* 1 and R.K.Yadav 2 1 Electrical and Instrumentation Engineering Department Thapar University, Patiala,Punjab *sunilksingla2001@gmail.com

More information

A Robot Recognizing Everyday Objects

A Robot Recognizing Everyday Objects A Robot Recognizing Everyday Objects -- Towards Robot as Autonomous Knowledge Media -- Hideaki Takeda Atsushi Ueno Motoki Saji, Tsuyoshi Nakano Kei Miyamato The National Institute of Informatics Nara Institute

More information

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Indian Multi-Script Full Pin-code String Recognition for Postal Automation 2009 10th International Conference on Document Analysis and Recognition Indian Multi-Script Full Pin-code String Recognition for Postal Automation U. Pal 1, R. K. Roy 1, K. Roy 2 and F. Kimura 3 1 Computer

More information

Text-Tracking Wearable Camera System for the Blind

Text-Tracking Wearable Camera System for the Blind 2009 10th International Conference on Document Analysis and Recognition Text-Tracking Wearable Camera System for the Blind Hideaki Goto Cyberscience Center Tohoku University, Japan hgot @ isc.tohoku.ac.jp

More information

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),

More information

Enterprise Data Catalog for Microsoft Azure Tutorial

Enterprise Data Catalog for Microsoft Azure Tutorial Enterprise Data Catalog for Microsoft Azure Tutorial VERSION 10.2 JANUARY 2018 Page 1 of 45 Contents Tutorial Objectives... 4 Enterprise Data Catalog Overview... 5 Overview... 5 Objectives... 5 Enterprise

More information

XETA: extensible metadata System

XETA: extensible metadata System XETA: extensible metadata System Abstract: This paper presents an extensible metadata system (XETA System) which makes it possible for the user to organize and extend the structure of metadata. We discuss

More information

Automatic Reader. Multi Lingual OCR System.

Automatic Reader. Multi Lingual OCR System. Automatic Reader Multi Lingual OCR System What is the Automatic Reader? Sakhr s Automatic Reader transforms scanned images into a grid of millions of dots, optically recognizes the characters found in

More information

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM Anoop K. Bhattacharjya and Hakan Ancin Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104 Palo Alto, CA 94304 e-mail: {anoop, ancin}@erd.epson.com Abstract

More information

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017 Visual Concept Detection and Linked Open Data at the TIB AV- Portal Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017 Agenda 1. TIB and TIB AV-Portal 2. Automated Video Analysis 3. Visual

More information

Narrowing It Down: Information Retrieval, Supporting Effective Visual Browsing, Semantic Networks

Narrowing It Down: Information Retrieval, Supporting Effective Visual Browsing, Semantic Networks Clarence Chan: clarence@cs.ubc.ca #63765051 CPSC 533 Proposal Memoplex++: An augmentation for Memoplex Browser Introduction Perusal of text documents and articles is a central process of research in many

More information

An Improving for Ranking Ontologies Based on the Structure and Semantics

An Improving for Ranking Ontologies Based on the Structure and Semantics An Improving for Ranking Ontologies Based on the Structure and Semantics S.Anusuya, K.Muthukumaran K.S.R College of Engineering Abstract Ontology specifies the concepts of a domain and their semantic relationships.

More information

Semantic Clickstream Mining

Semantic Clickstream Mining Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti

More information

6. Applications - Text recognition in videos - Semantic video analysis

6. Applications - Text recognition in videos - Semantic video analysis 6. Applications - Text recognition in videos - Semantic video analysis Stephan Kopf 1 Motivation Goal: Segmentation and classification of characters Only few significant features are visible in these simple

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script Arwinder Kaur 1, Ashok Kumar Bathla 2 1 M. Tech. Student, CE Dept., 2 Assistant Professor, CE Dept.,

More information

Federated Search: Results Clustering. JR Jenkins, MLIS Group Product Manager Resource Discovery

Federated Search: Results Clustering. JR Jenkins, MLIS Group Product Manager Resource Discovery Federated Search: Results Clustering JR Jenkins, MLIS Group Product Manager Resource Discovery Why Federated Search? The Web has changed how we deliver and consume information The paradigm shift from physical

More information

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS By SSLMIT, Trieste The availability of teaching materials for training interpreters and translators has always been an issue of unquestionable

More information

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus Outline Evaluation of browsing behaviour and automated subject classification: examples from KnowLib Subject browsing Automated subject classification Koraljka Golub, Knowledge Discovery and Digital Library

More information

Automatic Metadata Extraction for Archival Description and Access

Automatic Metadata Extraction for Archival Description and Access Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques

More information

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a * Information Retrieval System Based on Context-aware in Internet of Things Ma Junhong 1, a * 1 Xi an International University, Shaanxi, China, 710000 a sufeiya913@qq.com Keywords: Context-aware computing,

More information

OCR For Handwritten Marathi Script

OCR For Handwritten Marathi Script International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 OCR For Handwritten Marathi Script Mrs.Vinaya. S. Tapkir 1, Mrs.Sushma.D.Shelke 2 1 Maharashtra Academy Of Engineering,

More information

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE Impact Factor (SJIF): 5.301 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 5, Issue 3, March-2018 IMPLEMENTING ON OPTICAL CHARACTER

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Early-Modern Printed Character Recognition using Ensemble Learning

Early-Modern Printed Character Recognition using Ensemble Learning 288 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'17 Early-Modern Printed Character Recognition using Ensemble Learning Kaori Fujimoto, Yu Ishikawa, Masami Takata, Kazuki Joe Department of Advanced

More information

Ahmed Samir Bibliotheca Alexandrina El Shatby Alexandria, Egypt

Ahmed Samir Bibliotheca Alexandrina El Shatby Alexandria, Egypt DIGITAL PRESERVATION: HANDLING LARGE COLLECTIONS CASE STUDY: DIGITIZING EGYPTIAN PRESS ARCHIVE AT CENTRE FOR ECONOMIC, JUDICIAL, AND SOCIAL STUDY AND DOCUMENTATION (CEDEJ) Ahmed Samir Bibliotheca Alexandrina

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

3 Background technologies 3.1 OntoGen The two main characteristics of the OntoGen system [1,2,6] are the following.

3 Background technologies 3.1 OntoGen The two main characteristics of the OntoGen system [1,2,6] are the following. ADVANCING TOPIC ONTOLOGY LEARNING THROUGH TERM EXTRACTION Blaž Fortuna (1), Nada Lavrač (1, 2), Paola Velardi (3) (1) Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia (2) University of Nova

More information

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Mining User - Aware Rare Sequential Topic Pattern in Document Streams Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,

More information