Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization
|
|
- Allen Harvey
- 5 years ago
- Views:
Transcription
1 Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. In order to discover new knowledge from massive amounts of information, support of information technologies is required. For supporting knowledge discovery from vast amount of books, we developed an OCR-based automatic book-digitizing framework and the system visualizing documents with relationships among them calculated by using NLP techniques. We applied the framework to Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami Shoten. We show an example of knowledge structure extracted from Shisō by using our visualizing system. 1. Introduction The purpose of this study is to provide access to the modern history of Japanese philosophy using natural language processing (NLP) and visualization. Knowledge 1 has been increasing at an exponential rate with advances in science and technology in recent years, resulting in massive amounts of information that have been extremely difficult to process manually. Thus, it is important to utilize information technologies (IT) to support new discoveries of knowledge from vast resources, such as literature, that are now available digitally. To implement the study, we have developed: 1) An automatic digitization framework for historical documents, 2) A computational model for extracting ontology from the digitized corpus, and 3) An interactive user interface (UI) to support discoveries of new knowledge. We chose the Japanese journal Shisō ( Thought ) by the Japanese publisher Iwanami * Center for Research and Development of Higher Education, University of Tokyo ** Center for Knowledge Structuring, University of Tokyo *** Graduate School of Engineering, University of Tokyo 1 Although the definition of knowledge is domain-specific, we define knowledge here as the particles represented by ontology, which is the (hierarchical) collection and classification of (technical) terms used to recognize their semantic relevance. Journal of the Japanese Association for Digital Humanities, vol. 1, 37
2 Shoten as our target corpus. This is one of the most representative journals of philosophy in Japan, having an over ninety-year history, from 1921 to the present. It is comprised of about 10,000 papers and around 200,000 pages of textual data. The first step in this study is to develop a technology to digitize such large amounts of textual data from physical books (semi-) automatically. Because the corpus was too large to digitize manually (i.e., by typing), a rapid, accurate, and low-cost approach was required. Thus, we developed an OCR-based (semi-) automatic book-digitizing framework, in which we integrated three processes: (1) book scanning, (2) OCR, and (3) automatic document style recognition. The inputs for the framework are physical books and the output is digitized text with metadata (titles, authors, page numbers, and dates). Because we employed machine learning techniques for document style recognition, our digitizing framework can be applied to other styles of documents. The next step is to use the digitized text to support discoveries of new knowledge. We developed an ontology extraction system using an NLP technology, and a system visualizing the documents with relationships among them based on extracted ontologies. 2. Automatic Digitizing System We have developed an automatic digitizing and document analysis system. The flow of the entire process is as follows: Scanning: Scan books to generate image files of each page. Character Recognition: Recognize the text characters and text blocks by applying an OCR process. Logical Layout Analysis: Estimate logical layout, which means the logical types of text blocks, such as body, title, and author. Reading Order Estimation: Estimate orders of text blocks for reading. Text Extraction: Extract texts by collecting text blocks in the reading order. Figure 1 shows an overview of our digitizing flow. In the Scanning step, we apply a non-destructive book scanner to historical books and create images for each page. In the Character Recognition step, we apply a customized, commercially available OCR system to scanned image files. We customized the OCR system to output XML files which contain not only the text itself and text blocks, but also additional information used in the later steps, such as character size, character position, block position, and so on. In the next steps, Logical Layout Analysis and Reading Order Estimation, we estimate logical types and reading order of text blocks automatically by using machine learning techniques. In the last step, Text Extraction, we extract the text from the OCR results by using the logical types and reading order estimated in previous steps. In the Journal of the Japanese Association for Digital Humanities, vol. 1, 38
3 rest of this section, we describe the details of the Logical Layout Analysis and Reading Order Detection steps. Figure 1. Overview of Digitizing Flow 2.1. Logical Layout Analysis For the Logical Layout Analysis step, we propose a method to identify the logical types of the blocks using a machine learning technique. In order to estimate the logical types of the blocks, we employed Support Vector Machine (SVM) (Cortes and Vapnik 1995), a machine learning model for classification, and employed various features to classify blocks based on the text and additional information in the OCR results. The features we selected are as follows: Position: x and y coordinates of the block Blank Space Length: the length of a blank space for four directions (upper, lower, left, and right) Block Size: the width and height of the block Character Size: average width and height of characters in the block Noun: percentage of noun words in the text of the block Personal Name: percentage of personal name words in the text of the block Values for the last two features, Noun and Personal Name, are calculated from the results by the Japanese language morphological analyzer MeCab. 2 We evaluated the proposed approach and the existing rule-based approach, which classify the blocks by the classifying rules created by hand based on position and character size, using the OCR data of the journal Shisō. In the experiments, both methods classified blocks into five logical types, Title, Author, Header, Page Number, and Text Body. The proposed approach was trained with a small set of OCR results with correct logical types annotated by hand. Table 1 shows the accuracy of logical type estimation by the two approaches. The results show that the proposed 2 MeCab: Japanese Morphological Analyzer. Journal of the Japanese Association for Digital Humanities, vol. 1, 39
4 approach classified blocks correctly better than the rule-based approach, especially in the identification of Title and Author blocks, which are important for recognizing the unit of an article. Table 1. Accuracy Percentage of Logical Layout Analysis Proposed approach Rule-based approach Precision Recall Precision Recall Title Author Header Page Number Text Body Reading Order Estimation A page contains a number of text blocks of OCR results. We need to know their order so as to extract the whole text body correctly. For the reading order estimation, we introduced a Page Splitting method. This method splits a page vertically or horizontally into two areas and splits those areas recursively until it reaches the level of a single block, as shown in figure 2. A reading order is organized by using predefined orientations, for example, right to left and top to bottom for Shisō documents. Different split rules leads to different reading orders. In our method, a split rule is determined by a score function (weight vector * feature vector of a split candidate). To train the software in the split rule, we used a machine learning optimization method, DE (Differential Evolution). DE searches for an optimal value of weight vector that determines the split rule. The features used here are the following: Type of split (vertical or horizontal) The number of blocks which exist under a split line Position of a split line Width of a split Types of blocks (e.g., whether the split area contains Title or Author blocks) Journal of the Japanese Association for Digital Humanities, vol. 1, 40
5 Figure 2. Page Splitting Method The proposed approach was evaluated on the OCR result of Shisō documents. The proposed approach was learned from a small set of OCR results with human-generated correct reading order. For the Shisō documents, the approach estimated the reading order of blocks with a Spearman Distance of about The experimental result for the above two steps shows enough accuracy for extracting knowledge in the later step. Moreover, since we employed machine learning techniques for estimating the logical types and reading order of blocks, our approach can be applied to documents with other types of layout given the correct data for training our estimation systems. 3. Visualization System We constructed a visualization system for the set of digitized documents. The main objective of the system is to facilitate knowledge acquisition from documents and generate ideas through terminology-based real-time calculations of document similarities and their visualization with an interactive UI. Figure 3 outlines the visualization of knowledge structures for papers relevant to the keyword shisō ( 思想 thought ) in the 1930s. The system extracts terminological information from text in advance by using NLP techniques (Mima and Ananiadou 2000). The system constructs a graph to structure knowledge in which the nodes (dots) reflect relevant papers with the keyword, and the links between the nodes reflect semantic similarities that are Journal of the Japanese Association for Digital Humanities, vol. 1, 41
6 calculated based on the extracted terminological information. Additionally, the locations of all the nodes are calculated and optimized when the graph is drawn. The distances between nodes depend on how close they are in meaning. Cluster recognition is also carried out based on the detection of groups of papers in which every combination of papers that are included is strongly linked (i.e., their similarity exceeds a threshold). As seen in figure 3, several clusters are automatically recognized and category names such as Marxism, Socialism, and right-wing thought are also automatically assigned to clusters to facilitate an overview of thoughts discussed in these papers. Figure 3. Visualization of knowledge structure Journal of the Japanese Association for Digital Humanities, vol. 1, 42
7 4. Conclusion We have described a framework for revealing aspects of the modern history of Japanese philosophy, which consists of digitization of documents to extract text, extraction of terminological information from the text with NLP techniques, and visualization of target documents based on the terminological information. Our current target is the journal Shisō (Thought), one of the major journals of philosophy in Japan. We processed the Shisō documents with our framework and constructed a search and visualization system which helps researches to view the whole document set and to discover new knowledge. We expect to discover new knowledge on the historical flow of Japanese thinking during one of its most important eras, from before World War II to the present day, by digitizing and analyzing huge amounts of historical textual data with the system. Although we constructed the framework described in this paper specifically for use with the target Shisō, the framework itself can be applied to any type of documents that can be recognized by using an OCR system. As the next step, we will apply the framework on documents in other areas in order to enable discovery of knowledge in those areas. References Cortes, Corinna, and Vladimir Vapnik Support-vector Networks. Machine Learning 20(3): doi: /bf Mima, Hideki, and Sophia Ananiadou An Application and Evaluation of the C/NC-value Approach for the Automatic Term Recognition of Multi-word Units in Japanese, International Journal on Terminology 6(2): doi: /term mim. Journal of the Japanese Association for Digital Humanities, vol. 1, 43
Ubiquitous Computing and Communication Journal (ISSN )
A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationMURDOCH RESEARCH REPOSITORY
MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout
More informationParmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge
Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which
More informationA semi-incremental recognition method for on-line handwritten Japanese text
2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department
More informationChapter 28. Outline. Definitions of Data Mining. Data Mining Concepts
Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms
More informationInferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets
2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation
More informationA Layout-Free Method for Extracting Elements from Document Images
A Layout-Free Method for Extracting Elements from Document Images Tsukasa Kochi and Takashi Saitoh Information and Communication Research and Development Center 32 Research Group, RICOH COMPANY,LTD. 3-2-3
More informationRecognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1839-1845 International Research Publications House http://www. irphouse.com Recognition of
More informationA Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations
IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.1, January 2013 1 A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations Hiroyuki
More informationSemantic Annotation using Horizontal and Vertical Contexts
Semantic Annotation using Horizontal and Vertical Contexts Mingcai Hong, Jie Tang, and Juanzi Li Department of Computer Science & Technology, Tsinghua University, 100084. China. {hmc, tj, ljz}@keg.cs.tsinghua.edu.cn
More informationFLL: Answering World History Exams by Utilizing Search Results and Virtual Examples
FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples Takuya Makino, Seiji Okura, Seiji Okajima, Shuangyong Song, Hiroko Suzuki, Fujitsu Laboratories Ltd. Fujitsu R&D Center
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationData Mining Concepts
Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential
More informationMaximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009
Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images
More informationA Technique for Classification of Printed & Handwritten text
123 A Technique for Classification of Printed & Handwritten text M.Tech Research Scholar, Computer Engineering Department, Yadavindra College of Engineering, Punjabi University, Guru Kashi Campus, Talwandi
More informationCLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING
CLASSIFICATION FOR ROADSIDE OBJECTS BASED ON SIMULATED LASER SCANNING Kenta Fukano 1, and Hiroshi Masuda 2 1) Graduate student, Department of Intelligence Mechanical Engineering, The University of Electro-Communications,
More informationAutomatic Extraction of Event Information from Newspaper Articles and Web Pages
Automatic Extraction of Event Information from Newspaper Articles and Web Pages Hidetsugu Nanba, Ryuta Saito, Aya Ishino, and Toshiyuki Takezawa Hiroshima City University, Graduate School of Information
More informationAutomatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources
ALI RIDHO BARAKBAH 106 Automatic Metadata Generation By Clustering Extracted Representative Keywords From Heterogeneous Sources Ali Ridho Barakbah Abstract In the information retrieval, the generation
More informationA Supervised Method for Multi-keyword Web Crawling on Web Forums
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,
More informationExtending the Facets concept by applying NLP tools to catalog records of scientific literature
Extending the Facets concept by applying NLP tools to catalog records of scientific literature *E. Picchi, *M. Sassi, **S. Biagioni, **S. Giannini *Institute of Computational Linguistics **Institute of
More informationText-mining based journal splitting
Text-mining based journal splitting Xiaofan Lin Intelligent Enterprise Technology Laboratory HP Laboratories Palo Alto HPL-2001-137 (R.1) November 18 th, 2002* E-mail: xiaofan.lin@hp.com table of contents,
More informationMathematical formulae recognition and logical structure analysis of mathematical papers
Mathematical formulae recognition and logical structure analysis of mathematical papers DML 2010 July 7, 2010, Paris Masakazu Suzuki Kyushu University InftyProject ((http://www/inftyproject.org) Science
More informationAvailable online at ScienceDirect. Procedia Computer Science 60 (2015 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 60 (2015 ) 1014 1020 19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems
More informationNews Filtering and Summarization System Architecture for Recognition and Summarization of News Pages
Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---
More informationA System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification
A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of
More informationINFS 214: Introduction to Computing
INFS 214: Introduction to Computing Session 4 Input Technology Lecturer: Dr. Ebenezer Ankrah, Dept. of Information Studies Contact Information: eankrah@ug.edu.gh College of Education School of Continuing
More informationText mining tools for semantically enriching the scientific literature
Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationR 2 D 2 at NTCIR-4 Web Retrieval Task
R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationEfficient Indexing and Searching Framework for Unstructured Data
Efficient Indexing and Searching Framework for Unstructured Data Kyar Nyo Aye, Ni Lar Thein University of Computer Studies, Yangon kyarnyoaye@gmail.com, nilarthein@gmail.com ABSTRACT The proliferation
More informationBus Detection and recognition for visually impaired people
Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation
More informationHebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process
A Text-Mining-based Patent Analysis in Product Innovative Process Liang Yanhong, Tan Runhua Abstract Hebei University of Technology Patent documents contain important technical knowledge and research results.
More informationKNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA
KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA 1 A little background of me Teach at Drexel University
More informationArchives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment
Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information
More informationFiltering of Unstructured Text
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 11, Issue 12 (December 2015), PP.45-49 Filtering of Unstructured Text Sudersan Behera¹,
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationVideo annotation based on adaptive annular spatial partition scheme
Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory
More informationDEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS
DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS Sushilkumar N. Holambe Dr. Ulhas B. Shinde Shrikant D. Mali Persuing PhD at Principal
More informationJournal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.
Journal of Applied Research and Technology ISSN: 1665-6423 jart@aleph.cinstrum.unam.mx Centro de Ciencias Aplicadas y Desarrollo Tecnológico México Singla, S. K.; Yadav, R. K. Optical Character Recognition
More informationExtracting Algorithms by Indexing and Mining Large Data Sets
Extracting Algorithms by Indexing and Mining Large Data Sets Vinod Jadhav 1, Dr.Rekha Rathore 2 P.G. Student, Department of Computer Engineering, RKDF SOE Indore, University of RGPV, Bhopal, India Associate
More informationIntelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD
World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationBuilding Web Annotation Stickies based on Bidirectional Links
Building Web Annotation Stickies based on Bidirectional Links Hiroyuki Sano, Taiki Ito, Tadachika Ozono and Toramatsu Shintani Dept. of Computer Science and Engineering Graduate School of Engineering,
More informationKeyword Spotting in Document Images through Word Shape Coding
2009 10th International Conference on Document Analysis and Recognition Keyword Spotting in Document Images through Word Shape Coding Shuyong Bai, Linlin Li and Chew Lim Tan School of Computing, National
More informationSegmentation of Arabic handwritten text to lines
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 73 (2015 ) 115 121 The International Conference on Advanced Wireless, Information, and Communication Technologies (AWICT
More informationConstruction of Knowledge Base for Automatic Indexing and Classification Based. on Chinese Library Classification
Construction of Knowledge Base for Automatic Indexing and Classification Based on Chinese Library Classification Han-qing Hou, Chun-xiang Xue School of Information Science & Technology, Nanjing Agricultural
More informationSemantic Video Indexing
Semantic Video Indexing T-61.6030 Multimedia Retrieval Stevan Keraudy stevan.keraudy@tkk.fi Helsinki University of Technology March 14, 2008 What is it? Query by keyword or tag is common Semantic Video
More informationTISA Methodology Threat Intelligence Scoring and Analysis
TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationFault Identification from Web Log Files by Pattern Discovery
ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING
More informationMathematical formula recognition using virtual link network
Mathematical formula recognition using virtual link network Yuko Eto Toshiba Corporation e-solution Company Suehiro-cho 2-9, Ome, Tokyo, 198-8710, Japan yuko.eto@toshiba.co.jp Masakazu Suzuki Faculty of
More informationProf. Ahmet Süerdem Istanbul Bilgi University London School of Economics
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful
More informationManaging Learning Objects in Large Scale Courseware Authoring Studio 1
Managing Learning Objects in Large Scale Courseware Authoring Studio 1 Ivo Marinchev, Ivo Hristov Institute of Information Technologies Bulgarian Academy of Sciences, Acad. G. Bonchev Str. Block 29A, Sofia
More informationExtraction of Web Image Information: Semantic or Visual Cues?
Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus
More informationGeneric object recognition using graph embedding into a vector space
American Journal of Software Engineering and Applications 2013 ; 2(1) : 13-18 Published online February 20, 2013 (http://www.sciencepublishinggroup.com/j/ajsea) doi: 10.11648/j. ajsea.20130201.13 Generic
More informationPatent Image Retrieval
Patent Image Retrieval Stefanos Vrochidis IRF Symposium 2008 Vienna, November 6, 2008 Aristotle University of Thessaloniki Overview 1. Introduction 2. Related Work in Patent Image Retrieval 3. Patent Image
More informationTable Identification and Information extraction in Spreadsheets
Table Identification and Information extraction in Spreadsheets Elvis Koci 1,2, Maik Thiele 1, Oscar Romero 2, and Wolfgang Lehner 1 1 Technische Universität Dresden, Germany 2 Universitat Politècnica
More informationLarge Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao
Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese
More informationSupport Vector Machines for Mathematical Symbol Recognition
Support Vector Machines for Mathematical Symbol Recognition Christopher Malon 1, Seiichi Uchida 2, and Masakazu Suzuki 1 1 Engineering Division, Faculty of Mathematics, Kyushu University 6 10 1 Hakozaki,
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationMATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA
Journal of Computer Science, 9 (5): 534-542, 2013 ISSN 1549-3636 2013 doi:10.3844/jcssp.2013.534.542 Published Online 9 (5) 2013 (http://www.thescipub.com/jcs.toc) MATRIX BASED INDEXING TECHNIQUE FOR VIDEO
More informationInternational Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations
Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey
More informationInteractive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study
Interactive Machine Learning (IML) Markup of OCR Generated by Exploiting Domain Knowledge: A Biodiversity Case Study Several digitization projects such as Google books are involved in scanning millions
More informationANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining
ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila
More informationOptical Character Recognition Based Speech Synthesis System Using LabVIEW
Optical Character Recognition Based Speech Synthesis System Using LabVIEW S. K. Singla* 1 and R.K.Yadav 2 1 Electrical and Instrumentation Engineering Department Thapar University, Patiala,Punjab *sunilksingla2001@gmail.com
More informationA Robot Recognizing Everyday Objects
A Robot Recognizing Everyday Objects -- Towards Robot as Autonomous Knowledge Media -- Hideaki Takeda Atsushi Ueno Motoki Saji, Tsuyoshi Nakano Kei Miyamato The National Institute of Informatics Nara Institute
More informationIndian Multi-Script Full Pin-code String Recognition for Postal Automation
2009 10th International Conference on Document Analysis and Recognition Indian Multi-Script Full Pin-code String Recognition for Postal Automation U. Pal 1, R. K. Roy 1, K. Roy 2 and F. Kimura 3 1 Computer
More informationText-Tracking Wearable Camera System for the Blind
2009 10th International Conference on Document Analysis and Recognition Text-Tracking Wearable Camera System for the Blind Hideaki Goto Cyberscience Center Tohoku University, Japan hgot @ isc.tohoku.ac.jp
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationEnterprise Data Catalog for Microsoft Azure Tutorial
Enterprise Data Catalog for Microsoft Azure Tutorial VERSION 10.2 JANUARY 2018 Page 1 of 45 Contents Tutorial Objectives... 4 Enterprise Data Catalog Overview... 5 Overview... 5 Objectives... 5 Enterprise
More informationXETA: extensible metadata System
XETA: extensible metadata System Abstract: This paper presents an extensible metadata system (XETA System) which makes it possible for the user to organize and extend the structure of metadata. We discuss
More informationAutomatic Reader. Multi Lingual OCR System.
Automatic Reader Multi Lingual OCR System What is the Automatic Reader? Sakhr s Automatic Reader transforms scanned images into a grid of millions of dots, optically recognizes the characters found in
More informationDATA EMBEDDING IN TEXT FOR A COPIER SYSTEM
DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM Anoop K. Bhattacharjya and Hakan Ancin Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104 Palo Alto, CA 94304 e-mail: {anoop, ancin}@erd.epson.com Abstract
More informationVisual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017
Visual Concept Detection and Linked Open Data at the TIB AV- Portal Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017 Agenda 1. TIB and TIB AV-Portal 2. Automated Video Analysis 3. Visual
More informationNarrowing It Down: Information Retrieval, Supporting Effective Visual Browsing, Semantic Networks
Clarence Chan: clarence@cs.ubc.ca #63765051 CPSC 533 Proposal Memoplex++: An augmentation for Memoplex Browser Introduction Perusal of text documents and articles is a central process of research in many
More informationAn Improving for Ranking Ontologies Based on the Structure and Semantics
An Improving for Ranking Ontologies Based on the Structure and Semantics S.Anusuya, K.Muthukumaran K.S.R College of Engineering Abstract Ontology specifies the concepts of a domain and their semantic relationships.
More informationSemantic Clickstream Mining
Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti
More information6. Applications - Text recognition in videos - Semantic video analysis
6. Applications - Text recognition in videos - Semantic video analysis Stephan Kopf 1 Motivation Goal: Segmentation and classification of characters Only few significant features are visible in these simple
More informationLeave-One-Out Support Vector Machines
Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm
More informationRPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???
@ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON
More informationA Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script
A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script Arwinder Kaur 1, Ashok Kumar Bathla 2 1 M. Tech. Student, CE Dept., 2 Assistant Professor, CE Dept.,
More informationFederated Search: Results Clustering. JR Jenkins, MLIS Group Product Manager Resource Discovery
Federated Search: Results Clustering JR Jenkins, MLIS Group Product Manager Resource Discovery Why Federated Search? The Web has changed how we deliver and consume information The paradigm shift from physical
More informationMULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste
MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS By SSLMIT, Trieste The availability of teaching materials for training interpreters and translators has always been an issue of unquestionable
More informationCIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task
CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information
More informationContext Based Web Indexing For Semantic Web
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT
More informationOutline. Structures for subject browsing. Subject browsing. Research issues. Renardus
Outline Evaluation of browsing behaviour and automated subject classification: examples from KnowLib Subject browsing Automated subject classification Koraljka Golub, Knowledge Discovery and Digital Library
More informationAutomatic Metadata Extraction for Archival Description and Access
Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques
More informationInformation Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *
Information Retrieval System Based on Context-aware in Internet of Things Ma Junhong 1, a * 1 Xi an International University, Shaanxi, China, 710000 a sufeiya913@qq.com Keywords: Context-aware computing,
More informationOCR For Handwritten Marathi Script
International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 OCR For Handwritten Marathi Script Mrs.Vinaya. S. Tapkir 1, Mrs.Sushma.D.Shelke 2 1 Maharashtra Academy Of Engineering,
More informationIMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE
Impact Factor (SJIF): 5.301 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 5, Issue 3, March-2018 IMPLEMENTING ON OPTICAL CHARACTER
More informationOptimizing Search Engines using Click-through Data
Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches
More informationEarly-Modern Printed Character Recognition using Ensemble Learning
288 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'17 Early-Modern Printed Character Recognition using Ensemble Learning Kaori Fujimoto, Yu Ishikawa, Masami Takata, Kazuki Joe Department of Advanced
More informationAhmed Samir Bibliotheca Alexandrina El Shatby Alexandria, Egypt
DIGITAL PRESERVATION: HANDLING LARGE COLLECTIONS CASE STUDY: DIGITIZING EGYPTIAN PRESS ARCHIVE AT CENTRE FOR ECONOMIC, JUDICIAL, AND SOCIAL STUDY AND DOCUMENTATION (CEDEJ) Ahmed Samir Bibliotheca Alexandrina
More informationEmpirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department
More information3 Background technologies 3.1 OntoGen The two main characteristics of the OntoGen system [1,2,6] are the following.
ADVANCING TOPIC ONTOLOGY LEARNING THROUGH TERM EXTRACTION Blaž Fortuna (1), Nada Lavrač (1, 2), Paola Velardi (3) (1) Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia (2) University of Nova
More informationMining User - Aware Rare Sequential Topic Pattern in Document Streams
Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,
More information