Content analysis with Apache Tika
|
|
- Rosaline Hamilton
- 5 years ago
- Views:
Transcription
1 Content analysis with Apache Tika or -
2 Main challenge Lucene index 2
3 Other challenges Licenses Dependencies Efforts breaking up Custom solution limits 3
4 What is Tika? Another Indian Lucene project? No. 4
5 What is Tika? It is a Toolkit Doing What? Where? How? Detection Metadata Various documents Using existing parser libs Extraction Structured text content 5
6 Current coverage 77 Mime types (+ 36 aliases) 179 glob patterns 18 magic patterns 66 Supported Types 15 Parsers 6
7 A brief history of Tika Sponsored by the Apache Lucene PMC Graduating 0.2 Release March 07 Incubator December 07 1 Release 7
8 Tika organization 7 committers Issue Tracking se/tika Mailing Lists tika-dev@incubator.apache.org Changing after graduation 8
9 Getting Tika and contributing Download Build Have fun 0.1-incubatingsrc (tar.gz) mvn install svn.apache.org mvn eclipse:eclipse 9
10 Tika Design The Parser interface Document input stream XHTML SAX events Document metadata Parser implementations 10
11 The Parser interface void parse(inputstream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; Document + Metadata parse() XHTML + Metadata 11
12 Tika Design The Parser interface Document input stream XHTML SAX events Document metadata Parser implementations 12
13 Document input stream InputStream Not Readable IOException InputStream Not Parsable TikaException InputStream OK PARSED! 13
14 Tika Design The Parser interface Document input stream XHTML SAX events Document metadata Parser implementations 14
15 XHTML SAX events <html xmlns=" <head> <title>...</title> </head> <body>... </body> </html> parse() SAX events ContentHandler 15
16 Why XHTML? Reflect the structured text content of the document Not recreating the low level details For low level details use low level parser libs 16
17 ContentHandler (CH) and Decorators (CHD) XHTMLContentHandler BodyContentHandler TextContentHandler CHD that Produces XHTML events CHD that only passes the body to a CH CHD that only passes characters to a CH 17
18 Tika Design The Parser interface Document input stream XHTML SAX events Document metadata Parser implementations 18
19 Document metadata Metadata.RESOURCE_NAME_KEY The name of the file or resource that contains the document Metadata.CONTENT_TYPE According to the content type the document was parsed to Metadata.TITLE If the document format contains an explicit title field Metadata.AUTHOR If the document format contains an explicit author field 19
20 more metadata: HPSF Apache POI: HPSF TITLE - Title SUBJECT - Subject AUTHOR - Author KEYWORDS - Keywords COMMENTS - Comments TEMPLATE - Template LAST_SAVED - Last Saved By REVISION_NUMBER - Revision Number LAST_PRINTED - Last Printed LAST_SAVED - Last Saved Time/Date LAST_SAVED - Last Saved Time/Date PAGE_COUNT- Number of Pages WORD_COUNT- Number of Words CHARACTER_COUNT - Number of Characters APPLICATION_NAME - Name of Creating Application 20
21 Tika Design The Parser interface Document input stream XHTML SAX events Document metadata Parser implementations 21
22 Parser implementations 3 Party Libraries PDFBox Apache POI Tika Parsers PDFParser OfficeParser 22
23 The AutoDetectParser Encapsulates all Tika functionalities Can handle any type of document Type Detection Parsing 23
24 Type Detection MimeType type = types.getmimetype( ); Magic markers in prefix type.matchesxml(data) magic.eval(data) Resource Name MimeType type = patterns.matches(name) Metadata Metadata.CONTENT_TYPE Default Type application/octet-stream 24
25 tika-mimetypes.xml An example: Gzip <mime-type type="application/x-gzip"> <magic priority="40"> <match value="\037\213" type="string offset="0" /> </magic> <glob pattern="*.tgz" /> <glob pattern="*.gz" /> <glob pattern="*-gz" /> </mime-type> 25
26 Supported formats Excel Word Power Point PDF Plain Text RTF Outlook Gzip Bzip2 XML HTML Images Java class & jar MP3 OpenDocument Tar ZIP others 26
27 A really simple example InputStream input = MyTest.class.getResourceAsStream("testPPT.ppt"); Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new OfficeParser().parse(input, handler, metadata); String contenttype = metadata.get(metadata.content_type); String title= metadata.get(metadata.title); String content = handler.tostring(); 27
28 Future Goals OCR Speech Recognition XMP Integration Parser Configurability 28
29 Who uses Tika? Apache Nutch Apache Jackrabbit Apache Droids Apache UIMA 29
This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.
About the Tutorial This tutorial provides a basic understanding of Apache Tika library, the file formats it supports, as well as content and metadata extraction using Apache Tika. Audience This tutorial
More informationTIKA - QUICK GUIDE TIKA - OVERVIEW
http://www.tutorialspoint.com/tika/tika_quick_guide.htm TIKA - QUICK GUIDE Copyright tutorialspoint.com TIKA - OVERVIEW What is Apache Tika? Apache Tika is a library that is used for document type detection
More informationApache Tika What s new with 2.0?
Apache Tika What s new with 2.0? Nick Burch CTO, Quanticate Tika, in a nutshell small, yellow and leech-like, and probably the oddest thing in the Universe Like a Babel Fish for content! Helps you work
More informationWhat's with all the 1s and 0s? Making sense of binary data at scale with Apache Tika
What's with all the 1s and 0s? Making sense of binary data at scale with Apache Tika Nick Burch CTO, Quanticate Those 1s and 0s Apache Tika the basics Detection Binary formats Text formats Extending Tika
More informationTika in Action JUKKA MANNING CHRIS A. MATTMANN L. ZITTING. Shelter Island
Tika in Action CHRIS A. MATTMANN JUKKA L. ZITTING 11 MANNING Shelter Island contents foretuord xv preface xvii acknowledgments xix about this book xxi about the authors xxv about the cover illustration
More informationThe Use of Search Engines for Massively Scalable Forensic Repositories
The Use of Search Engines for Massively Scalable Forensic Repositories www.cybertapllc.com/ John H. Ricketson jricketson@cybertapllc.com jricketson@dejavutechnologies.com +1-978-692-7229 Who Is cybertap?
More informationEvaluating Text Extraction: Apache Tika s New tika-eval Module
Evaluating Text Extraction: Apache Tika s New tika-eval Module Tim Allison ApacheCon North America 2017 Miami, FL May 18, 2017 2 Overview tika-eval Content and metadata extraction in the ETL stack overview
More informationXML in the Development of Component Systems. Parser Interfaces: SAX
XML in the Development of Component Systems Parser Interfaces: SAX XML Programming Models Treat XML as text useful for interactive creation of documents (text editors) also useful for programmatic generation
More informationPair Networks Hosting Services - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
Property Value FTP Server apache.mirrors.pair.com Description Pair Networks Hosting Services Country United States Scan Date 04/Oct/2015 Total Dirs 1,993 Total Files 10,445 Total Data 73.87 GB Top 20 Directories
More informationData Science Vignettes in Java. Rui Miguel Forte Lead Data Workable
Data Science Vignettes in Java Rui Miguel Forte Lead Data Scientist @ Workable Data Science This looks cool, but what is it really? Data Science Data Science It is a fertile mixture of: Using statistical
More informationTagSoup: A SAX parser in Java for nasty, ugly HTML. John Cowan
TagSoup: A SAX parser in Java for nasty, ugly HTML John Cowan (cowan@ccil.org) Copyright This presentation is: Copyright 2002 John Cowan Licensed under the GNU General Public License ABSOLUTELY WITHOUT
More informationStrigi Internals. Fast libraries for text and metadata. Jos van den Oever
Strigi Internals Fast libraries for text and metadata deepgrep libstreamindexer deepfind libstreams xmlindexer kio fuse vfs Strigi Reading nested files *.gz *.bz2 *.tar *.zip, *.[jwe]ar, openoffice files
More informationIn this tutorial, we will understand how to use the OpenNLP library to build an efficient text processing service.
About the Tutorial Apache OpenNLP is an open source Java library which is used process Natural Language text. OpenNLP provides services such as tokenization, sentence segmentation, part-of-speech tagging,
More informationPreservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa
Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa How to get what you need to keep what you ve got The stack
More informationWeb-based File Upload and Download System
COMP4905 Honor Project Web-based File Upload and Download System Author: Yongmei Liu Student number: 100292721 Supervisor: Dr. Tony White 1 Abstract This project gives solutions of how to upload documents
More informationLucene. Jianguo Lu. School of Computer Science. University of Windsor
Lucene Jianguo Lu School of Computer Science University of Windsor 1 A Comparison of Open Source Search Engines for 1.69M Pages 2 lucene Developed by Doug CuHng iniially Java-based. Created in 1999, Donated
More informationCSCI572 Hw2 Report Team17
CSCI572 Hw2 Report Team17 1. Develop an indexing system using Apache Solr and its ExtractingRequestHandler ( SolrCell ) or using Elastic Search and Tika Python. a. In this part, we chose SolrCell and downloaded
More informationXML Parsers. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University
XML Parsers Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of Computer Engineering Khon Kaen University 1 Overview What are XML Parsers? Programming Interfaces of XML Parsers DOM:
More informationverapdf Industry supported PDF/A validation
verapdf Industry supported PDF/A validation About this webinar What we ll be showing you: our current development status; the Consortium s development plans for 2016; how we ve been testing the software
More informationCopyright 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 7 XML
Chapter 7 XML 7.1 Introduction extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML Lax syntactical rules Many complex features that are rarely used HTML
More informationHomework: MIME Diversity in the Text Retrieval Conference (TREC) Polar Dynamic Domain Dataset Due: Friday, March 4, pm PT
Homework: MIME Diversity in the Text Retrieval Conference (TREC) Polar Dynamic Domain Dataset Due: Friday, March 4, 2016 12pm PT 1. Overview Figure 1: The TREC Dynamic Domain Polar Dataset http://github.com/chrismattmann/trec-dd-polar/
More informationSimple API for XML (SAX)
Simple API for XML (SAX) Asst. Prof. Dr. Kanda Runapongsa (krunapon@kku.ac.th) Dept. of Computer Engineering Khon Kaen University 1 Topics Parsing and application SAX event model SAX event handlers Apache
More informationMethods for Evaluating Text Extraction Toolkits: An Exploratory Investigation
MTR140443R2 MITRE TECHNICAL REPORT Methods for Evaluating Text Extraction Toolkits: An Exploratory Investigation Contract No.: W15P7T-13-C-A802 Project No.: 0714G01Z-HB Timothy B. Allison Paul M. Herceg
More informationAndrea Goethals, Harvard Library ASERL Webinar File Information Tool Set
Andrea Goethals, Harvard Library ASERL Webinar 2013 File Information Tool Set Intro to File formats File tools FITS Specific structure or arrangement of data code stored as a computer file. A file format
More informationStorm Crawler. Low latency scalable web crawling on Apache Storm. Julien Nioche digitalpebble. Berlin Buzzwords 01/06/2015
Storm Crawler Low latency scalable web crawling on Apache Storm Julien Nioche julien@digitalpebble.com digitalpebble Berlin Buzzwords 01/06/2015 About myself DigitalPebble Ltd, Bristol (UK) Specialised
More informationDesign of my planned contribution to the PDFBox Project
ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS DEPARTMENT OF MANAGEMENT SCIENCE AND TECHNOLOGY Maintenance and Refactoring Design of my planned contribution to the PDFBox Project 1 Communication with Mr.
More informationTechnical University of Munich - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
Technical University of Munich - FTP Site Statistics Property Value FTP Server ftp.lpr.e-technik.tu-muenchen.de Description Technical University of Munich Country Germany Scan Date 23/May/2014 Total Dirs
More information7.1 Introduction. extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML
7.1 Introduction extensible Markup Language Developed from SGML A meta-markup language Deficiencies of HTML and SGML Lax syntactical rules Many complex features that are rarely used HTML is a markup language,
More informationODF API - ODFDOM. Svante Schubert Software Engineer Sun Microsystems, Hamburg
ODF API - ODFDOM Svante Schubert Software Engineer Sun Microsystems, Hamburg 1 Do you know ODF? The OASIS / ISO standard for office documents (2005/06) The document format of many office applications A
More informationEvaluating Text Extraction: Developing a Toolkit for Apache Tika
Evaluating Text Extraction: Developing a Toolkit for Apache Tika ApacheCon NA 2015 Tim Allison Paul M. Herceg The MITRE Corporation Overview Opening Notes of Gratitude Quick Overview on Tika Tika on the
More informationDesign and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song WANG 1 and Kun ZHU 1
2017 2 nd International Conference on Computer Science and Technology (CST 2017) ISBN: 978-1-60595-461-5 Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song
More informationThis work is licensed under the Creative Commons Attribution 4.0 International License. Page 1 of 10
This work is licensed under the Creative Commons Attribution 4.0 International License. Page 1 of 10 1.1 1.2 2.1 1 Page 2 of 10 2.3 2.4 2.4.1 2.4.2 2 Page 3 of 10 2.5 2.6 Page 4 of 10 2.7 2.8 Page 5 of
More informationAccessData Forensic Toolkit Release Notes
AccessData Forensic Toolkit 6.0.1 Release Notes Document Date: 11/30/2015 2015 AccessData Group, Inc. All rights reserved Introduction This document lists the new features, fixed issues, and known issues
More informationDHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI
DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6801 SERVICE ORIENTED ARCHITECTURE Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation:
More informationStandards and the Portals Project
Standards and the Portals Project Carsten Ziegeler cziegeler@apache.org Competence Center Open Source S&N AG, Germany Member of the Apache Software Foundation Committer in some Apache Projects Cocoon,
More informationStormCrawler. Low Latency Web Crawling on Apache Storm.
StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com @digitalpebble @stormcrawlerapi 1 About myself DigitalPebble Ltd, Bristol (UK) Text Engineering Web Crawling
More informationAutomated Tagging to Enable Fine-Grained Browsing of Lecture Videos
Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos K.Vijaya Kumar (09305081) under the guidance of Prof. Sridhar Iyer June 28, 2011 1 / 66 Outline Outline 1 Introduction 2 Motivation 3
More informationMANAGING INFORMATION (CSCU9T4) LECTURE 4: XML AND JAVA 1 - SAX
MANAGING INFORMATION (CSCU9T4) LECTURE 4: XML AND JAVA 1 - SAX Gabriela Ochoa http://www.cs.stir.ac.uk/~nve/ RESOURCES Books XML in a Nutshell (2004) by Elliotte Rusty Harold, W. Scott Means, O'Reilly
More informationDigging into File Formats: Poking around at data using file, DROID, JHOVE, and more
Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more Presented by Stephen Eisenhauer UNT Libraries TechTalks February 12, 2014 Why? We handle a lot of digital information
More informationFrequently Asked Questions
Frequently Asked Questions This PowerTools FAQ answers many frequently asked questions regarding the functionality of the various parts of the PowerTools suite. The questions are organized in the following
More informationHomework: Content extraction and search using Apache Tika Employment Postings Dataset contributed via DARPA XDATA Due: October 6, pm PT
Homework: Content extraction and search using Apache Tika Employment Postings Dataset contributed via DARPA XDATA Due: October 6, 2014 12pm PT 1. Overview Figure 1: Map of Jobs (Colored by Country) In
More informationOverview Metadata Extraction Tool Hachoir Sleuthkit Summary CS 6V Metadata Extraction Tools. Junyuan Zeng
CS 6V81-05 Metadata Extraction Tools Junyuan Zeng Department of Computer Science The University of Texas at Dallas September 23 th, 2011 Outline 1 Overview 2 Metadata Extraction Tool Overview 3 Hachoir
More informationXML Programming in Java
Mag. iur. Dr. techn. Michael Sonntag XML Programming in Java DOM, SAX XML Techniques for E-Commerce, Budapest 2005 E-Mail: sonntag@fim.uni-linz.ac.at http://www.fim.uni-linz.ac.at/staff/sonntag.htm Michael
More informationHow to work with HTTP requests and responses
How a web server processes static web pages Chapter 18 How to work with HTTP requests and responses How a web server processes dynamic web pages Slide 1 Slide 2 The components of a servlet/jsp application
More informationCall: Core&Advanced Java Springframeworks Course Content:35-40hours Course Outline
Core&Advanced Java Springframeworks Course Content:35-40hours Course Outline Object-Oriented Programming (OOP) concepts Introduction Abstraction Encapsulation Inheritance Polymorphism Getting started with
More informationThe XML PDF Access API for Java Technology (XPAAJ)
The XML PDF Access API for Java Technology (XPAAJ) Duane Nickull Senior Technology Evangelist Adobe Systems TS-93260 2007 JavaOne SM Conference Session TS-93260 Agenda Using Java technology to manipulate
More informationIndexing HTML files in Solr 1
Indexing HTML files in Solr 1 This tutorial explains how to index html files in Solr using the built-in post tool, which leverages Apache Tika and auto extracts content from html files. You should have
More informationSelenium Training. Training Topics
Selenium Training Training Topics Chapter 1 : Introduction to Automation Testing What is automation testing? When Automation Testing is needed? When Automation Testing is not needed? What is the use of
More informationAccessData Enterprise Release Notes
AccessData Enterprise 6.0.2 Release Notes Document Date: 3/09/2016 2016 AccessData Group, Inc. All rights reserved Introduction This document lists the new features, fixed issues, and known issues for
More informationPROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE
PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE.... LIBBY BISHOP... INGEST SERVICES UNIVERSITY OF ESSEX... HOW TO SET UP A DATA SERVICE, 3 4 JULY 2013 PRE - PROCESSING Liaising with depositor:
More informationOpenClinica: Towards Database Abstraction, Part 1
OpenClinica: Towards Database Abstraction, Part 1 Author: Tom Hickerson, Akaza Research Date Created: 8/26/2004 4:17 PM Date Updated: 6/10/2005 3:22 PM, Document Version: v0.3 Document Summary This document
More informationUniversity of Hagen - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
Property Value FTP Server ftp.fernuni-hagen.de Description University of Hagen Country Germany Scan Date 25/Feb/2015 Total Dirs 15,751 Total Files 253,958 Total Data 153.37 GB Top 20 Directories Sorted
More informationComputer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am
Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.
More informationDocument Transformation Services Administration Guide
Document Transformation Services Administration Guide Version 5.3 March 2005 Copyright 1994-2005 EMC Corporation Table of Contents Preface... 7 Chapter 1 Document Transformation Services Overview... 9
More informationDelivery Options: Attend face-to-face in the classroom or via remote-live attendance.
XML Programming Duration: 5 Days US Price: $2795 UK Price: 1,995 *Prices are subject to VAT CA Price: CDN$3,275 *Prices are subject to GST/HST Delivery Options: Attend face-to-face in the classroom or
More informationEMF Compare Ganymede Simultaneous Release
EMF Compare 0.8.0 Ganymede Simultaneous Release June 16 th, 2008 Ganymede Release Talking Point Noteworthy New Features 2 way / 3 way comparison detecting conflics differencing, merging and extensibility
More informationOnDemand Discovery Quickstart Guide
Here is a complete guide to uploading native files directly to OnDemand, using our new OnDemand Discovery Client. OnDemand Discovery Quickstart Guide OnDemand Technical Support P a g e 1 Section 1: Welcome
More informationTo accomplish the parsing, we are going to use a SAX-Parser (Wiki-Info). SAX stands for "Simple API for XML", so it is perfect for us
Description: 0.) In this tutorial we are going to parse the following XML-File located at the following url: http:www.anddev.org/images/tut/basic/parsingxml/example.xml : XML:
More informationGenerating the Server Response: HTTP Response Headers
Generating the Server Response: HTTP Response Headers 1 Agenda Format of the HTTP response Setting response headers Understanding what response headers are good for Building Excel spread sheets Generating
More informationPO CO DEVELOPER TRAINING C++ PORTABLE PO CO SMARTER DEVICE NETWORKING
C++ RTABLE MNENTS DEVELOPER TRAINING Overview An Overview and a Guided Tour of the C++ Libraries "Without a good library, most interesting tasks are hard to do in C++; but given a good library, almost
More informationSelenium Course Content
Chapter 1 : Introduction to Automation Testing Selenium Course Content What is automation testing? When Automation Testing is needed? When Automation Testing is not needed? What is the use of automation
More informationWindows Device Driver and API Reference Manual
Windows Device Driver and API Reference Manual 797 North Grove Rd, Suite 101 Richardson, TX 75081 Phone: (972) 671-9570 www.redrapids.com Red Rapids Red Rapids reserves the right to alter product specifications
More informationDelivery Options: Attend face-to-face in the classroom or remote-live attendance.
XML Programming Duration: 5 Days Price: $2795 *California residents and government employees call for pricing. Discounts: We offer multiple discount options. Click here for more info. Delivery Options:
More informationUniversity of Osnabruck - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
University of Osnabruck - FTP Site Statistics Property Value FTP Server ftp.usf.uni-osnabrueck.de Description University of Osnabruck Country Germany Scan Date 17/May/2014 Total Dirs 29 Total Files 92
More informationPROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE
PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE....... INGEST SERVICES UNIVERSITY OF ESSEX... HOW TO SET UP A DATA SERVICE, 8-9 NOVEMBER 2012 PRE - PROCESSING Liaising with depositor: consent
More informationDocument Parser Interfaces. Tasks of a Parser. 3. XML Processor APIs. Document Parser Interfaces. ESIS Example: Input document
3. XML Processor APIs How applications can manipulate structured documents? An overview of document parser interfaces 3.1 SAX: an event-based interface 3.2 DOM: an object-based interface Document Parser
More informationThe design of the PowerTools engine. The basics
The design of the PowerTools engine The PowerTools engine is an open source test engine that is written in Java. This document explains the design of the engine, so that it can be adjusted to suit the
More informationDocx (MS Word) Library
Docx (MS Word) Library Table of Contents Screenshots and Usage...................................................................... 2 Installing the Fixture Data.................................................................
More informationSpace for your outline of the XML document produced by simple.f90:
Practical 1: Writing xml with wxml The aims of this exercises are to familiarize you with the process of compiling the FoX library and using its wxml API to produce simple xml documents. The tasks revolve
More informationLecture 11.1 I/O Streams
21/04/2014 Ebtsam AbdelHakam 1 OBJECT ORIENTED PROGRAMMING Lecture 11.1 I/O Streams 21/04/2014 Ebtsam AbdelHakam 2 Outline I/O Basics Streams Reading characters and string 21/04/2014 Ebtsam AbdelHakam
More informationShared MIME-info Database
X Desktop Group (http://www.freedesktop.org) Thomas Leonard tal197 at users.sf.net 1. Introduction 1.1. Version This is version 0.20 of the Shared MIME-info Database specification, last updated 8 October
More informationSpango Internet - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
Property Value FTP Server mirror1.spango.com Description Spango Internet Country Netherlands Scan Date 03/Jun/2014 Total Dirs 2,707 Total Files 53,097 Total Data 183.64 GB Top 20 Directories Sorted by
More informationInformation Retrieval
Information Retrieval Assignment 3: Boolean Information Retrieval with Lucene Patrick Schäfer (patrick.schaefer@hu-berlin.de) Marc Bux (buxmarcn@informatik.hu-berlin.de) Lucene Open source, Java-based
More informationHandling SAX Errors. <coll> <seqment> <title PMID="xxxx">title of doc 1</title> text of document 1 </segment>
Handling SAX Errors James W. Cooper You re charging away using some great piece of code you wrote (or someone else wrote) that is making your life easier, when suddenly plotz! boom! The whole thing collapses
More informationChapter 11: Editorial Workflow
Chapter 11: Editorial Workflow Chapter 11: Editorial Workflow In this chapter, you will follow as submission throughout the workflow, from first submission to final publication. The workflow is divided
More informationNetBeans to Eclipse GlassFish Project Converter. Michael Tidd
Project Number. GFP 0903 NetBeans to Eclipse GlassFish Project Converter A Major Qualifying Project Report: submitted to the faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More informationExeFilter. An open-source framework for active content filtering. CanSecWest /03/2008
ExeFilter An open-source framework for active content filtering CanSecWest 2008 28/03/2008 http://cansecwest.com Philippe Lagadec NATO/NC3A philippe.lagadec(à)nc3a.nato.int ExeFilter Goals To protect sensitive
More informationApache Wink Developer Guide. Draft Version. (This document is still under construction)
Apache Wink Developer Guide Software Version: 1.0 Draft Version (This document is still under construction) Document Release Date: [August 2009] Software Release Date: [August 2009] Apache Wink Developer
More informationCOURSE DETAILS: CORE AND ADVANCE JAVA Core Java
COURSE DETAILS: CORE AND ADVANCE JAVA Core Java 1. Object Oriented Concept Object Oriented Programming & its Concepts Classes and Objects Aggregation and Composition Static and Dynamic Binding Abstract
More informationXML STANDARDS FOR ARCHIVING LEGISLATIVE RECORDS
XML STANDARDS FOR ARCHIVING LEGISLATIVE RECORDS NDIIPP ALL PARTNERS MEETING DANIEL DODGE DECEMBER 6, 2011 TODAY'S OVERVIEW Proposed XML Standard A Working Group developed a proposed XML standard to support
More informationNTCIR-12 MathIR Task Wikipedia Corpus (v0.2.1)
NTCIR-12 MathIR Task Wikipedia Corpus (v0.2.1) This is the revised (v 0.2.1) version of the 'Wikipedia' corpus for the NTCIR-12 Mathematical Information Retrieval (MathIR) tasks (see http://ntcir-math.nii.ac.jp/introduction/).
More informationIBM Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
IBM Software Archive - FTP Site Statistics Property Value FTP Server public.dhe.ibm.com Description IBM Software Archive Country United States Scan Date 09/Oct/2015 Total Dirs 202,167 Total Files 4,041,461
More informationCourse Design, Representation and Browser for Web Based Education
Course Design, Representation and Browser for Web Based Education KUNAL CHAWLA Department of Information Technology Indian Institute of Information Technology Allahabad, Uttar Pradesh INDIA Abstract: -
More informationJAVA SERVLET. Server-side Programming INTRODUCTION
JAVA SERVLET Server-side Programming INTRODUCTION 1 AGENDA Introduction Java Servlet Web/Application Server Servlet Life Cycle Web Application Life Cycle Servlet API Writing Servlet Program Summary 2 INTRODUCTION
More informationPDF Exporter Xpages Custom Control Documentation
PDF Exporter Xpages Custom Control Documentation 2(8) 1 What is this custom control and what it does...3 1.1 PDF template...3 1.2 How to use Open Office Impress...4 2 Technical overview...4 3 Installation
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More informationDigital Preservation: A Software Approach
Digital Preservation: A Software Approach 8 th Convention PLANNER 2012 Digital Preservation: A Software Approach R K Joteen Singh Abstract In today s ever developing and rapid growing world, the national
More informationFiles and IO, Streams. JAVA Standard Edition
Files and IO, Streams JAVA Standard Edition Java - Files and I/O The java.io package contains nearly every class you might ever need to perform input and output (I/O) in Java. All these streams represent
More informationOverview. What is TCM? TCM Supported File Types A Day in the Life of a Document Using TCM in Munis Using TCM without Munis TCM extra Features Q&A *
TCM 101 Overview What is TCM? TCM Supported File Types A Day in the Life of a Document Using TCM in Munis Using TCM without Munis TCM extra Features Q&A * 2 What is Tyler Content Manager? Provides Munis
More informationDocument Metadata: document technical metadata for digital preservation
Document Metadata: document technical metadata for digital preservation By Carol C.H. Chou - Florida Digital Archive (FDA) Andrea Goethals - Harvard University Library (HUL) March 24, 2009 1 Table of Contents
More informationPython INTRODUCTION: Understanding the Open source Installation of python in Linux/windows. Understanding Interpreters * ipython.
INTRODUCTION: Understanding the Open source Installation of python in Linux/windows. Understanding Interpreters * ipython * bpython Getting started with. Setting up the IDE and various IDEs. Setting up
More informationBUILDING A WEBSITE FOR THE NUMBER ONE CHILDREN S HOSPITAL IN THE U.S. May 10, 2011
BUILDING A WEBSITE FOR THE NUMBER ONE CHILDREN S HOSPITAL IN THE U.S. May 10, 2011 0 Introduction About me and NorthPoint NorthPoint is a USA-based organization Specializing in Open Source technologies
More informationParaben Examiner 9.0 Release Notes
Paraben E-mail Examiner 9.0 Release Notes 1 Paraben Corporation Welcome to Paraben s E-mail Examiner 9.0! Paraben s Email Examiner-EMX allows for the forensic examination of the most popular local e-mail
More informationI/O and Parsing Tutorial
I/O and Parsing Tutorial 22-02-13 Structure of tutorial 1.Example program to access and write to an XML file 2.Example usage of JFlex Tasks program Program to help people plan and manage their work on
More informationDownloading Tweet Streams and Parsing
and Parsing Ayan Bandyopadhyay IR Lab. CVPR Unit Indian Statistical Institute (Kolkata) To download this slide go to: https://goo.gl/aywi1s 1 and Parsing Downloading Tweet Streams It is imagined that Tweets
More information.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar..
.. Cal Poly CPE/CSC 366: Database Modeling, Design and Implementation Alexander Dekhtyar.. XML in a Nutshell XML, extended Markup Language is a collection of rules for universal markup of data. Brief History
More informationCover Page. Oracle Report Parser System Administration Guide 10g Release 3 ( ) March 2007
Cover Page Oracle Report Parser System Administration Guide 10g Release 3 (10.1.3.3.0) March 2007 Oracle Report Parser System Administration Guide, 10g Release 3 (10.1.3.3.0) Copyright 2007, Oracle. All
More informationGreenstone Publications
Greenstone Publications Online@USP Presentation for the PacLII Workshop 1-2 October 2010 Emalus Campus, Port Vila, Vanuatu Sin Joan Yee University Librarian Digitisation @USP Library (Laucala Campus) Began
More informationSAX Reference. The following interfaces were included in SAX 1.0 but have been deprecated:
G SAX 2.0.2 Reference This appendix contains the specification of the SAX interface, version 2.0.2, some of which is explained in Chapter 12. It is taken largely verbatim from the definitive specification
More information