Cross Corpora Discovery Via Minimal Spanning Tree Exploration
|
|
- Abner Todd
- 6 years ago
- Views:
Transcription
1 Cross Corpora Discovery Via Minimal Spanning Tree Jeff Solka, Edward J. Wegman, and Avory Bryant 5/28/2004 1
2 In a Nutshell? o What are we trying to do? Develop a semi-automated system to facilitate the discovery of articles from disparate corpora that may contain subtle relationships. o What is our approach predicated on? The synthesis of methodologies from statistics, mathematics and visualization. o What is our test case? Roughly 1200 Science News abstracts that have been precategorized into 8 categories. 2
3 The Science News Corpus o 1117 documents from o Obtained from the SN website on December ,2002 using wget. o Each article ranges from 1/2 a page to roughly a page in length. o The corpus html/xml code was subsequently parsed into straight text. o The corpus was read through and categorized into 8 categories.
4 The Science News Corpus Breakdown o Anthropology and Archeology (48). o Astronomy and Space Sciences (124). o Behavior (88). o Earth and Environmental Sciences (164). o Life Sciences (174). o Mathematics and Computers (65). o Medical Sciences (10). o Physical Sciences and Technology (144) 4
5 Our Approach to be Discussed Today (5/26/04) Multi-Discipline Document Set Feature Extraction (Denoising, stemming, BPM, TPM) Interpoint Distance Calculation Minimal Spanning Tree (MST) Calculation MST Layout Via Spring Based Models Cross Corpora Discovery Via MST 5
6 Denoising and Stemming o These steps are performed prior to subsequent feature extraction steps. o Denoising consists of removal of all words that appear on a stopper or noise word list. the, a, an, o Stemming transforms a given word into its base walking walk walked walk 6
7 Net Algorithmic Complexity o Let p be equal to the average number of word pairs or triplets in each document. o The net algorithmic complexity is O(n 2 p) o It would be easy to formulate parallel computation strategies in both n and p. o Note that the computational complexity associated with the actual calculation of the BPM is not included here. 7
8 Feature Extraction (Bigram Proximity Matrix (BPM) & Trigram Proximity Matrix (TPM)) The wise young man sought his father in the crowd. 8
9 Evidence That BPM and TPM Capture Semantic Content o Angel Martinez, A Framework for the Representation of Semantics, Ph.D Dissertation under the direction of Edward Wegman, October Supervised Learning. Hypothesis Tests ( sets of tests). Unsupervised Learning. Supervised Learning in a Reduced Dimension Space. 9
10 Similarity Measures and Pseudometrics on the BPM o Following Martinez (2002) we propose the use of the Ochiai measure in the case of the BPM: S ( X, Y ) = X and Y ( X Y ) o This is converted to a distance via: ( 2 2S( X Y )) d ( X, Y ) =, 10
11 Interpoint Distance Complexity Issues o Let n be the number of documents in the corpus. The interpoint distance matrix involves (n(n-1))/2 comparisons which results in an O(n 2 ) operation in the number of documents n. o It will pay to make each of these comparisons as efficient as possible. 11
12 Similarity and Distance Measure Complexities o o Let x be the set of word pairs or triplets in Article X. Let y be the set of word pairs or triplets in Article Y. Then Article X AND Article Y can be described as the intersection of the sets x and y. The sets x and y are represented as hash tables where the key, word pair or triplet, maps to the number of occurrences in the article. The intersection of x and y can then be computed by the number of keys in x that are also in y. The contains Key function being used is close to O(1) so the computation of X AND Y should be close to O(size of x) or for all keys in x check if y contains key. The value of Article X Article Y is size of x size of y S ( X, Y ) = X and Y ( X Y ) ( 2 2S( X Y )) d ( X, Y ) =, 12
13 What Do We Have at This Point in The Process? o An interpoint distance matrix between each of the articles. Hopefully articles that are close semantically will be close in this matrix. Hopefully articles that are far apart semantically will be far apart in this matrix. o We also have a previously obtained categorization of the articles obtained via: Human feedback. Automatic process. 1
14 How Do We Exploit This Interpoint Distance Matrix? o o o o First order exploitation. Look for the closest points between each pair of categories (corpora). Second order exploitation. Look for those articles that are along the boundary that separates the two categories. Third order exploitation. Look for those articles that have the same relationship to the discriminate boundary. Fourth order exploitation. Allow the user to drive the interpoint distance geometry via identification of first, second, and third order interesting relationships and subsequent regeometrization.
15 The Minimal Spanning Tree (MST): A Strategy for Effective of the Interpoint Distance Matrix o Definition (Minimal Spanning Tree (MST)) The collection of edges that join all of the points in a set together, with the minimum possible sum of edge values. The edge values that will be used here is the distance measures stored in our interpoint distance matrix. A complete graph. Associated MST. 15
16 Calculation of the MST : Kruskal s Algorithm
17 Calculation of the MST : Kruskal s Algorithm
18 Calculation of the MST : Kruskal s Algorithm
19 Calculation of the MST : Kruskal s Algorithm
20 Calculation of the MST : Kruskal s Algorithm
21 Calculation of the MST : Kruskal s Algorithm
22 Calculation of the MST : Kruskal s Algorithm
23 Calculation of the MST : Kruskal s Algorithm
24 MST Classifier Complexity Characterization Previous work had suggested that the number of cross class edges can be used as a surrogate for classification complexity. These cross class (corpora) edges will be used in our scheme to facilitate the cross-corpora discovery process. 24
25 Implementation Issues (The Devil in the Details) o BPM extraction and interpoint distance calculation: Implemented in C#. o BPM similarity and distance calculation: Implemented in C#. o MST calculation: Implemented using Kruskal s algorithm in JAVA. o Visualization environment: Implemented in JAVA. Graph layout facilitated using TouchGraph. 25
26 TouchGraph o o o TouchGraph is a general public license JAVA-based library for the visualization of graphs. ( Graph layout in TouchGraph: When a graph is first loaded, nodes start out at the center with slightly random positions, and then spread out because of node-node repulsions. Graph manipulation tools provided by TouchGraph. Zooming. Rotation. Hyperbolic manipulation. Graph dragging. 26
27 The Environment (Opening Screen) 27
28 The Environment (MST) Blue is anthropology and archaeology. Pink is behavior.
29 The Environment (The Comparison File) 29
30 The Demo 0
31 Wrap-up o Demonstrated a new method for cross corpora document discovery o Method predicated on the use of BPM and the MST as a convenient foil for the exploration of the cross corpora relationships. o This work represents the tip of the iceberg of a new area that is not only of strategic importance to the United States but also is highly relevant to all who are currently conducting research in any discipline. 1
32 Backup Slides 2
33 An Alternate Approach Multi-Discipline Document Set Exemplar Term Production Via Synonym Analysis Feature Extraction (BPM, TPM) Dimensionality Reduction ISOMAP/LLE Model-Based Clustering With Adaptive Mixtures Initialization Serendipity Identification and Visualization
34 A Paradigm you don t reach Serendip by plotting a course for it. You have to set out in good faith for elsewhere and lose your bearings serendipitously. -- John Barth, The Last Voyage of Somebody the Sailor 4
35 Acknowledgements o Jim Gentle (Opportunity to speak) o Algotek (Funding and Program Management) Anna Tsao o Algotek Team (Helpful discussions and encouragement) Carey Priebe David Marchette 5
36 The Porter Stemming Algorithm o The Porter stemming algorithm (or Porter stemmer ) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalization process that is usually done when setting up Information Retrieval systems. ( official home page for distribution of the Porter Stemming Algorithm 6
Interactive Text Mining with Iterative Denoising
Interactive Text Mining with Iterative Denoising, PhD kegiles@vcu.edu www.people.vcu.edu/~kegiles Assistant Professor Department of Statistics and Operations Research Virginia Commonwealth University Interactive
More informationPre-Requisites: CS2510. NU Core Designations: AD
DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification
More informationK Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat
K Nearest Neighbor Wrap Up K- Means Clustering Slides adapted from Prof. Carpuat K Nearest Neighbor classification Classification is based on Test instance with Training Data K: number of neighbors that
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationSemi-Automatic Transcription Tool for Ancient Manuscripts
The Venice Atlas A Digital Humanities atlas project by DH101 EPFL Students Semi-Automatic Transcription Tool for Ancient Manuscripts In this article, we investigate various techniques from the fields of
More informationImage Classification Using Text Mining and Feature Clustering (Text Document and Image Categorization Using Fuzzy Similarity Based Feature Clustering)
Image Classification Using Text Mining and Clustering (Text Document and Image Categorization Using Fuzzy Similarity Based Clustering) 1 Mr. Dipak R. Pardhi, 2 Mrs. Charushila D. Pati 1 Assistant Professor
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationData Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition
Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Instance-Based Learning Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Instance Based Classifiers
More informationRobot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning
Robot Learning 1 General Pipeline 1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation construction 3. Robot learning: e.g., classification (recognition) or clustering (knowledge
More informationVIDAEXPERT: DATA ANALYSIS Here is the Statistics button.
Here is the Statistics button. After creating dataset you can analyze it in different ways. First, you can calculate statistics. Open Statistics dialog, Common tabsheet, click Calculate. Min, Max: minimal
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationE0005E - Industrial Image Analysis
E0005E - Industrial Image Analysis The Hough Transform Matthew Thurley slides by Johan Carlson 1 This Lecture The Hough transform Detection of lines Detection of other shapes (the generalized Hough transform)
More informationNon-linear dimension reduction
Sta306b May 23, 2011 Dimension Reduction: 1 Non-linear dimension reduction ISOMAP: Tenenbaum, de Silva & Langford (2000) Local linear embedding: Roweis & Saul (2000) Local MDS: Chen (2006) all three methods
More informationDay 3 Lecture 1. Unsupervised Learning
Day 3 Lecture 1 Unsupervised Learning Semi-supervised and transfer learning Myth: you can t do deep learning unless you have a million labelled examples for your problem. Reality You can learn useful representations
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationText Document Clustering Using DPM with Concept and Feature Analysis
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationDATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10
COMP717, Data Mining with R, Test Two, Tuesday the 28 th of May, 2013, 8h30-11h30 1 DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationThis research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight
This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight local variation of one variable with respect to another.
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationMathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul
Mathematics of Data INFO-4604, Applied Machine Learning University of Colorado Boulder September 5, 2017 Prof. Michael Paul Goals In the intro lecture, every visualization was in 2D What happens when we
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationText Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering
Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani
More informationPattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures
Pattern recognition Classification/Clustering GW Chapter 12 (some concepts) Textures Patterns and pattern classes Pattern: arrangement of descriptors Descriptors: features Patten class: family of patterns
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationIntro to Artificial Intelligence
Intro to Artificial Intelligence Ahmed Sallam { Lecture 5: Machine Learning ://. } ://.. 2 Review Probabilistic inference Enumeration Approximate inference 3 Today What is machine learning? Supervised
More informationCS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,
More informationSimultaneous selection of features and metric for optimal nearest neighbor classification
Simultaneous selection of features and metric for optimal nearest neighbor classification David A. Johannsen johannsenda@nswc.navy.mil Naval Surface Warfare Center Dahlgren Division Jeffrey L. Solka solkajl@nswc.navy.mil
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationIndex Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationMachine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham
Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationThe University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree
Accreditation & Quality Assurance Center Curriculum for Doctorate Degree 1. Faculty King Abdullah II School for Information Technology 2. Department Computer Science الدكتوراة في علم الحاسوب (Arabic).3
More informationA Taxonomy of Semi-Supervised Learning Algorithms
A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationHomework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1
Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based
More informationAutomatic Record Linkage using Seeded Nearest Neighbour and SVM Classification
Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,
More information6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION
6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationK-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
K-Nearest Neighbors Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Check out review materials Probability Linear algebra Python and NumPy Start your HW 0 On your Local machine:
More informationA Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining
A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining D.Kavinya 1 Student, Department of CSE, K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India 1
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationNLP in practice, an example: Semantic Role Labeling
NLP in practice, an example: Semantic Role Labeling Anders Björkelund Lund University, Dept. of Computer Science anders.bjorkelund@cs.lth.se October 15, 2010 Anders Björkelund NLP in practice, an example:
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting
More informationEnhancing Clustering Results In Hierarchical Approach By Mvs Measures
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More information2. Design Methodology
Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily
More informationKnowledge Discovery and Data Mining 1 (KU)
Knowledge Discovery and Data Mining 1 (KU) Simon Walk IICM, TU Graz October 22, 2015 Simon Walk (IICM) KDDM1 October 22, 2015 1 / 11 KDDM 1 (KU) - Introduction Introduction Institute for Information Systems
More informationCSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 / CX 4242 October 9, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Volume Variety Big Data Era 2 Velocity Veracity 3 Big Data are High-Dimensional Examples of High-Dimensional Data Image
More informationClass 5: Attributes and Semantic Features
Class 5: Attributes and Semantic Features Rogerio Feris, Feb 21, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Project
More informationCalculating a PCA and a MDS on a fingerprint data set
BioNumerics Tutorial: Calculating a PCA and a MDS on a fingerprint data set 1 Aim Principal Components Analysis (PCA) and Multi Dimensional Scaling (MDS) are two alternative grouping techniques that can
More informationCIS192 Python Programming
CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18 Outline 1 Machine Learning
More informationAn overview of Graph Categories and Graph Primitives
An overview of Graph Categories and Graph Primitives Dino Ienco (dino.ienco@irstea.fr) https://sites.google.com/site/dinoienco/ Topics I m interested in: Graph Database and Graph Data Mining Social Network
More informationFigure 1: Workflow of object-based classification
Technical Specifications Object Analyst Object Analyst is an add-on package for Geomatica that provides tools for segmentation, classification, and feature extraction. Object Analyst includes an all-in-one
More informationThe Goal of this Document. Where to Start?
A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More informationMore Efficient Classification of Web Content Using Graph Sampling
More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationMultimodal Information Spaces for Content-based Image Retrieval
Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due
More informationApplications of Machine Learning on Keyword Extraction of Large Datasets
Applications of Machine Learning on Keyword Extraction of Large Datasets 1 2 Meng Yan my259@stanford.edu 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
More informationMultiple-Choice Questionnaire Group C
Family name: Vision and Machine-Learning Given name: 1/28/2011 Multiple-Choice naire Group C No documents authorized. There can be several right answers to a question. Marking-scheme: 2 points if all right
More informationInternational Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL): Introducing D2K and M2K
International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL): Introducing D2K and M2K {This document is a slightly revised version of the M2K demo handout provided at ISMIR 2004, Barcelona,
More informationPERSONALIZATION OF MESSAGES
PERSONALIZATION OF E-MAIL MESSAGES Arun Pandian 1, Balaji 2, Gowtham 3, Harinath 4, Hariharan 5 1,2,3,4 Student, Department of Computer Science and Engineering, TRP Engineering College,Tamilnadu, India
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More informationParallel Coordinates ++
Parallel Coordinates ++ CS 4460/7450 - Information Visualization Feb. 2, 2010 John Stasko Last Time Viewed a number of techniques for portraying low-dimensional data (about 3
More informationAjloun National University
Study Plan Guide for the Bachelor Degree in Computer Information System First Year hr. 101101 Arabic Language Skills (1) 101099-01110 Introduction to Information Technology - - 01111 Programming Language
More informationTools, Tips and Workflows Thinning Feature Vertices in LP360 LP360,
LP360, 2017.1 l L. Graham 7 January 2017 We have lot of LP360 customers with a variety of disparate feature collection and editing needs. To begin to address these needs, we introduced a completely new
More informationInformation Retrieval
Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationCPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018
CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018 Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations.
More informationCSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More information9.1. K-means Clustering
424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific
More informationDIGIT.B4 Big Data PoC
DIGIT.B4 Big Data PoC RTD Health papers D02.02 Technological Architecture Table of contents 1 Introduction... 5 2 Methodological Approach... 6 2.1 Business understanding... 7 2.2 Data linguistic understanding...
More informationCOOPERATIVE EDITING APPROACH FOR BUILDING WORDNET DATABASE
Key words Wordnet, TouchGraph, Graph-based semantic editing Konrad DUSZA, Łukasz BYCZKOWSKI, Julian SZYMANSKI COOPERATIVE EDITING APPROACH FOR BUILDING WORDNET DATABASE The paper presents a approach for
More informationLecture 11: Clustering Introduction and Projects Machine Learning
Lecture 11: Clustering Introduction and Projects Machine Learning Andrew Rosenberg March 12, 2010 1/1 Last Time Junction Tree Algorithm Efficient Marginals in Graphical Models 2/1 Today Clustering Project
More information1 Case study of SVM (Rob)
DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how
More informationLiterature Synthesis - Visualisations
Literature Synthesis - Visualisations By Jacques Questiaux QSTJAC001 Abstract This review takes a look at current technologies and methods that are used today to visualise data. Visualisations are defined
More informationThe World Is Not Flat: An Introduction to Modern Geometry
The World Is Not Flat: An to The University of Iowa September 15, 2015 The story of a hunting party The story of a hunting party What color was the bear? The story of a hunting party Overview Gauss and
More informationPerformance Level Descriptors. Mathematics
Performance Level Descriptors Grade 3 Well Students rarely, Understand that our number system is based on combinations of 1s, 10s, and 100s (place value, compare, order, decompose, and combine using addition)
More informationSensor Based Time Series Classification of Body Movement
Sensor Based Time Series Classification of Body Movement Swapna Philip, Yu Cao*, and Ming Li Department of Computer Science California State University, Fresno Fresno, CA, U.S.A swapna.philip@gmail.com,
More informationStats fest Multivariate analysis. Multivariate analyses. Aims. Multivariate analyses. Objects. Variables
Stats fest 7 Multivariate analysis murray.logan@sci.monash.edu.au Multivariate analyses ims Data reduction Reduce large numbers of variables into a smaller number that adequately summarize the patterns
More informationAn Introduction to Content Based Image Retrieval
CHAPTER -1 An Introduction to Content Based Image Retrieval 1.1 Introduction With the advancement in internet and multimedia technologies, a huge amount of multimedia data in the form of audio, video and
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationJava Archives Search Engine Using Byte Code as Information Source
Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id
More informationComputer vision: models, learning and inference. Chapter 10 Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x
More informationInformation System Architecture. Indra Tobing
Indra Tobing What is IS Information architecture is the term used to describe the structure of a system, i.e the way information is grouped, the navigation methods and terminology used within the system.
More informationVector Semantics. Dense Vectors
Vector Semantics Dense Vectors Sparse versus dense vectors PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alternative: learn vectors which are short (length 200-1000)
More informationCopyright. Anna Marie Bouboulis
Copyright by Anna Marie Bouboulis 2013 The Report committee for Anna Marie Bouboulis Certifies that this is the approved version of the following report: Poincaré Disc Models in Hyperbolic Geometry APPROVED
More informationRegion-based Segmentation
Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More information