Heterogeneous (Information) Networks. CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17

Size: px
Start display at page:

Download "Heterogeneous (Information) Networks. CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17"

Transcription

1 Heterogeneous (Information) Networks CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17

2 Overview Topic: Heterogenous Information Networks Outline - Paper introducing the field of HIN mining Two really cool applications of HIN Objective/Takeaway: Piqued interest in the field, but more importantly, see how HINs can be a part of your personal research / class project / hobbies.

3 Heterogeneous Information Network Analysis Paper: A Survey on Heterogeneous Information Network Analysis Authors: Shi, Chuan, Yitong Li, Jiawei Zhang, Yizhou Sun, and S. Yu Philip IEEE Transactions on Knowledge and Data Engineering Jan 1;29(1):17-37.

4 Background Real systems have large number of interactions between multi-typed components. An information network is ubiquitous in terms of modeling/representing interacting components. Mining of such; related to works in link analysis, network analysis, network science and graph mining. Contemporary information network analyses restricted to single-type objects/nodes and/or links/edges. HIN: Allows fusing more information, more richer semantic representation

5 Concepts and Definitions Def 1: Information Network - G = (V,E) Object mapping function An object belongs to only one type Link mapping function: A link belongs to one relation type If two links belong to same relation type, they share same starting and ending object type.

6 Concepts and Definitions Def 1: Heterogenous/Homogenous Information Network - G = (V,E) Object mapping function An object belongs to one type Link mapping function: A link belongs to one relation type If two links belong to same relation type, they share same starting and ending object type. Heterogenous if A > 1 OR R > 1; else Homogenous

7 HIN Example: Bibliographic dataset

8 HIN: Meta-Paths Key difference between homogeneous networks: Two objects can be connected via different paths. Each path can have it s own meaning.

9 Meta-Path Definition Given network schema S = (A, R) (remember from previous slide) Meta-path P is of form: Composite relation, where, between objects If, no multiple relations between two object types, the above can be represented via object types. For ex for bibliographic data, we have 2-length meta-path, or APA for short.

10 Meta-Path: Bibliographic Dataset Question/Challenge: Would a task output depend on the metapath. For ex: Finding similar authors. Would the result be different if we chose meta path (a) as compared to meta path (b)?

11 Related network types Homogeneous Network: A = 1, R = 1. Special case of HIN. Can be derived from HIN through network projection. Analysis techniques not directly applicable to HIN. Multi-Relational Network: A = 1, R > 1. Special case of HIN. Multi-Dimensional/Mode Network: Same as Multi-Relational Network Composite Network: Users in network have various relationships, diff behavior in subnetwork, share latent variables. Same as Multi-Relational Network Complex Network: Non-trivial topological features. Fields of study include math, physics, biology, CS, etc. Real world networks (like social, biological) are complex networks. Real world HIN might be complex networks.

12 Common HIN Network Schemas Multi-Relational network with single-typed object: Facebook, Xiaonei, etc. Bipartite: User-Item, Document-word, (extended to k-partities) Star-Schema: Bibliographic, movie data, US patent data. (Typically derived from Multi-Hub Network: Most commonly for Bioinformatics data DB tables. Most Popular)

13 Complex HINs Multiple HINs (Studying connection across two social networks). Schema-rich network (based on ontologies written through semantic web standards) such as KnowledgeGraph.

14 Summary of Research Work on HIN

15 Summary of Research Work on HIN mining papers analyzed - Seven main data mining tasks: - Similarity Measure - Clustering - Classification - Link Prediction - Recommendation - Others

16 HIN Data Mining: Similarity Measure - Two approaches: Link-based; (Personalized PageRank [54], SimRank [55], etc. ) Attribute-based (Feature value comparison using Jaccard coefficient, cosine similarity, etc.) - Similarity on HIN: Considers meta path along with structure similarity. Two different meta path have different semantic meaning. - Example: Find authors most similar to Christos Faloutos. APA says his students are most similar APVPA (correctly) shows most similar in the same field. Other works: - PathSim uses symmetric metapaths [14] - RelSim uses metapaths to measure similarity in relations [59] - HeteSim measures multi-typed object relevance using arbitrary meta path [13][62] - Social Influence using object similarity + influence in HIN [67]

17 HIN Data Mining: Clustering - Traditionally based on object features and done on homogeneous networks. Heterogeneity in object types makes the task harder. - Example: Cluster bibliographical dataset. Result in sub-network clusters, each pertaining to a particular research/cs domain. Clustering this way preserves information. Rich information in HIN helps clustering by integration of additional information and/or improve learning tasks. - Attribute information integration using attribute incompleteness, vertex attributes, random fields, etc. - Text information integration: Ex: topic model of contents, clusters based on topics. - Integration with other mining tasks, such as ranking: Ex: ranking-based clustering, mutually enhancing ranking and clustering - Other information : Social influence based clustering based on connections and social activities

18 HIN Data Mining: Classification - Traditionally done on objects satisfying IID (may not hold in HIN) - Classification in HIN: Can classify multiple-type of objects simultaneously Metapath widely used in classification in HIN - Example: 4 types of objects interlinked Classification = process of knowledge propagation. Deriving correlations among objects. Other approaches - Represent meta path in latent space to label multiple nodes - Modeling mutual influence for multi-label classification - Mine multiple relationships for multi-label classification - Meta paths as feature generators (GNetMine [21], HetPathMine[99]) - Meta path based dependences for collective classification

19 HIN Data Mining: Link Prediction - Challenge with HIN: Links to be predicted are of different types. Need to predict multiple types of links collectively. - Meta Path-based approaches Two-step process: 1) Extract meta-path based features; 2) Train regression/classification model to predict link. [23][24][110][111][112] PathPredict solves for co-authorship prediction using meta paths and logistic regression. [23] Path based features to predict company organizational chart. - Probabilistic models-based approaches Predict links by modeling influence propagation between heterogenous relationships. - Some work include link prediction across multiple HINs and dynamic link prediction such as predicting community members evolution

20 HIN Data Mining: Recommendation - Richer information/semantics of HIN make better for recommendations. - Constructing HIN for recommendation would help fuse all information, potentially utilized for the task. - Meta path is used well to explore relations between objects. HeteRecom finds similarities between movies based on semantic info on meta path. [43] SemRec is a personalized recommender system that builds a weighted HIN by using movie ratings on links. [48] - Fusing heterogeneous information to help with recommendation Context-dependent matrix factorization models

21 HIN Data Mining: Others Information Fusion: Process of merging information from heterogeneous sources with different conceptual, contextual and typographical representation. As seen in tasks such as data schema integration in DW, protein-protein interaction networks, ontology mapping in web semantics. Related work includes social network matching and various solutions for the alignment problem. Intuition: Fusing of HINs improve other previous covered tasks. (More contextual data) Application System Create systems with design based on HIN System for exploring and analyzing a topical hierarchy constructed from an HIN Online social media spam detection system for social network security Malware detection (details in following paper)

22 Shortcomings / Future Mining The field of HIN and HIN mining is relatively young. Future consideration include: Integrating attribute values to build weighted HIN: Real networks may contain attribute values on links, and these attribute values may contain important information. Dynamic HIN: To represent and model time-series data. Network construction for complex data: Semantically-rich RDF-based graph (Management of objects and relations with so many types and meta paths) HIN with more descriptive meta path Methods to optimize/rank selection of meta path for data mining tasks

23 Heterogeneous Information Network to detect Malware in Android Apps Paper: HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network Authors: Hou, S., Ye, Y., Song, Y., & Abdulhayoglu, M. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2017

24 Background Example App: Locker.apk Malicious: Once installed, victim is locked out from phone and is asked to pay a ransom. Goal: Predict malicious apps published in Android. Key methodology: Feature extraction and learning via HIN and meta-paths

25 Preliminary concepts Android App Compiled and packaged as a single.apk file. Includes app code (.dex file), resources, assets and manifest file. Dex (a Dalivik exec) file format has compiled code. (Unreadable) Smali is a.dex assembler/disassembler Provides code in Smali code

26 Smali Code

27 Feature Extraction: Common APIs across Apps For each App, note down the APIs called in Smali code - Parse smali code for API call extraction Represent occurrence as a matrix, A: Source: des_ye.pdf

28 FE: Relationship with APIs -> Code Block Find APIs that occur in the same code block. - Smali Code block markup:.method ->.endmethod Represent occurrence as a matrix, B: Source: des_ye.pdf

29 FE: Relationship with APIs -> Package Intuition: API calls belonging to the same package show similar intent. - API-[1-4] is part of Package 1 API-[5-8] is part of Package 2 Represent occurrence as a matrix, P: Source: des_ye.pdf

30 FE: Relationship with APIs -> InvokeMethods Intuition: API calls using same invoke method LIKE Words having same part of speech Represent occurrence as a matrix, I: Source: des_ye.pdf

31 HIN Construction Rationale: A = {App, API}; R = {contains, codeblock, package, invokemethod}

32 HIN Source:

33 Meta-paths: Revision There can be multiple API calls satisfying this particular meta-path constraint

34 Computing Similarities: Commuting Matrices

35 Commuting Matrices Example If GApp, API is the matrix between Apps and API calls. For meta-path:, commuting of apps is, which is equal to AAT. Therefore, given this matrix AAT, similarity between app ai and app aj is: ati aj This represents the dot product of two feature vectors. Each feature vector for this meta-path matrix is simply bag-of-apis for an app

36 Possible Meta-Paths Source:

37 Source:

38 Source:

39 Experiment Setup Data: Two datasets from Comodo Cloud Security Center Android apps from Jan to Feb : 1834 training samples (920 benign, 914 malicious) 500 test samples (198 benign, 302 malicious) One month of data: 30,000 Android apps split on benign and malicious. Experiments: 1) Evaluate performance of proposed method; 2) Compare system against other classification models; 3) Compare against commercial mobile security products; and 4) Evaluate on Large and Real Sample from Industry

40 Source:

41 Source:

42 Source:

43 Source:

44 Impact HinDroid has already been incorporated into the scanning tool of Comodo s Mobile Security Product. HinDroid has been used to predict the daily sample collection from Comodo Cloud Security Center. HinDroid has been deployed and tested based on the real daily sample collection for around half a year (about 2,700,000 Android apps in total have either been trained or tested). In practice, an anti-malware analyst has to spend at least 8 hours to manually analyze 40 Android apps for malware detection. Using the developed system HinDroid, the analysis of about 15,000 file samples can be performed within minutes with multiple servers. Cost Effective!

45 Bottom Line As stated in the paper: HinDroid we use more expressive representation for the data, and build the connection between the higher-level semantics of the data and the final results. and.....more labels is not as important as the need of more expressive representations of data.

46 Heterogeneous Networks to Study Critical Infrastructure Failure Cascades Paper: HotSpots: Failure Cascades on Heterogeneous Critical Infrastructure Networks Authors: Chen, L., Xu, X., Lee, S., Duan, S., Tarditi, A. G., Chinthavali, S., & Prakash, B. A. ACM International Conference on Information and Knowledge Management (CIKM) 2017

47 HN to study critical infrastructure failure cascades Domain: Critical infrastructure systems that provide power/electricity, water, communication etc., each of which is dependent on one another (in some way or the other). Overall objective: Study cascading effects of failure of such systems and its effects on one another when CIs fail (more likely during a crisis such as blackouts, hurricanes, etc.) Specific task: Find k such CI, the failure of which, would maximize failures across ALL CIs (or CI networks). Problem with current efforts: 1) Work on one CI; 2) Don t consider dynamics of the system, 3) Relatively simple models.

48 Study Design: Network Creation Representing CIs interdependencies as a heterogenous network. DELIVER NATURAL GAS AS FUEL TO GENERATE POWER SEND COMPRESSED NATURAL GAS VIA PIPELINES SEND GENERATED POWER VIA TRANSMISSION LINES MOVE POWER TO SUBSTATIONS DISTRIBUTE POWER TO LOCAL NATURAL GAS COMPRESSORS *CI/components extracted from HSIP and EIA dataset from power systems and natural gas system. [1][3]

49 Cascade Model: F-Cas RULES OF CASCADE PER CI VULNERABLE TO LOCAL FAILURES BEING AMPLIFIED SYSTEM-WIDE (FURTHER GREATER FAILURES) PROBLEM: HARD TO MODEL BEHAVIOR - SUBSTATION: FAIL WHEN NO PATH TO POWER PLANT - GAS COMPRESSORS: FAIL WHEN CONNECTED SUBSTATION FAILS - POWER PLANTS: FAIL WHEN CONNECTED GAS COMPRESSORS - PIPELINE: Connection between power plants and gas compressors. DON T DEPEND ON ANYBODY. NO CASCADING EFFECTS IN FAILURES. - TRANSMISSION: - NAIVE: BUILD CO-PARENT NETWORK, THEN IC MODEL - REAL: PROB. of FAIL BASED ON FAIL of PARENT

50 Problem Definition Problem 1 (Max-Sub): Given heterogenous network, G, F-CAS, and value k : Find the best set S* of k transmissions nodes to fail, such that the expected number of final failed substations are maximized. S* = arg max E[#s S] #s = number of substations that would eventually fail, given initial failure set S.

51 Problem Definition Problem 2 (Max-SubBus): Given heterogenous network, G, F-CAS, and value k : Find the best set S* of k transmissions nodes to fail, such that the expected number of final failed substations and transmission nodes/lines are maximized. S* = arg max E[#s + #t S] #t = number of transmission nodes/lines that would eventually fail, given initial failure set S. For both Max-Sub & Max-SubBus: *Note: Max-Sub & Max-SubBus are NP-hard

52 Approach/Methodology Two scenarios: 1) No loop on failure cascade; 2) Loop on failure cascade Estimating Pr(si S) empirically is hard to optimize. Solution: Dominator Tree

53 Approach/Methodology: Scenario 1 Scenario 1: No loop on failure cascade (Power plant -> substation failure) Estimation of Pr(si S) is based on probability of any transmission node (in the dominator tree path) failing. If any ti fails, si is bound to fail. Given that, Objective function for Max-Sub: Objective function for Max-SubBus: Main Contribution: Dominator-tree-based method for estimation can be solve near-optimally using greedy algorithm. (Otherwise, not possible).

54 Experiment 1: Effectiveness Experiment 1: Dataset: HSIP Gold data and EIA data for states: TN, PA, FL, OH

55 Experiment 2: Scalability How scales as number of seeds k and size of network V changes

56 Case study 1: Estimate/Predict damage of hurricane Setup: Overlay hurricane Sandy path with heterogeneous network G. Estimate: Immediate impact/damage and predicted damage. Study result, of predicting cascading loop trends, complemented existing hurricane assessment tools by including cascade effect.

57 Case study 2: 2003 NE Blackout Background: Initial study showed that over the course of a couple of hours since the first transmission line failure, many more failed causing a cascade of failures throughout southeastern Canada and 8 NE states. Case Study: Heterogeneous network, G, overlapping Ohio to identify top 5 vulnerable/critical nodes.

58 Case study 2: Results Insights - OH map, top right node identified was indeed truly critical - Nodes identified should either be on large generation plants or on transmission lines - As seen in figure, identified nodes corresponded with areas of several converging lines or High Voltage lines.

59 Study Extendability via User Interface - UI provided to run simulations on finding critical nodes in various other maps. - This involves: - Generating the Heterogeneous networks. - Running cascade simulations - Getting real-time failure statistics via visualizations.

60 Impact and Bottom Line - HotSpots algo, HIN generation toolkit and F-CAS model are first attempt to analyze upto 5 different critical infrastructures. Adding additional components easy. Methods capture path-based and neighbor-based failure conditions. Path-based failure cascading not restricted to transmission networks and is applicable to wide range of CI systems.

61 Reflections: Lessons Learnt - What are heterogenous networks: 1) Network structure and 2) Rich semantic meaning of structural types of objects and links - Types of datasets that have been represented via HINs - Various graph mining algorithms that have been designed for HINs - More specifically, how heterogeneous representation has helped: Predict/Classify malicious Android apps, and Identify a subset of critical infrastructures, the failure of which would have the biggest catastrophic impact on availability of vital resources

HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network

HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network Shifu Hou 1, Yanfang Ye 1, Yangqiu Song 2, Melih Abdulhayoglu 3 1. Department of CSEE, West

More information

WE know that most real systems usually consist of a

WE know that most real systems usually consist of a IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 1, JANUARY 2017 17 A Survey of Heterogeneous Information Network Analysis Chuan Shi, Member, IEEE, Yitong Li, Jiawei Zhang, Yizhou Sun,

More information

Chapter 1 Introduction

Chapter 1 Introduction Chapter 1 Introduction Abstract In this chapter, we introduce some basic concepts and definitions in heterogeneous information network and compare the heterogeneous information network with other related

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han University of Illinois at Urbana-Champaign (UIUC) Facebook Inc. U.S. Army Research

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks Shaoli Bu bsl89723@gmail.com Zhaohui Peng pzh@sdu.edu.cn Abstract Relevance search in

More information

Chapter 2 Survey of Current Developments

Chapter 2 Survey of Current Developments Chapter 2 Survey of Current Developments Abstract Heterogeneous information network (HIN) provides a new paradigm to manage networked data. Meanwhile, it also introduces new challenges for many data mining

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

An overview of Graph Categories and Graph Primitives

An overview of Graph Categories and Graph Primitives An overview of Graph Categories and Graph Primitives Dino Ienco (dino.ienco@irstea.fr) https://sites.google.com/site/dinoienco/ Topics I m interested in: Graph Database and Graph Data Mining Social Network

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH Abstract We analyze the factors contributing to the relevance of a web-page as computed by popular industry web search-engines. We also

More information

Lily: Ontology Alignment Results for OAEI 2009

Lily: Ontology Alignment Results for OAEI 2009 Lily: Ontology Alignment Results for OAEI 2009 Peng Wang 1, Baowen Xu 2,3 1 College of Software Engineering, Southeast University, China 2 State Key Laboratory for Novel Software Technology, Nanjing University,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Citation Prediction in Heterogeneous Bibliographic Networks

Citation Prediction in Heterogeneous Bibliographic Networks Citation Prediction in Heterogeneous Bibliographic Networks Xiao Yu Quanquan Gu Mianwei Zhou Jiawei Han University of Illinois at Urbana-Champaign {xiaoyu1, qgu3, zhou18, hanj}@illinois.edu Abstract To

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Image Similarity Measurements Using Hmok- Simrank

Image Similarity Measurements Using Hmok- Simrank Image Similarity Measurements Using Hmok- Simrank A.Vijay Department of computer science and Engineering Selvam College of Technology, Namakkal, Tamilnadu,india. k.jayarajan M.E (Ph.D) Assistant Professor,

More information

Understanding policy intent and misconfigurations from implementations: consistency and convergence

Understanding policy intent and misconfigurations from implementations: consistency and convergence Understanding policy intent and misconfigurations from implementations: consistency and convergence Prasad Naldurg 1, Ranjita Bhagwan 1, and Tathagata Das 2 1 Microsoft Research India, prasadn@microsoft.com,

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

A Data Classification Algorithm of Internet of Things Based on Neural Network

A Data Classification Algorithm of Internet of Things Based on Neural Network A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE 1 Computer Science CSC 108 Intro to Computer Science 3 cr An introduction to the major areas of computer science, such as computing systems, the binary number

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Department of Computer Science & Engineering University of Kalyani. Syllabus for Ph.D. Coursework

Department of Computer Science & Engineering University of Kalyani. Syllabus for Ph.D. Coursework Department of Computer Science & Engineering University of Kalyani Syllabus for Ph.D. Coursework Paper 1: A) Literature Review: (Marks - 25) B) Research Methodology: (Marks - 25) Paper 2: Computer Applications:

More information

Orange3 Data Fusion Documentation. Biolab

Orange3 Data Fusion Documentation. Biolab Biolab Mar 07, 2018 Widgets 1 IMDb Actors 1 2 Chaining 5 3 Completion Scoring 9 4 Fusion Graph 13 5 Latent Factors 17 6 Matrix Sampler 21 7 Mean Fuser 25 8 Movie Genres 29 9 Movie Ratings 33 10 Table

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing

More information

Study of Data Mining Algorithm in Social Network Analysis

Study of Data Mining Algorithm in Social Network Analysis 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Study of Data Mining Algorithm in Social Network Analysis Chang Zhang 1,a, Yanfeng Jin 1,b, Wei Jin 1,c, Yu Liu 1,d 1

More information

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE 1 Computer Science CSC 108 Intro to Computer Science 3 cr An introduction to the major areas of computer science, such as computing systems, the binary number

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4 Prof. James She james.she@ust.hk 1 Selected Works of Activity 4 2 Selected Works of Activity 4 3 Last lecture 4 Mid-term

More information

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks 1 BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks Pan Hui, Jon Crowcroft, Eiko Yoneki Presented By: Shaymaa Khater 2 Outline Introduction. Goals. Data Sets. Community Detection Algorithms

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Visualization and text mining of patent and non-patent data

Visualization and text mining of patent and non-patent data of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent

More information

Wither OWL in a knowledgegraphed, Linked-Data World?

Wither OWL in a knowledgegraphed, Linked-Data World? Wither OWL in a knowledgegraphed, Linked-Data World? Jim Hendler @jahendler Tetherless World Professor of Computer, Web and Cognitive Science Director, Rensselaer Institute for Data Exploration and Applications

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Graph Processing. Connor Gramazio Spiros Boosalis

Graph Processing. Connor Gramazio Spiros Boosalis Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps

More information

Link prediction in multiplex bibliographical networks

Link prediction in multiplex bibliographical networks Int. J. Complex Systems in Science vol. 3(1) (2013), pp. 77 82 Link prediction in multiplex bibliographical networks Manisha Pujari 1, and Rushed Kanawati 1 1 Laboratoire d Informatique de Paris Nord (LIPN),

More information

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks Xiao Yu, Yizhou Sun, Brandon Norick, Tiancheng Mao, Jiawei Han Computer Science Department University

More information

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems KNOWLEDGE GRAPHS Introduction and Organisation Lecture 1: Introduction and Motivation Markus Kro tzsch Knowledge-Based Systems TU Dresden, 16th Oct 2018 Markus Krötzsch, 16th Oct 2018 Course Tutors Knowledge

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

Social Network Mining An Introduction

Social Network Mining An Introduction Social Network Mining An Introduction Jiawei Zhang Assistant Professor Florida State University Big Data A Questionnaire Please raise your hands, if you (1) use Facebook (2) use Instagram (3) use Snapchat

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems KNOWLEDGE GRAPHS Lecture 1: Introduction and Motivation Markus Krötzsch Knowledge-Based Systems TU Dresden, 16th Oct 2018 Introduction and Organisation Markus Krötzsch, 16th Oct 2018 Knowledge Graphs slide

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Jan 16, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

HitFraud: A Broad Learning Approach for Collective Fraud Detection in Heterogeneous Information Networks

HitFraud: A Broad Learning Approach for Collective Fraud Detection in Heterogeneous Information Networks HitFraud: A Broad Learning Approach for Collective Fraud Detection in Heterogeneous Information Networks Bokai Cao, Mia Mao, Siim Viidu and Philip S. Yu Department of Computer Science, University of Illinois

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks A. Papadopoulos, G. Pallis, M. D. Dikaiakos Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks IEEE/WIC/ACM International Conference on Web Intelligence Nov.

More information

Semantic Web Mining. Diana Cerbu

Semantic Web Mining. Diana Cerbu Semantic Web Mining Diana Cerbu Contents Semantic Web Data mining Web mining Content web mining Structure web mining Usage web mining Semantic Web Mining Semantic web "The Semantic Web is a vision: the

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information