Heterogeneous (Information) Networks. CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17

Size: px

Start display at page:

Download "Heterogeneous (Information) Networks. CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17"

Anthony Brown
5 years ago
Views:

1 Heterogeneous (Information) Networks CS 6604: Data Mining Large Networks and Time-Series Paper Presentation Prashant Chandrasekar 11/01/17

2 Overview Topic: Heterogenous Information Networks Outline - Paper introducing the field of HIN mining Two really cool applications of HIN Objective/Takeaway: Piqued interest in the field, but more importantly, see how HINs can be a part of your personal research / class project / hobbies.

3 Heterogeneous Information Network Analysis Paper: A Survey on Heterogeneous Information Network Analysis Authors: Shi, Chuan, Yitong Li, Jiawei Zhang, Yizhou Sun, and S. Yu Philip IEEE Transactions on Knowledge and Data Engineering Jan 1;29(1):17-37.

4 Background Real systems have large number of interactions between multi-typed components. An information network is ubiquitous in terms of modeling/representing interacting components. Mining of such; related to works in link analysis, network analysis, network science and graph mining. Contemporary information network analyses restricted to single-type objects/nodes and/or links/edges. HIN: Allows fusing more information, more richer semantic representation

5 Concepts and Definitions Def 1: Information Network - G = (V,E) Object mapping function An object belongs to only one type Link mapping function: A link belongs to one relation type If two links belong to same relation type, they share same starting and ending object type.

6 Concepts and Definitions Def 1: Heterogenous/Homogenous Information Network - G = (V,E) Object mapping function An object belongs to one type Link mapping function: A link belongs to one relation type If two links belong to same relation type, they share same starting and ending object type. Heterogenous if A > 1 OR R > 1; else Homogenous

7 HIN Example: Bibliographic dataset

8 HIN: Meta-Paths Key difference between homogeneous networks: Two objects can be connected via different paths. Each path can have it s own meaning.

9 Meta-Path Definition Given network schema S = (A, R) (remember from previous slide) Meta-path P is of form: Composite relation, where, between objects If, no multiple relations between two object types, the above can be represented via object types. For ex for bibliographic data, we have 2-length meta-path, or APA for short.

10 Meta-Path: Bibliographic Dataset Question/Challenge: Would a task output depend on the metapath. For ex: Finding similar authors. Would the result be different if we chose meta path (a) as compared to meta path (b)?

11 Related network types Homogeneous Network: A = 1, R = 1. Special case of HIN. Can be derived from HIN through network projection. Analysis techniques not directly applicable to HIN. Multi-Relational Network: A = 1, R > 1. Special case of HIN. Multi-Dimensional/Mode Network: Same as Multi-Relational Network Composite Network: Users in network have various relationships, diff behavior in subnetwork, share latent variables. Same as Multi-Relational Network Complex Network: Non-trivial topological features. Fields of study include math, physics, biology, CS, etc. Real world networks (like social, biological) are complex networks. Real world HIN might be complex networks.

12 Common HIN Network Schemas Multi-Relational network with single-typed object: Facebook, Xiaonei, etc. Bipartite: User-Item, Document-word, (extended to k-partities) Star-Schema: Bibliographic, movie data, US patent data. (Typically derived from Multi-Hub Network: Most commonly for Bioinformatics data DB tables. Most Popular)

13 Complex HINs Multiple HINs (Studying connection across two social networks). Schema-rich network (based on ontologies written through semantic web standards) such as KnowledgeGraph.

14 Summary of Research Work on HIN

Summary of Research Work on HIN mining - 100 papers analyzed - Seven main data mining tasks:

15 Summary of Research Work on HIN mining papers analyzed - Seven main data mining tasks: - Similarity Measure - Clustering - Classification - Link Prediction - Recommendation - Others

16 HIN Data Mining: Similarity Measure - Two approaches: Link-based; (Personalized PageRank [54], SimRank [55], etc. ) Attribute-based (Feature value comparison using Jaccard coefficient, cosine similarity, etc.) - Similarity on HIN: Considers meta path along with structure similarity. Two different meta path have different semantic meaning. - Example: Find authors most similar to Christos Faloutos. APA says his students are most similar APVPA (correctly) shows most similar in the same field. Other works: - PathSim uses symmetric metapaths [14] - RelSim uses metapaths to measure similarity in relations [59] - HeteSim measures multi-typed object relevance using arbitrary meta path [13][62] - Social Influence using object similarity + influence in HIN [67]

17 HIN Data Mining: Clustering - Traditionally based on object features and done on homogeneous networks. Heterogeneity in object types makes the task harder. - Example: Cluster bibliographical dataset. Result in sub-network clusters, each pertaining to a particular research/cs domain. Clustering this way preserves information. Rich information in HIN helps clustering by integration of additional information and/or improve learning tasks. - Attribute information integration using attribute incompleteness, vertex attributes, random fields, etc. - Text information integration: Ex: topic model of contents, clusters based on topics. - Integration with other mining tasks, such as ranking: Ex: ranking-based clustering, mutually enhancing ranking and clustering - Other information : Social influence based clustering based on connections and social activities

HIN Data Mining: Classification - Traditionally done on objects satisfying IID (may not hold in HIN) - Classification in HIN: Can classify multiple-type of objects simultaneously Metapath widely used

18 HIN Data Mining: Classification - Traditionally done on objects satisfying IID (may not hold in HIN) - Classification in HIN: Can classify multiple-type of objects simultaneously Metapath widely used in classification in HIN - Example: 4 types of objects interlinked Classification = process of knowledge propagation. Deriving correlations among objects. Other approaches - Represent meta path in latent space to label multiple nodes - Modeling mutual influence for multi-label classification - Mine multiple relationships for multi-label classification - Meta paths as feature generators (GNetMine [21], HetPathMine[99]) - Meta path based dependences for collective classification

19 HIN Data Mining: Link Prediction - Challenge with HIN: Links to be predicted are of different types. Need to predict multiple types of links collectively. - Meta Path-based approaches Two-step process: 1) Extract meta-path based features; 2) Train regression/classification model to predict link. [23][24][110][111][112] PathPredict solves for co-authorship prediction using meta paths and logistic regression. [23] Path based features to predict company organizational chart. - Probabilistic models-based approaches Predict links by modeling influence propagation between heterogenous relationships. - Some work include link prediction across multiple HINs and dynamic link prediction such as predicting community members evolution

20 HIN Data Mining: Recommendation - Richer information/semantics of HIN make better for recommendations. - Constructing HIN for recommendation would help fuse all information, potentially utilized for the task. - Meta path is used well to explore relations between objects. HeteRecom finds similarities between movies based on semantic info on meta path. [43] SemRec is a personalized recommender system that builds a weighted HIN by using movie ratings on links. [48] - Fusing heterogeneous information to help with recommendation Context-dependent matrix factorization models

21 HIN Data Mining: Others Information Fusion: Process of merging information from heterogeneous sources with different conceptual, contextual and typographical representation. As seen in tasks such as data schema integration in DW, protein-protein interaction networks, ontology mapping in web semantics. Related work includes social network matching and various solutions for the alignment problem. Intuition: Fusing of HINs improve other previous covered tasks. (More contextual data) Application System Create systems with design based on HIN System for exploring and analyzing a topical hierarchy constructed from an HIN Online social media spam detection system for social network security Malware detection (details in following paper)

22 Shortcomings / Future Mining The field of HIN and HIN mining is relatively young. Future consideration include: Integrating attribute values to build weighted HIN: Real networks may contain attribute values on links, and these attribute values may contain important information. Dynamic HIN: To represent and model time-series data. Network construction for complex data: Semantically-rich RDF-based graph (Management of objects and relations with so many types and meta paths) HIN with more descriptive meta path Methods to optimize/rank selection of meta path for data mining tasks

23 Heterogeneous Information Network to detect Malware in Android Apps Paper: HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network Authors: Hou, S., Ye, Y., Song, Y., & Abdulhayoglu, M. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2017

24 Background Example App: Locker.apk Malicious: Once installed, victim is locked out from phone and is asked to pay a ransom. Goal: Predict malicious apps published in Android. Key methodology: Feature extraction and learning via HIN and meta-paths

25 Preliminary concepts Android App Compiled and packaged as a single.apk file. Includes app code (.dex file), resources, assets and manifest file. Dex (a Dalivik exec) file format has compiled code. (Unreadable) Smali is a.dex assembler/disassembler Provides code in Smali code

26 Smali Code

27 Feature Extraction: Common APIs across Apps For each App, note down the APIs called in Smali code - Parse smali code for API call extraction Represent occurrence as a matrix, A: Source: des_ye.pdf

28 FE: Relationship with APIs -> Code Block Find APIs that occur in the same code block. - Smali Code block markup:.method ->.endmethod Represent occurrence as a matrix, B: Source: des_ye.pdf

29 FE: Relationship with APIs -> Package Intuition: API calls belonging to the same package show similar intent. - API-[1-4] is part of Package 1 API-[5-8] is part of Package 2 Represent occurrence as a matrix, P: Source: des_ye.pdf

30 FE: Relationship with APIs -> InvokeMethods Intuition: API calls using same invoke method LIKE Words having same part of speech Represent occurrence as a matrix, I: Source: des_ye.pdf

31 HIN Construction Rationale: A = {App, API}; R = {contains, codeblock, package, invokemethod}

32 HIN Source:

33 Meta-paths: Revision There can be multiple API calls satisfying this particular meta-path constraint

34 Computing Similarities: Commuting Matrices

35 Commuting Matrices Example If GApp, API is the matrix between Apps and API calls. For meta-path:, commuting of apps is, which is equal to AAT. Therefore, given this matrix AAT, similarity between app ai and app aj is: ati aj This represents the dot product of two feature vectors. Each feature vector for this meta-path matrix is simply bag-of-apis for an app

Possible Meta-Paths Source:http://community.

36 Possible Meta-Paths Source:

37 Source:

38 Source:

39 Experiment Setup Data: Two datasets from Comodo Cloud Security Center Android apps from Jan to Feb : 1834 training samples (920 benign, 914 malicious) 500 test samples (198 benign, 302 malicious) One month of data: 30,000 Android apps split on benign and malicious. Experiments: 1) Evaluate performance of proposed method; 2) Compare system against other classification models; 3) Compare against commercial mobile security products; and 4) Evaluate on Large and Real Sample from Industry

40 Source:

41 Source:

42 Source:

43 Source:

44 Impact HinDroid has already been incorporated into the scanning tool of Comodo s Mobile Security Product. HinDroid has been used to predict the daily sample collection from Comodo Cloud Security Center. HinDroid has been deployed and tested based on the real daily sample collection for around half a year (about 2,700,000 Android apps in total have either been trained or tested). In practice, an anti-malware analyst has to spend at least 8 hours to manually analyze 40 Android apps for malware detection. Using the developed system HinDroid, the analysis of about 15,000 file samples can be performed within minutes with multiple servers. Cost Effective!

45 Bottom Line As stated in the paper: HinDroid we use more expressive representation for the data, and build the connection between the higher-level semantics of the data and the final results. and.....more labels is not as important as the need of more expressive representations of data.

46 Heterogeneous Networks to Study Critical Infrastructure Failure Cascades Paper: HotSpots: Failure Cascades on Heterogeneous Critical Infrastructure Networks Authors: Chen, L., Xu, X., Lee, S., Duan, S., Tarditi, A. G., Chinthavali, S., & Prakash, B. A. ACM International Conference on Information and Knowledge Management (CIKM) 2017

47 HN to study critical infrastructure failure cascades Domain: Critical infrastructure systems that provide power/electricity, water, communication etc., each of which is dependent on one another (in some way or the other). Overall objective: Study cascading effects of failure of such systems and its effects on one another when CIs fail (more likely during a crisis such as blackouts, hurricanes, etc.) Specific task: Find k such CI, the failure of which, would maximize failures across ALL CIs (or CI networks). Problem with current efforts: 1) Work on one CI; 2) Don t consider dynamics of the system, 3) Relatively simple models.

48 Study Design: Network Creation Representing CIs interdependencies as a heterogenous network. DELIVER NATURAL GAS AS FUEL TO GENERATE POWER SEND COMPRESSED NATURAL GAS VIA PIPELINES SEND GENERATED POWER VIA TRANSMISSION LINES MOVE POWER TO SUBSTATIONS DISTRIBUTE POWER TO LOCAL NATURAL GAS COMPRESSORS *CI/components extracted from HSIP and EIA dataset from power systems and natural gas system. [1][3]

49 Cascade Model: F-Cas RULES OF CASCADE PER CI VULNERABLE TO LOCAL FAILURES BEING AMPLIFIED SYSTEM-WIDE (FURTHER GREATER FAILURES) PROBLEM: HARD TO MODEL BEHAVIOR - SUBSTATION: FAIL WHEN NO PATH TO POWER PLANT - GAS COMPRESSORS: FAIL WHEN CONNECTED SUBSTATION FAILS - POWER PLANTS: FAIL WHEN CONNECTED GAS COMPRESSORS - PIPELINE: Connection between power plants and gas compressors. DON T DEPEND ON ANYBODY. NO CASCADING EFFECTS IN FAILURES. - TRANSMISSION: - NAIVE: BUILD CO-PARENT NETWORK, THEN IC MODEL - REAL: PROB. of FAIL BASED ON FAIL of PARENT

50 Problem Definition Problem 1 (Max-Sub): Given heterogenous network, G, F-CAS, and value k : Find the best set S* of k transmissions nodes to fail, such that the expected number of final failed substations are maximized. S* = arg max E[#s S] #s = number of substations that would eventually fail, given initial failure set S.

51 Problem Definition Problem 2 (Max-SubBus): Given heterogenous network, G, F-CAS, and value k : Find the best set S* of k transmissions nodes to fail, such that the expected number of final failed substations and transmission nodes/lines are maximized. S* = arg max E[#s + #t S] #t = number of transmission nodes/lines that would eventually fail, given initial failure set S. For both Max-Sub & Max-SubBus: *Note: Max-Sub & Max-SubBus are NP-hard

52 Approach/Methodology Two scenarios: 1) No loop on failure cascade; 2) Loop on failure cascade Estimating Pr(si S) empirically is hard to optimize. Solution: Dominator Tree

53 Approach/Methodology: Scenario 1 Scenario 1: No loop on failure cascade (Power plant -> substation failure) Estimation of Pr(si S) is based on probability of any transmission node (in the dominator tree path) failing. If any ti fails, si is bound to fail. Given that, Objective function for Max-Sub: Objective function for Max-SubBus: Main Contribution: Dominator-tree-based method for estimation can be solve near-optimally using greedy algorithm. (Otherwise, not possible).

54 Experiment 1: Effectiveness Experiment 1: Dataset: HSIP Gold data and EIA data for states: TN, PA, FL, OH

55 Experiment 2: Scalability How scales as number of seeds k and size of network V changes

56 Case study 1: Estimate/Predict damage of hurricane Setup: Overlay hurricane Sandy path with heterogeneous network G. Estimate: Immediate impact/damage and predicted damage. Study result, of predicting cascading loop trends, complemented existing hurricane assessment tools by including cascade effect.

57 Case study 2: 2003 NE Blackout Background: Initial study showed that over the course of a couple of hours since the first transmission line failure, many more failed causing a cascade of failures throughout southeastern Canada and 8 NE states. Case Study: Heterogeneous network, G, overlapping Ohio to identify top 5 vulnerable/critical nodes.

58 Case study 2: Results Insights - OH map, top right node identified was indeed truly critical - Nodes identified should either be on large generation plants or on transmission lines - As seen in figure, identified nodes corresponded with areas of several converging lines or High Voltage lines.

59 Study Extendability via User Interface - UI provided to run simulations on finding critical nodes in various other maps. - This involves: - Generating the Heterogeneous networks. - Running cascade simulations - Getting real-time failure statistics via visualizations.

60 Impact and Bottom Line - HotSpots algo, HIN generation toolkit and F-CAS model are first attempt to analyze upto 5 different critical infrastructures. Adding additional components easy. Methods capture path-based and neighbor-based failure conditions. Path-based failure cascading not restricted to transmission networks and is applicable to wide range of CI systems.

61 Reflections: Lessons Learnt - What are heterogenous networks: 1) Network structure and 2) Rich semantic meaning of structural types of objects and links - Types of datasets that have been represented via HINs - Various graph mining algorithms that have been designed for HINs - More specifically, how heterogeneous representation has helped: Predict/Classify malicious Android apps, and Identify a subset of critical infrastructures, the failure of which would have the biggest catastrophic impact on availability of vital resources

HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network

HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network Shifu Hou 1, Yanfang Ye 1, Yangqiu Song 2, Melih Abdulhayoglu 3 1. Department of CSEE, West