Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Similar documents
Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Holistic Analysis of Multi-Source, Multi-Feature Data: Modeling and Computation Challenges

HUBify: Efficient Estimation of Central Entities across Multiplex Layer Compositions

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

Big Data Management and NoSQL Databases

COMP 465: Data Mining Still More on Clustering

Social Data Exploration

An overview of Graph Categories and Graph Primitives

Data Mining Concepts & Tasks

Edge Classification in Networks

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Large Scale Graph Algorithms

An Empirical Analysis of Communities in Real-World Networks

Graph Mining and Social Network Analysis

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Socrates: A System for Scalable Graph Analytics C. Savkli, R. Carr, M. Chapman, B. Chee, D. Minch

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Epilog: Further Topics

Exploring graph mining approaches for dynamic heterogeneous networks

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Query Independent Scholarly Article Ranking

Broad Learning via Fusion of Heterogeneous Information

Numeric Ranges Handling for Graph Based Knowledge Discovery Oscar E. Romero A., Jesús A. González B., Lawrence B. Holder

Introduction to Big-Data

Understanding Clustering Supervising the unsupervised

Introduction to Linked List: Review. Source:

Link Prediction for Social Network

Web Structure Mining Community Detection and Evaluation

Know your neighbours: Machine Learning on Graphs

On the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Data Mining Concepts & Tasks

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research

Efficient Mining Algorithms for Large-scale Graphs

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

Overview of Clustering

Tie strength, social capital, betweenness and homophily. Rik Sarkar

L1 - Introduction. Contents. Introduction of CAD/CAM system Components of CAD/CAM systems Basic concepts of graphics programming

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

A Process-Centric Data Mining and Visual Analytic Tool for Exploring Complex Social Networks

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

Interestingness Measurements

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

Unsupervised Learning

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

3 Data, Data Mining. Chengkai Li

Parallel repartitioning and remapping in

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Sublinear Algorithms for Big Data Analysis

CS249: ADVANCED DATA MINING

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Fast Inbound Top- K Query for Random Walk with Restart

OLAP Introduction and Overview

DATA MINING - 1DL105, 1DL111

arxiv: v1 [cs.si] 17 Sep 2016

Web Personalization & Recommender Systems

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Introduction to Data Mining

Mining Web Data. Lijun Zhang

Data Partitioning and MapReduce

Part I: Data Mining Foundations

Network visualization techniques and evaluation

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

An Introduction to Big Data Formats

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

PoP Level Mapping And Peering Deals

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework

Clustering. (Part 2)

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Shen, Tang, Yang, and Chu

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

The DataBridge: A Social Network for Long Tail Science Data!

Section 7.12: Similarity. By: Ralucca Gera, NPS

arxiv: v1 [physics.soc-ph] 10 Jun 2014

Chapter 3: AIS Enhancements Through Information Technology and Networks

Clustering in Data Mining

Comparison of Online Record Linkage Techniques

IBM System Storage DCS3700

A data-driven framework for archiving and exploring social media data

Web Personalization & Recommender Systems

Addressing the Variety Challenge to Ease the Systemization of Machine Learning in Earth Science

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

CTL.SC4x Technology and Systems

Mining Data that Changes. 17 July 2015

Global Register Allocation - 2

University of Florida CISE department Gator Engineering. Clustering Part 5

Lecture Note: Computation problems in social. network analysis

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Transcription:

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of Texas at Arlington, Arlington, Texas, USA 2 CSE Department, University of Nebraska at Omaha, Omaha, Nebraska, USA Email: 1 abhishek.santra@mavs.uta.edu, 2 sbhowmick@unomaha.edu Which class of big data problems are we looking into? Interaction amongaset of people Most popular or socially active group of people? Most influential set of people? 1

Airline connectivity among a set of US cities Similarity amongaset of traffic accidents Airline connectivity among a set of Indian cities Highly central cities (hubs)? Light Conditions Weather Conditions Road Surface Conditions Accident Prone Regions? Most dominant feature? Order of features? Connectivity among scientists, cities and conferences Research Collaboration Direct Flights Attendance Residences Venues Domains Best city to hold a workshop? Which research domains are prevalent in which state? Explicit relationships (friends, flights ) or based on similarity metric (numeric, location, video, time ) Holistic or Flexible Analysis: Study effect of different combinations of features or perspectives Two Challenges Modeling Processing/Computations 2

Traditional Approach: Monoplex Modeling Single Edge Monoplex Entities as Nodes Traditional Approach: Monoplex Modeling Multi Edge Monoplex Entities as Colored Nodes Relationships as Edges Weights for Strength Direction for Information Flow Drawbacks Need to be generated for every feature combination Repeated Dataset Scanning Repeated Similarity Metric Computation Relationships as Colored Edges Drawbacks Loss of partial analysis results w.r.t. different feature combinations Repeated Graph Traversals to extract any feature combination subgraph Convoluted Representation Multiple Monoplexes = Multiplex Modeling Non Suitability of Monoplex Modeling Computations will obscure information Computationally Expensive Convoluted Representations Use Multiplexes or Multilayer Networks A network of networks Each layer/network represents a single perspective or feature Differentiated into 3 types based on type of entities Multiplex Modeling (Same Entities, Different Relationships) Homogeneous Multiplex Multiple relationships among same type of entities Similarity of disasters (accidents, storms etc.) based on factors Interaction among people via various media (social media, calls etc.) Connectivity among cities based on different airlines Accident Multiplex 3

Multiplex Modeling (Different Entities, Different Relationships) Heterogeneous Multiplex Multiple relationships among different types of entities Residence, venue and attendance connectivity among city airline, scientist collaboration and similar conference networks Hybrid Multiplex Any layer of the heterogeneous multiplex can be expanded to a homogenous multiplex Different friendship networks among scientists City Scientist Conference Multiplex Benefits of Multiplexes Flexible analysis Mixing required layers in arbitrary ways Parallel algorithms can be leveraged Ease of handling the dataset incrementally Addition of new entities (nodes), relationship with existing entities (edges) and features/perspectives (layers) Efficient maintenance of the latest dataset snapshot Deletion of obsolete entities (nodes), relationships (edges) and features/perspectives (layers) Multi Feature Computations using Multiplexes Multiplex based analysis is at a nascent stage Layers considered individually or all layers aggregated together, in specific sub disciplines Hardly any work on mining and querying multiplexes Existing algorithms for monoplexes can be leveraged Need to generate, store and analyze each layer combination N individual layers O(2 N ) layer combinations! Exponential overall computational costs (storage and time), if N is large Multiplex based Holistic Analysis (Need of the Hour) Develop cost effective and robust algorithms for flexibly analyzing any layer combination to Compute network characteristics, Mine interesting hidden patterns, and Query Specific computations and related challenges vary with the type of multiplex under consideration 4

(Requirements) Devise efficient techniques for generating interesting network structures with respect to any combination of multiplex layers Communities: Tightly connected group of nodes Effectiveness of accident prevention techniques Variation of accident prone regions over time Hubs: Highly central nodes Maximize the reach of an advertisement Most influential people across social media (Requirements) Develop methods to compute the relative ordering and correlation among different feature (or layer) combinations based on their importance Allocation of funds for accident prevention Ordering factors based on impact on accident occurrence Identify metrics to quantify the importance of a layer Alternatives: Density, Number of Influential Nodes (Hubs), Core Periphery Structure, Local and Global Clustering Coefficient (Challenges) Need to reduce the exponential complexity Formulation of efficient aggregation functions that can combine the results from N individual layers to compute the results of any layer combination Diverse formats of layer wise results: Substructures (communities), Real Numbers (density, clustering coefficients), Sets (high centrality nodes, nodes in inner core) Analyze performance of aggregation functions Accuracy using NMI, ARI, Purity and Jaccard Index Obtain Confidence Intervals for different characteristics of layer wise results Savings in computational time and storage space by eliminating the need to generate, store and analyze each layer combination (Preliminary Work Boolean Composition) Proposed Multiplex Layer Compositions though Boolean Operations AND, OR, NOT AND Composition Relationships present in all layers NOT Composition Relationships not present in a layer OR Composition Relationships present in at least one layer 5

(Preliminary Work Recreating Communities) Proposed Multiplex Layer Compositions though Boolean Operations AND, OR, NOT Proposed an accurate intersection based community recreation technique for any AND composed multiplex layer using layer wise communities* Reduced the overall computation time by over 40% (with real life multi feature datasets traffic accidents, storms) Accident Multiplex with 3 individual layers (Preliminary Work Estimating Hubs) Hubs defined as the high degree or closeness centrality vertices Proposed efficient heuristics to estimate the high centrality vertices for any AND composed multiplex layer using the layer wise hub sets Overall average accuracy of at least 70 80%, Reduced the overall computation time by over 30% (with real life multifeature datasets traffic accidents, IMDb) Performance of degree centrality based heuristic (DC1) on IMDb Multiplex for co actors with Comedy (m 1), Action (m 2) and Drama (m 3) individual genre based layers Performance of closeness centrality based heuristic (CC1) on Accident Multiplex for similar accidents with Light (a 1), Weather (a 2) and Road Surface (a 3) Conditions individual layers *with individual layers having self preserving communities (Related Publications) Scalable Holistic Analysis of Multi Source, Data Intensive Problems Using Multilayered Networks CoRR abs 2016 Efficient Community Re creation in Multilayer Networks Using Boolean Operations ICCS 2017 HUBify: Efficient Estimation of Central Entities across Multiplex Layer Compositions DaMNet@ICDM 2017 (Requirements and Challenges) Devise efficient ways to compute high centrality vertices across multiple connected layers Best city to organize a workshop Aggregation functions need to take into account results of the bipartite graph formed by the inter layer edges Layers may be connected not just sequentially one after another (i.e. layer A connected to layer B connected to layer C) but can be connected in different directions (i.e. layers A, B and C can be connected to each other in a triangle) Vertex Hotspot for Cities Vertex Hotspot for Cities with respect to the Scientists 6

(Requirements and Challenges) Develop algorithms for mining on multiplexes Define the notion of subgraphs and patterns in a multiplex Establish metrics like MDL and frequency Define Exact and Similar (or in exact) substructures Devise strategies to partition a multiplex Example of a frequent city collaboration pattern (Requirements and Challenges) Develop query processing algorithms for queries on multiplexes Node degree based: Cities where scientists attending most number of conferences reside Community based: Cities where scientists belonging to largest group of collaborators reside Path based: Best possible city where a well connected group of collaborators can meet up by taking the minimum number of flights (Requirements and Challenges) Develop query processing algorithms for queries on multiplexes Determine the order (in parallel or as a partial order) to process layers for efficiency Generating metric to evaluate alternate query plans Evaluate the suitability of an index based or substructure expansion based approach Identify query processing requirements in terms of the graph properties (Related Publications) Heterogeneous Community Detection using Network Composition KDD 2018 (under preparation) 7

Conclusions Questions? Multiplexes have modeling and computational advantages for holistically analyzing multi source, multi feature data Computational challenges identified in this paper are being addressed by the research community Efficient techniques as solutions to these challenges will enrich the data analytics repertoire making it easier to analyze such a class of big data problems Sharma Chakravarthy Professor sharma@cse.uta.edu Abhishek Santra PhD Student abhishek.santra@mavs.uta.edu Sanjukta Bhowmick Associate Professor sbhowmick@unomaha.edu For more information visit: http://itlab.uta.edu 8