Big Data analytics and Visualization

Similar documents
Enriching knowledge graphs with text processing techniques

Evolution of Database Systems

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Qlik. 10 key elements of a successful data strategy and modern analytics platform. February 2019 Julie Kae Executive Director, Qlik.

Informatica Enterprise Information Catalog

Ontology and Hyper Graph Based Dashboards in Data Warehousing Systems

Progress DataDirect For Business Intelligence And Analytics Vendors

An Introduction to Big Data Formats

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

An overview of Graph Categories and Graph Primitives

High Performance Oracle Endeca Designs for Retail. Technical White Paper 24 June

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Analytics Fundamentals by Mark Peco

Patent Image Retrieval

The Definitive Guide to Preparing Your Data for Tableau

Welcome to the topic of SAP HANA modeling views.

What is Gluent? The Gluent Data Platform

BIG DATA INDUSTRY PAPER

The Hadoop Paradigm & the Need for Dataset Management

E6895 Advanced Big Data Analytics Lecture 4:

Some Big Data Challenges

OLAP Introduction and Overview

Securing Your Digital Transformation

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Accelerate your SAS analytics to take the gold

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Oracle and Tangosol Acquisition Announcement

DATA FORMATS FOR DATA SCIENCE Remastered

Optimize Your Databases Using Foglight for Oracle s Performance Investigator

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Construction Change Order analysis CPSC 533C Analysis Project

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France


Technology In Action, Complete, 14e (Evans et al.) Chapter 11 Behind the Scenes: Databases and Information Systems

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Level 4 Diploma in Computing

Where do these data come from? What technologies do they use?? Whatever they use, they need models (schemas, metadata, )

DB2 for z/os: Programmer Essentials for Designing, Building and Tuning

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

USC Viterbi School of Engineering

Data Mining Concepts & Techniques

Chapter 3: Google Penguin, Panda, & Hummingbird

QLIKVIEW ARCHITECTURAL OVERVIEW

Basics of Data Management

Feature Scope Description Document Version: CUSTOMER. SAP Analytics Hub. Software version 17.09

Analytics Driven, Simple, Accurate and Actionable Cyber Security Solution CYBER ANALYTICS

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Integrate MATLAB Analytics into Enterprise Applications

At the heart of Europe s ICT ecosystem

Building a Data Strategy for a Digital World

Microsoft Power BI for O365

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

SAP Sybase SQL Anywhere Manage enterprise data in remote and mobile locations. Speaker s Name/Department (delete if not needed) Month 00, 2012

CORPORATE PROFILE. AI based Search Interface for your. B2B Marketing Campaigns

Architectures for Scalable Media Object Search

Massive Scalability With InterSystems IRIS Data Platform

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Microservice Layout in Netflix

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Step-by-step data transformation

The Rules of Subsurface Analytics Jane McConnell, Practice Partner Oil and Gas, Teradata DEJ KL, 4 October 2017

End to End Analysis on System z IBM Transaction Analysis Workbench for z/os. James Martin IBM Tools Product SME August 10, 2015

A Fast and High Throughput SQL Query System for Big Data

Data-Transformation on historical data using the RDF Data Cube Vocabulary

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Embedded Technosolutions

New Approach to Graph Databases

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

STRATEGIC INFORMATION SYSTEMS IV STV401T / B BTIP05 / BTIX05 - BTECH DEPARTMENT OF INFORMATICS. By: Dr. Tendani J. Lavhengwa

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Data Warehouse and Data Mining

A Data Warehouse Implementation Using the Star Schema. For an outpatient hospital information system

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Approaching the Petabyte Analytic Database: What I learned

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

TIBCO Spotfire Statement of Direction. Spotfire Product Management

Integrate MATLAB Analytics into Enterprise Applications

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Visualisation of Abstract Information

Beyond Query/400: Leap into Business Intelligence with DB2 Web Query

Test On Line: reusing SAS code in WEB applications Author: Carlo Ramella TXT e-solutions

Open Research Online The Open University s repository of research publications and other research outputs

GEO-SPATIAL METADATA SERVICES ISRO S INITIATIVE

Application of machine learning and big data technologies in OpenAIRE system

Qlik s Associative Model

Overview. About CERN 2 / 11

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Introduction to Data Management. Lecture #1 (Course Trailer )

OPEN. INTELLIGENT. Laser Scanning Software Solutions

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

NTP Software VFM Task Service for NetApp

Qlik Sense Desktop. Data, Discovery, Collaboration in minutes. Qlik Sense Desktop. Qlik Associative Model. Get Started for Free

Migrate from Netezza Workload Migration

Enterprise Knowledge Map: Toward Subject Centric Computing. March 21st, 2007 Dmitry Bogachev

The Emerging Data Lake IT Strategy

Transcription:

Big Data analytics and Visualization MTA Cloud symposium A. Agocs, D. Dardanis, R. Forster, J.-M. Le Goff, X. Ouvrard CERN MTA Head quarters, Budapest, 17 February 2017 1

Background information Collaboration Spotting (CS) Platform (V2) used to process examples CS is a Visual Analytics tool originally developed to analyse the technology landscape of key enabling technologies for the Particle Physics programme at CERN Using Publications and Patent metadata The CS Platform has been used to visualize other datasets: CERN procurement data Ceased assets in collaborations with the UN-UNCRI Neuro-science data in collaboration with Wigner CS Platform V3 2

Characteristics of Big Data Huge quantity Processing and storage Distributed sources Access rights, security Complexity Interconnectivity Valuable information may be hidden behind complexity Unravelling new knowledge Data scientists are instrumental to analytics Domain experts are at the heart of the reasoning process 3

Big Data is organised in networks Big Data is distributed Document systems with metadata in Database Database tables with metadata in schema Big Data is strongly interconnected Connectivity not materialised due to the distributed nature of data sources Connectivity relates to the understanding of the data Technical challenges of using Big Data analytics 4

Big Data Intrinsic vs additional value The additional value of Big Data comes from its interconnectivity Discrete data Connected data Relational DBMS No-SQL Graph DB Conventional analytics Conventional + visual analytics Technical challenges of using Big Data analytics 5

Two Criteria: Bottom-up VS Top Down Discrete data VS highly interconnected data Technical challenges of using Big Data analytics 6

Top-Down VS Bottom-up Process driven Hypothesis Simulation software Validation with real data Review hypothesis Data driven Extract features from data Generate hypothesis Run what-if scenario Validate with data Experiments Compare results with simulation Typically hard sciences Big Data Software for domain expert to make sense out of Big Data Empirical approach Discrete data Connected data Relational DBMS No-SQL Graph DB Domain Expert Technical challenges of using Big Data Analytics 7

Domain expert vs Data scientist Domain expert Software developer Cycle is managed by Data Scientist Software developer Domain expert Source: JIOX: Intelligence Tradecraft & Analysis 8

Challenge Bring domain experts at the centre of the visual analytics cycle Experts have the knowledge Data scientists have the skills Bring analytics to experts Understand results of analytics Instruct computers to perform analytics according to findings Domain expert Data scientists to build platforms that enable experts to perform analytics by themselves 9

What is required? Network Data and Domain independent Scalable and flexible Easily accessible and navigable to Experts Enhance value of Data Network for Experts Support interconnectivity Support Cross Domain applications Smart Data management concepts and tools Support any combination of data sources Support any combination of data structures Support visualisation of network content Support visualisation of analysis results Smart graphic management concepts and tools Support navigation of network content Support queries of network content A Domain independent platform Technical challenges of using Big Data analytics 10

Smart Data Management Directed graphs are natural representations of large and interconnected datasets Complexity Interconnectivity Scalability Multi dimensional Schema is embedded in the data Nodes labels Compact graph structure Graph query language No schema evolution Graphs of connected elements constitute multi-dimensional networks Schema: labels and edges (interconnectivity) Labels Graph dimensions Edges Directed relationships between Labels Data graph: vertices and edges Vertices: data instances and dimension instances Edges: Directed relationships between vertices Graph Databases offer a natural support for storing network information Label property graph data model 11

Building a Network from two data sources (Pub/Pat) Document metadata Graph of data types SCat: Journal category, Kw: Keyword, Org: Organisation, Cny: Country, Tech: Technology Technical challenges of using Big Data analytics 12

Graph of Network T: Technologies, A: Pub/Pat, K: keywords, O: Organisations, C: Countries 13

Data Graph & Graph Schema Graph of data network Reachability Graph Technical challenges of using Big Data analytics 14

Building multi-dimensional networks File Systems Tables in DB Graphs in DB Processing/Populating/Labelling/Organising Multi-dimensional Network in GraphDB No limitations on the sixe of a network! Technical challenges of using Big Data analytics 15

Combining data sources Enriching networks More interconnectivity Data sources Publications/Patents Citations Institutions/Companies Data sources EU projects Financial data Geolocation data Schema: Graph of datatypes/labels Dimension: a datatype i.e. a node in the graph schema No limitations on the extension of the network s schema! 16

Smart Graphic Concepts and Management tools Graphs are excellent for visualising networks Retain complexity Singularities Clusters/communities/patterns Graphs contain many visual information Vertex label, shape, size and colour to visualise properties of datasets Edges colours to highlight clusters Visualisation enhances the perceptual reasoning potential of analytics 17

Smart Graphic Concepts and Management tools(2) Maximizing human understanding Selecting network dimensions Traversing network dimensions Graphical queries Time/Frequency evolution Enhancing reasoning Viewing multiple data sources Looking for collaborations Sorting communities Contextual visualisation & analytics Technical challenges of using Big Data analytics 18

Selecting Visualisation dimensions (Tech) (Pub) (Pat) (SCat) (Kw) (Org) (Cny) Reference dimensions for Analytics Pub: Publications, Pat: Patents (Attributes: Title and abstract are used for semantic searches) Visualisation dimensions of Analytics results: SCat: Journal category, Kw: Keyword, Org: Organisation and Cny: Country) Technology Search: Czochralski Silicon wafer Pub/Pat: documents found in search results 19

Traversing Dimensions (Tech) (Pub) (Pat) (SCat) (Kw) (Org) (Cny) Technology Search: Czochralski Silicon wafer Pub/Pat: documents found in search results 20

How to scale up the graph approach for very large multidimensional networks? Technical challenges of using Big Data analytics 21

Visual analytics features Visual analytics does not replace Big Data analytics Visualize results Maintain visual perception quality and user interactivity No matter the size No matter the diversity (dimensions) No matter the interconnectivity Data sampling & filtering Visualize subsets of network dimensions View data from different perspectives Technical challenges of using Big Data analytics 22

Visual analytics Needs Visualize part of data network with respect to particular references and from different perspectives Reference: Data dimensions (labels) Perspective: Visual dimensions (labels) Need to navigate across visual dimensions => Visual queries Need to get contextual statistics In the context of a particular view Need to change Data Reference while navigating Queries adapted to change of reference Technical challenges of using Big Data analytics 23

Visual analytics Features (2) Structural vs Behavioural Understand from the data how something is working Visualization Maximum number of collaborations that can be processed (~100k) to feed visualization Maximum number of vertices and edges one can visualize within a graph (~ 10k) Maximum number of Clusters one can visualize within a graph (~10k) Data quality Can the data be trusted? How complete is the dataset under study? Technical challenges of using Big Data analytics 24

Visual analytics Needs(2) Need to visualize processes, interactions in addition to structure of data network Connectivity graphs AND Causality graphs directed edges For large graphs: Replace vertices with communities in complex graphs Compound graph approach For graphs built out of large collaborations Replace 2-adic calculations with m-adic Example Neuro science: paths of length 2 to visualize input/process/target flows Technical challenges of using Big Data analytics 25

Reduce visual complexity & faster graph processing: Hyperedges vs edges Organisation landscape graph view Organisation landscape hypergraph view Technology search: BGO Crystals Pub/Pat: documents found in search results Edges vs hyper-edges 26

Tailor visualisation to data STRATEGY: Combining various techniques to support quality visual perception and user interactions according to data and graph sizes Statistics Data sampling & Reduction Compound graphs 2-adic vs n-adic node-link graph representation Technical challenges of using Big Data analytics 27

Combining techniques for visualisation Visual Data Analysis: Compute Collaborations # collaborations too large? Yes No Community Processing: Compute clusters Can graph be layered? No Yes Yes # clusters too large? No No Visualize Data anyway? Yes Visual Data Analysis: Data Reduction/Sampling Community Processing: Compute Compound Graph Visual Data Analysis: Build statistics No # vertices too large Yes Visual Data Analysis: Filter/Reduce dataset Visual Analysis: Display Graph with vertices Visual Analysis: Display Graph with clusters as vertices Objective: Reduce dataset and graph content with very minimal loss in visual perception 28

Computing requirements for visualization Service users within a few seconds Heavy computing at the backend to process clusters, optimize layout and support visual navigation Need for Cloud computing Using machines with 4 CPU cores (8 threads), 8 GB of memory CPU vs GPU Comparing them using consumer level hardware (Intel Core i7, GeForce GTX 980) CS Platform V3 29

Computing requirements for visualization Computation on the CPU Graphs with tens of thousands of nodes and hundreds of thousands of edges, computing requires ~17 seconds. Further optimization can be achieved by further distributing the computation among multiple machines Computation on the GPU Same graphs compute ~8 times faster (~2 seconds) Distribution among multiple GPUs is a further possible optimization CS Platform V3 30

Runtime (s) Computing requirements for visualization 25 Computation time on CPU vs GPU 20 15 10 5 0 CT 3D Database Silicon CPU GPU CS Platform V3 31

The macaque case g2 is too large for visual perception Communities 172 clusters 10668 edges g 0 : directed graph of brain area interconnectivity* (42 vertices = areas, 601 edges= interactions) g 2 : directed graph of cortical interactions* (Input/Processing/Target) (9869 vertices = IPT flows, 166219 edges = common interactions) *Data/slide: L. Négyessy, A. Fülöp Technical challenges of using Big Data analytics 32

Constructed Reachability Graph Brain Area Modality Target Area Processing Area Input Area Cerebral lobe L2_path ProcessType InterLobe g 0 edges g 2 edges g 2 g 0 connections Are type of processing and Interactive lobe - Vertex attributes? - Visual dimensions? Macaque brain network data: optimal for navigation 33

g 0 graph Technical Challenge of Using Big Data Analytics 34

g 2 (with intercluster edges) CS platform concepts V3 35

g 2 (paths of length 2 CS platform concepts V3 36

Community_61 Egocentric 37

Community_61 CS platform concepts V3 38

Community_61 39

CS platform concepts V3 40

Conclusion To visualize Big Data Analytics output you need: Graphs to store your data networks and their schema Graphs to view network structure through selected dimensions Graphs to navigate across dimensions to provide contextual data to visualisation tools To maintain visual perception you need to combine various techniques Statistics, sampling, compound graph, layered graph To support structural and behavioural visualisation you need to explore Clustering algorithms supporting directed edges Processes, interactions in relation with the data Technical challenges with Big Data Analysis 41

Thank you for your attention!