Analysis of Big Data and other sources

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Architect.

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Stages of Data Processing

Presented by Sunnie S Chung CIS 612

BIG DATA COURSE CONTENT

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Introduction to Graph Databases

CIB Session 12th NoSQL Databases Structures

The age of Big Data Big Data for Oracle Database Professionals

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Challenges for Data Driven Systems

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

Understanding NoSQL Database Implementations

Data contains value and knowledge

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Big Data with Hadoop Ecosystem

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Introduction to NoSQL Databases

SpagoBI and Talend jointly support Big Data scenarios

Ian Choy. Technology Solutions Professional

Non-Relational Databases. Pelle Jakovits

Review - Relational Model Concepts

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Relational databases

Distributed Non-Relational Databases. Pelle Jakovits

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Databases 2 (VU) ( / )

NOSQL Databases. Dr. Lena Wiese

A Review to the Approach for Transformation of Data from MySQL to NoSQL

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

A Study of NoSQL Database

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

New Approaches to Big Data Processing and Analytics

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Big Data Hadoop Stack

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

HDInsight > Hadoop. October 12, 2017

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CISC 7610 Lecture 2b The beginnings of NoSQL

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Data Lake Based Systems that Work

DATA SCIENCE USING SPARK: AN INTRODUCTION

The Creation of Scalable Tools for Solving Big Data Analysis Problems Based on the MongoDB Database

What Next for DBAs in the Big Data Era

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE

Why NoSQL? Why Riak?

The Technology of the Business Data Lake. Appendix

Hadoop course content

Data Architectures in Azure for Analytics & Big Data

microsoft

Big Data Analytics using Apache Hadoop and Spark with Scala

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Cassandra- A Distributed Database

Innovatus Technologies

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Getting to know. by Michelle Darling August 2013

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Apache Spark: A Literature Review. Presenter: Aaron Sarson

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

Intro Cassandra. Adelaide Big Data Meetup.

Oracle GoldenGate for Big Data

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Databases and Big Data Today. CS634 Class 22

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Analytics. Rasoul Karimi

Facebook, 14 Fast projection index, 84 First database revolution data handling code, 6 DBMS, 6 network and hierarchical model, 6 7

Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Certified Big Data and Hadoop Course Curriculum

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

DATABASE DESIGN II - 1DL400

L24: NoSQL (continued) CS3200 Database design (sp18 s2) 4/12/2018

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Distributed Databases: SQL vs NoSQL

Predictive Performance Comparison Analysis of Relational & NoSQL Graph Databases

Big Data Management and NoSQL Databases

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Bigdata Platform Design and Implementation Model

CIS 601 Graduate Seminar in Computer Science Sunnie S. Chung

Hadoop. Introduction / Overview

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Microsoft Big Data and Hadoop

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Design and Implement of Bigdata Analysis Systems

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Big Data Fundamentals

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data Hadoop Course Content

Transcription:

Analysis of Big Data and other sources

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

Introduction to Big Data

Introduction to Big Data There are different working areas in big data: Data storage Data processing Data mining Data visualisation Business Intelligence Systems

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

A Survey on Tools - Data storage DOCUMENTS KEY/VALUE COLUMNS GRAPHS MongoDB CouchDB Riak Riak Voldemort Redis Memcached Membase DynamoDB Google Bigtable HBase Cassandra Sybase IQ Hypertable FlockDB OrientDB AllegroGraph Neo4J

A Survey on Tools - Data processing ADQUISITION STORAGE ANALYSIS BATCH HDFS commands Scoop Flume HDFS HBase MapReduce Spark, SparkQL Hive Pig Cascading STREAMING Flume Kafka Kestrel RabbitMQ AWS SQS Storm Trident Spark Streaming Samza HYBRID Lamda, Kappa, Summingbird, Lambdoop, Apache Flik

PROPIETARY OPEN A Survey on Tools - Data mining SPSS Weka Rapid Miner Mahout Gate NLTK KMine OpenNN Scikit-learn Carrot2 R Torch RapidMiner IBM Watson SAS Entreprise Miner Statistica Data Miner Oracle Data Miner Microsoft Analysis Services LIONSolver ClaraBridge

A Survey on Tools - Data visualisation Vis.js D3.js CartoDB Plot.ly Tableau QlikView R HighCharts

A Survey on Tools - Business Intelligence Pentaho Actuate SpagoBI JasperReports Tableau QlikView Palo Tactic IBM Cognos MicroStrategy Microsoft PowerBI Plot.ly

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

Data Storage in Depth - SQL vs. NoSQL SQL databases limitations: Fixed structure and integrity restrictions Ineficiency with large number of insertions, modifications, deletions High complexity to model real-life relationships NoSQL databases: NoSQL = Not only SQL Store large volumes of data in small units of time

Data Storage in Depth - NoSQL types There are basically four types of NoSQL databases, although some of them share characteristics from more than one type: Document oriented: The basic unit is the document (e.g. XML, json, ) Key/Value: Any object identified by a key and described by a set of attributes (values). Also known as hash warehouses Column oriented: Data are stored around tables with families of predefined columns, propitiating OLAP operations Graph databases: Not only store objects but also relationships among them shaping graphs of information

Data Storage in Depth - Document oriented The basic unit is the document A document can have an arbitrary number of fields Each field can be of different type and size Each field can store multiple values Examples of documents are XML, JSON, or similar Document databases do not need a fixed schema of document Each document can have different fields than other documents in the database Security is assigned at document level Full-text search capabilities with high performance

Data Storage in Depth - Document oriented JSON document example Unlike key/value model, id is part of the document Full-text search is provided in the whole document

Data Storage in Depth - Document oriented

Data Storage in Depth - Key/value warehouses Warehouses where store any kind of information of any type Objects are identified by a unique key Objects are defined by an arbitrary set of attributes There is neither structure nor restrictions They are also known as hash warehouses

Data Storage in Depth - Key/value warehouses

Data Storage in Depth - Column oriented Unlike SQL databases organised as rows, column-oriented databases are organised around columns Tables are defined as families of columns It is easy to implement OLAP operations Drill, roll, slice&dice, pivot

Data Storage in Depth - Column oriented

Data Storage in Depth - Graph databases Relational databases lack relationships Bob s friends What about big data? Alice s friends-of-friends

Data Storage in Depth - Graph databases NoSQL databases also lack relationships Relationships can be emulated by aggregated fields, but: - They should be maintained (update and delete) programmatically. - Aggregated links are not reflexive: there is no point backward (e.g. to know who bought a product).

Data Storage in Depth - Graph databases A graph is a collection of vertices representing entities and edges representing the relationships among them. In a property graph both nodes and relationships can have properties. Graph data model means that data are modelled such a graph. A (property) graph database is an online database management system with Create, Read, Update and Delete methods that expose a (property) graph data model.

Data Storage in Depth - Graph databases Property graph Relationship with a property which value is Follows Node with a property which value is Harry

Data Storage in Depth - Graph databases Cypher is an expressive graph database query language. Cypher is designed to be easily read and understood by developers, database professionals and business stakeholders. The key of Cypher is that enables to find data that matches a specific pattern, following our intuition to describe graphs using diagrams.

Data Storage in Depth - Graph databases Nodes Relation type and direction Separation among subgraphs

Data Storage in Depth - Graph databases The simplest query: - a START clause followed by a MATCH and a RETURN clauses

Data Storage in Depth - Graph databases - START: specifies the starting point(s) in the graph (e.g. nodes or relationships) - MATCH: describes the specification by example, using characters to represent nodes and relationships, in order to draw the data we are interested in. - RETURN: defines the nodes, relationships and/or attributes that should be returned.

Data Storage in Depth - Graph databases OTHER CYPHER CLAUSES - WHERE: provides criteria for filtering. CREATE (UNIQUE): for the creation of nodes and relationships. DELETE: removes nodes, relationships and properties. SET: sets property values to nodes and relations. FOREACH: allows to perform an updating action for a list of elements. - UNION: merges results from different queries. - WITH: allows to pipe results from one query to the next.

Data Storage in Depth - Graph databases

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

Data Processing - Types BATCH STREAMING VOLUME VELOCITY HYBRID Batch processing for large volumes of information (e.g. ADN sequentiation) Streaming processing for rapid generated data (e.g. Twitter) Hybrid processing for large volumes rapidly generated (e.g. in-depth analysis of Twitter tweets)

Data Processing - Processing steps DATA ADQUISITION DATA STORAGE DATA ANALYSIS

https://www.youtube.com/watch?v=yrqmen-5pi8 Data Processing In-depth analysis of a Twitter stream - Types - tweets/second tweets/minute tweets/hour Retrieve and store Evolution Words and topics Labelling Hashtags People Locations Brands Polarity, stance Users, relationships Gender, age Author profile... tweets/day

Data Processing - Batch processing Map/Reduce paradigm: Map: The Map process divides the data into subsets and sends them to each process node in key-value format <K, V> Reduce: Each node returns the result in key-list of values format <K, L (V)> and they are combine to produce the final result Example of counting words in a text: Map: A line of text is sent to each node, where the key K is the line number, and the value V is the line of text <nline, text>. The result of the task is a list of pairs <word, 1> for each word in the text. Reduce: It collects all the outputs of Map processes as pairs <key, value> or <word, 1>, and it is responsible for grouping them in pairs <word, occurrence> by adding the ones of each word

Data Processing - Batch processing

Data Processing - Batch processing function Map (key, values) { for each word w in values { return (w, 1) function Reduce (word, list_of_values) } { } for each value v in list_of_values { total += v } return (word, total) }

Data Processing - Batch processing ADQUISITION STORAGE PROCESSING

Data Processing - Stream processing autoritas Cosmos-intelligence

Data Processing - Stream processing ADQUISITION STORAGE PROCESSING KESTREL trident

Data Processing - Hybrid processing

Data Processing - Hybrid processing SUMMINGBIRD

Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J

References Graph Databases. Ian Robinson, Jim Webber and Emil Eifrem. O Reilly. http://neo4j.com/books/graph-databases/ Social Network Data Analytics. Charu C. Aggarwal. Springer. http://www.springer.com/us/book/9781441984616 Networks, Crowds and Markets: Reasoning about a Highly Connected World. David Easly and Jon Kleinberg. Cambridge University Press. https://www.cs.cornell.edu/home/kleinber/networks-book/

References Aggargal, C. C. (2011). Social network data analytics. Springer Banker, K. (2012). Mongodb in action. Manning Publications Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E. (2008). Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems Dixon, J. (2015). Pentaho, hadoop and data lakes. James Dixon s Blog Harrington, P. (2012). Machine learning in action. Manning Publications Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems Hewitt, E. (2011). Cassandra: the definitive guide. O Reilly Jones, O. M., Robinson, A. (2009). Scientific programming and simulation using r. Taylor & Francis Group Lam, C. (2011). Hadoop in action. Manning Publications Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Stanford University Press Owen, S., Anil, R., Dunning, T., Friedman, E. (2013). Mahout in action. Manning Publications Co. Snijders, C.; Matzat, U.; Reips, U.D. (2012). Big data: big gaps of knowledge in the field of interent. International Journal of Internet Science Stanton, J. (2012). An introduction to data science. Syracuse University Witten, I. H., Frank, E., Hall, M. A. (2011). Data mining. Practical machine learning tools and techniques. Morgan Kaufmann Publishers