Analysis of Big Data and other sources
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
Introduction to Big Data
Introduction to Big Data There are different working areas in big data: Data storage Data processing Data mining Data visualisation Business Intelligence Systems
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
A Survey on Tools - Data storage DOCUMENTS KEY/VALUE COLUMNS GRAPHS MongoDB CouchDB Riak Riak Voldemort Redis Memcached Membase DynamoDB Google Bigtable HBase Cassandra Sybase IQ Hypertable FlockDB OrientDB AllegroGraph Neo4J
A Survey on Tools - Data processing ADQUISITION STORAGE ANALYSIS BATCH HDFS commands Scoop Flume HDFS HBase MapReduce Spark, SparkQL Hive Pig Cascading STREAMING Flume Kafka Kestrel RabbitMQ AWS SQS Storm Trident Spark Streaming Samza HYBRID Lamda, Kappa, Summingbird, Lambdoop, Apache Flik
PROPIETARY OPEN A Survey on Tools - Data mining SPSS Weka Rapid Miner Mahout Gate NLTK KMine OpenNN Scikit-learn Carrot2 R Torch RapidMiner IBM Watson SAS Entreprise Miner Statistica Data Miner Oracle Data Miner Microsoft Analysis Services LIONSolver ClaraBridge
A Survey on Tools - Data visualisation Vis.js D3.js CartoDB Plot.ly Tableau QlikView R HighCharts
A Survey on Tools - Business Intelligence Pentaho Actuate SpagoBI JasperReports Tableau QlikView Palo Tactic IBM Cognos MicroStrategy Microsoft PowerBI Plot.ly
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
Data Storage in Depth - SQL vs. NoSQL SQL databases limitations: Fixed structure and integrity restrictions Ineficiency with large number of insertions, modifications, deletions High complexity to model real-life relationships NoSQL databases: NoSQL = Not only SQL Store large volumes of data in small units of time
Data Storage in Depth - NoSQL types There are basically four types of NoSQL databases, although some of them share characteristics from more than one type: Document oriented: The basic unit is the document (e.g. XML, json, ) Key/Value: Any object identified by a key and described by a set of attributes (values). Also known as hash warehouses Column oriented: Data are stored around tables with families of predefined columns, propitiating OLAP operations Graph databases: Not only store objects but also relationships among them shaping graphs of information
Data Storage in Depth - Document oriented The basic unit is the document A document can have an arbitrary number of fields Each field can be of different type and size Each field can store multiple values Examples of documents are XML, JSON, or similar Document databases do not need a fixed schema of document Each document can have different fields than other documents in the database Security is assigned at document level Full-text search capabilities with high performance
Data Storage in Depth - Document oriented JSON document example Unlike key/value model, id is part of the document Full-text search is provided in the whole document
Data Storage in Depth - Document oriented
Data Storage in Depth - Key/value warehouses Warehouses where store any kind of information of any type Objects are identified by a unique key Objects are defined by an arbitrary set of attributes There is neither structure nor restrictions They are also known as hash warehouses
Data Storage in Depth - Key/value warehouses
Data Storage in Depth - Column oriented Unlike SQL databases organised as rows, column-oriented databases are organised around columns Tables are defined as families of columns It is easy to implement OLAP operations Drill, roll, slice&dice, pivot
Data Storage in Depth - Column oriented
Data Storage in Depth - Graph databases Relational databases lack relationships Bob s friends What about big data? Alice s friends-of-friends
Data Storage in Depth - Graph databases NoSQL databases also lack relationships Relationships can be emulated by aggregated fields, but: - They should be maintained (update and delete) programmatically. - Aggregated links are not reflexive: there is no point backward (e.g. to know who bought a product).
Data Storage in Depth - Graph databases A graph is a collection of vertices representing entities and edges representing the relationships among them. In a property graph both nodes and relationships can have properties. Graph data model means that data are modelled such a graph. A (property) graph database is an online database management system with Create, Read, Update and Delete methods that expose a (property) graph data model.
Data Storage in Depth - Graph databases Property graph Relationship with a property which value is Follows Node with a property which value is Harry
Data Storage in Depth - Graph databases Cypher is an expressive graph database query language. Cypher is designed to be easily read and understood by developers, database professionals and business stakeholders. The key of Cypher is that enables to find data that matches a specific pattern, following our intuition to describe graphs using diagrams.
Data Storage in Depth - Graph databases Nodes Relation type and direction Separation among subgraphs
Data Storage in Depth - Graph databases The simplest query: - a START clause followed by a MATCH and a RETURN clauses
Data Storage in Depth - Graph databases - START: specifies the starting point(s) in the graph (e.g. nodes or relationships) - MATCH: describes the specification by example, using characters to represent nodes and relationships, in order to draw the data we are interested in. - RETURN: defines the nodes, relationships and/or attributes that should be returned.
Data Storage in Depth - Graph databases OTHER CYPHER CLAUSES - WHERE: provides criteria for filtering. CREATE (UNIQUE): for the creation of nodes and relationships. DELETE: removes nodes, relationships and properties. SET: sets property values to nodes and relations. FOREACH: allows to perform an updating action for a list of elements. - UNION: merges results from different queries. - WITH: allows to pipe results from one query to the next.
Data Storage in Depth - Graph databases
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
Data Processing - Types BATCH STREAMING VOLUME VELOCITY HYBRID Batch processing for large volumes of information (e.g. ADN sequentiation) Streaming processing for rapid generated data (e.g. Twitter) Hybrid processing for large volumes rapidly generated (e.g. in-depth analysis of Twitter tweets)
Data Processing - Processing steps DATA ADQUISITION DATA STORAGE DATA ANALYSIS
https://www.youtube.com/watch?v=yrqmen-5pi8 Data Processing In-depth analysis of a Twitter stream - Types - tweets/second tweets/minute tweets/hour Retrieve and store Evolution Words and topics Labelling Hashtags People Locations Brands Polarity, stance Users, relationships Gender, age Author profile... tweets/day
Data Processing - Batch processing Map/Reduce paradigm: Map: The Map process divides the data into subsets and sends them to each process node in key-value format <K, V> Reduce: Each node returns the result in key-list of values format <K, L (V)> and they are combine to produce the final result Example of counting words in a text: Map: A line of text is sent to each node, where the key K is the line number, and the value V is the line of text <nline, text>. The result of the task is a list of pairs <word, 1> for each word in the text. Reduce: It collects all the outputs of Map processes as pairs <key, value> or <word, 1>, and it is responsible for grouping them in pairs <word, occurrence> by adding the ones of each word
Data Processing - Batch processing
Data Processing - Batch processing function Map (key, values) { for each word w in values { return (w, 1) function Reduce (word, list_of_values) } { } for each value v in list_of_values { total += v } return (word, total) }
Data Processing - Batch processing ADQUISITION STORAGE PROCESSING
Data Processing - Stream processing autoritas Cosmos-intelligence
Data Processing - Stream processing ADQUISITION STORAGE PROCESSING KESTREL trident
Data Processing - Hybrid processing
Data Processing - Hybrid processing SUMMINGBIRD
Outline 1. 2. 3. 4. 5. Introduction to big data A survey on tools Data storage in depth Data processing Practice: a. Word count with Spark b. Graph analysis with Neo4J
References Graph Databases. Ian Robinson, Jim Webber and Emil Eifrem. O Reilly. http://neo4j.com/books/graph-databases/ Social Network Data Analytics. Charu C. Aggarwal. Springer. http://www.springer.com/us/book/9781441984616 Networks, Crowds and Markets: Reasoning about a Highly Connected World. David Easly and Jon Kleinberg. Cambridge University Press. https://www.cs.cornell.edu/home/kleinber/networks-book/
References Aggargal, C. C. (2011). Social network data analytics. Springer Banker, K. (2012). Mongodb in action. Manning Publications Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E. (2008). Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems Dixon, J. (2015). Pentaho, hadoop and data lakes. James Dixon s Blog Harrington, P. (2012). Machine learning in action. Manning Publications Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems Hewitt, E. (2011). Cassandra: the definitive guide. O Reilly Jones, O. M., Robinson, A. (2009). Scientific programming and simulation using r. Taylor & Francis Group Lam, C. (2011). Hadoop in action. Manning Publications Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Stanford University Press Owen, S., Anil, R., Dunning, T., Friedman, E. (2013). Mahout in action. Manning Publications Co. Snijders, C.; Matzat, U.; Reips, U.D. (2012). Big data: big gaps of knowledge in the field of interent. International Journal of Internet Science Stanton, J. (2012). An introduction to data science. Syracuse University Witten, I. H., Frank, E., Hall, M. A. (2011). Data mining. Practical machine learning tools and techniques. Morgan Kaufmann Publishers