Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1

Big Data "The world is crazy. But at least it s getting regular analysis." Izabela Moise, Evangelos Pournaras, Dirk Helbing 2

Big Data Explosion 1.8 ZB of information world-wide (2011) Izabela Moise, Evangelos Pournaras, Dirk Helbing 3

Big data becomes Real data Big Data became real in 2013 Obama "the Big Data President" Oscar prediction Finding and telling data-driven stories in billions of Tweets Izabela Moise, Evangelos Pournaras, Dirk Helbing 4

The 5 V s Volume large amounts of data generated every second (emails, twitter messages, videos, sensor data...) Velocity the speed of data moving in and out data management systems (videos going viral...) on-the-fly Variety different data formats in terms of structured or unstructured (80%) data Value insights we can reveal within the data Veracity trustworthiness of the data Izabela Moise, Evangelos Pournaras, Dirk Helbing 5

Big Data Extremely large datasets that are hard to deal with using traditional (Relational) Databases Storage/Cost Search/Performance Analytics and Visualization Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Big data Different Types of Data: Structured data (Relational, Tables) Semi Structured Data (XML, JSON, Logfiles) Unstructured Data (Free Text, Webpages) Graph Data (Social Network, Semantic Web) Streaming Data Typical Operations: Aggregation & Statistics Data warehouse, OLAP Index, Searching, Querying Keyword bases search Pattern matching Knowledge discovery Data Mining Statistical Modeling Izabela Moise, Evangelos Pournaras, Dirk Helbing 7

Big Data Old model: Query the world : Data acquisition coupled to a specific hypothesis New model: Download the world : Data acquisition supports many hypotheses Examples: E-commerce: Transaction, Customer tracking, Social Graph Security: Logfiles, Anomaly detection Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Biology: lab automation, high-throughput sequencing, Oceanography: high-resolution models, cheap sensors, satellites Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

MapReduce Izabela Moise, Evangelos Pournaras, Dirk Helbing 9

Parallel and distributed programming paradigms Partitioning 1. Computation 2. Data Mapping Assign computation and data parts to resources Synchronisation Communication Send intermediate data between workers Scheduling Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

Parallel and distributed programming paradigms Partitioning 1. Computation 2. Data Mapping Assign computation and data parts to resources Synchronisation Communication Send intermediate data between workers Scheduling A paradigm is an abstraction that hides the implementation of these issues from the users Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

MapReduce An abstraction for performing computations on data automatic parallelization of computations large-scale data distribution simple, yet powerful interface user-transparent fault tolerance commodity hardware Introduced by Google in 2004: paradigm and implementation Izabela Moise, Evangelos Pournaras, Dirk Helbing 11

Motivation: Common operations on data Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

Motivation: Common operations on data Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Provide a functional abstraction for these two operations Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

Izabela Moise, Evangelos Pournaras, Dirk Helbing 13

MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map(in_key, in_value) list(out_key, intermediate_value) Processes input key/value pairs Produces set of intermediate pairs: reduce(out_key, list(intermediate_value)) list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by primitives of functional programming languages such as Lisp, Scheme and Haskell Izabela Moise, Evangelos Pournaras, Dirk Helbing 14

What is MapReduce used for? At Google Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo! Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook Data mining Ad optimization Spam detection Izabela Moise, Evangelos Pournaras, Dirk Helbing 15

Example: Word Count Reduce phase is optional: Jobs can be Map-only Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

Example: Word Count Izabela Moise, Evangelos Pournaras, Dirk Helbing 17

How does it work? Izabela Moise, Evangelos Pournaras, Dirk Helbing 18

Key Characteristics Parallelism map() and reduce() functions run in parallel each working on different data. reduce phase cannot start until map phase is completely finished. Locality master program assigns tasks based on location of data: tries to have map() tasks on the same machine as physical file data, or at least the same rack Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

The Hadoop Project Izabela Moise, Evangelos Pournaras, Dirk Helbing 21

Hadoop Tool Suite Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

HDFS Distributed File System An open-source implementation of Google File System Data split into chunks of fixed (configurable) size 64MB default Two server types: 1. Namenode - keeps the metadata 2. Datanode - stores the data Failures handled through chunk level replication 3 replicas: local, same rack, different rack Write-once-ready-many pattern Files are append-only Optimized for large files, sequential reads Izabela Moise, Evangelos Pournaras, Dirk Helbing 23

HDFS-Hadoop Distributed File System Izabela Moise, Evangelos Pournaras, Dirk Helbing 24

HDFS Design Izabela Moise, Evangelos Pournaras, Dirk Helbing 25

HDFS Design The Namenode manages the filesystem namespace filesystem tree and metadata for all the files and directories in the tree Stored persistently on local disk chunk placement on datanodes reconstructed from datanodes when the system starts Single point of failure if the Namenode fails, the filesystem is not usable anymore the metadata can be stored on a remote disk so that the namespace can be reconstructed if the Namenode fails Datanodes report periodically the list of chunks they store Namenode front page is at http://namenode-name:50070/ basic statistics of the cluster browse the file system Izabela Moise, Evangelos Pournaras, Dirk Helbing 26

HDFS File-based data structures SequenceFiles Data structure for binary key-value pairs Izabela Moise, Evangelos Pournaras, Dirk Helbing 27

HDFS configuration fs.default.name, set to hdfs://localhost/ the default HDFS port 8020 HDFS clients will use this property to work out where the namenode is running so they can connect to it. dfs.replication Chunk replication, by default set to 3 Izabela Moise, Evangelos Pournaras, Dirk Helbing 28

HDFS command line interface Izabela Moise, Evangelos Pournaras, Dirk Helbing 29

The Hadoop MapReduce framework Izabela Moise, Evangelos Pournaras, Dirk Helbing 30

Some Hadoop Terminology Job: a program that executes map and reduce processing across a data set Task: an execution of a Mapper or a Reducer on a slice of data also called, Task-In-Progress (TIP) Task Attempt: a particular instance of an attempt to execute a task on a machine Izabela Moise, Evangelos Pournaras, Dirk Helbing 31

Hadoop Internals Hadoop uses its own RPC protocol Communications are initiated by TaskTracker nodes heartbeat mechanism A single JobTracker per cluster accepts Job requests from clients Job is a Java jar file + an XML file containing program configuration options the Job client places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code A single TaskTracker instance runs on each slave node TaskTracker forks separate Java process for task instances Izabela Moise, Evangelos Pournaras, Dirk Helbing 32