Challenges for Data Driven Systems

Similar documents
Various Faces of Data Centric Networking and Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

Microsoft Big Data and Hadoop

Introduction to NoSQL Databases

CIB Session 12th NoSQL Databases Structures

Presented by Sunnie S Chung CIS 612

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Stages of Data Processing

Introduction to Graph Databases

DATABASE DESIGN II - 1DL400

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Big Data Architect.

MapReduce and Friends

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Embedded Technosolutions

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Hadoop An Overview. - Socrates CCDH

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

A BigData Tour HDFS, Ceph and MapReduce

Big Data with Hadoop Ecosystem

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

CS November 2017

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Certified Big Data and Hadoop Course Curriculum

Non-Relational Databases. Pelle Jakovits

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Chapter 24 NOSQL Databases and Big Data Storage Systems

The Datacenter Needs an Operating System

CS November 2018

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Databases 2 (VU) ( / )

Introduction to Hadoop and MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE 344 JULY 9 TH NOSQL

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Webinar Series TMIP VISION

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

A Review Paper on Big data & Hadoop

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CS 655 Advanced Topics in Distributed Systems

Distributed Databases: SQL vs NoSQL

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

STATE OF MODERN APPLICATIONS IN THE CLOUD

CompSci 516: Database Systems

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Analytics using Apache Hadoop and Spark with Scala

A Study of NoSQL Database

Next-Generation Cloud Platform

Getting to know. by Michelle Darling August 2013

CSE 444: Database Internals. Lecture 23 Spark

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Certified Big Data Hadoop and Spark Scala Course Curriculum

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Large-Scale Web Applications

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data Hadoop Course Content

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Big Data and Cloud Computing

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Analytics. Rasoul Karimi

DIVING IN: INSIDE THE DATA CENTER

How to survive the Data Deluge: Petabyte scale Cloud Computing

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Lecture 11 Hadoop & Spark

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

Hadoop Development Introduction

CompSci 516 Database Systems

Oracle Big Data Connectors

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Big Data Hadoop Stack

Data Analytics with HPC. Data Streaming

Data Partitioning and MapReduce

Distributed File Systems II

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Oracle NoSQL Database Enterprise Edition, Version 18.1

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

CSE6331: Cloud Computing

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Distributed Data Store

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

/ Cloud Computing. Recitation 10 March 22nd, 2016

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Transcription:

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data centric Data as communication token Integration of complex data processing with programming, networking and storage A key vision for future computing 2

Big Data Increase of Storage Capacity Increase of Processing Capacity Availability of Data Hardware and software technologies can manage ocean of data 3 Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data centric Data as communication token Integration of complex data processing with programming, networking and storage A key vision for future computing 4

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 5 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 6

Distributed Infrastructure Google AppEngine MS Azure Amazon WS Zookeeper, Chubby Manage Access Processing Pig, Hive, DryadLinq, Java MapReduce (Hadoop, Google MR), Dryad Streaming Haloop Semi- Structured Storage HDFS, GFS, Dynamo HBase, BigTable, Cassandra 7 Distributed Infrastructure Computing + Storage transparently Cloud computing, Web 2.0 Scalability and fault tolerance Distributed servers Amazon EC2, Google App Engine, Elastic, Azure Pricing? Reserved, on-demand, spot, geography System? OS, customisations Sizing? RAM/CPU based on tiered model Storage? Quantity, type Distributed storage Amazon S3 Hadoop Distributed File System (HDFS) Google File System (GFS), BigTable Hbase 8

Challenges Distribute and shard parts over machines Still fast traversal and read to keep related data together Scale out instead scale up Avoid naïve hashing for sharding Do not depend of the number of node But difficult add/remove nodes Trade off data locality, consistency, availability, read/write/search speed, latency etc. Analytics requires both real time and post fact analytics and incremental operation 9 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 10

Data Model/Indexing Support large data Fast and flexible access to data Operate on distributed infrastructure Is SQL Database sufficient? 11 NoSQL (Schema Free) Database NoSQL database Operate on distributed infrastructure (e.g. Hadoop) Based on key-value pairs (no predefined schema) Fast and flexible Pros: Scalable and fast Cons: Fewer consistency/concurrency guarantees and weaker queries support Implementations MongoDB CouchDB Cassandra Redis BigTable Hibase Hypertable 12

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Stream processing Operations on big data Analytics Realtime Analytics 13 Distributed Processing Non standard programming models Use of cluster computing No traditional parallel programming models (e.g. MPI) E.g. MapReduce Data (flow) parallel programming (e.g. MapReduce, Dryad/LINQ, CIEL, NAIAD) 14

MapReduce Target problem needs to be parallelisable Split into a set of smaller code (map) Next small piece of code executed in parallel Finally a set of results from map operation get synthesised into a result of the original problem (reduce) 15 CIEL: Dynamic Task Graph Data-dependent control flow CIEL: Execution engine for dynamic task graphs (D. Murray et al. CIEL: a universal execution engine for distributed data-flow computing, NSDI 2011) 16

Stream Data Processing Stream Data Processing Stream: infinite sequence of {tuple, timestamp} pairs Continuous query is result of a query in an unbounded stream Data stream processing emerged from the database community (90 s) Database systems and Data stream systems Database Mostly static data, ad-hoc one-time queries Store and query Data stream Mostly transient data, continuous queries 17 Real-Time Data Departure from traditional static web pages New time-sensitive data is generated continuously Rich connections between entities Challenges: High rate of updates Continuous data mining - Incremental data processing Data consistency 18

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service) Storage Distributed storage (e.g. Amazon S3) Data model/indexing High-performance schema-free database (e.g. NoSQL DB) Programming Model Distributed processing (e.g. MapReduce) Operations on big data Analytics Realtime Analytics 19 Techniques for Analysis Applying these techniques: larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones Classification Cluster analysis Crowd sourcing Data fusion/integration Data mining Ensemble learning Genetic algorithms Machine learning NLP Neural networks Network analysis Optimisation Pattern recognition Predictive modelling Regression Sentiment analysis Signal processing Spatial analysis Statistics Supervised learning Simulation Time series analysis Unsupervised learning Visualisation 20

Do we need new Algorithms? Can t always store all data Online/streaming algorithms Memory vs. disk becomes critical Algorithms with limited passes N 2 is impossible Approximate algorithms 21 Typical Operation with Big Data Smart sampling of data Reducing original data with maintaining statistical properties Find similar items efficient multidimensional indexing Incremental updating of models support streaming Distributed linear algebra dealing with large sparse matrices Plus usual data mining, machine learning and statistics Supervised (e.g. classification, regression) Non-supervised (e.g. clustering..) 22

Sorting Easy Cases Google 1 trillion items (1PB) sorted in 6 Hours Searching Hashing and distributed search Random split of data to feed M/R operation Not all algorithms are parallelisable 23 More Complex Case: Stream Data Have we seen x before? Rolling average of previous K items Sliding window of traffic volume Hot list most frequent items seen so far Probability start tracking new item Querying data streams Continuous Query 24

Big Graph Data Bipartite graph of appearing phrases in documents Social Networks Airline Graph Gene expression data Protein Interactions [genomebiology.com] 25 How to Process Big Graph Data? Data-Parallel (MapReduce, DryadLINQ) Generalisation of NoSQL can be found in commodity architecture: Large datasets are partitioned across several machines and replicated No efficient random access to data Graph algorithms are not fully parallelisable Parallel DB Tabular format providing ACID properties Allow data to be partitioned and processed in parallel Graph does not map well to tabular format Moden NoSQL Allow flexible structure (e.g. graph) Trinity, Neo4J In-memory graph store for improving latency (e.g. Redis, Scalable Hyperlink Store (SHS)) Expensive for petabyte scale workload 26

Big Graph Data Processing MapReduce is ill-suited for graph processing Many iterations are needed for parallel graph processing Intermediate results at every MapReduce iteration harm performance Graph specific data parallel Multiple iterations needed to explore entire graph Tool Box CC Iterative algorithms common in Machine Learning, graph analysis BFS SSSP 27 Data Centric Networking 28

Data Centric Networking Shift to Content Based Networking Original Internet 70s technology, conversational pipes, end-to-end Now, Internet use (>90%): Content retrieval & Service access Request & Delivery of named data - access content Shift to a content-centric view: Content-awareness and massive storage Existing approach e.g. Publish/Subscribe overlay 29 Content Centric Networking Network delivers content from closest location Integrates a variety of transport mechanisms Integrated caching (short-term memory) Search for related information Verify authenticity and control access 4WARD 2009 30

Delay Tolerant Networks Delay Tolerant Networks (DTN) Network holds data Path existing over time Store and forward paradigm Weak and episodic connectivity - Eventual connectivity Non-Internet-like networks Stochastic mobility Periodic/predictable mobility Exotic links Deep space [40+ min RTT; episodic connectivity] Underwater [acoustics: low capacity, high error rates & latencies] 31 Session 1: Introduction Topic Areas Session 2: Programming in Data Centric Environment Session 3: Processing Models of Large-Scale Graph Data Session 4: Map/Reduce Hands-on Tutorial with EC2 Session 5: Graph Data Processing in Resource Limited Environment + Guest lecture Session 6: Stream Data Processing + Guest lecture Session 7: Data Centric Networking Session 8: Project study presentation 32

Summary R212 course web page: http://www.cl.cam.ac.uk/~ey204/teaching/acs/r212 _2013_2014 Enjoy the course! 33