Big Data Analytics. Rasoul Karimi

Similar documents
Ghislain Fourny. Big Data 5. Column stores

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CISC 7610 Lecture 2b The beginnings of NoSQL

Typical size of data you deal with on a daily basis

Ghislain Fourny. Big Data 5. Wide column stores

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Informatics. Seon Ho Kim, Ph.D.

Introduction to NoSQL

CS November 2018

CS November 2017

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Comparing SQL and NOSQL databases

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

CA485 Ray Walshe NoSQL

Presented by Sunnie S Chung CIS 612

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CIB Session 12th NoSQL Databases Structures

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

BigTable: A Distributed Storage System for Structured Data

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

CS 655 Advanced Topics in Distributed Systems

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Embedded Technosolutions

Webinar Series TMIP VISION

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

A Glimpse of the Hadoop Echosystem

Hadoop An Overview. - Socrates CCDH

Big Data Analytics using Apache Hadoop and Spark with Scala

Column-Family Databases Cassandra and HBase

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Chapter 24 NOSQL Databases and Big Data Storage Systems

Distributed Data Store

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

50 Must Read Hadoop Interview Questions & Answers

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Distributed Databases: SQL vs NoSQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Challenges for Data Driven Systems

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

DATA MIGRATION METHODOLOGY FROM SQL TO COLUMN ORIENTED DATABASES (HBase)

Microsoft Big Data and Hadoop

Introduction to BigData, Hadoop:-

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Hadoop Development Introduction

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Rule 14 Use Databases Appropriately

Advanced Database Technologies NoSQL: Not only SQL

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Distributed File Systems II

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

A Fast and High Throughput SQL Query System for Big Data

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Non-Relational Databases. Pelle Jakovits

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Study of NoSQL Database Along With Security Comparison

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

A brief history on Hadoop

Big Data Architect.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Introduction to NoSQL Databases

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

Database Systems CSE 414

Safe Harbor Statement

Architekturen für die Cloud

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Cmprssd Intrduction To

Intro Cassandra. Adelaide Big Data Meetup.

HBase... And Lewis Carroll! Twi:er,

VOLTDB + HP VERTICA. page

CompSci 516 Database Systems

Big Data with Hadoop Ecosystem

Introduction to NoSQL by William McKnight

COMP9321 Web Application Engineering

Hadoop. copyright 2011 Trainologic LTD

Importing and Exporting Data Between Hadoop and MySQL

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

CSE-E5430 Scalable Cloud Computing Lecture 9

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Evolution of Database Systems

Transcription:

Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1

Outline HBase (Hadoop database) Big Data Analytics 1 / 1

Outline HBase (Hadoop database) Big Data Analytics 1 / 1

Distributed File System (DFS) A DFS (Hadoop, GFS,...) is the opposite of a virtual machine. A virtual machine chops one machine into multiple virtual machines. A distributed file system combines many virtual machines into one virtual machine. It includes a NoSQL database Relational databases are like sports cars, Hadoop is like a freight train. Relational database is for online transaction processing (OLTP), ACID transactions, etc. DSF is good for analyzing massive amounts of data, handling unstructured data, handling very complex analyses, etc. Big Data Analytics 1 / 1

Relational Databases Advantages of relational DB. persistence integration transaction reporting The disadvantage of relational databases is that one object in the program is split into many tables in the data base ( impedance mismatch). Big Data Analytics 2 / 1

Object-Oriented Databases The motivation was to solve the impedance mismatch issue. It could not replace the relational data bases because the integration aspect of relational data bases was very powerful. Therefore, relational databases dominated the market for 20 years (1990-2010). Big Data Analytics 3 / 1

The Rise of Internet Solution 1: centralized solution Problems: cost limited Solution 2: Distributed solution SQL does not work well for big distributed databases. NoSQL databases were innovated for big distributed databases. Big Data Analytics 4 / 1

Characteristics of Non-relational Cluster-friendly Open source Web-oriented Schema-less Big Data Analytics 5 / 1

What is missing in No joint operation No complex transaction No constrains Big Data Analytics 6 / 1

The objectives of horizontal scalability availability high performance Big Data Analytics 7 / 1

When to use Large amount of data Relationships are not important Dealing with growing list of elements The data is not structured or the structure is changing Prototype and fast applications No need for constraints or validation rules in data bases Big Data Analytics 8 / 1

When NOT to use Complex transactions Joint Constrains in data bases Big Data Analytics 9 / 1

Key-value Document Column Graph Big Data Analytics 10 / 1

Key Value stores Key Value stores use the associative array as their fundamental data model. In this model, data is represented as a collection of key value pairs. The key value model is one of the simplest non-trivial data models. Example: { Great Expectations : John, Pride and Prejudice : Alice, Wuthering Heights : Alice } Big Data Analytics 11 / 1

Document Databases Documents are addressed in the database via a unique key that represents that document. Documents can be retrieved by their key content Documents are organized through Collections Tags Non-visible Metadata Directory hierarchies Buckets Big Data Analytics 12 / 1

Elements of Column Databases Column Column Family KeySpace Big Data Analytics 13 / 1

Column Columns are a tuple (a key-value pair) consisting of three elements: Unique name: Used to reference the column Value: The content of the column. Timestamp: The system timestamp used to determine the valid content. Example: { } street: name: street, value: 1234 x street, timestamp: 123456789, city: name: city, value: san francisco, timestamp: 123456789, zip: name: zip, value: 94107, timestamp: 123456789, Big Data Analytics 14 / 1

Column Family A set of column sets that are about related data. It also includes a key-value pair, where the key is mapped to a value that is a column set. In analogy with relational databases, a standard column family is as a table, each key-value pair being a row. Big Data Analytics 15 / 1

KeySpace A keyspace is an object that holds together all column families in a design. It resembles to the schema concept in Relational database management systems Column Family 1: UserProfile = { Cassandra = { emailaddress: casandra@apache.org, age: 20 } TerryCho = { emailaddress: terry.cho@apache.org, gender: male } Cath = { emailaddress: cath@apache.org, age: 20, gender: female, address: Seoul } } Big Data Analytics 16 / 1

KeySpace Column Family 2: Products = { book = { name: introduction to Java, year: 2001 } movie = { name: The Lord of the Rings, director: Peter Jackson } Shirt = { Brand: lacoste, color: blue, price: 100 } } Big Data Analytics 17 / 1

KeySpace < KeyspaceName = Company > < KeysCachedFraction > 0.01 < /KeysCachedFraction > < ColumnFamilyCompareWith = UTF 8Type Name = UserProfiles / > < ColumnFamilyCompareWith = UTF 8Type Name = Products / > < /Keyspace > Big Data Analytics 18 / 1

Aggregation in Column Databases As an example, a relational table could consist of the columns UID, first name, surname, birthdate, gender. Suppose that we are searching people who are male and were born between 1950 and 1960. In the relational database, all the table has to be read. In Column datasets, we only read the two columns. The Columns in a family are stored together in a low level storage file known. Big Data Analytics 19 / 1

Graph Database Key-value, Document, and Column databases are aggregate databases. Graph databases are not an aggregate database. Graph databases are used for the data that is not aggregated and there are many relationships Graph databases are schema-less like aggregate databases. Graph databases support transactions. Big Data Analytics 20 / 1

Summary of Databases Aggregate databases (Key-value, Document, and Column). Graph databases Relational Databases Big Data Analytics 21 / 1

ACID transactions In relational databases, ACID transactions guarantee that the database is consistent. Aggregate databases do not have ACID transactions. Graph databases have ACID transactions. In aggregate databases, time stamps are used to resolve the conflicts. Big Data Analytics 22 / 1

Big data and consistency Replication leads to more availability More nodes are able to execute requests If one nodes goes down, other nodes can still process requests But it also causes more problems in consistence Consistence or availability? which one is more important? Big Data Analytics 23 / 1

Example There is an international booking website with several servers all over the world. Two persons from different continents request a particular room of a hotel in the same time. Two servers communicate and in the end the room is assigned to one of the persons. Big Data Analytics 24 / 1

Example What about if the connection between the servers go down? If consistency is more important, both requests are rejected. If availability is more important, both requests are accepted. The choice between availability and consistency depends on the business rules of the application. Big Data Analytics 25 / 1

CAP theorem It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) Big Data Analytics 26 / 1

Consistency vs. Availability The choice is not binary, but a spectrum. In practice, the trade off is not always between consistency and availability or partitioning, but consistency and response time. Big Data Analytics 27 / 1

The reasons of going for NoSQL databases Big Data Easy and rapid development The importance of integration issue has deceased due to service-oriented programming. Analytics Big Data Analytics 28 / 1

Challenges of NoSQL databases Organizational change Immaturity Eventual consistency SQL database will still play important role in databases at least for one decade. Big Data Analytics 29 / 1

HBase (Hadoop database) Outline HBase (Hadoop database) Big Data Analytics 30 / 1

HBase (Hadoop database) HBase (Hadoop database) it is a distributed column-oriented database, which lays between HDFS and MapReduce. It is open-source implementation of Google s Big Table database. has some built-in features such as Scalability Versioning Compression Garbage collection Fault tolerance Big Data Analytics 30 / 1

HBase (Hadoop database) The architecture of HBase Big Data Analytics 31 / 1

HBase (Hadoop database) HMaster Performing Administration Managing and Monitoring the Cluster Assigning Regions to the Region Servers Controlling the Load Balancing and Failover Big Data Analytics 32 / 1

HBase (Hadoop database) HRegionServer Hosting and managing Regions Splitting the Regions automatically Handling the read/write requests Communicating with the Clients directly Big Data Analytics 33 / 1

HBase (Hadoop database) HRegionServer Each Region Server contains a Log (called HLog) and multiple Regions. Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families. The MemStore holds in-memory modifications to the Store (data). Big Data Analytics 34 / 1

HBase (Hadoop database) Column vs. row-oriented database Big Data Analytics 35 / 1

HBase (Hadoop database) Characteristics of Row-oriented database Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the data in a row is required. Easy to read and write records Well suited for OLTP systems Not efficient in performing operations applicable to the entire dataset and hence aggregation is an expensive operation Typical compression mechanisms provide less effective results than those on column-oriented data stores In the hard disk, each row is stored together. Big Data Analytics 36 / 1

HBase (Hadoop database) Characteristics of column-oriented database Data is stored and retrieved in columns and hence can read only relevant data if only some data is required Read and Write are typically slower operations Well suited for OLAP systems Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over many rows and columns In the hard disk, each column is stored together. Big Data Analytics 37 / 1

HBase (Hadoop database) RDBMS vs. HBase HBase is schema-less, but RDBMS is based on a fixed schema HBase is for big data (million columns, billion rows), but RDBMS was not initially created to store big data HBase stores denormalized data, but RDBMS stores normalized data HBase supports automatic partitioning, but RDBMS does not. Big Data Analytics 38 / 1

HBase (Hadoop database) HBase vs. HDFS HBase is is built for Low Latency operations, but HDFS is suited for High Latency operations batch processing HBase Data is accessed through shell commands, Client APIs in Java, etc., but in HDFS Data is primarily accessed through MapReduce HBase Provides access to single rows from billions of records, but HDFS is designed for batch processing and hence doesn t have a concept of random reads/writes Big Data Analytics 39 / 1

HBase (Hadoop database) Powered by HBase Facebook uses HBase to power their Messages infrastructure. Twitter runs HBase across its entire Hadoop cluster Yahoo! uses HBase to store document fingerprint. Adobe use HBase in several areas from social services to structured data and processing for internal use. Big Data Analytics 40 / 1