Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Size: px
Start display at page:

Download "Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0"

Transcription

1 Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB

2 Agenda Intro to CALS System Hadoop Ecosystem / IT Support At CERN Data Formats Proof Of Concept Data Ingestion Data Extraction Performance Results Demo Challenges / Benefits / Conclusions Questions? 5/20/2016 BE-CO-DS 2

3 (Short) Introduction to CALS 5/20/2016 BE-CO-DS 3

4 CERN Accelerator Logging Service Started in 2001 by R. Billen & M. Peryt Continued by Ch. Roderick, M. Gourber-Pace, G. Kruk and others Taken over by J. Wozniak in Mandate Information for acc. performance improvement Decision support system for management Avoids duplicate logging efforts 5/20/2016 BE-CO-DS 4

5 CALS Architecture Timber Other Apps Extraction Layer Extraction Servers Persistence Layer MDB LDB Oracle Logging Process Logging Process Logging Process Scadar QPSR Providers Layer Middleware WinCC (PVSS) 5/20/2016 BE-CO-DS 5

6 CALS In Numbers MDB 8,200 devices 16,000 properties 260,000 variables LDB 1,482,000 variables Number of data points (dp) 5,000,000,000 dp/day 1.6E12 dp/year 6,000,000 extraction requests per day Storage MDB -> ~40 TB total, 3 months of data, (daily ~700 GB), LDB -> ~430 TB total (daily ~570 GB) 5/20/2016 BE-CO-DS 6

7 Storage Evolution Size in GB / day System designed for 1TB / year /20/2016 BE-CO-DS 7

8 CALS-2.0 Motivations Changing Landscape Brings New Challenges Make possible analysis of bigger data sets in longer time windows Provide analytical functionalities Increase bandwidth (16 machines cluster => 20x) Use right tools for the job (BigData toolset) Very difficult with current Oracle setup as it is stretched to its very limits Limit external logging efforts Allow for better API integration with outside community (Python) Limit extensive Oracle expertise required currently Limit costs (around 30%) with commodity hardware Renovate (aging) system to meet evolving requirements Improve configuration / maintenance / system monitoring 5/20/2016 BE-CO-DS 8

9 CALS-2.0 Proof Of Concept Team of roughly people Help from IT-DB-SAS (many thanks!) Work from early February 2016 until early May 2016 (~3 months) Mandate Explore and learn Hadoop technology Implement PoC for CALS-2.0 (store, extract) Present results to community 5/20/2016 BE-CO-DS 9

10 Hadoop 5/20/2016 BE-CO-DS 10

11 Hadoop Open-source framework for large scale data processing Distributed storage and processing Shared nothing architecture scales horizontally Optimized for high throughput on sequential data access Interconnect network CPU CPU CPU CPU CPU CPU MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY Disks Disks Disks Disks Disks Disks 5/20/2016 BE-CO-DS 11

12 Hadoop Good for: Parallel processing of large amounts of data Perform analytics on a big scale Dealing with diverse data: structured, semistructured, unstructured But not optimal for: Random reads and real-time access Small datasets Updates / appends 5/20/2016 BE-CO-DS 12

13 Setup and run the infrastructure Provide consultancy Build the community As per Oracle support Joint work IT-DB and IT-ST Hadoop Service in IT 13

14 Hadoop Clusters in IT (Oct 2015) lxhadoop (22 nodes) general purpose cluster (mainly used by ATLAS) stable software setup recent hardware analytix (56 nodes) for analysis of monitoring data varied hardware specifications the biggest in terms of number of nodes hadalytic (14 nodes, 224 cores, 768 GB of RAM) general purpose cluster with additional services recent hardware 14

15 5/20/2016 BE-CO-DS 15

16 Zookeeper Coordination Flume Log data collector Impala SQL Spark Sqoop Data exchange with RDBMS Pig Scripting Hive SQL Hbase NoSql columnar store Hadoop Ecosystem Large scale data proceesing MapReduce YARN Cluster resource manager HDFS Hadoop Distributed File System 16

17 Data Formats 5/20/2016 BE-CO-DS 17

18 File Formats Apache Avro Row oriented Compact, fast, binary serialization format Rich data structures scalars, arrays, maps, structs, rows, etc Can use compression 5/20/2016 BE-CO-DS 18

19 File Formats Apache Parquet Google Dremel white-paper Columnar storage Very efficient compression algorithms Delta encodings Binary (bit) packing Dictionary Pushdowns Projection pushdown Predicate pushdown Very efficient reads (avoid reading unwanted data) 5/20/2016 BE-CO-DS 19

20 Parquet Columnar storage Pushdowns 5/20/2016 BE-CO-DS 20

21 Parquet vs. Avro Our Experience Entropy in the data -> better compression Avro file - 6 GB Parquet file GB (same data) More than 4x storage ratio on disk Reads faster 4x-5x from Parquet (Spark, with column projection) Our PoC scenario: Data comes as Avro Data compactor converts to Parquet Compaction creates big files from many small ones Parquet files are read by Impala / Spark 5/20/2016 BE-CO-DS 21

22 PoC (Lambda) Architecture //1 - create HDFS file system Configuration conf = new Configuration(); FileSystem filesystem = FileSystem.get(conf); PUSH vs. PULL??? //2 - create Avro Schema Schema schema = createavroschema(); Log. Log. Proc. Proc. Kafka Flume Gobblin Data Ingestion? Hbase //3 - create Avro file writer DataFileWriter<GenericRecord> datafilewriter = new DataFileWriter<GenericRecord>(new GenericDatumWriter<GenericRecord>(schema))); FSDataOutputStream outputstream = filesystem.create(new Path( /user/cals/data.tmp ); datafilewriter.create(schema, outputstream); //4 create record GenericRecord record = createrecord(, schema); //5 write record datafilewriter.append(record); Speed layer Compactor Batch HDFS layer Storage? Spark Data Extraction? Impala Schema Partition Provider CCDB 5/20/2016 BE-CO-DS 22

23 Data Ingestion 5/20/2016 BE-CO-DS 23

24 Data Ingestion Objectives Acquire data from different data sources and store it in a persistent storage Data Latency <~30sec(from being published to being available to users) No data losses possible Data must be kept in the ingestion layer (for some time) in case of storage layer unavailability (i.e. maintenance) Provide data transformation features Enhancing by adding some context info i.e. beam mode Filtering out Distinct until changed, sampling etc. 5/20/2016 BE-CO-DS 24

25 Ingestion Architecture Speed Log. Log. Proc. Proc. 100ms Kafka pull 30s pull 7 min Gobblin 30s 7 min Batch HBase Compactor HDFS Schema Partition Provider Storage? CCDB 5/20/2016 BE-CO-DS 25

26 Data Collection Overview 1. Acquire data from JAPC 2. Convert to Avro - Get schema for given device and property 3. Serialize and send asynchronously to Kafka Acquire Transform Publish Kafka Schema Provider 100ms CCDB 5/20/2016 BE-CO-DS 26

27 APV -> Avro Conversion APV Data Avro Record Avro Schema 5/20/2016 BE-CO-DS 27

28 Apache Kafka on 2 slides A distributed, partitioned, replicated message broker High throughput, low latency, scalable, centralized, awesome Originally developed at LinkedIn in 2011 Graduated Apache incubator in /20/2016 BE-CO-DS 28

29 Apache Kafka on 2 slides Producers, Brokers, Topics, Consumers Topic are broken into (replicated) partitions Messages are assigned sequential ID called offset Messages are retained with configurable SLA Messages are stored on the file system Optimized OS operations: page cache, sendfile(), zero copy Replication of partitions as design default approach Guarantee fault-tolerance 5/20/2016 BE-CO-DS 29

30 CALS-2.0 Kafka Setup Two Kafka brokers organized in one cluster using broadly available Hadoop Zookeper Each holding 7 topics with 10 partitions One topic maps to one Logging Process Each topic created with replication factor = 2 Each device/property is always stored in the same partition Producers sent messages in Async mode with 100ms batches. 5/20/2016 BE-CO-DS 30

31 Ingestion Architecture Speed Log. Log. Proc. Proc. 100ms Kafka 1min 7 min Gobblin 1min 7 min Batch HBase Compactor HDFS Storage? Schema Partition Provider 5/20/2016 BE-CO-DS 31

32 LinkedIn Gobblin Universal data ingestion, ETL framework Composed of: Source intelligent task-partition assignments Work Unit Extractor pulls data from the source Converter filtering, projection, type conversion etc. Quality Checker schema compatibility, unique keys Writer one per task/work unit Data Publisher to final directories Work Unit Task Extractor Converter Quality Checker Writer Source Work Unit Task Extractor Converter Quality Checker Writer Data Publisher 5/20/2016 BE-CO-DS 32

33 CALS-2.0 Gobblin Setup 2 standalone instances, each running 7 jobs pulling data from a dedicated topic Each job composed of 10 tasks (one per partition) Custom Extractor to convert Kafka record to Avro Custom Writers to write data to HBase and HDFS Custom Data Partitioners for: HDFS class/version/property/yyyy-mm-dd directories HBase class_version_property tables Work Unit Task Extractor Converter Quality Checker Writer Source Work Unit Task Extractor Converter Quality Checker Writer Data Publisher 5/20/2016 BE-CO-DS 33

34 CALS-2.0 Data Partitioning Data partitioned by class/version/property/yyyy-mm-dd Pros Convenient to gather data statistics i.e. about used space per client/system Convenient to move/backup/restore on demand More optimal for scanning (less data to process) Cons We need history of changes for a given device to know its data proper location over time 5/20/2016 BE-CO-DS 34

35 Data Ingestion Summary System run for ~2 months without any problems Data load between 7k and 10k events/second Big chunks of data handled smoothly Big records with ~100k fields Big records with 1GB/min No data losses observed 5/20/2016 BE-CO-DS 35

36 Data Extraction 5/20/2016 BE-CO-DS 36

37 Impala 5/20/2016 BE-CO-DS 37

38 What Is Cloudera Impala? MPP Query engine running on Apache Hadoop Low latency SQL queries on data in HDFS (Parquet/Avro) and Apache Hbase For big data processing and analytics directly via SQL or business intelligence tools Most common SQL-92 features of HiveQL Supports Hadoop security (Kerberos, Sentry) Provides JDBC/ODBC driver

39 Extraction API what we implemented

40 Extraction API What Had To Be Done Getting HDFS and HBase table names for variable Creating references to the table on the fly (including all the partitions) Handling of recent and long term data (union of different data sources) Basic filtering implementation Use of analytic and aggregation functions Effective use of partitioning query optimization

41 Cache Performance Test performed: 7.1GB of data distributed over 14 machines Cache management in HDFS with OS buffer cache Query execution time between 26 and 29 seconds Conclusion: CPU-bound (decoding), time for I/O is irrelevant Operation times in seconds Real execution time CPU time Decoding (all threads) I/O (all threads) Both OS only HDFS only 5/20/2016 BE-CO-DS 41

42 What Is Still Missing In Impala Support for array type Specific implementation resulting in cartesian product: Only be used in tables with the Parquet file format (HBase issue arrays not implemented). Introducing some overhead (at most 2x slowdown compared to tables that do not have any complex types) Cloudera contacted, development foreseen, no dates Outer joins not implemented (works on one node only)

43 Next Steps Regarding Impala Find a solution for arrays / joins Build supporting batch jobs (statistics, metadata refresh etc.)

44

45 What Is Spark? Distributed data processing framework Easy to use (comparing to its predecessor MapReduce) In-memory and fast (10-100x faster than MapReduce) General purpose Unified platform for different type of data processing jobs Supports several APIs (Scala, Python, Java, R) Scalable Increase the data processing capacity by extending the cluster Fault Tolerant Automatically handles failure of node in the cluster

46 High-level Architecture Spark application involves five key entities: Driver program Cluster manager Workers Executors Tasks Driver Program Worker Node Executor Task Worker Node Executor Task Task Task Cluster managers: Standalone YARN Mesos Cluster Manager Worker Node Executor Task Task

47 RDD RDD (Resilient Distributed Dataset) Main Spark abstraction Collection of rows partitioned across nodes of cluster that can be operated on in parallel There are two ways of creating RDD: Parallelizing a collection Reading from external source (HDFS, HBase, Amazon S3, etc.) RDD Transformations and Actions Node Node Node Task Task Task Task Task

48 Spark SQL and DataFrame API Higher-level abstraction for processing structured data Makes Spark easier to use as providing SQL interface Increase the performance of Spark applications DataFrames SQL/HiveQL Spark Streaming Spark SQL MLlib RDD API Spark Core Data Sources

49 Spark in CALS-2.0 Data Compaction Data Extraction Implemented representative CALS methods (as with Impala) Support of complex types SparkSQL DataFrame df = sqlcontext.read().load("examples/src/main/resources/users.parquet"); df.registertemptable("people"); DataFrame names = sqlcontext.sql("select name from people where age > 21"); Row[] result = names.collect(); DataFrames DataFrame df = sqlcontext.read().load("examples/src/main/resources/users.parquet"); DataFrame res = df.filter(df.col( age").gt(21)).select(df.col( name )); List<Row> resultlist = res.collectaslist());

50 Data analysis in CALS-2.0 Spark opens many doors for data analysis over Logging data Could be used for direct access of the HDFS data Supports Python, R and Scala Possible replacement of the Logging API Provides access through open source data analysis Notebooks Spark Notebooks Web interface with built-in Spark integration Data visualisation (tables, charts, etc.) Dynamic input forms and data widgets Support work in collaboration and publishing results online

51 Jupyter Notebook

52 Apache Zeppelin Notebook

53 Extraction Performance Tests 5/20/2016 BE-CO-DS 53

54 Performance Tests Setup Real data distribution KB / day -> 35% of data logged MB / day -> 60% of data logged GB / day -> 5% of data logged Generated data GB / day 10 devices 1.2 sec updates -> 263 millions of rows / year 1 day = 1.36 GB, 0.5 TB of total data / year Shared hadalytic cluster Impala (max ~224 cores, adaptive) Spark (120 cores, fixed) No in-depth optimization done so far 5/20/2016 BE-CO-DS 54

55 Time (s) Real Data Scan (MB/day) Oracle Impala Spark Number of Records 5/20/2016 BE-CO-DS 55

56 Time (s) 1 Year Scan On 0.5 TB Demo Data 1000 Oracle Impala Spark Number of Records 5/20/2016 BE-CO-DS 56

57 Time (s) Real Data Arrays Scan 100 Oracle Spark Number of Records 5/20/2016 BE-CO-DS 57

58 Performance Results Hadoop outperforms Oracle for BigData queries Improvement needed for SmallData queries Hbase? Spark seems a little bit better than Impala Arrays / structs are not a problem (with Spark) 5/20/2016 BE-CO-DS 58

59 Demo 5/20/2016 BE-CO-DS 59

60 Challenges / Benefits Conclusions 5/20/2016 BE-CO-DS 60

61 Hadoop Challenges Changing ecosystem New products being actively developed Missing documentation Kudu for all (no hybrid Lambda Arch.)? Impala unimplemented features Micro-service / hybrid architecture Solid monitoring required Kerberos issues 5/20/2016 BE-CO-DS 61

62 CALS Challenges Data migration from old CALS WinCC modeling as our biggest client CCDB history required for tracking changes Configuration tools Redesign of Extraction API 5/20/2016 BE-CO-DS 62

63 Benefits Open door for BigData Queries, Tools & Techniques More generic & accommodating architecture Avoids many scattered logging systems Spark analytics (Python, Machine Learning, R) MD/faults/performance/decision analysis Share results with others via Hadoop Horizontal scaling to cover ever growing needs Cost & maintenance reduction Attractive technology -> easier to recruit for 5/20/2016 BE-CO-DS 63

64 Conclusion / Next Steps Technically possible to replace old system satisfying known requirements better Next steps Gather additional input Formal management discussion If accepted have it ready by LS2 5/20/2016 BE-CO-DS 64

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Apache Kudu. Zbigniew Baranowski

Apache Kudu. Zbigniew Baranowski Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,

More information

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Data Lake Based Systems that Work

Data Lake Based Systems that Work Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016 Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

More information

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI / Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

HDInsight > Hadoop. October 12, 2017

HDInsight > Hadoop. October 12, 2017 HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing

More information

Oracle GoldenGate for Big Data

Oracle GoldenGate for Big Data Oracle GoldenGate for Big Data The Oracle GoldenGate for Big Data 12c product streams transactional data into big data systems in real time, without impacting the performance of source systems. It streamlines

More information

Apache Kylin. OLAP on Hadoop

Apache Kylin. OLAP on Hadoop Apache Kylin OLAP on Hadoop Agenda What s Apache Kylin? Tech Highlights Performance Roadmap Q & A http://kylin.io What s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information