UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Similar documents
Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Warehouse- Scale Computing and the BDAS Stack

Accelerate Big Data Insights

Bringing Data to Life

An Introduction to Apache Spark

The Evolution of Big Data Platforms and Data Science

Fast Big Data Analytics with Spark on Tachyon

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

WHITEPAPER. MemSQL Enterprise Feature List

DATA SCIENCE USING SPARK: AN INTRODUCTION

Spark, Shark and Spark Streaming Introduction

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Cloud Computing & Visualization

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Security and Performance advances with Oracle Big Data SQL

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Flash Storage Complementing a Data Lake for Real-Time Insight

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Oracle Big Data Fundamentals Ed 2

Lambda Architecture for Batch and Stream Processing. October 2018

Microsoft Big Data and Hadoop

Scalable Tools - Part I Introduction to Scalable Tools

Introduction to Big-Data

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Apache Spark 2.0. Matei

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data and Object Storage

Big data systems 12/8/17

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Benchmarks Prove the Value of an Analytical Database for Big Data

Optimizing Apache Spark with Memory1. July Page 1 of 14

Digital Enterprise Platform for Live Business. Kevin Liu SAP Greater China, Vice President General Manager of Big Data and Platform BU

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

5 Fundamental Strategies for Building a Data-centered Data Center

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Hadoop. Introduction / Overview

Kinetic Open Storage Platform: Enabling Break-through Economics in Scale-out Object Storage PRESENTATION TITLE GOES HERE Ali Fenn & James Hughes

The Datacenter Needs an Operating System

Apache Spark Graph Performance with Memory1. February Page 1 of 13

BIG DATA TESTING: A UNIFIED VIEW

Apache Flink: Distributed Stream Data Processing

CSE 444: Database Internals. Lecture 23 Spark

Innovatus Technologies

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Analyze Big Data Faster and Store It Cheaper

A Survey on Big Data

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Modern Data Warehouse The New Approach to Azure BI

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

The Technology of the Business Data Lake. Appendix

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Platforms and Pattern Mining

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Getting Started with Spark

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Modernizing Business Intelligence and Analytics

Integrating Splunk with AWS services:

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster

Databricks, an Introduction

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Overview of Data Services and Streaming Data Solution with Azure

Improved Solutions for I/O Provisioning and Application Acceleration

Cloud Analytics and Business Intelligence on AWS

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

powered by Cloudian and Veritas

Learn-Memorize-Recall-Reduce: A Robotic Cloud Computing Paradigm

Approaching the Petabyte Analytic Database: What I learned

Time Series Storage with Apache Kudu (incubating)

New Approach to Unstructured Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Next Generation Storage for The Software-Defned World

Oracle Exadata: The World s Fastest Database Machine

Backtesting with Spark

Capture Business Opportunities from Systems of Record and Systems of Innovation

Apache Kylin. OLAP on Hadoop

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

April Copyright 2013 Cloudera Inc. All rights reserved.

Transcription:

UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017

HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016 Open Sourced in 2013 Apache License 2.0 Latest Stable Release: Alluxio 1.4.0 Alluxio 1.5.0 Planned For Q2, 2017 2

BIG DATA ECOSYSTEM YESTERDAY 3

BIG DATA ECOSYSTEM TODAY 3

BIG DATA ECOSYSTEM ISSUES 3

BIG DATA ECOSYSTEM WITH ALLUXIO Native File System Hadoop Compatible File System Native Key-Value Interface FUSE Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface 3

BIG DATA ECOSYSTEM WITH ALLUXIO Native File System Hadoop Compatible File System Native Key-Value Interface FUSE Compatible File System Enabling Application to Access Data from any Storage System at Memory-speed HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface 3

4

5

FASTEST-GROWING BIG DATA PROJECT 6

FASTEST-GROWING BIG DATA PROJECT Formerly named Tachyon, born in the AMPLab 500+ contributors from 100+ organizations Running world s largest production clusters 6

WHY ALLUXIO Co-located compute and data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only 7

ALLUXIO BENEFITS Unification Performance Flexibility New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage grow each independently, buy only what is needed 8

ALLUXIO DEPLOYMENTS 9

ALLUXIO USE CASES Managing data across disparate storage systems On-Demand Analytics & Accelerating I/O to and from remote storage Sharing data across workloads at memory speed 10

MANAGE DATA ACROSS STORAGE SYSTEMS Qunar uses real-time machine learning for their website ads ALLUXIO 200+ nodes deployment 6 billion logs (4.5 TB) daily Mix of Memory + HDD We ve been running in production for over 9 months, Alluxio s enabled different applications & frameworks to easily interact with data from different storage systems RESULTS Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing Improved the performance of their system with 15x 300x speedups Tiered storage feature manages storage resources including memory, SSD and disk 11

ON-DEMAND ANALYTICS & ACCELERATE I/O TO/FROM REMOTE STORAGE PMs run interactive queries to gain insights into their products & business ALLUXIO Baidu File System 200+ nodes deployment 2+ petabytes of storage Mix of memory + HDD The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. RESULTS Data queries are now 30x faster with Alluxio Alluxio cluster runs stably, providing over 50TB of RAM space By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds 12

SHARE DATA ACROSS JOBS @ MEMORY SPEED Barclays uses query & machine learning to train models for risk management 6 node deployment 1TB of storage Memory only ALLUXIO Relational Database: Teradata Thanks to Alluxio, we now have the raw data immediately available at every iteration & can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. RESULTS Barclays workflow iteration time decreased from hours to seconds Alluxio enabled workflows that were impossible before By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds 13

Thank you! Contact: {haoyuan}@alluxio.com or info@alluxio.com 14