UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017
HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016 Open Sourced in 2013 Apache License 2.0 Latest Stable Release: Alluxio 1.4.0 Alluxio 1.5.0 Planned For Q2, 2017 2
BIG DATA ECOSYSTEM YESTERDAY 3
BIG DATA ECOSYSTEM TODAY 3
BIG DATA ECOSYSTEM ISSUES 3
BIG DATA ECOSYSTEM WITH ALLUXIO Native File System Hadoop Compatible File System Native Key-Value Interface FUSE Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface 3
BIG DATA ECOSYSTEM WITH ALLUXIO Native File System Hadoop Compatible File System Native Key-Value Interface FUSE Compatible File System Enabling Application to Access Data from any Storage System at Memory-speed HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface 3
4
5
FASTEST-GROWING BIG DATA PROJECT 6
FASTEST-GROWING BIG DATA PROJECT Formerly named Tachyon, born in the AMPLab 500+ contributors from 100+ organizations Running world s largest production clusters 6
WHY ALLUXIO Co-located compute and data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only 7
ALLUXIO BENEFITS Unification Performance Flexibility New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage grow each independently, buy only what is needed 8
ALLUXIO DEPLOYMENTS 9
ALLUXIO USE CASES Managing data across disparate storage systems On-Demand Analytics & Accelerating I/O to and from remote storage Sharing data across workloads at memory speed 10
MANAGE DATA ACROSS STORAGE SYSTEMS Qunar uses real-time machine learning for their website ads ALLUXIO 200+ nodes deployment 6 billion logs (4.5 TB) daily Mix of Memory + HDD We ve been running in production for over 9 months, Alluxio s enabled different applications & frameworks to easily interact with data from different storage systems RESULTS Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing Improved the performance of their system with 15x 300x speedups Tiered storage feature manages storage resources including memory, SSD and disk 11
ON-DEMAND ANALYTICS & ACCELERATE I/O TO/FROM REMOTE STORAGE PMs run interactive queries to gain insights into their products & business ALLUXIO Baidu File System 200+ nodes deployment 2+ petabytes of storage Mix of memory + HDD The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. RESULTS Data queries are now 30x faster with Alluxio Alluxio cluster runs stably, providing over 50TB of RAM space By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds 12
SHARE DATA ACROSS JOBS @ MEMORY SPEED Barclays uses query & machine learning to train models for risk management 6 node deployment 1TB of storage Memory only ALLUXIO Relational Database: Teradata Thanks to Alluxio, we now have the raw data immediately available at every iteration & can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. RESULTS Barclays workflow iteration time decreased from hours to seconds Alluxio enabled workflows that were impossible before By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds 13
Thank you! Contact: {haoyuan}@alluxio.com or info@alluxio.com 14