Flash Storage Complementing a Data Lake for Real-Time Insight
|
|
- Alice Dickerson
- 5 years ago
- Views:
Transcription
1 Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018
2 Agenda Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data lake for real-time insight Stream ingestion to ActiveScale object storage as system a as a component within a data a data lake lake IoT speed layer implementation A real-world IoT use case 2
3 Evolving Landscape from Core to Edge 3
4 Evolving Landscape from Core to Edge consumer platforms & systems embedded & removable solid state & hard disk drives 4
5 Delivering Value to Data from Core to Edge Enable tight coupling between Big Data and Fast Data Big Data Data Aggregation Streaming Analytics Fast Data INSIGHT MOBILITY PREDICTION Batch Analytics Machine Learning REAL-TIME RESULTS Modeling PRESCRIPTION Scale Artificial Intelligence SMART MACHINES Performance 5
6 Delivering Value to Data from Core to Edge Enable data sharing and utilization of system resources across diverse applications Past Data Held Captive by Single Application Current and Future Data Pooled and Shared by Multiple Applications create transform app deliver create create store transform store capture store access transform create access store create create deliver transform 6
7 Agenda Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data lake for real-time insight Stream ingestion to ActiveScale object storage system as a component within a data lake IoT speed layer implementation 5 A real-world IoT use case 7
8 A Unified Workflow for Analytics on the 3 rd Platform Big data Batch Layer Raw Data Hadoop Store + Map Reduce Batch Views Streaming event data from business processes (vehicles, transactions, ) Capture Merged Views Visualization, Machine Learning, Queries, Models, Metrics Serving Layer Transformed Data Streaming Analytics Fast data Incremental Real-time Views Flash Memor y Summit Speed 2018 Layer Santa Clara, CA 8
9 An Enhanced Workflow for Analytics on the 3 rd Platform Streaming event data from business processes (vehicles, transactions, ) ActiveScale object storage system as the repository of all raw data, metadata & process data Data Sources (Images, Backups, Documents,., Business Process Models) All Raw Data Kafka + Apache (index /tag/ Kafka search/ml) Rack-scale Flash (shared PCIe allflash enclosure, IntelliFlash ), NVMe SSDs: memory latency, I/O throughput Transformed Data Data Lake Apache Kafka S3 connector Big data Data Ingest /ML/ Transformation/ Process Data Persistence S3 / NFS ActiveScale S3A connector Other Apps Hadoop YARN Streaming Analytics Fast data Batch View Real-time View Batch Views Incremental Real-time Views Other non-map Reduce apps can use ActiveScale Real-time & Batch Queries, Search, Machine Learning, Dashboards Merged Views Flash Memor y Summit 2018 Santa Clara, Batch Layer Serving Layer Rack-scale Flash (shared PCIe all-flash enclosure, IntelliFlash ), NVMe SSDs: memory latency, IOPS CA 9 Smart Analytics Speed Layer
10 Aggregated vs. Disaggregated Architectures for Analytics Application Tier Data Store Tier Data Network Object storage system Co-located compute and storage Applications communicate with an integrated compute and storage (HDD/Flash) and with a back-end object storage. Homogenous across resources. Application Tier Data Store Tier Pool of compute Data Network Pool of HDDs Shared pool of flash Object storage system Applications communicate with a disaggregated pool of compute, HDDs, shared flash and object storage. Independent scaling of resources. Heterogeneous, and interoperable across resources. 10
11 Agenda Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data lake for real-time insight Stream ingestion to ActiveScale object storage system as a component within a data lake IoT speed layer implementation 5 A real-world IoT use case 11
12 Stream Ingestion from Apache Kafka to ActiveScale Architecture Concurrent Clients Client Node 1 Client Node 2 Client Node GigE connections Streaming data 2 NVMe SSDs in a shared PCIe all-flash enclosure allocated to each Kafka node Apache Kafka Cluster Kafka Node 1 Kafka Node 2 Apache Kafka Connect Cluster Connector 1 Worker 1... Connector 32 Worker 2... Connector 64 Connector 33 Worker 3... Connector 96 Connector GigE connections ActiveScale Scaler 1 Scaler 2 Scaler 3 Client Node 4 Client Node 5 8 GB/s Kafka Node GB/s Worker 4... Connector 97 Connector 128 Client Node 6 Client Node 7 Client Node 8 For each benchmark run on 8 clients: Record size: 500 bytes/message. Sends 800 M messages at 8 GB/s. Publishes to a Kafka topic, 32 M messages at a time. Kafka Node 4 For each benchmark run, Kafka cluster can process: 16 M messages/sec, for the message size of 500 bytes. Average latency of acknowledgment to the clients is 75 ms/message. Worker 5... Connector 129 Connector 160 Worker 6... Connector 161 Connector 192 Each connector in the Kafka Connect Cluster works as a subscriber to a Kafka topic in the Kafka Cluster and sinks the data to the ActiveScale. A total of 192 connectors is writing to ActiveScale and saturating the write throughput to a single ActiveScale system. Achieved average and maximum parallel write throughput to ActiveScale are ~ 4.28 GB/s and 6 GB/s, respectively. Projected throughput will increase by 2x with 12 worker nodes in the Kafka Connect cluster. 12
13 Write Throughput (GB/s) Write Throughput (GB/s) Apache Kafka S3 Connector to ActiveScale : Performance Write Throughput (GB/s) Kafka-S3 Connector Scalability with Number of Connectors Average Maximum Number of Connectors (one connector per partition) An optimal throughput is achieved with a total of 192 connectors writing to a single ActiveScale system. Higher the better An optimal and stable performance is achieved with the S3 upload part size of 128 MB and with a commit size of 1 Million messages (500 MB), irrespective of the number of connector partitions. 48 connectors have been used for these performance tests Average 1.93 Kafka-S3 Connector Scalability with S3 Upload Part Size Maximum MB 64 MB 128 MB 256 MB S3 Upload Part Size (1 Million Messages or 500MB Commit Size) Kafka S3 sink Connector Scalability with File/Object Commit Size Average Maximum File/Object Commit Size (MB) (S3 Upload Part Size - 64 MB) Higher the better Higher the better Flash Memor y Summit 2018 Santa Clara, CA 13
14 Acknowledgment Latency (ms) Apache Kafka Performance on Flash vs. HDDs With 16 M messages/sec, for the message size of 500 bytes: Kafka performance per node: Flash vs. HDD 160 Lower the better The average latency of acknowledgment to the clients by Kafka is 75 ms/message with 2 NVMe SSDs in a shared PCIe all-flash enclosure allocated to each of the Kafka nodes HDDs (5 HDDs/LUN, 2 LUNs) - 6TB 75 1 SSD - 3TB 2 SSDs - 6TB 3 SSDs - 9TB 60 A low latency of Kafka with 2 SSDs allocated to each node of a 4-node cluster has allowed for achieving a rate of 16 M messages/s. With 1 SSD 10 HDDs, in terms of latency, it would be necessary to use 20 HDDs to achieve the above message rate by Kafka. 14
15 Agenda Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data lake for real-time insight Stream ingestion to ActiveScale object storage system as a component within a data lake IoT speed layer implementation 6 A real-world IoT use case 15
16 IoT Speed layer: Stream Processing and Storing Requirements: Stream Processing Process events in near-real time : low latency; High Velocity; Guaranteed processing; Fault-tolerant; Scales easily. Choices Spark Streaming Low-Latency, High Throughput, Fault-tolerant; Uses Spark core for large-scale stream processing, In-flight messages; Component of Apache Spark Ecosystem; Uses: ETL on streaming data, Operational dashboards, anomaly & fraud detection, NLP analysis, sensor data analytics; High Adoption Rate. Others: Apache Storm, Flink, Apache Apex, Apache Samza, Apache NiFi, etc. Requirements: Storing Fast/real-time lookup storage; Support for write-intensive workloads; Support for concurrent reads and writes of multiple data sizes within a minimal response time. Choices NoSQL is a good option Cassandra Best-in-class, scalable NoSQL data management system; Parallel, distributed, high throughput, 100% fault-tolerant; No dependency on Hadoop; has Spark connector; Typical applications are IoT, fraud detection, recommendation engines, messaging apps. Others: HBase, Redis, Accumulo, MongoDB, etc. 16
17 Apache Spark Streaming Performance Processing Time ( ms) Spark Streaming Scalability - Rate of incoming events 10K 20K 40K 50K 60K 80K 100K 120K Incoming events per sec Lower the better Spark Streaming is implemented on 4 nodes (24c, 48t, 96GB) with 2 NVMe SSDs in a shared PCIe all-flash enclosure allocated to each node. The processing time of a single streaming application reaches a threshold at 100K events/s. Increasing the number of partitions or streams for a single application, does not improve the processing time, due to the overhead of aggregating results across multiple partitions. From measured data, it is estimated that we could easily execute multiple streaming applications concurrently (a total of 3.2M events) within a 55 sec response time and CPU usage of 75-80%. Processing Time (ms) Scalability of Concurrent Streaming Applications - 4 nodes (24c, 48t, 96GB/node) Concurrent Processing Time (secs) CPU(%)/node 47s Spark Streaming Scalability - Number of Kafka partitions/streams 100K events/sec 45,901 46,928 48s 2% 15% 49,122 1p/1s 4p/4s 8p/8s 96p/96s Number of Streams/Partitions 55s 55,189 Lower the better 75% 1 app (100K/s/app) 8 apps (100K/s/app) 32 apps (100K/s/app) (projected) 17
18 Agenda Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data lake for real-time insight Stream ingestion to ActiveScale object storage system as a component within a data lake IoT speed layer implementation 6 A real-world IoT use case 18
19 Real-world Use Case: Unified Streaming and Batch Analytics on Climate IoT Data ActiveScale object storage system as the storage repository of all raw data Raw data feeds and eventdriven data from weather stations All Raw Data Apache Kafka on a shared PCIe all-flash enclosure Kafka + Data Lake Transformed Data Big data Apache Kafka S3 Connector Tornado alerts Hadoop YARN Streaming Analytics Fast data ActiveScale S3A connector for Hadoop Tornadoes past history Batch View Real-time View Batch Views Incremental Real-time Views Apache Hive Spark Streaming on on a shared PCIe all-flash enclosure Merged Views Cassandra on a shared PCIe allflash enclosure Batch Layer Tornadoes present and past, Forecasting Serving Layer Speed Layer 19 Smart Analytics
20 Summary and Best Practices For efficient coupling between big data and fast data for IoT use cases, it is recommended to: Implement ActiveScale object storage system as a component within a data lake, for storing all the incoming raw streaming data from the ingestion layer. Tune the number of connectors configurable in the Kafka Connect Cluster, the S3 upload part size, and the File/Object commit size (also known as flush size), for ingesting streaming data from Apache Kafka to ActiveScale object storage system with an optimal write throughput. Leverage ActiveScale S3A connector for Hadoop to perform various long-running batch analytics and ad-hoc queries against the petabyte-scale data stored in ActiveScale, without any physical data movement. Implement a shared all-flash storage for the speed layer. Allocate the shared flash storage to compute nodes, for a balanced throughput, IOPS and capacity, required by the applications in the speed layer including ingestion, stream processing and real-time store. Implement the shared all-flash storage to complement an ActiveScale object storage system to achieve a disaggregation of compute and storage for the whole solution, without compromising on long-term data persistence, centralization or quality, durability, high availability and cost. Tune the number of concurrent applications, number of streams per application, and also the allocation of compute and JVM memory to achieve an optimal performance in the speed layer. 20
21 Western Digital, the Western Digital logo, ActiveScale, and IntelliFlash are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. Apache, Apache Apex, Apache Hive, Apache Samza, Apache Spark, Apache Storm, Accumulo, Cassandra, Flink, Hadoop, HBase, and Kafka are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. The NVMe word mark is a trademark of NVM Express, Inc. All other marks are the property of their respective owners. 21
Toward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationCloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018
Cloudline Autonomous Driving Solutions Accelerating insights through a new generation of Data and Analytics October, 2018 HPE big data analytics solutions power the data-driven enterprise Secure, workload-optimized
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationMicrosoft Exam
Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationAccelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage
Accelerating Real-Time Big Data Breaking the limitations of captive NVMe storage 18M IOPs in 2u Agenda Everything related to storage is changing! The 3rd Platform NVM Express architected for solid state
More informationMicron and Hortonworks Power Advanced Big Data Solutions
Micron and Hortonworks Power Advanced Big Data Solutions Flash Energizes Your Analytics Overview Competitive businesses rely on the big data analytics provided by platforms like open-source Apache Hadoop
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationIncrease Value from Big Data with Real-Time Data Integration and Streaming Analytics
Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationDDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage
DDN DDN Updates Data DirectNeworks Japan, Inc Shuichi Ihara DDN A Broad Range of Technologies to Best Address Your Needs Protection Security Data Distribution and Lifecycle Management Open Monitoring Your
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationNVMe SSDs with Persistent Memory Regions
NVMe SSDs with Persistent Memory Regions Chander Chadha Sr. Manager Product Marketing, Toshiba Memory America, Inc. 2018 Toshiba Memory America, Inc. August 2018 1 Agenda q Why Persistent Memory is needed
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationCOMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING
Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationHighly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture
A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform
More informationTable 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti
Solution Overview Cisco UCS Integrated Infrastructure for Big Data with the Elastic Stack Cisco and Elastic deliver a powerful, scalable, and programmable IT operations and security analytics platform
More information2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice
2014 年 3 月 13 日星期四 From Big Data to Big Value Infrastructure Needs and Huawei Best Practice Data-driven insight Making better, more informed decisions, faster Raw Data Capture Store Process Insight 1 Data
More informationPocket: Elastic Ephemeral Storage for Serverless Analytics
Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationCisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr
Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for
More informationAuto Management for Apache Kafka and Distributed Stateful System in General
Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies
More informationIntegrating Splunk with AWS services:
Integrating Splunk with AWS services: Using Redshi+, Elas0c Map Reduce (EMR), Amazon Machine Learning & S3 to gain ac0onable insights via predic0ve analy0cs via Splunk Patrick Shumate Solutions Architect,
More informationAn InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager
An InterSystems Guide to the Data Galaxy Benjamin De Boe Product Manager Analytics 3 InterSystems Corporation. All rights reserved. 4 InterSystems Corporation. All rights reserved. 5 InterSystems Corporation.
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationAerospike Scales with Google Cloud Platform
Aerospike Scales with Google Cloud Platform PERFORMANCE TEST SHOW AEROSPIKE SCALES ON GOOGLE CLOUD Aerospike is an In-Memory NoSQL database and a fast Key Value Store commonly used for caching and by real-time
More informationCombine Native SQL Flexibility with SAP HANA Platform Performance and Tools
SAP Technical Brief Data Warehousing SAP HANA Data Warehousing Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools A data warehouse for the modern age Data warehouses have been
More informationACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE
ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE An innovative storage solution from Pure Storage can help you get the most business value from all of your data THE SINGLE MOST IMPORTANT
More informationThe Future of Real-Time in Spark
The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationBig Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data on AWS Big Data Agility and Performance Delivered in the Cloud 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Technologies and techniques for working productively
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationTowards a Real- time Processing Pipeline: Running Apache Flink on AWS
Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges
More informationThe Technology of the Business Data Lake. Appendix
The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationBig Data and Object Storage
Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich Volume, Variety & Velocity + Analytics Velocity
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationData pipelines with PostgreSQL & Kafka
Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationOracle GoldenGate for Big Data
Oracle GoldenGate for Big Data The Oracle GoldenGate for Big Data 12c product streams transactional data into big data systems in real time, without impacting the performance of source systems. It streamlines
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationCapture Business Opportunities from Systems of Record and Systems of Innovation
Capture Business Opportunities from Systems of Record and Systems of Innovation Amit Satoor, SAP March Hartz, SAP PUBLIC Big Data transformation powers digital innovation system Relevant nuggets of information
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationData Lake Best Practices
Data Lake Best Practices Agenda Why Data Lake Key Components of a Data Lake Modern Data Architecture Some Best Practices Case Study Summary Takeaways What is a Data Lake? What, why etc. What is a data
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationManaging IoT and Time Series Data with Amazon ElastiCache for Redis
Managing IoT and Time Series Data with ElastiCache for Redis Darin Briskman, ElastiCache Developer Outreach Michael Labib, Specialist Solutions Architect 2016, Web Services, Inc. or its Affiliates. All
More informationDDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1
1 DDN DDN Updates DataDirect Neworks Japan, Inc Nobu Hashizume DDN Storage 2018 DDN Storage 1 2 DDN A Broad Range of Technologies to Best Address Your Needs Your Use Cases Research Big Data Enterprise
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationSizing Guidelines and Performance Tuning for Intelligent Streaming
Sizing Guidelines and Performance Tuning for Intelligent Streaming Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the
More informationDataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage
Solution Brief DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage DataON Next-Generation All NVMe SSD Flash-Based Hyper-Converged
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationRoadmap for Enterprise System SSD Adoption
Roadmap for Enterprise System SSD Adoption IBM Larry Chiu STSM, Storage Research Manager Santa Clara, CA USA August 2009 1 Smart Data Placement IBM is focusing on placing the right data on SSDs to maximize
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationAugust 23, 2017 Revision 0.3. Building IoT Applications with GridDB
August 23, 2017 Revision 0.3 Building IoT Applications with GridDB Table of Contents Executive Summary... 2 Introduction... 2 Components of an IoT Application... 2 IoT Models... 3 Edge Computing... 4 Gateway
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationContainer 2.0. Container: check! But what about persistent data, big data or fast data?!
@unterstein @joerg_schad @dcos @jaxdevops Container 2.0 Container: check! But what about persistent data, big data or fast data?! 1 Jörg Schad Distributed Systems Engineer @joerg_schad Johannes Unterstein
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationOracle Exadata: Strategy and Roadmap
Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More informationIsilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team
Isilon: Raising The Bar On Performance & Archive Use Cases John Har Solutions Product Manager Unstructured Data Storage Team What we ll cover in this session Isilon Overview Streaming workflows High ops/s
More informationDown the event-driven road: Experiences of integrating streaming into analytic data platforms
Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationRealizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics
Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Zvonimir Z. Bandic, Sr. Director, Next Generation Platform Technologies Western Digital Corporation
More information@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS
@unterstein @dcos @bedcon #bedcon Operating microservices with Apache Mesos and DC/OS 1 Johannes Unterstein Software Engineer @Mesosphere @unterstein @unterstein.mesosphere 2017 Mesosphere, Inc. All Rights
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationThe Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput
Solution Brief The Impact of SSD Selection on SQL Server Performance Understanding the differences in NVMe and SATA SSD throughput 2018, Cloud Evolutions Data gathered by Cloud Evolutions. All product
More informationIntroduction to Apache Apex
Introduction to Apache Apex Siyuan Hua @hsy541 PMC Apache Apex, Senior Engineer DataTorrent, Big Data Technology Conference, Beijing, Dec 10 th 2016 Stream Data Processing Data Delivery
More informationIBM System Storage DCS3700
IBM System Storage DCS3700 Maximize performance, scalability and storage density at an affordable price Highlights Gain fast, highly dense storage capabilities at an affordable price Deliver simplified
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More informationMAPR DATA GOVERNANCE WITHOUT COMPROMISE
MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationStorage for HPC, HPDA and Machine Learning (ML)
for HPC, HPDA and Machine Learning (ML) Frank Kraemer, IBM Systems Architect mailto:kraemerf@de.ibm.com IBM Data Management for Autonomous Driving (AD) significantly increase development efficiency by
More informationOverview of Data Services and Streaming Data Solution with Azure
Overview of Data Services and Streaming Data Solution with Azure Tara Mason Senior Consultant tmason@impactmakers.com Platform as a Service Offerings SQL Server On Premises vs. Azure SQL Server SQL Server
More informationNVM Express over Fabrics Storage Solutions for Real-time Analytics
NVM Express over Fabrics Storage Solutions for Real-time Analytics Presented by Paul Prince, CTO Santa Clara, CA 1 NVMe Over Fabrics NVMf Why do we need NVMf? What is it? How does it fit in the Market?
More informationIBM Data Replication for Big Data
IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source
More informationIBM Spectrum Scale IO performance
IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial
More information