Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Size: px
Start display at page:

Download "Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB"

Transcription

1 Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB Pagely is the market leader in managed WordPress hosting, and an AWS Advanced Technology, SaaS, and Public Sector partner. We provide various tiers of high performance Wordpress hosting services for enterprise level customers like BMC, Unicef, Northwestern University, and the City of Boston, offering flexibility in our solutions and the industries best expert-only, tier-less support. Pagely utilizes a proprietary tech stack that accelerates WordPress sites through the use of our own ARES Web Application gateway, PressCACHE and PressCDN technologies, as well as open source tools such as Redis and Nginx. In order to answer usage, billing, and other customer questions, our service team requires access to the logs created by the application servers. Historically we relied on a shell script that gathered basic statistics on-demand when needed. The job to process the logs for our largest customer ran for 8 hours or more for a single report, sometimes crashing due to resource limitations. Instead of putting more effort to fix a legacy process, we decided it was time to implement a proper analytics platform. Amazon Athena allows us to run SQL queries directly against the logs, which are stored as compressed JSON files in Amazon S3. This approach is great because there is no need for us to prepare the data, simply define the table and query away. While JSON is a supported format for Amazon Athena, it is not the most efficient format for use at query time. JSON files must be read in their entirety, even if you are only using 1 or 2 fields from each row of data. Besides not being cost effective, the inefficiencies of processing JSON causes longer query times. Querying the logs of our largest customers was not ideal with Athena, as we ran into the 30 minute query timeout limit. This limit can be increased but, however the query was already taking longer than we wanted. Partitioning and Columnar Formats The best practice for structuring the data in S3 is twofold, partitioning and columnar file structure.. Partitioning is the process of splitting data into different prefixes or folders on S3 with a naming convention that s most suitable to efficient retrieval of data. This allows the Athena to skip over data that is not relevant to the particular query being executed. Apache Parquet is a columnar file format popular with tools in the Hadoop ecosystem. Parquet stores the columns of the data in separate, contiguous regions in the file. Directed by metadata footers, tools like Athena can read only the sections of the file that are needed to fulfill the query, eliminating a large portion of the IO and network transfer. Reducing IO through partitioning and parquet files not only increases query performance, but it can dramatically reduce the cost of using Athena. Engaging Beyondsoft...

2 We knew that we needed to transform our data into partitioned parquet in order to make it performant with Athena, but being a lean shop, we didn't have the bandwidth to dive into the technologies. In order to bridge the gap, we engaged Beyondsoft, an AWS Advanced Partner, to optimize our data lake using their open source tool, ConvergDB. ConvergDB ConvergDB is a devops-friendly approach to managing serverless data lakes. Tables are defined using technology agnostic schema definitions which are then deployed to concrete cloud services (such as Glue and Athena) through the use of Hashicorp Terraform. The schema and deployment definitions provide a single point of management for the structure and behavior of the data as it flows through the cloud. ConvergDB does not requires servers to operate, but is used either locally on a user's machine, or in a CI/CD pipeline. The appeal of managing our data with ConvergDB is that we can design our data lake by defining only the important elements. The schema files are used to define tables, including field level SQL expressions that are used to transform the incoming data as it is being loaded. This makes it easy to derive calculated fields, as well as the fields used for data partitioning. Once the schema is defined, the deployment file allows us to place the tables into an ETL job that is used to manage them. The ETL job schedule is specified in the deployment file, as well as optional fields such as the target S3 bucket and number of Glue DPUs to use at run time. ConvergDB is a command line binary and does not need to be installed on a server. All of the artifacts are files that can be managed with source control. This makes ConvergDB easy to integrate into CI/CD pipelines created with the tooling of your choice. The ConvergDB binary takes in all of the configuration files, then outputs a Terraform configuration containing all of the artifacts necessary to deploy the data lake such as ETL scripts, table and database definitions, IAM policies necessary to run the jobs, SNS notification

3 topics, and even a Cloudwatch dashboard showing the volume of data processed by ConvergDB ETL jobs. Speed Bumps No implementation goes perfectly. The next sections are provided by Jeremy Winters, a Beyondsoft engineer, explaining the problems they ran into and how they were addressed. Small File Problem A classic issue encountered with Hadoop ecosystem tools is known as the "small file problem". Processing a large number of small files creates a lot of overhead for the system, causing job execution times to skyrocket, and potentially fail. Pagely had approximately 4TB of history across 30 million files million of these files only represented 1.2TB of the data in S3. In order to analyze this issue, we enabled S3 inventory reporting on the source data bucket. The report is delivered daily in an ORC format. From there it is very easy to create an Athena table to analyze the bucket contents with SQL. We used Athena to identify S3 prefixes that were "hot spots"... having a large number of small files. We identified prefixes with less than 1GB of data that we could consolidate. So million files consolidated into files. The following query is a way to identify small file hot spots. The group by expression can be suited to your data. The example shows a way of grouping by the first folder in the bucket. select -- we are looking at the first string in a / delimited path -- if the key is path_to_data/ json.. it will group on path_to_data split_part(key,'/',1) as prefix -- calculate the total size in mb for all files in prefix,sum(size)/cast(1024*1024 as double) as mb -- count of objects in the prefix,count(*) as object_count from pagely_gateway_logs

4 where -- assumes that versioning is disabled -- you should use the latest date after -- refreshing all partitions dt = ' ' group by 1 having -- only return prefixes with a total size of less than 1 gb -- and a file count greater than 8 sum(size)/cast(1024*1024 as double) < 1024 and count(*) >= 8 The results show prefixes in your object paths that can, and should be consolidated. Anything less than 1GB with more than 8 files can then be consolidated into a single object, replacing the originals. To perform the actual consolidation, we ran a containerized script using Fargate, the serverless Docker container feature of ECS. Each worker container instance processed the files for a given S3 key prefix. A governor container managed the lifecycle of the workers, limiting concurrency, and keeping track of which jobs succeeded. Using Fargate, we were able to perform the consolidation of all the small files for $27. Historical Data

5 Daily data volumes for Pagely logs are in the 10s of GB per day, easily handled by the smallest AWS Glue configuration. Transforming the 4TB compressed (~28TB uncompressed) of historical data was a bit more challenging. For example, if you are 20 hours into a data transformation, and the job tries to process a file with an incorrect S3 ACL, the entire job will fail, resulting in 20 hours of wasted compute resources. ConvergDB mitigates the risk of wasting compute resources by batching the data into smaller chunks. In the case of a 20 hour job failing, only the last batch will be lost, resulting in around one hour of compute being lost. ConvergDB uses its own state tracking mechanism to communicate the failure to the next run of the job, which will clean up any mess before trying to process the batch again. Batching is an automatic feature of the ETL job created by ConvergDB, based upon the size of the Glue cluster. Post-deployment at Pagely Now that our data lake is in production, running our legacy report for a medium size application took 91 seconds to run with the legacy process, and 5 seconds when run from Athena.. For a gain of 18x. Our largest data set breaks our legacy process, and is not performant when querying the JSON directly with Athena, but the new tables enable completion of the analysis in 24 seconds. Legacy Process Athena with JSON Athena with Parquet Medium Customer Largest Customer 1m 31s 1m 6s > 8 hours > 30 min 24s While these numbers are obviously important, the biggest advantage is that now we don't have to worry about performance and cost, and the engineer can focus on solving problems, 15 minutes of writing queries and the entire team now has access to new data. I was able to upgrade the legacy process with queries dispatched to Athena through the AWS SDK. This process can now run on any lightweight machine (like my laptop) while Athena does the heavy lifting. About Beyondsoft Consulting, Inc. Beyondsoft Consulting, Inc. is a leading Cloud consulting, services, and technology company. Beyondsoft delivers solutions and services globally and across many verticals. Our team of highly skilled professionals, coupled with our focus on customer success, truly separates us as an Amazon Web Services Advanced Partner.

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

How can you implement this through a script that a scheduling daemon runs daily on the application servers? You ve been tasked with implementing an automated data backup solution for your application servers that run on Amazon EC2 with Amazon EBS volumes. You want to use a distributed data store for your backups

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Corriendo R sobre un ambiente Serverless: Amazon Athena

Corriendo R sobre un ambiente Serverless: Amazon Athena Corriendo R sobre un ambiente Serverless: Amazon Athena Mauricio Muñoz Solutions Architect, AWS Chile April, 2017 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Web Services

More information

Experiences with Serverless Big Data

Experiences with Serverless Big Data Experiences with Serverless Big Data AWS Meetup Munich 2016 Markus Schmidberger, Head of Data Service Munich, 17.10.16 Key Components of our Data Service Real-Time Monitoring Enable our development teams

More information

How to go serverless with AWS Lambda

How to go serverless with AWS Lambda How to go serverless with AWS Lambda Roman Plessl, nine (AWS Partner) Zürich, AWSomeDay 12. September 2018 About myself and nine Roman Plessl Working for nine as a Solution Architect, Consultant and Leader.

More information

Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture

Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture A Typical Application with Microservices Client Webapp Webapp Webapp Greeting Greeting Greeting Name Name Name Microservice

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION The process of planning and executing SQL Server migrations can be complex and risk-prone. This is a case where the right approach and

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

DevOps Tooling from AWS

DevOps Tooling from AWS DevOps Tooling from AWS What is DevOps? Improved Collaboration - the dropping of silos between teams allows greater collaboration and understanding of how the application is built and deployed. This allows

More information

AWS Service Catalog. User Guide

AWS Service Catalog. User Guide AWS Service Catalog User Guide AWS Service Catalog: User Guide Copyright 2017 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in

More information

AWS Course Syllabus. Linux Fundamentals. Installation and Initialization:

AWS Course Syllabus. Linux Fundamentals. Installation and Initialization: AWS Course Syllabus Linux Fundamentals Installation and Initialization: Installation, Package Selection Anatomy of a Kickstart File, Command line Introduction to Bash Shell System Initialization, Starting

More information

PUBLIC SAP Vora Sizing Guide

PUBLIC SAP Vora Sizing Guide SAP Vora 2.0 Document Version: 1.1 2017-11-14 PUBLIC Content 1 Introduction to SAP Vora....3 1.1 System Architecture....5 2 Factors That Influence Performance....6 3 Sizing Fundamentals and Terminology....7

More information

Autonomous Database Level 100

Autonomous Database Level 100 Autonomous Database Level 100 Sanjay Narvekar December 2018 1 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal 4 Effective Tools for Docker Monitoring By Ranvijay Jamwal CONTENT 1. The need for Container Technologies 2. Introduction to Docker 2.1. What is Docker? 2.2. Why is Docker popular? 2.3. How does a Docker

More information

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

AWS Administration. Suggested Pre-requisites Basic IT Knowledge Course Description Amazon Web Services Administration (AWS Administration) course starts your Cloud Journey. If you are planning to learn Cloud Computing and Amazon Web Services in particular, then this

More information

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing

More information

BI ENVIRONMENT PLANNING GUIDE

BI ENVIRONMENT PLANNING GUIDE BI ENVIRONMENT PLANNING GUIDE Business Intelligence can involve a number of technologies and foster many opportunities for improving your business. This document serves as a guideline for planning strategies

More information

About Intellipaat. About the Course. Why Take This Course?

About Intellipaat. About the Course. Why Take This Course? About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Accenture Cloud Platform Serverless Journey

Accenture Cloud Platform Serverless Journey ARC202 Accenture Cloud Platform Serverless Journey Tom Myers, Sr. Cloud Architect, Accenture Cloud Platform Matt Lancaster, Lightweight Architectures Global Lead November 29, 2016 2016, Amazon Web Services,

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

AALOK INSTITUTE. DevOps Training

AALOK INSTITUTE. DevOps Training DevOps Training Duration: 40Hrs (8 Hours per Day * 5 Days) DevOps Syllabus 1. What is DevOps? a. History of DevOps? b. How does DevOps work anyways? c. Principle of DevOps: d. DevOps combines the best

More information

Automating Elasticity. March 2018

Automating Elasticity. March 2018 Automating Elasticity March 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS s current product

More information

Research at PNNL: Powered by AWS NLIT 2018

Research at PNNL: Powered by AWS NLIT 2018 Research at PNNL: Powered by AWS NLIT 2018 RALPH PERKO AND MIKE GIARDINELLI Pacific Northwest National Laboratory Reference herein to any specific commercial product, process, or service by trade name,

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Energy Management with AWS

Energy Management with AWS Energy Management with AWS Kyle Hart and Nandakumar Sreenivasan Amazon Web Services August [XX], 2017 Tampa Convention Center Tampa, Florida What is Cloud? The NIST Definition Broad Network Access On-Demand

More information

Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR

Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR Table of Contents Foreword... 2 New Era of Rapid Data Warehousing... 3 Eliminating Slow Reporting and Analytics Pains... 3 Applying 20 Years

More information

Project Direction Proven ability to lead and manage a wide variety of design and development projects in team and independent situations.

Project Direction Proven ability to lead and manage a wide variety of design and development projects in team and independent situations. + Mike Hall Software Developer Email: mike@just3ws.com Telephone: (847) 877-3825 LinkedIn: linkedin.com/in/just3ws Skills API design Designed and refactored many application interfaces for use as libraries

More information

CloudExpo November 2017 Tomer Levi

CloudExpo November 2017 Tomer Levi CloudExpo November 2017 Tomer Levi About me Full Stack Engineer @ Intel s Advanced Analytics group. Artificial Intelligence unit at Intel. Responsible for (1) Radical improvement of critical processes

More information

DURATION : 03 DAYS. same along with BI tools.

DURATION : 03 DAYS. same along with BI tools. AWS REDSHIFT TRAINING MILDAIN DURATION : 03 DAYS To benefit from this Amazon Redshift Training course from mildain, you will need to have basic IT application development and deployment concepts, and good

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business

More information

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect Tour of Database Platforms as a Service June 2016 Warner Chaves Christo Kutrovsky Solutions Architect Bio Solutions Architect at Pythian Specialize high performance data processing and analytics 15 years

More information

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Best Practices and Performance Tuning on Amazon Elastic MapReduce Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or

More information

Gabriel Villa. Architecting an Analytics Solution on AWS

Gabriel Villa. Architecting an Analytics Solution on AWS Gabriel Villa Architecting an Analytics Solution on AWS Cloud and Data Architect Skilled leader, solution architect, and technical expert focusing primarily on Microsoft technologies and AWS. Passionate

More information

Real-time Streaming Applications on AWS Patterns and Use Cases

Real-time Streaming Applications on AWS Patterns and Use Cases Real-time Streaming Applications on AWS Patterns and Use Cases Paul Armstrong - Solutions Architect (AWS) Tom Seddon - Data Engineering Tech Lead (Deliveroo) 28 th June 2017 2016, Amazon Web Services,

More information

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Dr. Roland Michaely 2015 The MathWorks, Inc. 1 Data Analytics Workflow Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

70-532: Developing Microsoft Azure Solutions

70-532: Developing Microsoft Azure Solutions 70-532: Developing Microsoft Azure Solutions Exam Design Target Audience Candidates of this exam are experienced in designing, programming, implementing, automating, and monitoring Microsoft Azure solutions.

More information

STATE OF MODERN APPLICATIONS IN THE CLOUD

STATE OF MODERN APPLICATIONS IN THE CLOUD STATE OF MODERN APPLICATIONS IN THE CLOUD 2017 Introduction The Rise of Modern Applications What is the Modern Application? Today s leading enterprises are striving to deliver high performance, highly

More information

QLIK INTEGRATION WITH AMAZON REDSHIFT

QLIK INTEGRATION WITH AMAZON REDSHIFT QLIK INTEGRATION WITH AMAZON REDSHIFT Qlik Partner Engineering Created August 2016, last updated March 2017 Contents Introduction... 2 About Amazon Web Services (AWS)... 2 About Amazon Redshift... 2 Qlik

More information

What is Gluent? The Gluent Data Platform

What is Gluent? The Gluent Data Platform What is Gluent? The Gluent Data Platform The Gluent Data Platform provides a transparent data virtualization layer between traditional databases and modern data storage platforms, such as Hadoop, in the

More information

IBM Big SQL Partner Application Verification Quick Guide

IBM Big SQL Partner Application Verification Quick Guide IBM Big SQL Partner Application Verification Quick Guide VERSION: 1.6 DATE: Sept 13, 2017 EDITORS: R. Wozniak D. Rangarao Table of Contents 1 Overview of the Application Verification Process... 3 2 Platform

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Monitoring and Operating Cisco Prime Service Catalog Reports

Monitoring and Operating Cisco Prime Service Catalog Reports Monitoring and Operating Cisco Prime Service Catalog Reports This chapter contains the following topics: Configuring Cognos Memory Usage, page 1 Refreshing the Standard Reports Package, page 2 Refreshing

More information

SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE

SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE Summary Find the most critical limits for developing Lightning Platform applications. About This Quick Reference This quick reference provides

More information

Important DevOps Technologies (3+2+3days) for Deployment

Important DevOps Technologies (3+2+3days) for Deployment Important DevOps Technologies (3+2+3days) for Deployment DevOps is the blending of tasks performed by a company's application development and systems operations teams. The term DevOps is being used in

More information

Oracle Exadata: Strategy and Roadmap

Oracle Exadata: Strategy and Roadmap Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Integrating Splunk with AWS services:

Integrating Splunk with AWS services: Integrating Splunk with AWS services: Using Redshi+, Elas0c Map Reduce (EMR), Amazon Machine Learning & S3 to gain ac0onable insights via predic0ve analy0cs via Splunk Patrick Shumate Solutions Architect,

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide FAQs 1. What is the browser compatibility for logging into the TCS Connected Intelligence Data Lake for Business Portal? Please check whether you are using Mozilla Firefox 18 or above and Google Chrome

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

C ibm IBM C Foundations of IBM Cloud Reference Architecture V5 Version 1.0

C ibm  IBM C Foundations of IBM Cloud Reference Architecture V5 Version 1.0 C5050-287.ibm Number: C5050-287 Passing Score: 800 Time Limit: 120 min File Version: 1.0 IBM C5050-287 Foundations of IBM Cloud Reference Architecture V5 Version 1.0 Exam A QUESTION 1 Which IT methodology

More information

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS Dr Adnene Guabtni, Senior Research Scientist, NICTA/Data61, CSIRO Adnene.Guabtni@csiro.au EC2 S3 ELB RDS AMI

More information

The Seven Steps to Implement DataOps

The Seven Steps to Implement DataOps The Seven Steps to Implement Ops ABSTRACT analytics teams challenged by inflexibility and poor quality have found that Ops can address these and many other obstacles. Ops includes tools and process improvements

More information

Automated Netezza Migration to Big Data Open Source

Automated Netezza Migration to Big Data Open Source Automated Netezza Migration to Big Data Open Source CASE STUDY Client Overview Our client is one of the largest cable companies in the world*, offering a wide range of services including basic cable, digital

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

Serverless Architectures with AWS Lambda. David Brais & Udayan Das

Serverless Architectures with AWS Lambda. David Brais & Udayan Das Serverless Architectures with AWS Lambda by David Brais & Udayan Das 1 AGENDA AWS Lambda Basics Invoking Lambda Setting up Lambda Handlers Use Cases ASP.NET Web Service Log Processing with AWS Lambda +

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Werden Sie ein Teil von Internet der Dinge auf AWS. AWS Enterprise Summit 2015 Dr. Markus Schmidberger -

Werden Sie ein Teil von Internet der Dinge auf AWS. AWS Enterprise Summit 2015 Dr. Markus Schmidberger - Werden Sie ein Teil von Internet der Dinge auf AWS AWS Enterprise Summit 2015 Dr. Markus Schmidberger - schmidbe@amazon.de Internet of Things is the network of physical objects or "things" embedded with

More information

TECHNICAL BRIEF. Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes. January 2017 Version 2.0 For Public Use

TECHNICAL BRIEF. Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes. January 2017 Version 2.0 For Public Use TECHNICAL BRIEF Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes January 2017 Version 2.0 For Public Use Table of Contents 1 Summary... 2 2 Introduction... 2 3 Stonebranch DevOps

More information

Data Lake Best Practices

Data Lake Best Practices Data Lake Best Practices Agenda Why Data Lake Key Components of a Data Lake Modern Data Architecture Some Best Practices Case Study Summary Takeaways What is a Data Lake? What, why etc. What is a data

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

ShuttleService. Scalable Big Data Processing Utilizing Cloud Structures. A Tick Data Custom Data Solutions Group Case Study

ShuttleService. Scalable Big Data Processing Utilizing Cloud Structures. A Tick Data Custom Data Solutions Group Case Study ShuttleService Scalable Big Data Processing Utilizing Cloud Structures A Tick Data Custom Data Solutions Group Case Study Robert Fenster, Senior Engineer and AWS Certified Solutions Architect Neal Falkenberry,

More information

High School Technology Services myhsts.org Certification Courses

High School Technology Services myhsts.org Certification Courses AWS Associate certification training Last updated on June 2017 a- AWS Certified Solutions Architect (40 hours) Amazon Web Services (AWS) Certification is fast becoming the must have certificates for any

More information

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content Introduction to Cloud Computing A Short history Client Server Computing Concepts Challenges with Distributed Computing Introduction

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

70-532: Developing Microsoft Azure Solutions

70-532: Developing Microsoft Azure Solutions 70-532: Developing Microsoft Azure Solutions Objective Domain Note: This document shows tracked changes that are effective as of January 18, 2018. Create and Manage Azure Resource Manager Virtual Machines

More information

Percona Server for MySQL 8.0 Walkthrough

Percona Server for MySQL 8.0 Walkthrough Percona Server for MySQL 8.0 Walkthrough Overview, Features, and Future Direction Tyler Duzan Product Manager MySQL Software & Cloud 01/08/2019 1 About Percona Solutions for your success with MySQL, MongoDB,

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

At Course Completion Prepares you as per certification requirements for AWS Developer Associate. [AWS-DAW]: AWS Cloud Developer Associate Workshop Length Delivery Method : 4 days : Instructor-led (Classroom) At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

More information

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016 Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016 Nikita Ivanov CTO and Co-Founder GridGain Systems Peter Zaitsev CEO and Co-Founder Percona About the Presentation

More information

San Jose Water Company Expedites New Feature Delivery with DevOps Help from ClearScale on AWS

San Jose Water Company Expedites New Feature Delivery with DevOps Help from ClearScale on AWS San Jose Water Company Expedites New Feature Delivery with DevOps Help from ClearScale on AWS 2016 ClearScale LLC. All rights reserved. Executive Summary Founded in 1866, San Jose Water Company (SJWC)

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Amazon Search Services. Christoph Schmitter

Amazon Search Services. Christoph Schmitter Amazon Search Services Christoph Schmitter csc@amazon.de What we'll cover Overview of Amazon Search Services Understand the difference between Cloudsearch and Amazon ElasticSearch Service Q&A Amazon Search

More information

2018 Database DevOps Survey DBmaestro 1

2018 Database DevOps Survey DBmaestro 1 2018 Database DevOps Survey 2017 DBmaestro 1 Table of Contents Executive Summary... 3 What Percentage of IT Projects in Your Company Use a DevOps Approach?... 4 Integration of DBAs with DevOps Teams...

More information

4) An organization needs a data store to handle the following data types and access patterns:

4) An organization needs a data store to handle the following data types and access patterns: 1) A company needs to deploy a data lake solution for their data scientists in which all company data is accessible and stored in a central S3 bucket. The company segregates the data by business unit,

More information

Scaling DreamFactory

Scaling DreamFactory Scaling DreamFactory This white paper is designed to provide information to enterprise customers about how to scale a DreamFactory Instance. The sections below talk about horizontal, vertical, and cloud

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE

SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE SALESFORCE DEVELOPER LIMITS AND ALLOCATIONS QUICK REFERENCE Summary Find the most critical limits for developing Lightning Platform applications. About This Quick Reference This quick reference provides

More information

White Paper / Azure Data Platform: Ingest

White Paper / Azure Data Platform: Ingest White Paper / Azure Data Platform: Ingest Contents White Paper / Azure Data Platform: Ingest... 1 Versioning... 2 Meta Data... 2 Foreword... 3 Prerequisites... 3 Azure Data Platform... 4 Flowchart Guidance...

More information

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight http://killexams.com/pass4sure/exam-detail/70-775 QUESTION: 30 You are building a security tracking solution in Apache Kafka to parse

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Documentation. This PDF was generated for your convenience. For the latest documentation, always see

Documentation. This PDF was generated for your convenience. For the latest documentation, always see Management Pack for AWS 1.50 Table of Contents Home... 1 Release Notes... 3 What's New in Release 1.50... 4 Known Problems and Workarounds... 5 Get started... 7 Key concepts... 8 Install... 10 Installation

More information

The OLX data theory of everything

The OLX data theory of everything The OLX data theory of everything Caspar Schönau Head of Global BI Jakub Orłowski Data engineering manager The biggest internet company that you have never heard of Founded 1915 South-Africa Market cap:

More information

Introduction to AWS GoldBase. A Solution to Automate Security, Compliance, and Governance in AWS

Introduction to AWS GoldBase. A Solution to Automate Security, Compliance, and Governance in AWS Introduction to AWS GoldBase A Solution to Automate Security, Compliance, and Governance in AWS September 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Pocket: Elastic Ephemeral Storage for Serverless Analytics Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information