Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Size: px

Start display at page:

Download "Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB"

Vernon Patrick
5 years ago
Views:

1 Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB Pagely is the market leader in managed WordPress hosting, and an AWS Advanced Technology, SaaS, and Public Sector partner. We provide various tiers of high performance Wordpress hosting services for enterprise level customers like BMC, Unicef, Northwestern University, and the City of Boston, offering flexibility in our solutions and the industries best expert-only, tier-less support. Pagely utilizes a proprietary tech stack that accelerates WordPress sites through the use of our own ARES Web Application gateway, PressCACHE and PressCDN technologies, as well as open source tools such as Redis and Nginx. In order to answer usage, billing, and other customer questions, our service team requires access to the logs created by the application servers. Historically we relied on a shell script that gathered basic statistics on-demand when needed. The job to process the logs for our largest customer ran for 8 hours or more for a single report, sometimes crashing due to resource limitations. Instead of putting more effort to fix a legacy process, we decided it was time to implement a proper analytics platform. Amazon Athena allows us to run SQL queries directly against the logs, which are stored as compressed JSON files in Amazon S3. This approach is great because there is no need for us to prepare the data, simply define the table and query away. While JSON is a supported format for Amazon Athena, it is not the most efficient format for use at query time. JSON files must be read in their entirety, even if you are only using 1 or 2 fields from each row of data. Besides not being cost effective, the inefficiencies of processing JSON causes longer query times. Querying the logs of our largest customers was not ideal with Athena, as we ran into the 30 minute query timeout limit. This limit can be increased but, however the query was already taking longer than we wanted. Partitioning and Columnar Formats The best practice for structuring the data in S3 is twofold, partitioning and columnar file structure.. Partitioning is the process of splitting data into different prefixes or folders on S3 with a naming convention that s most suitable to efficient retrieval of data. This allows the Athena to skip over data that is not relevant to the particular query being executed. Apache Parquet is a columnar file format popular with tools in the Hadoop ecosystem. Parquet stores the columns of the data in separate, contiguous regions in the file. Directed by metadata footers, tools like Athena can read only the sections of the file that are needed to fulfill the query, eliminating a large portion of the IO and network transfer. Reducing IO through partitioning and parquet files not only increases query performance, but it can dramatically reduce the cost of using Athena. Engaging Beyondsoft...

2 We knew that we needed to transform our data into partitioned parquet in order to make it performant with Athena, but being a lean shop, we didn't have the bandwidth to dive into the technologies. In order to bridge the gap, we engaged Beyondsoft, an AWS Advanced Partner, to optimize our data lake using their open source tool, ConvergDB. ConvergDB ConvergDB is a devops-friendly approach to managing serverless data lakes. Tables are defined using technology agnostic schema definitions which are then deployed to concrete cloud services (such as Glue and Athena) through the use of Hashicorp Terraform. The schema and deployment definitions provide a single point of management for the structure and behavior of the data as it flows through the cloud. ConvergDB does not requires servers to operate, but is used either locally on a user's machine, or in a CI/CD pipeline. The appeal of managing our data with ConvergDB is that we can design our data lake by defining only the important elements. The schema files are used to define tables, including field level SQL expressions that are used to transform the incoming data as it is being loaded. This makes it easy to derive calculated fields, as well as the fields used for data partitioning. Once the schema is defined, the deployment file allows us to place the tables into an ETL job that is used to manage them. The ETL job schedule is specified in the deployment file, as well as optional fields such as the target S3 bucket and number of Glue DPUs to use at run time. ConvergDB is a command line binary and does not need to be installed on a server. All of the artifacts are files that can be managed with source control. This makes ConvergDB easy to integrate into CI/CD pipelines created with the tooling of your choice. The ConvergDB binary takes in all of the configuration files, then outputs a Terraform configuration containing all of the artifacts necessary to deploy the data lake such as ETL scripts, table and database definitions, IAM policies necessary to run the jobs, SNS notification

3 topics, and even a Cloudwatch dashboard showing the volume of data processed by ConvergDB ETL jobs. Speed Bumps No implementation goes perfectly. The next sections are provided by Jeremy Winters, a Beyondsoft engineer, explaining the problems they ran into and how they were addressed. Small File Problem A classic issue encountered with Hadoop ecosystem tools is known as the "small file problem". Processing a large number of small files creates a lot of overhead for the system, causing job execution times to skyrocket, and potentially fail. Pagely had approximately 4TB of history across 30 million files million of these files only represented 1.2TB of the data in S3. In order to analyze this issue, we enabled S3 inventory reporting on the source data bucket. The report is delivered daily in an ORC format. From there it is very easy to create an Athena table to analyze the bucket contents with SQL. We used Athena to identify S3 prefixes that were "hot spots"... having a large number of small files. We identified prefixes with less than 1GB of data that we could consolidate. So million files consolidated into files. The following query is a way to identify small file hot spots. The group by expression can be suited to your data. The example shows a way of grouping by the first folder in the bucket. select -- we are looking at the first string in a / delimited path -- if the key is path_to_data/ json.. it will group on path_to_data split_part(key,'/',1) as prefix -- calculate the total size in mb for all files in prefix,sum(size)/cast(1024*1024 as double) as mb -- count of objects in the prefix,count(*) as object_count from pagely_gateway_logs

4 where -- assumes that versioning is disabled -- you should use the latest date after -- refreshing all partitions dt = ' ' group by 1 having -- only return prefixes with a total size of less than 1 gb -- and a file count greater than 8 sum(size)/cast(1024*1024 as double) < 1024 and count(*) >= 8 The results show prefixes in your object paths that can, and should be consolidated. Anything less than 1GB with more than 8 files can then be consolidated into a single object, replacing the originals. To perform the actual consolidation, we ran a containerized script using Fargate, the serverless Docker container feature of ECS. Each worker container instance processed the files for a given S3 key prefix. A governor container managed the lifecycle of the workers, limiting concurrency, and keeping track of which jobs succeeded. Using Fargate, we were able to perform the consolidation of all the small files for $27. Historical Data

5 Daily data volumes for Pagely logs are in the 10s of GB per day, easily handled by the smallest AWS Glue configuration. Transforming the 4TB compressed (~28TB uncompressed) of historical data was a bit more challenging. For example, if you are 20 hours into a data transformation, and the job tries to process a file with an incorrect S3 ACL, the entire job will fail, resulting in 20 hours of wasted compute resources. ConvergDB mitigates the risk of wasting compute resources by batching the data into smaller chunks. In the case of a 20 hour job failing, only the last batch will be lost, resulting in around one hour of compute being lost. ConvergDB uses its own state tracking mechanism to communicate the failure to the next run of the job, which will clean up any mess before trying to process the batch again. Batching is an automatic feature of the ETL job created by ConvergDB, based upon the size of the Glue cluster. Post-deployment at Pagely Now that our data lake is in production, running our legacy report for a medium size application took 91 seconds to run with the legacy process, and 5 seconds when run from Athena.. For a gain of 18x. Our largest data set breaks our legacy process, and is not performant when querying the JSON directly with Athena, but the new tables enable completion of the analysis in 24 seconds. Legacy Process Athena with JSON Athena with Parquet Medium Customer Largest Customer 1m 31s 1m 6s > 8 hours > 30 min 24s While these numbers are obviously important, the biggest advantage is that now we don't have to worry about performance and cost, and the engineer can focus on solving problems, 15 minutes of writing queries and the entire team now has access to new data. I was able to upgrade the legacy process with queries dispatched to Athena through the AWS SDK. This process can now run on any lightweight machine (like my laptop) while Athena does the heavy lifting. About Beyondsoft Consulting, Inc. Beyondsoft Consulting, Inc. is a leading Cloud consulting, services, and technology company. Beyondsoft delivers solutions and services globally and across many verticals. Our team of highly skilled professionals, coupled with our focus on customer success, truly separates us as an Amazon Web Services Advanced Partner.

An Introduction to Big Data Formats

Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION