Redshift Queries Playbook

Similar documents
Greenplum SQL Class Outline

FUN WITH ANALYTIC FUNCTIONS UTOUG TRAINING DAYS 2017

Querying Data with Transact SQL

Integration Service. Admin Console User Guide. On-Premises

DB2 SQL Class Outline

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 7 Introduction to Structured Query Language (SQL)

Learn SQL by Calculating Customer Lifetime Value

INTERMEDIATE SQL GOING BEYOND THE SELECT. Created by Brian Duffey

SQL functions fit into two broad categories: Data definition language Data manipulation language

Integrating Hive and Kafka

Integration Service. Admin Console User Guide. On-Premises

Principles of Data Management

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database 11g: SQL and PL/SQL Fundamentals

SIT772 Database and Information Retrieval WEEK 6. RELATIONAL ALGEBRAS. The foundation of good database design

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Snowplow Meetup 11/08/15

Exam code: Exam name: Database Fundamentals. Version 16.0

Integration Service. Admin Console User Guide. On-Premises

After completing this course, participants will be able to:

CGS 3066: Spring 2017 SQL Reference

GridDB Advanced Edition SQL reference

4 Introduction to Web Intelligence

OVERVIEW OF RELATIONAL DATABASES: KEYS

Aster Data SQL and MapReduce Class Outline

Distributing Queries the Citus Way Fast and Lazy. Marco Slot

I Travel on mobile / FR

I Travel on mobile / UK

Oracle Database: SQL and PL/SQL Fundamentals Ed 2

Querying Data with Transact-SQL

20461: Querying Microsoft SQL Server 2014 Databases

SQL Server 2012 Development Course

Amazon Mobile Analytics. User Guide

This module presents the star schema, an alternative to 3NF schemas intended for analytical databases.

Table of Contents Page 2

MIXPANEL SYSTEM ARCHITECTURE

20461: Querying Microsoft SQL Server

The Plan. What will we cover? - Review Some Basics - Set Operators - Subqueries - Aggregate Filter Clause - Window Functions Galore - CTE s - Lateral

Aster Data Basics Class Outline

Querying Microsoft SQL Server (MOC 20461C)

Kaseya 2. User Guide. Version 7.0. English

MTA Database Administrator Fundamentals Course

PERISCOPE DATA PRESENTS. Speed up Your SQL

SQL Analytics: Best Practices, Tips and Tricks

In This Lecture. Yet More SQL SELECT ORDER BY. SQL SELECT Overview. ORDER BY Example. ORDER BY Example. Yet more SQL

Querying Microsoft SQL Server

COURSE OUTLINE MOC 20461: QUERYING MICROSOFT SQL SERVER 2014

Structured Query Language Continued. Rose-Hulman Institute of Technology Curt Clifton

Mobile MOUSe MTA DATABASE ADMINISTRATOR FUNDAMENTALS ONLINE COURSE OUTLINE

Querying Microsoft SQL Server

John Biancamano Inbound Digital LLC InboundDigital.net

SQL: Data Querying. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 4

Apple Deployment Program Volume Purchase Program for Education Guide

COURSE OUTLINE: Querying Microsoft SQL Server

Business Insight Authoring

SQL Data Querying and Views

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

Querying Data with Transact-SQL

COGS 121 HCI Programming Studio. Week 03 - Tech Lecture

Lesson 2. Data Manipulation Language

I Shopping on mobile / RU

Course 20461C: Querying Microsoft SQL Server

Database Management Systems by Hanh Pham GOALS

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71

GOOGLE ANALYTICS HELP PRESENTATION. We Welcome You to. Google Analytics Implementation Guidelines

Subquery: There are basically three types of subqueries are:

EECS 647: Introduction to Database Systems

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University

Querying Microsoft SQL Server

Data Exploration. The table below lists each of the files available for analysis with a short description of what is found in each one.

Exact Numeric Data Types

SQL Data Query Language

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

QUERYING MICROSOFT SQL SERVER COURSE OUTLINE. Course: 20461C; Duration: 5 Days; Instructor-led

CSC 343 Winter SQL: Aggregation, Joins, and Triggers MICHAEL LIUT

1Z Oracle Database 11g - SQL Fundamentals I Exam Summary Syllabus Questions

Acquiring, Exploring and Preparing the Data

The State of Mobile Advertising Q2 2012

Information Systems Engineering. SQL Structured Query Language DML Data Manipulation (sub)language

Finance on mobile: Canada

APPLICATION USER GUIDE

Slides by: Ms. Shree Jaswal

SQL CHEAT SHEET. created by Tomi Mester

Business Analytics Nanodegree Syllabus

Tableau Tutorial Using Canadian Arms Sales Data

Experimental Finance, IEOR. Mike Lipkin, Alexander Stanton

Querying Microsoft SQL Server

L07: SQL: Advanced & Practice. CS3200 Database design (sp18 s2) 1/11/2018

Introduction to SQL Part 2 by Michael Hahsler Based on slides for CS145 Introduction to Databases (Stanford)

TURN DATA INTO ACTIONABLE INSIGHTS. Google Analytics Workshop

MIS NETWORK ADMINISTRATOR PROGRAM

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language

ASSIGNMENT NO Computer System with Open Source Operating System. 2. Mysql

Computing for Medicine (C4M) Seminar 3: Databases. Michelle Craig Associate Professor, Teaching Stream

Hustle Documentation. Release 0.1. Tim Spurway

4. SQL - the Relational Database Language Standard 4.3 Data Manipulation Language (DML)

CSC Web Programming. Introduction to SQL

Guide to Google Analytics: Admin Settings. Campaigns - Written by Sarah Stemen Account Manager. 5 things you need to know hanapinmarketing.

NCSS: Databases and SQL

Transcription:

scalable analytics built for growth Redshift Queries Playbook SQL queries for understanding user behavior Updated June 23, 2015

This playbook shows you how you can use Amplitude's Amazon Redshift database to answer common questions about user behavior in your app. All queries are written in the PostgreSQL syntax used by Redshift and can be executed directly from the Redshift prompt. You can use Amazon s Redshift documentation to help you understand the supported functions.

Quick and Essential ContentsTips 1. Data for respective apps will be kept in their own schemas (namespaces/packages in Redshift) By Default, every Redshift command you run will be under the public schema. However, you can select which schema you want to work under instead. You can do this by using the SET search_path command. SET search_path = app123; SELECT COUNT(*) FROM events; Or you can include the schema as a prefix to the table SELECT COUNT(*) FROM app123.events; 2. Query directly from each app's table instead of the entire events table when possible. The events from each of your Amplitude apps are stored in their own tables. The table name for each app is 'events###' where the ### is the app number, which you can find in the URL of the Amplitude dashboard. The union of each app's events appears in a table called 'events'. Selecting FROM events### when possible will make your queries faster and more efficient. 3. Custom event properties and custom user properties associated with an event_type will be pulled into their own columns in the respective event_type table. Custom user-properties will be appended by a u_' and custom eventproperties will be appended by a 'e_'. Note - There is a a limit of 400 user properties and 50 event properties that will be pulled into their own columns. Anything past the limit will still require the JSON_EXTRACT_PATH_TEXT Function.

4. Always include a date range in the WHERE clause of your query. Our Redshift tables do not have a primary key but are sorted by the event_time column. Adding a date range in the WHERE clause of your query will significantly increase query speeds. We recommend using the DATE() function with event_time as the input. 5. Avoid SELECT * queries when possible. The more columns you select, the slower your query will be. Selecting only relevant columns, as opposed to all (*) columns, will significantly increase query speeds and show only relevant data.

Contents Table of Contents 0. Schema description 1. Active Users Count the active users on a given day 2. New Users Count the new users on a given day Click to reach the relevant section 3. Composition Show the breakdown of devices for users in a two week period 4. Sessions Show the distribution of session lengths on a specific date Show the average session length per segment 5. Events Show the distribution of event property totals Count the number of users who did an event more than twice Count the number of events done by a specific set of users who did another event Show the distribution of users who have done an event by number of times done Find out the last three events a user does before churning 6. Funnels Obtain a list of users for each step of a funnel Adding steps to a funnel Getting the list of users who did (or did not) reach a step in a funnel Funnels where users did event X, then Y, with no other events in between Funnels where users did event Y after event X, within 24 hour hours of event X

Contents Table of Contents 7. Revenue Obtain the number of paying users and total revenue Obtain a list of top paying users 8. User Properties Obtain the most common values for a given user property Obtain a list of users who have certain properties Obtain the most common advertising referral networks for users Obtain the number of users whose current level is greater than 7 but less than 10 9. Event properties Click to reach the relevant section Obtain the list of items bought and how frequently that item was purchased Obtain the number of users who placed a bet and wagered between 100-500 credits Additional Resources

Contents 0. Schema Description Below is the list of the columns in the table for the event type played_song. Included is the the column type and a brief description of the column. Column Type Description id bigint A depreciated column app integer App ID from the dashboard amplitude_id bigint Internal ID used to count unique users device_id character varying (256) Device specific identifier user_id character varying (256) A readable ID specified by you event_time timestamp w/o time zone Event time (UTC) after reconciliation client_event_time timestamp w/o time zone Local event time client_upload_time timestamp w/o time zone Local upload time server_upload_time timestamp w/o time zone Server time when event was received event_id integer Counter distinguishing events session_id bigint Session start time in milliseconds since epoch event_type character varying (256) A unique identifier for your event amplitude_event_typ character varying (256) Amplitude specific identifiers based on event e first_event boolean True if event is first for a given amplitude_id version_name character varying (256) App version os_name character varying (256) OS name os_version character varying (256) OS version continued on next page >> back to Table of Contents 1

Column Type Description device_brand character varying (256) Device brand device_manufacture character varying (256) Device manufacturer device_model character varying (256) Device model device_carrier character varying (256) Device carrier country character varying (256) Country language character varying (256) Language revenue double precision Revenue generated by a revenue event product_id character varying (256) Product ID of a revenue event quantity integer Quantity of a revenue event price double precision Price of a revenue event location_lat double precision Latitude location_lng double precision Longitude ip_address character varying (256) IP address event_properties character varying (65535) JSON string of event properties user_properties character varying (65535) JSON string of user properties region character varying (256) Region city character varying (256) City dma character varying (256) Designated Marketing Area (DMA) device_family character varying (256) Device Family device_type character varying (256) Device Type platform character varying (256) Platform (ios, Android, or Web) e_type character varying (2048) Custom event property 'type' e_length character varying (2048) Custom event property 'length' u_age character varying (2048) Custom user property 'age' u_gender character varying (2048) Custom user property 'gender' >> back to Table of Contents 2

Contents 1. Active Users The number of Active Users that an app has over a given period of time is one of the most basic and important metrics in measuring an app's level of user engagement. This metric counts the number of distinct users who performed at least one tracked event during the specified time period. A basic example of an active user count query is: Query Objective: Count the active users on a given day SELECT COUNT(DISTINCT amplitude_id) FROM events123 WHERE DATE(event_time) = '2015-03-01'; Explanation This query returns the number of users who logged at least one event on March 1, 2015. The red text of the query above should be adjusted to your specific case. amplitude_id vs. device_id vs. user_id Notice amplitude_id is used in the query above; this is the most accurate field to identify unique users as it combines information from device_id and user_id. Still, results based on either user_id or amplitude_id will usually be similar, so you can use either one in most cases. Further, in certain situations (see below) device_id and user_id are more useful because they contain information usable outside of Amplitude - e.g. user_id can be used for contacting users by email (as user_id's are often user's email addresses) and device_id can be used for push notifications. For more discussion of ID types and to understand how we count unique users, see our documentation. >> back to Table of Contents 3

Modifications Time Zones Dates and times are in UTC (formatted yyyy-mm-dd hh:mm:ss), so if you are interested in getting active user counts for different time zones, forgo the DATE() function and offset the full timestamps by the appropriate differential. For example, to obtain the number of daily active users in the 24-hour period corresponding to March 1st Pacific Time, modify the event_time part of the query above to: WHERE event_time >= '2015-03-01 08:00:00' AND event_time < '2015-03-02 08:00:00' Users Who Did Specific Events The basic query above counts users who did any event as active users. If you are instead interested in users who did (or did not do) certain event types, you can easily modify the query to do so. For example, if you only want users who did the 'sentmessage' event, just modify the WHERE part of the query to: WHERE event_type = 'sentmessage' AND DATE(event_time) = '2015-03-01'; Similarly, you can query for users who logged events other than certain events. For example, if your app tracks passive events such as push notifications, an active user might be best defined as a user who does some active action. So, if the event you want to exclude is called 'receivedpush', modify the query to: WHERE event_type!= 'receivedpush' AND DATE(event_time) = 2015-03-01'; >> back to Table of Contents 4

Obtaining a List of Users If you want to see who the set of active users are, rather than simply how many, you can obtain the list of user ids (which, depending on your app, may be a list of user email addresses, log in names, etc). The query is the same except for the beginning: SELECT DISTINCT user_id FROM events123... Note that we use user_id instead of amplitude_id because user_id is the identifier that your app recognizes (e.g. user email addresses, log-in names, etc) while amplitude_id is Amplitude's internal id for users, which is not meaningful outside of Amplitude use. Saving Output to a File This modification above returns a table with one user id per row, so if your app has thousands (or more) users per day, this can be a very long table. It is often more useful to save the results of the query in a file instead of just viewing it the Redshift terminal. To do this, simply type the following command in the Redshift prompt: \o your_file_name.csv All query results for the remainder of your Redshift session will be written to the file your_file_name.csv on your local machine. To stop writing queries to the file, quit your session with: \q A variety of SQL UI tools exist where you can save tables generated from queries to Excel directly. A couple of these programs are SQL Workbench/J and Navicat. >> back to Table of Contents 5

Contents 2. New Users Another fundamental and important metric of app performance is the number of new users (per day, week, month, etc). New users for a given day are the users whose first Amplitude-recorded event occurred on that day. The basic query for new user count is: Query Objective: Count the new users on a given day SELECT COUNT(amplitude_id) FROM events123 WHERE first_event = 'True' AND DATE(event_time) = '2015-03-01'; Explanation The query above returns the number of users who logged their first event (specified by first_event = 'True'), and hence were new users, on March 1, 2015. The red text of the query above should be adjusted to your specific case. Modifications Time Zones Dates and times are in UTC (formatted yyyy-mm-dd hh:mm:ss), so if you are interested in getting new user counts for different time zones, forgo the DATE() function and offset the full timestamps by the appropriate differential. >> back to Table of Contents 6

For example, the number of daily active users in the 24-hour period corresponding to March 1st Pacific Time, modify the event_time part of the query above to: WHERE event_time >= '2015-03-01 08:00:00' AND event_time < '2015-03-02 08:00:00' Number of Users Who Did a Specific Event The basic query above counts users who did any event as their first event. If you are instead interested in users who did a certain event type, you can easily modify the query to do so. For example, if you only want users who did the 'signedup' event, just query on the signedup event table: SELECT COUNT(amplitude_id) FROM app123.signedup WHERE first_event = 'True' AND DATE(event_time) = '2015-03-01'; Obtaining a List of Users Just as with active users, it is often useful to obtain a list of the actual new user id's in addition to the count. SELECT DISTINCT amplitude_id FROM events123... >> back to Table of Contents 7

Contents 3. Composition Grouping your users by user properties will give you insight into who is using your app. Query Objective: Show the breakdown of devices for users in a two week period SELECT device_model, COUNT(DISTINCT(amplitude_id)) FROM events123 WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14' GROUP BY device_model ORDER BY COUNT DESC; Explanation The query above counts the number of distinct users by device for the first two weeks in March. It s worth noting that if a user does events on multiple devices during the time period, she will be counted in each device bucket. The red text of the query above should be adjusted to your specific case. Modifications Filter on another user property If you want to filter on another user property, you need to add it to the WHERE clause. WHERE country = India' The query will now only include users in India. >> back to Table of Contents 8

4. SessionsContents You can see the duration of time people are using your app. On the dashboard, session lengths are calculated by subtracting the MAX(client_event_time) and session_id (which is the number of milliseconds since epoch). Query Objective: Show the distribution of session lengths on a specific date SELECT DATEDIFF('milliseconds',timestamp 'epoch' + session_id / 1000.0 * INTERVAL '1 second',max) AS diff_millisec FROM (SELECT session_id, amplitude_id,min(client_event_time) as min, MAX(client_event_time) AS max FROM events123 WHERE session_id!= -1 AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' GROUP BY session_id, amplitude_id) WHERE DATE(min) = '2015-03-01' ORDER BY diff_millisec ASC; Explanation The inner SELECT chooses distinct pairs of session_id and amplitude_id as well as the minimum and maximum timestamps per unique pair. The outer SELECT uses the datediff func-on to subtract the MAX(client_event_-me) and session_id by turning the session_id into a -mestamp. It does so by dividing by 1000 (gets to seconds), and then mul-plying by the 1 second interval, and then adding it to the epoch -mestamp (which is 0). >> back to Table of Contents 9

The final WHERE clause restricts the calculation to sessions that started on March 1 (because they could have extended into March 2). The red text of the query above should be adjusted to your specific case. Query Objective: Show the average session length per segment SELECT( SELECT SUM(length) FROM( SELECT DISTINCT session_id, amplitude_id, DATEDIFF('milliseconds',timestamp 'epoch' + session_id / 1000.0 * INTERVAL '1 second',max) AS length FROM( SELECT amplitude_id, session_id, MAX(client_event_time) OVER(PARTITION BY session_id ORDER BY amplitude_id, client_event_time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)AS max, MIN(client_event_time) OVER(PARTITION BY session_id ORDER BY amplitude_id, client_event_time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS min FROM events123 WHERE country = 'United States' AND DATE(event_time) BETWEEN '2015-01-01' AND '2015-01-02' AND session_id!= '-1') WHERE DATE(min)= '2015-01-01') ) / (SELECT CAST(COUNT(DISTINCT session_id) AS float) FROM events123 WHERE session_id!= '-1' AND DATE(event_time)= '2015-01-01' AND country='united States') /1000 AS average >> back to Table of Contents 10

Explanation In the red subquery, we are selecting amplitude_id, session_id, event_time, the MAX value for the client_event_time in a given session, and the MIN value for the client_event_time from your events table, ONLY looking at users from the United States and on January 1st. We partition the table by session_id. PARTITION is the group function, but it does not aggregate the ID's (each row with the same amplitude ID stays independent), and within each partition, the client_event_time is sorted from earliest to latest. The blue subquery selects the distinct number of session_id s, amplitude_id s and the difference between the maximum and minimum client_event_times (to give you session length in milliseconds.) The orange subquery sums the lengths of the sessions, which should give you the TOTAL time for all sessions. The green subquery gives you the number of number of distinct sessions from users who were in the United States and on January 1st. Finally, the black outer subquery simply divides the TOTAL session time by the number of sessions, giving you the average session length. We then divide by 1000 to get our average in seconds. Text in purple can be adjusted for your specific case. >> back to Table of Contents 11

5. Events Contents Analyzing custom events will help you understand what users are actually doing when they re in your app. There are many different types of questions you can ask, so we ll provide you with some examples below. Query Objective: Show the distribution of event property totals SELECT DATE (event_time) AS DATE,e_type, COUNT(*) FROM app123.signup WHERE DATE (event_time) BETWEEN '2015-03-01' AND '2015-03-07' GROUP BY DATE, e_type ORDER BY DATE, COUNT DESC; Explanation The query shows the distribution of the type property of the signup event every day for the first week in March. Because event properties are pulled into their own columns, we can query on the event property type directly and use GROUP BY to capture each property on each day. The red text of the query above should be adjusted to your specific case. >> back to Table of Contents 12

Query Objective: Count the number of users who did an event more than twice on a specific date SELECT amplitude_id, COUNT(*) AS total FROM app123.game_initiated WHERE DATE(event_time) = '2015-03-01' GROUP BY amplitude_id HAVING COUNT(*) >= 2; Explanation The query above counts how many users did the Game Initiated event two or more times on March 1. The inner SELECT creates the table of users and how many times they did the Game Initiated event, and the outer SELECT only chooses those who have done it two or more times. The red text of the query above should be adjusted to your specific case. Query Objective: Count the number of events done by a specific set of users who did another event Specifically, we will count the number of sentmessage events done by people who did the signup event in California during the first two weeks of March. There will be two steps. First we need to get the set of users who did signup in California from March 1 through March 14. The query below gets us this set. This will be an intermediate query that we will use in the final query. >> back to Table of Contents 13

SELECT DISTINCT(amplitude_id) FROM app123.signup WHERE region = 'California' AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14'; The red text of the query above should be adjusted to your specific case. The next step is that we need to answer the question: for users in this set, how many sentmessage events happened during the same time period? There are two ways to get this. One way is to use an IN and the other way is to use a JOIN. Both will require the intermediate query we defined above. We ll explain both so you can choose which you feel more comfortable using. Using an IN SELECT COUNT(*) FROM app123.sentmessage WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14' AND amplitude_id IN (SELECT DISTINCT(amplitude_id) FROM app123.signup WHERE region = 'California' AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14') ; >> back to Table of Contents 14

Explanation The outer SELECT counts the number of sentmessage events. The condition where amplitude_id IN() means it will only select from rows there the amplitude_id is IN the set of users inside that function. So what do we do? We add our intermediate query inside the IN function so that we are only counting messages from users who have done signup in California from March 1 to March 14. The red text of the query above should be adjusted to your specific case. Using a JOIN DELETE CREATE OR REPLACE VIEW CalisignUp0301to0314 AS SELECT DISTINCT(amplitude_id) FROM app123.signup WHERE region = 'California' AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14'; SELECT COUNT(*) FROM app123.sentmessage INNER JOIN CalisignUp0301to0314 ON app123.sentmessage.amplitude_id = CalisignUp0301to0314.amplitude_id WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14'; Explanation The first part of the query is the intermediate query we defined above. We have turned it into a view (CalisignUp0301to0314) to make the query cleaner. The second part of the query is a JOIN -- we JOIN the events table with the created view. The JOIN selects out the amplitude_ids that appear in both tables (the users who did signup ) and the rest of the query only picks from these rows. The red text of the query above should be adjusted to your specific case. >> back to Table of Contents 15

Query Objective: Show the distribution of users who have done an event by number of times done SELECT amplitude_id, COUNT(*) AS messages FROM app123.sentmessage WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-14' GROUP BY amplitude_id ORDER BY COUNT(*) DESC; Explanation The query s output is a table of the messages and the number of users who sent that number of messages in the first two weeks of March. Here is a sample of the output: messages users ----------+------- 1 29588 2 12625 3 6151 4 3568 5 2469 6 1808 7 1363 etc The inner SELECT creates a table of unique users and how many messages they ve logged during the first two weeks of March. The outer SELECT creates a table based on the number of messages and the number of users who fell into that bucket. >> back to Table of Contents 16

Query Objective: Find out the last three events a user does before churning We'll limit the analysis to people who used the app the month prior to last. 1. Define churn as people who have not logged in during the last month: CREATE VIEW churned AS ( SELECT DISTINCT(amplitude_id) FROM events123 WHERE DATE(event_time) BETWEEN '2015-01-01' AND '2015-01-31' AND amplitude_id NOT IN ( SELECT DISTINCT(amplitude_id) FROM events123 WHERE DATE(event_time) BETWEEN '2015-02-01' AND '2015-02-28' ) ); 2. Fetch the last three events per user: CREATE TEMPORARY TABLE last3 AS ( SELECT * FROM ( ); SELECT amplitude_id, event_type, row_number() over (PARTITION BY amplitude_id ORDER BY event_time DESC) FROM events123 WHERE event_type NOT IN ('session_start', 'session_end') AND amplitude_id IN ( SELECT amplitude_id FROM churned) ) WHERE row_number <= 3 >> back to Table of Contents 17

3. Join the tables to combine the three events into one row: CREATE VIEW last3joined AS ( SELECT a.amplitude_id, a.event_type AS e1, b.event_type AS e2, c.event_type AS e3 FROM (SELECT * FROM last3 WHERE row_number=1) AS a JOIN (SELECT * FROM last3 WHERE row_number=2) AS b ON a.amplitude_id = b.amplitude_id JOIN (SELECT * FROM last3 WHERE row_number=3) AS c ON a.amplitude_id = c.amplitude_id ); 4. What were the last three events before the user churned? SELECT e1 ',' e2 ',' e3 AS last3, COUNT(*) FROM last3joined GROUP BY last3 ORDER BY count DESC; Explanation This query shows the last three events users did, out of the set of users who were active in January but not active in February. The red text of the query above should be adjusted to your specific case. >> back to Table of Contents 18

6. Funnels Contents For almost any app, there are key sequences of events that users should progress through in order to successfully begin or continue using the app; this sequence is commonly called a funnel. For example, for a messaging app, the key initial funnel might have three steps: (1) The openapp event (2) The viewmessage event (3) The sendmessage event *Note: Tracking the number of users who make it (and don't make it) to each stage in a funnel is crucial, as it identifies which parts of your app's user experience flow are smooth, and which parts are bottlenecks that need improvement. In this section, we'll demonstrate how to do funnel analysis in Redshift, using the three-stage texting app funnel described above as an example. To do this, we will create each step in the funnel as a SQL View - essentially a saved query that we can use without retyping the query. Query Objective: Obtain a list of users for each step of a funnel CREATE VIEW Funnel_Step_1 AS ( SELECT DISTINCT user_id FROM app123.openapp WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' ); >> back to Table of Contents 19

This view, which we name 'Funnel_Step_1', captures the users who opened the app during March 1st and 2nd. Next, we use the Funnel_Step_1 view to construct the view for the second step in the funnel: CREATE VIEW Funnel_Step_2 AS ( SELECT DISTINCT app123.viewmessage.user_id FROM app123.viewmessage INNER JOIN Funnel_Step_1 ON app123.viewmessage.user_id = Funnel_Step_1.user_id WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' ); Funnel_Step_2 captures the subset of the users from Funnel_Step_1 who also did the viewmessage event during the first two days of March; that is, the users who did both the openapp and viewmessage. Finally, we use Funnel_Step_2 to construct the view for the third step of the funnel: CREATE VIEW Funnel_Step_3 AS ( ); SELECT DISTINCT app123.sendmessage.user_id FROM app123.sendmessage INNER JOIN Funnel_Step_2 ON app123.sendmessage.user_id = Funnel_Step_2.user_id WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' Funnel_Step_3 captures the subset of the users from Funnel_Step_2 (which, recall, is itself a subset of users from Funnel_Step_1) who also did the sendmessage event during the first two days of March. >> back to Table of Contents 20

Now that we have created the views for our funnel, we can analyze each step. First we can look at the count of users who made it to step 1, 2, and 3, respectively, using the queries: SELECT count(*) FROM Funnel_Step_1; SELECT count(*) FROM Funnel_Step_2; SELECT count(*) FROM Funnel_Step_3; Query Objective: Adding steps to a funnel While our example funnel here has three steps, you can add as many steps to your funnel as you'd like. Let s add a step to our above funnel: CREATE VIEW Funnel_Step_4 AS ( SELECT DISTINCT app123.next_event.user_id FROM app123.next_event INNER JOIN Funnel_Step_3 ON app123.next_event.user_id = Funnel_Step_3.user_id WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' ); Query Objective: Getting the list of users who did (or did not) reach a step in a funnel In addition to getting the counts of users for each step in the funnel, you can also get the list of user_ids for the users who did (or did not) reach a given step in the funnel. To get the list of users who reached step X but then did not reach step X+1 -- referred to as users who dropped off the funnel at step X+1 -- use the query below; here we obtain the users who reached step 2 of our example funnel (so they did the openapp and viewmessage events) but did not reach step 3 (so they did not do the sendmessage event): >> back to Table of Contents 21

SELECT Funnel_Step_2.user_id FROM Funnel_Step_2 LEFT JOIN Funnel_Step_3 ON Funnel_Step_2.user_id = Funnel_Step_3.user_id WHERE Funnel_Step_3.user_id IS NULL; Query Objective: Funnels where users did event X, then Y, with no other events in between In our dashboard, users are counted as converted as long as they complete the next funnel step on the same day or up until 30 days they have entered the funnel. To get a list of users who did your first step in the funnel and immediately proceeded to do the next event, we will need to start using partition functions. Lets say we are looking at a funnel with events openapp viewmessage, and we only want to look at the number of users who did viewmessage immediately after openapp, with no other events in between. In this case, we must query on the events table instead of individual event tables as the individual events table does not give us information on what events immediately follows. To get a list of the number of users who did the openapp event, use the query: SELECT COUNT(DISTINCT amplitude_id) FROM( SELECT amplitude_id, event_type, event_time, LEAD(event_type, 1) OVER(PARTITION BY amplitude_id ORDER BY event_time) AS next_event_type FROM events123) WHERE next_event_type= 'viewmessage' AND event_type='openapp' AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02'; >> back to Table of Contents 22

Explanation The inner subquery selects amplitude_id, event_type, event_time along with the PARTITION function. PARTITION is similar to the group function, but it does not aggregate the ID's (each row with the same amplitude ID stays independent), and within each partition, we have chosen to sort by event_time, so the event_time is sorted from earliest to latest. The LEAD function with value 1 returns the value for the row that is one after the current row and the AS function names that column next_event_type. Note the LEAD function only works within the partition (see the null values in the sample table below). From this resulting table, we are only selecting the rows where the next_event_type has the value VIEW (second event in the funnel) and the event_type with the value openapp (first event in the funnel) AND only events on March 1st and March 2nd. A simplified example of the partition function can be seen below: amplitude_id event_time event_type next_event_type a 1:01 openapp viewmessage a 1:03 viewmessage viewmessage a 1:05 viewmessage null b 1:06 openapp viewmessage b 1:10 viewmessage null c 1:12 openapp null In this above example, we would have two rows and two users who would satisfy this requirement (users A and B, row 1 and row 4.) >> back to Table of Contents 23

Query Objective: Funnels where users did event Y after event X, within 24 hours of event X CREATE OR REPLACE VIEW openapp_funnel1 AS SELECT * FROM ( SELECT amplitude_id, event_time, row_number() OVER (PARTITION BY amplitude_id ORDER BY event_time ASC) FROM app123.openapp WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02') WHERE row_number = 1; The inner SELECT creates a table with the Amplitude ID and the time at which the user did the openapp event. The table is partitioned by amplitude_id, and within each partition the event times are sorted from least to greatest. Each row in each partition is given a row number. The outer SELECT picks only the first row of each partition - this is the first time the user did the openapp event in the given window. The inner SELECT makes a table that looks like this: amplitude_id event_time row_number() a 1:00 1 a 1:30 2 b 1:04 1 c 1:05 1 c 1:10 2 c 1:15 3 The outer SELECT makes a table that looks like this: amplitude_id event_time row_number() a 1:00 1 b 1:04 1 c 1:05 1 >> back to Table of Contents 24

CREATE OR REPLACE VIEW openapp_funnel2 AS SELECT DISTINCT(amplitude_id) FROM ( SELECT openapp_funnel1.amplitude_id, DATEDIFF('milliseconds',OPEN_funnel1.event_time, app123.viewmessage.event_time) AS dt FROM openapp_funnel1 INNER JOIN app123.viewmessage ON openapp_funnel1.amplitude_id = app123.viewmessage.amplitude_id WHERE DATE(app123.viewMessage.event_time) BETWEEN '2015-03-01' AND '2015-03-02') WHERE dt> 0 AND dt <= 86400000; The inner SELECT JOINs the funnel1 table with the events table on id. It selects the id and the difference of time between the 2nd event and the 1st event ( dt ). For the time difference we have to use the DATEDIFF()function because Redshift doesn t recognize intervals (the output you would get if you just normally subtracted the dates). In the WHERE clause, the upper bound is +1 day because the activation event could happen during the next day. The outer SELECT picks just the ids where the difference is greater than 0 milliseconds (meaning the 2nd event happened after the first event) and less than 86400000 milliseconds (1 day). SELECT COUNT(*) FROM openapp_funnel1; SELECT COUNT(*) FROM openapp_funnel2; To get weekly rate, divide the 2nd value by the 1st value. >> back to Table of Contents 25

7. RevenueContents If your app tracks revenue-generating events through Amplitude, such as in-app purchases, you can query for users and actions based on these revenue-generating events in Redshift. Note: Amplitude offers highly accurate revenue tracking by verifying purchases with Apple itunes and GooglePlay. This section assumes that your app has instrumented revenue verification. Verified Revenue events are stored in Amplitude's Redshift database with event_type verified_revenue and the actual monetary amount for that purchase is stored in the revenue column. Query Objective: Obtain the number of paying users and total revenue A very useful summary query is to find the number of distinct users who spent money on purchases over a period of time and the total amount of money they spent: SELECT count(distinct amplitude_id), sum(revenue) FROM app123.verified_revenue AND revenue > 0 AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02'; >> back to Table of Contents 26

Query Objective: Obtain a list of top paying users Next, we can obtain a list of our app's so-called whales (i.e. users who are highly engaged and are the highest spenders on in-app purchases). The following query returns the user_id's of paying users and the total amount they have purchased over a specified time period, in descending order (highest paying users first): SELECT user_id, sum(revenue) as totalspent FROM events123 WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' AND revenue IS NOT NULL GROUP BY user_id ORDER BY totalspent DESC; >> back to Table of Contents 27

Contents 8. User Properties A very common query is selecting users who satisfy some property intrinsic to them - their country, language, device platform (ios or Android), the ad network that directed them to the app, etc. Amplitude tracks all of this data, so finding the users who satisfy user properties is a simple query on Redshift. There are two primary types of user properties: properties tracked automatically by Amplitude, and custom-defined user properties. Each requires different query syntax, which we will go over below. Properties tracked automatically by Amplitude These properties are stored for every event in their own Redshift column and include: version : the version of your app being used (e.g. 3.4.2) country : the country as set on the user's device city : the city of the user region : the region of the user (states within the United States, and province in other countries) DMA : designated marketing area, a marketing area that shares media Language : language as set on the user's device Platform : operating system type, e.g. Android, ios, Chrome, etc. OS : version number of the operating system, e.g. Android 4.4.2 Device family : e.g. Samsung, Casio, Kyocera, Acer Device Type : e.g. iphone 6, Galaxy Carrier : e.g. Verizon, Vodafone, AT&T >> back to Table of Contents 28

It is often useful to first look at the most common values for a given user property. For example, perhaps we are interested in knowing the countries in which our app has the most users. To do this, use the following query: Query Objective: Obtain the most common values for a given user property SELECT country, count(distinct amplitude_id) AS count FROM events123 WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' GROUP BY country ORDER BY count DESC; This will return a table with the country name as the first column and the number of distinct users from that country as the second column; the 'ORDER BY count DESC' option at the end will list the countries from highest number of users to lowest. Once we have a sense of the relevant property values, we can then query for the list of users who have certain user properties. For example, if we are interested in getting a list of active users on March 1st and 2nd who are from either Canada or the United Kingdom, we can perform the following query: Query Objective: Obtain a list of users who have certain properties SELECT DISTINCT user_id, country, platform FROM events123 WHERE (country = 'Canada' OR country = 'United Kingdom') AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02'; Here we return all of the relevant columns (user_id, country, platform) so we can see the corresponding property values for each user satisfying the query; however, you can choose to just return the user_id or you can ask to return other column values that are not part of the WHERE clause. >> back to Table of Contents 29

Custom-defined User Properties In addition to the user properties automatically tracked by Amplitude, your app can specify additional user-level properties. User properties are pulled into their own columns in each event table. There is a limit of 400 user properties that can be put into their own columns. All other properties are saved in JSON format in a single column in Redshift called user_properties. Possible examples include the Advertising network the user was referred from, the number of photos the user has saved in the app, the amount of in-game currency the user has, etc. Conceptually, these are very similar to the Amplitude-tracked user properties discussed above; they track one aspect of the current state of a user and they are not event-specific (so the same user properties and values appear on all events for a user at a point in time). As an example, say we want to see the most common advertising referral networks for users and we have stored this value in the user_properties column under the key 'Referral'. Then the query is: Query Objective: Obtain the most common advertising referral networks for users SELECT JSON_EXTRACT_PATH_TEXT(user_properties,'Referral') AS Referral_Type, count(distinct amplitude_id) as count FROM events123 WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' GROUP BY Referral_Type ORDER BY count DESC; This will return a table with the referral network name as the first column (which we have chosen to call Referral_Type but you can name anything you want) and the number of associated distinct users as the second column; the 'ORDER BY count DESC' option at the end will list the referral network names in descending order from highest number of users to lowest. >> back to Table of Contents 30

Numerical Custom-defined User Properties If the user property you are interested in has numerical values instead of text, you can query for ranges of values. For example, below we query for the number of users whose Current Level in our game app is greater than 7 but less than 10 (i.e. Level 8 or 9): Query Objective: Obtain the number of users whose current level is greater than 7 but less than 10 SELECT DISTINCT count(amplitude_id) FROM events123 WHERE NULLIF(JSON_EXTRACT_PATH_TEXT(user_properties, 'Current Level'), '')::int > 7 AND NULLIF(JSON_EXTRACT_PATH_TEXT(user_properties, 'Current Level'), '')::int < 10 AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02'; Be sure to use the syntax above - specifically the NULLIF()function (which converts empty strings '' to the special SQL value NULL) and the ::int (casts strings to integers). This is necessary for numerical property values since the json_extract_path_text() function returns strings. >> back to Table of Contents 31

Contents 9. Event Properties In addition to user properties, Amplitude also allows tracking of event properties which provide deeper data on user actions, specific to the type of event that occurred. For example, in a gambling game app when the user does a 'BET' event on a hand of cards, an event property called 'amount' can capture the amount of in-game currency they wagered. Or in a shopping app, when a user purchases an item, triggering a 'PURCHASE' event, an event property called 'item_name' can capture the name of the specific item that was purchased. Amplitude stores these event-based properties in Redshift in their own individual columns for each event type. There is a limit of 50 event properties that can be pulled out into their own columns. All other event properties will be stored in a special JSON column called event_properties. To query for them, we use the same syntax type that we use for custom-defined user properties (as described in section 7), based on the Redshift json_extract_path_text() function. Taking the shopping app example from the previous paragraph, the following query finds the names of the items bought and the count of how many times that item was purchased, over a period of time, ordered by the count: Query Objective: Obtain the list of items bought and how frequently that item was purchased SELECT item_name, count(*) AS count FROM app123.purchase AND DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' GROUP BY item_name ORDER BY count DESC; >> back to Table of Contents 32

Numerical Event Properties If the event property you are interested in has numerical values instead of text, you can query for ranges of values. Taking the gambling game app example from the last section, we can query for the number of users who, when doing a 'BET' event, wagered between 100 and 500 credits: Query Objective: Obtain the number of users who did BET and wagered between 100-500 credits SELECT count(distinct amplitude_id) AS count FROM app123.bet WHERE DATE(event_time) BETWEEN '2015-03-01' AND '2015-03-02' AND NULLIF(e_credits,'')::int >= 100 AND NULLIF(e_credits, '')::int <= 500 ; >> back to Table of Contents 33

additional resources Amplitude Docs Case Studies Blog Amazon Redshift docs Questions? support@amplitude.com Scalable analytics built for growth