BI For Beginners

How to run cohort analyses in Excel or Tableau

Cohort analyses are critical for understanding the performance of certain KPIs across different segmentations. The issue with creating cohort analyses is the complexity in going from raw data to insightful visuals.

In this post I'm going to take you step-by-step through the process I use to create insightful cohort analyses for my clients using Excel and Tableau (check out my, "What is Tableau?" post).

If you have no experience with Tableau then focus on the principles I cover in this post and replicate the entire process in Excel.

Framing the cohort analysis

Since the concept of cohort analysis is boundless in its application lets pick an example to use. For the rest of this post lets assume we are trying to answer the following question: "Did the recent change to our onboarding email flow improve user retention?".

Let's assume the onboarding email flow changed from version 1 to version 2 as shown in the diagrams below. Let's also assume version 2 went live on 1st October and only users that signed up from that date were exposed to the new flow.

Whichever analysis you need to run you need to understand all the details so a specific nuance doesn't end up wasting you precious hours.

Since the goal mentioned above is a bit too vague for our liking lets compare the following specific metrics:

  • % of users which returned to the service after signup within 30 days of signup
  • % of users with multiple login days within the first 30 days, show the distribution in buckets of 1, 2, 3 - 5, 5 - 8, 8 - 12 and > 12.

Version #1 of the onboarding email flow

Onboarding email flow version 1

Version #2 of the onboarding email flow

Onboarding email flow version 2

The visuals above were created using Lucidchart.

We can see that the main differences in the flows are:

  • New email added on day 10
  • New versions to email #1, email #2 and email #5.
  • Email #3 in version 1 was replaced with email #4
  • The email sent on day 30 is different in version 2.

The next step is to identify which data we are going to use for our analysis. Since we are dealing with user retention we will need to use event-based data. Events can be used to view the history of our users which is essentially what we need to analyze. We will use the "Logged In" event which is sent each time a user logs into the service.

At this point we are done framing our analysis and are ready to start preparing our data set. Before we do that lets look at a quick summary of where we stand.

  • The goal: Compare the retention rates of users exposed to email onboarding flow v1 with the retention rates of users exposed to email onboarding flow v1. We are going to look specifically at return rate and the distribution of multiple login days.
  • Cohorts: Users that signed up before 1st October and users that signed up after 1st October. Since the onboarding takes 30 days we will need to give some time for users to go through the new flow.
  • The data: The "Logged In" event.

Preparing the data for the analysis

Now for the fun part, diving into the data. Depending on where you are sending your event data you may need to do a lot of manual work to get everything into Excel or Tableau. If you are fortunate enough to have your event data in a database then you can simply connect to the relevant database directly from Tableau or run the needed query and then export to a csv.

Do what you need to do to create a table like the one shown below either in excel or Tableau.

Event data for cohort analysis

Your table must contain the relevant event only (do all the heavy filtering at the beginning), the main id of the user / organization / member that triggered the event, and the timestamp of the event.

The next step is to enrich our events with relevant dimensions from our users that we may want to look at during our analysis. This step is optional since we do want to stay focused on the goals we set but this step is usually quick and can help with a deeper dive later. The enrichment can be done by connecting the user_id of the event to the user_id in the users table. This can be done by either a vlookup in Excel or by doing a left join on within Tableau.

Lastly we need to add a column to our table which we can use to group the rows by the relevant time-based dimension. In our case we need user signup date since this is when the onboarding process starts. In the case where you need to take the first event triggered by the user as your dimension then you can use the Tableau fixed formula to isolate the first event.

The formular is: {fixed user_id : min(event_sent_at)}.

In the end your table should look like this.

event data with dimensions for cohort analysis

If you got this far prepping your data in Excel it is time to import the file into Tableau. You can do this by clicking on "Microsoft Excel" on the loading screen.

Importing Excel file in Tableau

Once in Tableau we will can easily create a datediff calculation between creation date (our starting point) and when the login event took place.

The formula is: DATEDIFF('day',[creation_date],[event_sent_at])

Once you've created the calculated field convert it to a dimension.

How to convert a measure to a dimension in Tableau

That's it, that is all the prep that is needed and now you can start building your cohort views in Tableau.

Building the data visuals

Step 1 of building the visual is to place the datediff dimension as our columns and our starting point dimension, in this case creation date of the user as our rows. Filter out all the noise and start by isolating it down to the period we are interested in (first 30 days). I also decided to group creation date by weeks and not days so the table isn't too large. This also helps us have large enough samples in each row. You should have something like the table shown below.

Cohort table in Tableau

Next, lets create countd(user_id) and move it into the window.

Cohort table with data in Tableau

So let's try and understand what we are looking at.

The blue rectangle that runs top down represents the total unique users which signed up to our service in the given week listed in the column furthest to the left. So for example 370 unique users joined our service in the week of the 1st October 2017.

The columns represent the number of days from this signup date with zero being the same day. If we focus our attention on the blue rectangle that runs from left to right we can see the unique number of users which logged into the service 1 day after signup, 2 days after signup etc. We see that 69 users from the original 370 logged in on day 15 after signup. This is also the week where the new onboarding went live. Every data point above this row was exposed to version 1 while every data point in the row and after was exposed to the new onboarding.

Absolute numbers are useful but percentage of total is much more useful. In order to do this we will need to create a new calculated field using the window max formula.

The formula is: countd([user_id])/window_max(countd([user_id]))

Once you've got this new field replace countd(user_id) and you should now see the data in percentage terms.

Cohort data should be viewed in percentage terms

Lastly customize the view however you please in order to highlight the trends. I like to drag the measure onto the colors option and use the red-green split. After some quick customization you should get something similar to the table shown below.

You should use color scales to show trends in cohort tables

At this stage you can already see some interesting trends but since this table shows daily retention it's actually too granular for us. We need to create some groupings.

We can use the IF formula for this. The first grouping I'll create will use the following formula:

IF [Datediff(d):Creation -->Event Sent at] = 0 then "Same Day"
ELSE "After Signup day"
END

I can then replace the columns field with my new field and this will give me the table below. This table helps us answer our first question what is the % of users which returned to the service after signup within 30 days of signup.

Since this is dummy data I'm not going to spend time analyzing the visuals. If you made it this far then you'll be able to group and manipulate the % table we created earlier to answer all the questions you set out to answer.

Summary

Tableau and Excel are incredible analysis tools which can be used to quickly and easily take well-structured event data and turn it into powerful cohort-based visualizations. If you grasped the main principles I covered in this guide then you can replicate this process to run cohort analyses on any KPI for which you are saving historical data.

If you got stuck at any step in this process then feel free to comment below or email me at justin@projectbi.net and I'll be happy to help where I can.