BI For Beginners

How to enrich event data to help you answer all your questions

Imagine for a moment that you work for a gaming company. Imagine you've just been assigned a task to explore the data of one of the company's action games.

Now wouldn't be awesome if you could answer all your questions with one data set? In this post I'm going to share with you how to enrich event data so that your typical data set can be used for so much more.

The image below shows a sample of the data that you have to work with. Assume this data is sitting in a MySQL database.

event data sample

Write down a list of the questions you need answered

So the first thing you do is write down a list of questions you'd like to answer.

This is what you come up with:

  • What is the trend of new signups?
  • What is the percentage of users which fired at least one shot within 14 days of signup?
  • Which weapon is the most popular?
  • Which stage experiences the most shot's fired?
  • Are male or female players more active?
  • What is the percentage of users which switch characters at least 3 times?

I acknowledge that there are a lot more interesting questions you could ask about this data set. Test yourself by writing your own list and then see how you'd answer them.

The easy questions you can answer with event data

Let's start by getting the easy questions out of the way.

What is the trend of new signups?

To calculate this I'd simply filter the signed_up event and group by whichever time period I'm interested in.

In SQL it would look something like this:

SELECT
count (distinct user_id) as user_count,
date_trunct('month', event_sent_at) as signup_month
FROM
events
WHERE
event_name = "signed_up"
GROUP BY
2
ORDER BY
2 desc

Which weapon is the most popular?

The first thing I'd do to answer this question is enrich the data set with the weapon name. The weapon id is present so we can either do a vlookup (if we preferred to work in Excel) or a left join (in the case of Tableau or standard SQL).

Once I've got the weapon name as part of my data set I'd filter out all events but shot_fired and group by weapon name.

SELECT
count (distinct id) as event_count,
weapons.weapon_name
FROM
events
LEFT JOIN
weapons on events.weapon_id=weapons.id
WHERE
event_name = "shot_fired"
GROUP BY
2
ORDER BY
1 desc

Which stage experiences the most shot's fired?

This question is almost identical to the one above. Instead of enriching the data with weapon name, you'll bring stage name and then group by that dimension.

The tougher questions

Now for the fun part, the tougher questions.

What is the percentage of users which fired at least one shot within 14 days of signup?

Most analysts would look at this problem and opt for building a data source where the primary table is the users table (this is assuming the creation date of the user is in that table). A lot of SQL would then be written to try and calculate if each user fired at least one shot, and did it within 14 days of signup.

To do this in Tableau I'd follow the steps below:

  • I'd first want to add the signup date to every event. I can do this by either doing a left join between the events table and the users table, or by creating a duplicate field for the event_sent_at field if the event_name = signup_up.The first formula in Tableau would read like this: iif(event_name = "signed_up",event_created_at,null). Assume the name of this new field was signup_date.
    The second formula would be: {fixed user_id: max(signup_date)}. Assume the name of this new field was master_signup_date. Learn more about Tableau's fixed formula here.
  • I'd then create a new boolean field which is positive if the event_name = "shot_fired" and the datediff of the event was within 14 days of the signup date.The formula in Tableau would be: iif(event_name = "shot_fired" AND datediff('day',event_created_at,master_signup_date) <= 14,user_id,null). Assume the name of this field is, isShotFiredWithin14D
  • That's it really. To then build the viz I'd type countd(isShotFiredWithin14D)/countd(user_id). The reason this would work is because I'm doing a distinct count of the user_ids of users which have at least one row in the table where a shot fired event occurred, AND it happened within 14 days of the signup date of that user. I then divide that number by the unique count of users in my sample.
  • I'd admit that this is a dirty way of doing it and there are some nuanced issues you may run into but believe me, this is much quicker to pull off than trying to do it in standard SQL.

Are male or female players more active?

This one can be tricky because it really depends on how you define active. Let's assume that active is defined as creating 10 or more shot fired events in each of the first 3 months after signup.

Here is what I'd do:

  • I'd enrich the data by bringing the gender of the user from the users table via a left join. I'd user the user_id field for my join.
  • In Tableau I'd create a number of helper columns.
  • event_created_between_1_30_d = iif(datediff('day',master_signup_date,event_sent_at) <30,1,0)
  • event_created_between_31_60_d = iif(datediff('day',master_signup_date,event_sent_at) >29 and datediff('day',master_signup_date,event_sent_at) <60,1,0)
  • event_created_between_61_90_d = iif(datediff('day',master_signup_date,event_sent_at) >59 and datediff('day',master_signup_date,event_sent_at) <90,1,0)
  • shot_fired_in_month_1 = {fixed user_id: sum(event_created_between_1_30_d)}
  • shot_fired_in_month_2 = {fixed user_id: sum(event_created_between_31_60_d)}
  • shot_fired_in_month_3 = {fixed user_id: sum(event_created_between_61_90_d)}
  • I'd now create a calculated field which becomes my master flag. The formula would be iif(shot_fired_in_month_1 >=10 AND shot_fired_in_month_2 >=10 AND shot_fired_in_month_3>=10,user_id,null). Assume the name of this field is isActiveUserId
  • I'd now simply create a viz where I countd(isActiveUserID)/countd(user_id) and group by gender. I'd then compare the percentages of active users between the two genders, the one with the higher percentage would be the more active gender.

What is the percentage of users which switch characters at least 3 times?

This one is actually simpler than it looks.

I'd use the fixed formula to do a distinct count of character id on the user_id level.

The formula would look like this: {fixed user_id: countd(character_id)}.

I can now use that field as a filter, set it to >=3 and then distinct count user_id for that sample and divide by all users, similarly to how I did it in the question above.

Summary

Working with event data is a lot of fun but often you simply don't have all the data you need to answer key questions. This results in creating multiple data sources and a lot of wasted time.

I hope this post taught you that it is possible to enrich event data so you can manipulate the structure of the data set to answer almost any question.

It might not be the cleanest and standard way but as long as you understand the nuances involved in your data set you should be able to work quicker and provide more value.