Most entrepreneurs I meet have ambitious goals of trying to change the world. When you ask them about their road map it almost always includes plans to build a monolithic company with hundreds if not thousands of employees.
The road to build such a company is a long one with critical milestones along the way. Achieving milestones like product-market fit, growing a user base and bringing innovation to the market requires cutting corners and sacrifices which bring short term gains at the cost of long-term pains. One of these areas is the BI-friendliness of your company's databases.
In this post I'm going to share 22 tips that you should share with your R&D team so that they can make your databases more BI-friendly. This will save the company a huge amount of time and money in the long-run.
Let's get started.
The first tip and arguably one of the most important in this list is to be consistent. Imagine if you went into a store and the price of every third item was in a random currency of some far away land. It would be a disaster.
Your business analysts, developers and many others will be either writing code, queries or communicating to fellow colleagues about your databases an unmeasurable number of times in the history of your company. Make everyone's lives easier by being consistent with your naming conventions and formatting.
If you have a field in a table which can be stored as a timestamp rather than a date, do it. Your analysts can easily convert a timestamp into a date but can't convert a date into a specific second, minute or hour.
An analyst that needs to find how long it takes users to reach a certain milestone will only be able to calculate the number of days, instead of a more useful hours or minutes measurement.
I can't tell you how many hours of my life I've wasted trying to get queries to work because I assumed the name of a field was X when in fact it went against the naming convention rules and was actually Y.
Save your company all that wasted time by defining very clear naming conventions for your databases. Write these rules down and make sure every developer in the company is aware of them.
Below is a list of the most important naming convention rules I recommend to my clients.
If a field is a timestamp make sure to end the name with "at". So the timestamp for when a user was created in the system should be user_created_at.
For cases where you only have the date then end the name of the field with "date". An example of a user's birth date in your database would be user_birth_date. Another example may be "team_creation_date", for when an object like team is created in your database.
I recommend that you start the name of every boolean field with "is". "is_active_user", "is_decision_maker" and "is_owner" are just some examples.
This naming convention rule makes it easy for everyone to know if a field is a boolean or not.
Keep everything lowercase. This saves time when typing and since the names of tables and fields are not shared with the outside world, there is no repercussions with keeping the names of your tables and fields all lowercase.
For names which contain more than 1 word, put an underscore between words. Instead of writing userhistory, write user_history. This makes it easier to read, especially for non-native English speakers.
This recommendation falls into the "nice to have" category. I've yet to come across a production database which is set up like this. Generally this is only something you will do in your database but I like the idea of naming all columns in your tables with the name of the object in the name.
So for example, if you have a users table with id, first name, last name and domain, then instead use, user_id, user_first_name, user_last_name and user_domain.
When it comes to querying, this approach makes it much simpler and easier to debug complex queries.
The one exception to this rule is with foreign keys. If say I have a team_id in my user's table then I shouldn't name the field user_team_id, team_id is the better option.
If you are recording usage data of any kind then make sure you are saving the data in the lowest reasonable denomination.
Let's say for example you have a game where users make purchases with virtual coins they earn while playing the game. Instead of saving the total amount value of coins earned by users in dollars, save the value in cents. Your analysts can easily convert the amount into dollars and this will add more flexibility.
Another example is when recording time-based usage. Save this kind of information in seconds or milliseconds. Your analysts can convert the amount of time into minutes, hours or days with little effort.
Say you have an app which requires users to enter in their height and weight. If you had a field which was called "height" which had a number like 114 as a value, you'd need to do extra mental work to understand what this 114 represented. If the name of the field was "height_in_cms" you'd immediately know that the values in this field are the heights of users in centimeters.
If you have certain tables which are simply aggregations of other fact tables, say a daily aggregation of events per user, make sure to include "agg" to the names of these tables.
An example would be user_events_agg, or user_history_agg.
You might find that there are tables in your database which have nothing to do with your product. These might be admin tables for other internal tools which your R&D have built, or tables for a brand new product which R&D are planning.
In order to keep things organized I recommend including a prefix to these tables, something like "internal", or "alpha" so your analysts can identify irrelevant tables.
This tip is an obvious one. You want to keep the names of your tables and fields as short and clear as possible. The longer the names the more time it takes to write them down. Consider skipping helper words like "the", "for", etc when naming fields and tables.
I see this often with my clients. The person who is building your databases is often someone who has been in the company a long time and is very familiar with all the internal lingo. This lingo is foreign to every new employee.
Try and avoid using abbreviations and shortcodes when naming fields and tables, even if it means breaking tip #3.10.
This tip makes sense to every techie but its almost never a priority. As your product gets bloated so does your database and this means more noise for everyone.
Make sure to include cleaning up your database as part of your ongoing maintenance.
If you are in the habit of pivoting your products then you might decide to rebuild your database from scratch or at the very least add new tables.
You will be tempted to call this new set of tables "new_X" and "new_Y" for say, "new_programs" and "new_properties" to differentiate the tables from "programs" and "properties".
Don't do this. The issue is that "new" has little context and what do you do when you pivot again, use "new_new_X"?
A better approach is to use versioning, like "products_v2" and "properties_v2".
A strong BI manager or COO will understand where the company's analysts are wasting their time when it comes to creating data sets.
Do an 80/20 analysis and identify the common, heavy joins and grunt work that the team is doing and find engineering solutions for them.
This often means creating new aggregation or smart tables which include common calculations and information from numerous tables.
Creating these kinds of tables will result in a huge time saving for your analysts which can then spend their time helping to identify major risk factors to the business. Everyone wins.
As a general rule you want to try and avoid deleting your data. You never know what analyses you will need to run in the future and once the data is gone there is little you can do about it. Even if you create backups it can be a huge headache to try and access data which was deleted from production a year ago.
A way to get around this potential problem is to add "is_X_deleted" columns to all your non-helper dimension and fact tables. I'm talking about the tables where you store all the data you've collected from your users, or from your internal operations.
Let's say you have a scheduling app where users can schedule meetings with contacts. If a user deletes a scheduled meeting then the "is_meeting_deleted" field would change from zero to one, instead of the row being deleted.
Try your best to correctly spell the names of your tables and columns. A misspelled field or table will create unnecessary frustration for your analysts who need to know the misspelled version of the name, instead of just writing the name as it should be spelled.
I know what you're thinking. This Justin has crazy OCD. You aren't far off but there is actually a logical explanation behind this recommendation.
As a business analyst I couldn't remember the names of every field and table in every database so frequently I would run my eyes down a list.
Try and find a specific name in a long list and it will take you longer than you'd like. Imagine doing that dozens of times a month and the unnecessary waste of time and frustration starts to take a toll.
By ordering the fields in your tables alphabetically, your analysts will find what they are looking for much quicker and with a lot less frustration.
I previously mentioned the benefits of creating aggregation as a way to help your analysts. Another recommendation is to create helper fields to certain tables to save your analysts from running unnecessary logic.
An example might be calculating the difference in minutes between two timestamps in the table. If this is a common calculation for your analysts then let a data engineer create this once so time is saved in the long-run.
One of the toughest parts of being an analyst is trying to wrap your head around all the main objects in the company's database. Complex products may have users, members, organizations, companies and partners as separate objects in the database.
Make everyone's lives easier by being very clear and strict with how you arrange these objects, and the names you use to describe them.
This can be challenging for businesses which started out with their user as an organization and then added multi-user and multi-organization support. It may require more development time to rename all these objects in your code but its worth it in the long-run.
Another tip is to have a senior developer and product manager own a schematic of the key objects and their interactions. I recommend Lucidchart (I'm an affiliate of the service) as a great tool for creating such a schematic.
More often than not a company is not actively saving changes made by their users as historical data. This is very dangerous and dramatically limits your analysts from understanding important trends.
Some companies use event-tracking solutions like Segment, Mixpanel or Kissmetrics which is great but without moving this data into a data warehouse or at least a MySQL or Postgres, your analysts will be limited.
Start saving every relevant change a user can make to his account. This includes any changes to your product's feature set, his payment history, and his accessibility to different components of the service.
Thanks for reading.