Primary or secondary data analysis: which method should you choose?

Business analysts will conduct many analyses in their careers. Many of these analyses will be made possible by the hard work someone else did to collect the necessary data used in the analysis. In this post you'll learn the difference between primary and secondary data analysis.

What is the difference between primary and secondary data analyses?

Primary data is data collected by a researcher or group of researchers for a specific analysis.

Let's say for example that you wanted to run an analysis in your company to determine the health levels of your employees. You create a survey and hand it out to each employee. The data you collect would be classified as primary data.

Now let's say that after you've collected this data another analyst in your company uses it for a totally different analysis. In that case the second analyst would be using secondary data in a secondary data analysis.

Advantages of primary data analysis

There are a number of key advantages for using primary data for an analysis.

You can validate the reliability of the data

One of the biggest advantages of conducting a primary data analysis is the fact that you know the source of the data. You collected the data yourself so you have detailed knowledge of the collection methodology and if the data is reliable.

You have what you need for your analysis

Before you conduct any analysis you need to know what information is necessary to complete the analysis. In the case of primary data, you specifically gather the exact information that you need for your analysis.

The same can be said for volume. If you prefer a large sample size or a very specific mix of data then you can go out into the world and gather the exact volume and mix that you need.

Disadvantages of primary data analysis

Primary data analyses are expensive

Going out and manually collecting data, or paying a research company to do it for you is very expensive.

There will be specific analyses where you'll have no choice but to gather the data yourself. You'll need to plan well and budget accordingly so that you gather exactly what you need.

You will need to do your own data prep

Once you've gathered the primary data you'll need to go through a multi-stage process to clean, organize and validate the statistical validity of your data.

This is a difficult and timely process which will add to the costs of your analysis.

In the case of secondary data, often this work has already been done by the researchers that gathered the data in the first place.

Advantages of secondary data analysis

Secondary data analyses are quicker and (often) cheaper

Secondary data analyses are for the most part significantly cheaper and quicker to complete than primary data analyses because you're not collecting the data yourself. The data has often been prepped and validated statistically and can be used immediately. These two skipped steps will save you many hours of work.

Depending on the scope and subject matter of the analysis, it may be significantly cheaper to purchase a data source than collecting the data yourself. In these cases a secondary data analysis will not only be quicker (time is money) but the costs involved in getting your hands on the necessary data will be lower.

In the case of an online business, you may want to purchase a piece of software which will gather the data for you. Such a tool may allow your analysts to conduct dozens of ad-hock secondary data analyses.

Wide range of data sources available for secondary data analysis

More and more countries, organizations and companies are publishing large studies and useful data sources which can be used unlicensed for secondary data analyses.

My favorite site for finding useful data sources is Kaggle. Kaggle has the largest collection of resources and data sets for data scientists and analysts. Don't believe me, check out this up-to-date dataset of nearly 40,000 international football matches.

Disadvantages of secondary data analyses

Data validity and coverage

If you're not collecting the data yourself how will you know for sure that it's valid? There is always a risk when using a secondary dataset that the data is not reliable and has been faked or collected using an incorrect methodology (biased sample or manipulated for political reasons for example).

The other issue with secondary data is that it may not contact exactly what you need. You may be forced to come up with a proxy method for measuring a specific variable or combine a number of datasets to get around a missing variable. This may result in a higher than acceptable margin of error or the entire analysis being scrapped.

In the case of primary data, you know exactly what you're getting since you collected it yourself.

You don't control the structure of the data in a secondary data analysis

Since you're using a dataset that you didn't construct yourself, you won't necessarily have the data in the format that you'd like.

You may want to analyze the geo location of your data but all you have is the respective latitude and longitude of each point. Now you'll need to find a way to turn those data points into city.

Another example is grouping of data. For example you may have the range of salaries for employees instead of their exact salaries. This would prevent you from accurately counting average, min and max salaries among your sample.

What is better, primary or secondary data analysis?

Primary and secondary data analyses have different pros and cons.

If you have a tight budget and need to deliver the analysis quickly then secondary data would be the way to go.

If you have a very specific analysis in mind that needs to have a very high degree of accuracy then primary data would serve you best.

As an analyst, you'll need to determine which approach makes sense considering all the variables involved.