Data: Why More is Sometimes Worse

by Jesse Ringer | Jan 31, 2020

Being data-driven is only worth something if the data you work with is good data.

Which begs the question: how do we know if our data is good? And more importantly, how do we make sure that it is?

Ask any data scientist or analyst what the most annoying part about their work is and they will tell you that the answer is this:

Data Cleaning.

Data cleaning is the process of removing and/or modifying data to ensure it is accurate, complete and relevant.

It is by far the most tedious and least enjoyable step. It is also an important one.

Once data is cleaned, analysis can be performed on said data in order to provide insights that will ultimately be used to make strategic decisions.

It does not matter how good your analysis is, or how pretty your charts look. If your data is improperly cleaned, it could very well lead to poor decision making.

If you do miraculously find success from using inaccurate data to form your decision(s), please realize that YOU WERE LUCKY. And you would be naive to expect to replicate your success this way in the future.

“Just because you experienced a positive outcome, does not mean you made the right decision, and vice versa.”
– A Smart Person (I forget who)

One need look no further than at any past lottery winner to fully grasp this point.

Now, of course, more data is not inherently bad. There are a plethora of reasons as to why more data is better. But it does increase the likelihood of issues and the time spent data cleaning.

An Example:

A common practice amongst SEOs is to track keyword rankings on Google in order to measure the progress of SEO efforts on a website.

Keyword ranking data is used to gather insights, such as whether average rank has improved over time, or if a website now ranks in the top 3 positions on Google for more keywords from last month, etc.

On the surface, it would appear that tracking more keywords would be better. And it may be tempting to track hundreds, if not thousands of short and long-tail keywords.

This can create problems.

Keywords should not be treated equally.

For starters, some keywords matter more than others. The more keywords tracked, the more noise there will be because there will inevitably be more keywords that are less important or relevant.

For a barbershop website, its ranking performance for “cheap men’s haircuts” matters more than “men’s haircuts” since the latter might be an educational search rather than a transactional one.

Both those keywords matter more than “men’s haircuts 2020”.

Suppose we were tracking all three keywords, and our website’s ranking for “cheap men’s haircuts” improves by five positions, but “men’s haircuts 2020” worsens by six. On paper, we are worse off because the average rank for our tracked keywords has increased (increased, in this case, is bad).

I am sure you will agree that this would be the wrong conclusion to make.

It is easy to spot errors like this with fewer keywords, but imagine trying to do so when thousands of keywords are being tracked.

That’s a whole lot of noise.

Cutting Down On The Data Noise

One solution is to use keyword factors such as relevance, search volume, competition, location, etc, to categorize your keywords into different levels of importance.

This allows us to separate primary, secondary and tertiary keywords in our analysis in order to gauge performance more effectively.

But doing so leads to a lot more work when setting up the keyword tracking and when analyzing the keywords in the future.

Moreover, the incremental value derived from each additional keyword you track decreases, and at some points turns negative.

In layman’s terms, going from one to two keywords will without a doubt provide more value than going from 500 to 501. And that will provide more value than going from 1000 to 1001.

At some point, having more keywords ceases to provide you with any valuable insights, and may even prove detrimental.

If you don’t have the time to go through thousands of keywords, it may be better to track and measure the progress of fewer, but highly relevant, keywords.

Ask yourself, what do I really want to know? And what data do I actually need to get the answer?

Having more data is useful if it aids in answering the questions you have, and if you have the time to properly clean and analyze it. If you don’t have the time, then it may be better to focus on what matters most.