What if your data is NOT Normal?

The Omnipotent and Omnipresent Normal Distribution

Let’s keep this section short and sweet.

Normal (Gaussian) distribution is the most widely known probability distribution. Here are some links to the articles describing its power and wide applicability,

Why Data Scientists love Gaussian

Why Data Scientists love Gaussian?
Three main reasons why Gaussian distribution is so popular with deep learning, machine learning engineers and…towardsdatascience.com

Because of its appearance in various domains and the Central Limit Theorem(CLT), this distribution occupies a central place in data science and analytics.

Normal distribution - Wikipedia
In probability theory, the normal (or Gaussian or Gauss or Laplace-Gauss) distribution is a very common continuous…en.wikipedia.org

So, what’s the problem?

This is all hunky-dory, what is the issue?

The issue is that often you may find a distribution for your specific data set, which may not satisfy Normality i.e. the properties of a Normal distribution. But because of the over-dependence on the assumption of Normality, most of the business analytics frameworks are tailor-made for working with Normally distributed data sets.

It is almost ingrained in our subconscious mind.

Let’s say you are asked to detect check if a new batch of data from some process (engineering or business) makes sense. By ‘making sense’, you mean if the new data belong i.e. if it is within ‘expected range’.

What is this ‘expectation’? How to quantify the range?

Automatically, as if directed by a subconscious drive, we measure the mean and the standard deviation of the sample dataset and proceed to check if the new data falls within certain standard deviations range.

If we have to work with a 95% confidence bound, then we are happy to see the data falling within 2 standard deviations. If we need stricter bound, we check 3 or 4 standard deviations. We calculate Cpk, or we follow six-sigmaguidelines for ppm (parts-per-million) level of quality.

All these calculations are based on the implicit assumption that the population data (NOT the sample) follows Gaussian distribution i.e. the fundamental process, from which all the data has been generated (in the past and at the present), is governed by the pattern on the left side.

But what happens if the data follows the pattern on the right side?

Or, this, and… that?

Is there a more universal bound when the data is NOT Normal?

At the end of the day, we will still need a mathematically sound technique to quantify our confidence bound, even if the data is not normal. That means, our calculation may change a little, but we should still be able to say something like this-

“The probability of observing a new data point at a certain distance from the average is such and such…”

Obviously, we need to seek a more universal bound than the cherished Gaussian bounds of 68–95–99.7 (corresponding to 1/2/3 standard deviations distance from the mean).

Fortunately, there is one such bound called “Chebyshev Bound”.

What is Chebyshev Bound and how is it useful?

Chebyshev’s inequality (also called the Bienaymé-Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean.

Specifically, no more than 1/k² of the distribution’s values can be more than kstandard deviations away from the mean (or equivalently, at least 1−1/k² of the distribution’s values are within k standard deviations of the mean).

It applies to virtually unlimited types of probability distributions and works on a much more relaxed assumption than Normality.

How does it work?

Even if you don’t know anything about the secret process behind your data, there is a good chance you can say the following,

“I am confident that 75% of all data should fall within 2 standard deviations away from the mean”,

Or,

I am confident that 89% of all data should fall within 3 standard deviations away from the mean”.

Here is what it looks like for an arbitrary looking distribution,

Image Credit: https://2012books.lardbucket.org/books/beginning-statistics/s06-05-the-empirical-rule-and-chebysh.html

How to apply it?

As you can guess by now, the basic mechanics of your data analysis does not need to change a bit. You will still gather a sample of the data (larger the better), compute the same two quantities that you are used to calculating — mean and standard deviation, and then apply the new bounds instead of 68–95–99.7 rule.

The table looks like following (here k denotes that many standard deviations away from the mean),

What’s the catch? Why don’t people use this ‘more universal’ bound?

It is obvious what the catch is by looking at the table or the mathematical definition. The Chebyshev rule is much weaker than the Gaussian rule in the matter of putting bounds on the data.

It follows a 1/k² pattern as compared to an exponentially falling pattern for the Normal distribution.

For example, to bound anything with 95% confidence, you need to include data up to 4.5 standard deviations vs. only 2 standard deviations (for Normal).

But it can still save the day when the data looks nothing like a Normal distribution.

Is there anything better?

There is another bound called, “Chernoff Bound”/Hoeffding inequalitywhich gives an exponentially sharp tail distribution (as compared to the 1/k²) for sums of independent random variables.

This can also be used in lieu of the Gaussian distribution when the data does not look Normal, but only when we have a high degree of confidence that the underlying process is composed of sub-processes which are completely independent of each other.

Unfortunately, in many social and business cases, the final data is the result of an extremely complicated interaction of many sub-processes which may have strong inter-dependency.

Summary

In this article, we learned about a particular type of statistical bound which can be applied to the widest possible distribution of data independent of the assumption of Normality. This comes handy when we know very little about the true source of the data and cannot assume it follows a Gaussian distribution. The bound follows a power law instead of an exponential nature (like Gaussian) and therefore is weaker. But it is an important tool to have in your repertoire for analyzing any arbitrary kind of data distribution.

Contact

kingdavidoheb@gmail.com