Mastering ZScores: From Basics to Practical Applications
Introduction
In previous articles, I explained how to visualize a collection of data in various ways (e.g., histogram, line plot, density curve, etc.). These visualizations show us how the data is distributed and help identify trends, patterns, or outliers in the overall dataset.
However, sometimes, you may be interested in knowing more about a specific value within a distribution. Where does it fall in relation to the rest of the data points? Is it above or below the average? How far away is the value from the mean of the dataset? Is it an outlier?
Zscore can help us answer these questions, especially when working with normally distributed data.
In this article, I will explain the concept of zscore using simple, relatable examples. We’ll start with the basics of zscore and how to calculate it. Then, we’ll learn about using zscore to compare data points across similar distributions. And finally, I’ll show you how to detect outliers using zscores.
Let’s get started!
What is a ZScore?
The zscore tells us the position of a value within data distribution using the center and spread of the distribution. Specifically, the zscore measures how many standard deviations a value is from the mean. We can use the below formula to calculate it:
Calculating the Zscore as above is called standardization because it gives us the distance between the value and the mean in terms of standard deviations.
A positive zscore tells us that the value is greater than the mean. In contrast, a value below the mean will produce a negative zscore.
Let’s take an example. Suppose a newborn elephant weighed 250 pounds. You did some research and found that the average birth weight of an elephant is 200 pounds, and the standard deviation is 50 pounds.
We can use this information to calculate the zscore of the newborn elephant:
The baby elephant was born with a weight that’s exactly one standard deviation above the mean weight.
Note that the zscore of the mean value will be 0. You can verify that by plugging in the mean weight (200) in the zscore equation:
You can also do the reverse calculation to get the value for a given zscore. Simply multiply the zscore by the standard deviation and then add that to the mean. Here’s the formula:
Suppose the birth weight of another elephant had a zscore of 0.5. What’s the actual weight in pounds? Let’s find out using the above formula:
You might ask, ”So what if I can convert individual values to zscore, and vice versa? What do we gain by doing that?”
The importance of zscore becomes apparent when we use it with certain statistical distributions. Let’s look at one such example.
Zscores and Normal Distribution
Normal distribution has some unique properties that make it wellsuited for analysis using zscores.
When we convert all values of a normally distributed variable to zscores, we create what is known as a standard normal distribution. This new distribution will have a mean of 0 and a standard deviation of 1. Let’s look at a few properties of this distribution.
Center at Zscore = 0
The normal distribution is centered at the mean. Since the standard normal distribution has 0 as the mean, the zscore of 0 will lie at the center.
Symmetry
Mean splits the normal distribution into two symmetrical halves that mirror each other. Thus, the area under the curve from the center to a positive zscore, such as $[0, +1.5]$, is equal to the area between the center and the corresponding negative zscore of the same magnitude, $[1.5, 0]$.
The area under the curve represents the probability of a value falling within a specific interval. Therefore, the two shaded zscore intervals have the same probability. In other words, if you select a random value from a normal distribution, it is equally likely to fall within the interval of $[0, +1.5]$ as it is to fall within $[1.5, 0]$.
Cumulative Probability and Percentile
Cumulative probability is the probability of selecting a random value that is less than or equal to a specific value.
How can we find the cumulative probability of a given zscore in a normal distribution? It’ll be equal to the proportion of area under the curve that lies to the left of the zscore. The below graph highlights this area for a zscore of $0.25$:
It’s roughly $60\%$ of the total area. Hence the cumulative probability for zscore of $0.25$ is about 0.60.
We can interpret this conclusion in another way  approximately $60\%$ of the values have a zscore lower than $0.25$. Or we could say that a zscore of $0.25$ represents the $60$^{th} percentile in a normal distribution.
In the example above, I estimated the shaded area to find the cumulative probability and the percentile. You can get their precise values for any given zscore by using the zscore tables or Python with the SciPy library.
Assignment: Review the empirical rule, which tells us the percentage of values that fall within 1, 2, and 3 standard deviations of the mean in a normal distribution. Can you restate this rule in terms of zscores?
Comparing Values Using ZScores
Zscore can help us compare values from two different distributions. Let me illustrate this using an example.
Let’s say you and your cousin, Sunny, are in the same grade but attend different schools. You both got results back from a recent science exam  you scored 80 and Sunny scored 65.
At first glance, it may seem that you performed better because you scored higher. However, it is possible that Sunny’s exam was tougher or his teacher was stricter in grading it. How can we compare these two scores more accurately?
We need to look at how the scores are distributed in each class. Specifically, we can compare how far each score is from the mean of its respective distribution.
Suppose both sets of scores are normally distributed. The mean and standard deviation of the scores in your class were 75 and 10, respectively. In Sunny’s class, the mean score was 55 with a standard deviation of 10.
Let’s calculate the zscore for each of your scores:
Even though Sunny scored lower than you, his score is farther from the mean of his class than your score is from the mean of your class. Specifically, Sunny’s zscore is 1, while yours is only 0.5. Thus, compared to the respective classmates, he performed better than you.
We can look at the cumulative probabilities and percentiles to reinforce this point further. As per the ztable, zscores of 0.5 and 1.0 have cumulative probabilities of 0.6914 and 0.8413, respectively.
That means you scored more than 69.14% of your classmates. And Sunny’s score was higher than 84.13% of his peers. So don’t be surprised if he starts claiming to be the brains of the family 😉.
Finding Outliers Using ZScores
An outlier is an extreme value that occurs far away from the bulk of the observations. Such values are either too big or too small compared to the rest of the values.
Outliers can negatively affect data analysis  they can assert undue influence on statistics such as mean and standard deviation and give you a distorted picture of how data is distributed. Hence, outlier detection is one of the most crucial steps in data analysis.
The question is  how can we detect outliers? As per the empirical rule, 99.7% of the values of a normal distribution fall within 3 standard deviations from the mean. Any value outside of this range is considered an outlier because the probability of getting such a value is only 0.3%  a rarity.
Let me explain this using an example. Suppose you collect data on 200 basketball players for a season and calculate the average number of points they scored per game. And then plot average points per game as a histogram:
The average points per game are almost normally distributed  a slight variation from the theoretical, smooth curve is expected with real world data. The distribution has a mean of 15 with a standard deviation of 5. Thus, the points per game for most players are in the vicinity of 15.
We can also convert the points per game to zscore. Thus, the mean (15) will have a zscore of 0, and 20 will translate to a zscore of 1, etc. The graph shows the zscores below the raw points per game on the xaxis.
Almost all the values fall within 3 standard deviations, or within the zscore range of $[3.0, +3.0]$. However, one player finished the season with $32.5$ points per game, which translates to a zscore of $3.5$. His score is far from the rest of the players and lies beyond the range of 3 standard deviations from the mean. We’ll get such performance very rarely and thus consider it an outlier.
Once we’ve found outliers, we can handle them in many different ways. We can either exclude them from analysis or use measures such as the median, which are not unduly affected by the outliers. See my related post on this topic.
Summary
This article introduced you to the concept of zscore using practical, reallife examples. Let’s quickly recap what you learned today:

What is zscore, and how does it help us understand the relative position of an observation within a distribution.

What are the special properties of normal distribution (symmetry, predictable area under the curve), and how can we use zscores and these properties to calculate probabilities and percentiles.

How can we use zscores to compare values from different yet similar distributions?

What are outliers, and why is it important to identify them. How can zscore help us detect outliers in normally distributed data.
This indepth knowledge of zscore provides you with a solid foundation for advanced concepts in statistics. Watch out for my next article on correlation coefficient, where we’ll see how zscore plays a crucial role in defining the relationship between two variables.