## Introduction 🔗

Let’s say you’re working with a new data set. If you had to choose **one value that summarizes and captures the essence of the entire set**, what value would you pick?

This post will examine four measures that you can use to represent such a central value:

**Mean****Median****Weighted Mean****Mode**

By the end of this post, you’ll know how to calculate them and how they differ. And why you may prefer one over the other in specific circumstances.

Let’s dive in with the first one!

Image Credit: clicjeroen

## Mean 🔗

**Mean** is simply the mathematical **average** of all the given values. To calculate it, we add all the values and divide the sum by the number of values.

Here’s the formula to calculate the mean:

$$ Mean = \frac{Sum \ of \ All \ Values}{Number \ of \ Values}$$

Let’s understand this using an example. Imagine you’re planning to fly from New York to Chicago. Here are the prices (in US dollars) for each day of the last week:

$$294, \ 321, \ 368, \ 215, \ 422, \ 253, \ 507$$

So what’s the mean price? Let’s use the formula to find out:

Thus the mean airfare for the last week is $340. That gives you a ballpark amount you can expect to pay for your ticket.

## Trouble with the Mean 🔗

Mean is the most widely used metric to summarize data sets. It has a serious flaw, though. It doesn’t do well with data sets containing extreme values (outliers).

Let’s see this in action. Consider below list of values:

$$ \ 6, \ 97, \ 98, \ 99, \ 100 $$

The value \({\color{Red} 6}\) is an outlier. All the other values are much higher and are bunched together between \({\color{Blue}97 - 100}\).

What do you think will be the mean of this list? Let’s find out:

The mean, \({\color{Red} 80}\), is significantly lower than four out of the five values (\({\color{Blue}97, 98, 99, 100}\)). The outlier, \({\color{Red} 6}\), exerted an undue influence and pulled the mean away from most values.

Thus, **when a data set contains outliers, the mean may not represent the central or the typical value.**

So what’s the alternative? We’ll find out in the next section!

## Median 🔗

The **median** splits a data set into two halves - there will be an equal number of data points below and above the median.

Finding the median is a two-step process:

- First,
**sort the given data points**in ascending order. - Then find a
**value that splits the data set in half**.

The second step works differently for lists with even or odd numbers of data points. Let’s cover them one by one.

### Median for odd-sized lists 🔗

Let’s say you have an odd-sized list with \({ \large N} \) observations (\({ \large N} \) is odd). Assuming the list is sorted, **the observation at the \( { \Large \frac{(N+1)}{2}}\) position will be the list’s median**.

We can apply this to the (already sorted) list from the last section:

$$ \ 6, \ 97, \ 98, \ 99, \ 100 $$

There are 5 observations so \({ \large N = 5} \). Let’s find the position of the median:

Therefore, the value at the 3^{rd} position, \({\color{Blue} 98}\), is the median. It fits the definition. The value \({\color{Blue} 98}\) divides the sorted list into two halves - there are two observations below and above it.

Also, recall that the mean of this list was \({\color{Red} 80}\). The median, \({\color{Blue} 98}\), does a better job of representing the majority of the data points. **The outlier (value \({\color{Red} 6}\)) didn’t affect the median at all!**

Therefore, when you have a **data set containing outliers, the median will be a better metric to represent its central or typical value**.

### Median for even-sized lists 🔗

For data sets with even number of observations, we must **calculate the median by taking the mean of the two middle values.**

Why is that? Let me explain using the airfare data again. Suppose you have ticket prices for the last * eight* days:

$$210, \ 215, \ 253, {\color{Blue} \ \large \underline{294}, \ \large \underline{321}}, \ 368, \ 422, \ 507$$

There’s no single value that splits these sorted prices in half. Instead, you have two values at the 4^{th} and 5^{th} positions (294 and 321) in the middle of the list.

The value 294 or 321 cannot be the median, though. Neither of them splits the list into equal halves. In this case, we find the median by taking the mean of these two middle values:

$$ Median = \frac{294+321}{2} = 307.5$$

Thus the median airfare for the last eight days is $307.50.

Let’s generalize this rule. Suppose you have an even-sized list with \({ \large N} \) observations (\({ \large N} \) is even). Assuming the list is sorted, **the median will be equal to the mean of the observations at the positions \( { \Large \frac{N}{2}}\) and \( ({ \Large \frac{N}{2} }+ 1)\)**.

## Weighted Mean 🔗

We treated all the data values equally when we calculated the mean. We added all of them and divided the sum by the number of observations.

However, you may run into situations where **some data points have more importance or weight than others**.

We can’t use the simple mean formula in these cases. Instead, we must find the **weighted mean**.

Allow me to illustrate this with an example.

Suppose you’ve enrolled in a Statistics course. Your final score depends on your performance on a quiz, a project, and two exams.

Not all of them have equal importance, though. Your midterm and final exam scores will have a greater influence than your quiz or project scores. Therefore your professor has assigned higher weights to them:

Assignment Type | Weight |
---|---|

Quiz | 20% |

Project | 20% |

Midterm Exam | 30% |

Final Exam | 30% |

Let’s assume you get the below scores (the last column) for each assignment:

Assignment Type | Weight | Score |
---|---|---|

Quiz | 20% | 95 |

Project | 20% | 100 |

Midterm Exam | 30% | 85 |

Final Exam | 30% | 90 |

Your final score will be the weighted mean of all of your scores. So how can you find it?

**Here’s how you can calculate the weighted mean**:

- Multiply each score with the corresponding weight. The product will give you a
**weighted score**. - Add up all the weighted scores to get the
**total weighted score**. - Divide the total weighted score by the
**sum of all the weights**.

We can translate these steps into the weighted mean formula:

$$ Weighted \ Mean = \frac{Sum \ of \ Weighted \ Scores}{Sum \ of \ Weights}$$

Let’s apply this formula to your scores:

Assignment Type |
Weight (%) | Score | Weighted Score (Weight x Score) |
---|---|---|---|

Quiz | 20 | 95 | 1900 |

Project | 20 | 100 | 2000 |

Midterm Exam | 30 | 85 | 2550 |

Final Exam | 30 | 90 | 2700 |

$$ \displaystyle = \frac{1900+2000+2550+2700}{20+20+30+30} $$

$$ \displaystyle = \frac{9150}{100} = 91.50 $$

Thus your final score for the course will be **91.50**.

That’s quite impressive! I wish I had that score in my first statistics course 😞.

## Mode 🔗

**Mode is the value that occurs most frequently** in a dataset.

Consider the below list of numbers:

$$ 9, 7, {\color{blue} \underline{10}}, 13, {\color{blue} \underline{10}}, 11, 20, 11, {\color{blue} \underline{10}}, 8 $$

The number 10 occurs more times than any other value. Hence 10 is the mode of the above list.

Moreover, the list is **unimodal** because there’s only one mode - the value 10 is repeated the most often (three times). All the other values appear either twice or just once.

Contrast it with the list below:

$$ 9, 7, {\color{blue} \underline{10}}, 13, {\color{blue} \underline{10}}, {\color{red} \underline{11}}, 20, {\color{red} \underline{11}}, 12, 8 $$

This list is **bimodal** because it contains two modes - both 10 and 11 appear twice.

Similarly, a given list is **trimodal** if it contains three modes. And it’s **multimodal** if it has four or more modes.

## Summary 🔗

Today you learned how to calculate **mean, median, weighted mean,** and **mode** for any given data set.

These four metrics are known as the **Measures of Central Tendency** because they represent a data set’s center or typical value.

We also covered a few other key concepts:

- Why does the
**median work better than the mean**when the data set has**outliers**? - How to find the central value when
**certain values are assigned more weight**than others.

## Next Steps 🔗

Here’s how you can build upon the knowledge you’ve gained today:

- Knowing the center of a data set is only half the story. The other half is the distance between the values and the center. Please read the next post,
**Measures of Spread**, where we discuss that. - Read
**how mean and median play a crucial role in feature scaling**.