Five-Number Summary

Five-Number Summary

The 5-number summary is a way to describe the spread of a distribution around a median. If we start by sorting the values from smallest to largest, then chunk our data into quartiles (four equal groups of data points) we can get the five-number summary by identifying the borders:


  1. Q0 (or min): the minimum value in a data set 

  2. Q1: the border between the lowest and 2nd lowest quartile 

  3. Q2: the median; the border between the two middle quartiles (or if you split up the data into two equal groups of data points, the border between the lower half and upper half)

  4. Q3: the border between the highest and 2nd highest quartile

  5. Q4 (or max): the maximum value in a data set



The five-number summary can be visualized with boxplots. For more information see: boxplots; or gf_boxplot

Additional Notes and Information:

I hand calculated the five number summary and it’s not the same as what I got from favstats()? What’s going on?

Many Calculation Methods  

You might have tried calculating the 5-number summary from a small data set and then discovered that it does not mimic what turns up with favstats(). This discrepancy is due to the lack of universal agreement among statisticians on how to calculate quantiles (quartiles in the case of the 5-number summary). The hand calculation method is not necessarily the best way to calculate quartiles because it doesn't generalize to other quantiles (e.g., tertiles, quintiles, deciles). We will contrast the hand calculation method against the method used by favstats() in the sections below.

Hand Calculation Method

Here is how we usually tell students to hand calculate five number summary:

  1. Sort values from smallest to largest.

  2. Find median (Q2).

    1. For a data set with an odd number of values, the median is the middle number such that the number of values above the median is the same as the number of values below the median.

    2. For a data set with an even number of values, the median is the average of the two middle numbers.

  3. Q1 is the median of the set of numbers below the mean.

  4. Q3 is the median of the set of numbers above the mean.


If you would like to replicate this method in R, you can use the function fivenum() (but, in the book, we prefer to use the 5-number summary generated with the favstats() function. For an explanation why, read the section that follows the fivenum() example below):


mynums <- c(10, 30, 35, 40, 40, 45, 50, 50, 55, 100)

fivenum(mynums)


output of mynums and fivenum() function

A Better Five Number Summary → favstats()

When the set of values is even, favstats results in a slightly different Q1 and Q3 than the hand calculation method. There are actually 9 different types of five number summaries (WHOA!) but this particular one is important for 2 reasons: (1) We value the five number summary that can be connected more to the boxplots (and in general, most statisticians do too) and favstats returns the one that is commonly used to create boxplots; (2) this method of creating quantiles generalizes beyond the five number summary. The five number summary depends on determining quartiles of data (25% increments). But this method can extend to any quantile (e.g., quintiles, tertiles, etc).

 

Here is how favstats calculates the five number summary:

  1. Sort values from smallest to largest. Consider them to be . For example, in the mynums data below, and .

  2. Assume that the smallest number is the 0% quantile and the largest number is the 100% quantile. 

 

table of mynums and quantile

 

  1. The position of the 25% quantile is at ; note that you can generalize this to any q% quantile with the more general equation .

    1. For example, with the mynums data, . We’ll call this number position.

    2. If position is an integer, then that gives you the position of Q1. (This is what happens with an odd number of values in a data set.)

    3. If position is not an integer but a decimal number: 

      1. Consider the two integers immediately below and above that position. These integers will be called the floor and ceiling respectively. For example, if position = 3.25, then the floor = 3 and the ceiling = 4.

      2. The 25% quantile can then be found as . So for our data it would be

 

The way you might think about this method is that it is really trying to think about the set of values as mapped onto the 100% quantile. When the 25% quantile position is not a position that exists in the data, it tries to think about what value would be there if you considered the space between the existing values as a continuous number line.


graphic comparing ordinal position to mynums and quantile

 

For example, since there is no value at the 3.25 position in this set of data, consider the space between the 3rd and 4th values and ask ourselves, if this was a tiny number line, what value would be at 3.25?

 

output of mynums and favstats() function 

 

Here is a slightly more formal explanation that may help clarify the difference: https://chemicalstatistician.wordpress.com/2013/08/12/exploratory-data-analysis-the-5-number-summary-two-different-methods-in-r-2/


    • Related Articles

    • favstats()

      The favstats() function will compute a set of common summary statistics ("favorite stats") for a given variable, including the five-number summary (minimum, Q1, median/Q2, Q3, maximum), the mean, the standard deviation, the sample size (n), and the ...
    • Boxplots

      Note: If you are viewing this article within the Help Desk widget from within the course textbook, we recommend opening the article in a new tab with the button in the top right corner (as seen below) so that the images will be easier to see. A ...
    • gf_boxplot()

      The gf_boxplot() function will generate a boxplot (also known as a box and whiskers plot). A boxplot splits the data into quartiles, where each whisker and each half of the box contains 25% of all the observations. They are helpful for visualizing ...
    • sample distribution

      When sampled data—units that were selected and measured—are shown in such a way where we can see patterns of variation (such as shape, center, and spread); sample distributions can be seen in visualizations such as histograms, boxplots, etc. but also ...
    • aggregation

      Aggregation is putting together separate elements (e.g., when we aggregate data, we might be putting separate data values together into one summary statistic).