Q0 (or min): the minimum value in a data set
Q1: the border between the lowest and 2nd lowest quartile
Q2: the median; the border between the two middle quartiles (or if you split up the data into two equal groups of data points, the border between the lower half and upper half)
Q3: the border between the highest and 2nd highest quartile
Q4 (or max): the maximum value in a data set
You might have tried calculating the 5-number summary from a small data set and then discovered that it does not mimic what turns up with favstats(). This discrepancy is due to the lack of universal agreement among statisticians on how to calculate quantiles (quartiles in the case of the 5-number summary). The hand calculation method is not necessarily the best way to calculate quartiles because it doesn't generalize to other quantiles (e.g., tertiles, quintiles, deciles). We will contrast the hand calculation method against the method used by favstats() in the sections below.
Here is how we usually tell students to hand calculate five number summary:
Sort values from smallest to largest.
Find median (Q2).
For a data set with an odd number of values, the median is the middle number such that the number of values above the median is the same as the number of values below the median.
For a data set with an even number of values, the median is the average of the two middle numbers.
Q1 is the median of the set of numbers below the mean.
Q3 is the median of the set of numbers above the mean.
If you would like to replicate this method in R, you can use the function fivenum() (but, in the book, we prefer to use the 5-number summary generated with the favstats() function. For an explanation why, read the section that follows the fivenum() example below):
mynums <- c(10, 30, 35, 40, 40, 45, 50, 50, 55, 100)
fivenum(mynums)
When the set of values is even, favstats results in a slightly different Q1 and Q3 than the hand calculation method. There are actually 9 different types of five number summaries (WHOA!) but this particular one is important for 2 reasons: (1) We value the five number summary that can be connected more to the boxplots (and in general, most statisticians do too) and favstats returns the one that is commonly used to create boxplots; (2) this method of creating quantiles generalizes beyond the five number summary. The five number summary depends on determining quartiles of data (25% increments). But this method can extend to any quantile (e.g., quintiles, tertiles, etc).
Here is how favstats calculates the five number summary:
Sort values from smallest to largest. Consider them to be . For example, in the mynums data below, and .
Assume that the smallest number is the 0% quantile and the largest number is the 100% quantile.
The position of the 25% quantile is at ; note that you can generalize this to any q% quantile with the more general equation .
The way you might think about this method is that it is really trying to think about the set of values as mapped onto the 100% quantile. When the 25% quantile position is not a position that exists in the data, it tries to think about what value would be there if you considered the space between the existing values as a continuous number line.
For example, since there is no value at the 3.25 position in this set of data, consider the space between the 3rd and 4th values and ask ourselves, if this was a tiny number line, what value would be at 3.25?
Here is a slightly more formal explanation that may help clarify the difference: https://chemicalstatistician.wordpress.com/2013/08/12/exploratory-data-analysis-the-5-number-summary-two-different-methods-in-r-2/