Why are outliers defined like THAT?!
My own imagination asks:
Why is an outlier defined as 1.5 interquartile ranges outside of each quartile?
Great question, imagination!
The simple answer, I think, is that it’s a nice and easy thing to work out, and 1.5 interquartile ranges is quite a long way from the central box (if there’s no skew, it’s roughly three times half of the box in the box-and-whiskers plot).
But why 1.5?
My suspicion is that it’s something to do with the normal distribution, because everything is something to do with the normal distribution ((well, nearly everything. That’s why it’s called normal.))
The $z$-scores for the quartiles are, inexplicably, not in the standard table EdExcel kindly give you for your A-level exams. However, I have a $z$-score calculator on my machine which tells me they’re $\pm 0.6745$ (to four significant figures).
One-and-a-half interquartile ranges above the upper quartile would take us to a $z$-score of 2.698, which corresponds to a tail probability of 0.0035.
That means, assuming the normal distribution is the true underlying model for your observations, the probability of a given observation lying in the ‘outlier’ zone is a between 0.5% and 1% - in fact, about 1/143, if R isn’t telling me porkies.
A rabbit-hole about significance
Any time you do a statistical test, such as a medical diagnosis, you’re almost certain to misclassify things on occasion. You can have false positives (you wrongly tell someone they have an incurable disease) or false negatives (you wrongly tell someone they’re in fine fettle when they’re really dying.)
You could correctly diagnose every case of cancer in the country by simply telling everyone they had cancer - but you’d have a lot of false positives. You could correctly diagnose every case of non-cancer in the country by simply telling nobody they had cancer - but you’d have a lot of false negatives. In reality, doctors do a delicate balancing act of talking about unusual results and needing to do more tests in hope of reducing false positives; the tests themselves are designed to reduce or ideally eliminate false negatives (it’s not ideal to tell a healthy person they may be ill, but it’s much worse to tell an ill person there’s nothing to worry about).
Was that significant?
In a normal distribution, outliers can occur by chance - but they’re unusual. In a school of a thousand people, you’d most likely have seven students classed as outliers, for example, under normal circumstances. Calling them outliers doesn’t mean the normal distribution is inappropriate, it just means their results are unusual - using an arbitrary cut-off for unusualness.
The 1.5IQR cut-off is chosen to balance two things: firstly, it needs to be easy to calculate; secondly, it needs to make outliers unusual but not too unusual - and somewhere about 1% is a pretty good choice.
Intriguingly, $\sqrt{2}$ IQRs above and below the quartiles gives an outlier probability of very close to 1% - but that’s just a bit harder to remember, a bit harder to work out, and - in the end - just as arbitrary.