There was a Soldier
If you know anything about Scotland you should have herd of Andy Stewart and his very famous song the Scottish Soldier:
- There was a soldier
- A Scottish soldier
- Who wandered far away
- And soldiered far away …
I like the song and can sing along to all of the words! When I came across the data we are about to work on, I couldn’t resist the Scottish soldier link.
For a few years now I have presented some data on the chest sizes of Scottish military men taken in the 19th century. Why these data have been passed down is a mystery but I’ve got them and here they are!
The question I want to answer is this: are these grouped data normally distributed? That is, do they look like the following graph?
Using the =NORMDIST() function I took the raw data above and normalised it to prepare the above graph: in fact, this gives me my first point of analysis. These are what the distribution of measurements should look like if they were normal.
I drew the graph of the actual data next:
Does that look normal to you? Well yes but then again at the peak of the curve it’s not so smooth and down the bottom right of the curve it is definitely not so smooth.
However, the data might still behave in such a way as to be near normal since it doesn’t look to be too far away from the normalised curve. Then I did the following:
That is, I forced the data to be symmetrical as I think you can see here. I made point 16, the last point, equal to the first point at 3; then I made point 15 equal to the second point at 18 and so on until the table was perfectly symmetrical. It doesn’t matter whether I started at 16 and worked upwards or at one and moved downwards, the effect would have been the same.
I then took the differences between the original and revised data and, as you can see, I got variances all the way from point nine to point 16. Relatively, these differences from the actual data were sometimes very large, at 357% in once case and averaging 205%.
So something not so neat and tidy and I took the next step which comprised calculating the following:
- Data: these are the raw data and they must be sorted in ascending order
- Mean: the arithmetic mean of the grouped distribution
- Standard deviation: the standard deviation of the grouped distribution
- Cumulative distribution (CD) Factors: =1/(2*n) for data point one then =point one+2/(2*n) … =point two+2/(2*n)for the second, third and then subsequent points
- Expected Values: =NORM.INV(CD Factor point by point, the mean, the standard deviation)
- Z Values: =NORM.S.INV(CD Factor point by point)
This gave me:
Plotting these on a graph should give me two curves which should exactly overlap each other if the data were perfectly normal or be a little way away from each other in parts if the data were nearly normal. I got this:
The conclusion is that below points 9 – 16, the data behave normally whereas after points 1 – 8 they are significantly abnormal.
This is interesting because any statistician will tell you that such a large sample as 5,738 values of measurements of people should be normal: heights, weights, waist sizes … and chest measurements. Here is an exception that proves the rule. Whoever these militiamen were the taller ones were extraordinary!
These data come from here
Chest Measurements of 5738 Scottish Militiamen
Source: Data from A. Quetelet, Lettres aS.A.ft. le Due Regnant tie Saxe-Cobourg el Gotha. sur la Theories Probabilities Applique Sciences Morales et Poiillques. (Brussels: M. Hayez, 1846) p. 400.
Title: Applications, Basics and Computing of Exploratory Data Analysis
Authors: Velleman, Paul F.
Publisher: Duxbury Press
Download my working Excel file scottish_soldier_chest