Introduction

This page is one of the type that I sometimes write: an introduction to a concept or topic or idea and an Excel page. If you don’t want all of the background, go straight to the Excel file (link at the bottom of the page) I have provided and play around with it. Otherwise, read on!

The purpose of this page is to introduce and explain the concept o the skewness of data. Initially, we will discuss the descriptive statistics utility in Excel, in which we find a value for skewness. Once we have appreciated one way to find that value, we will explore the concept in much greater detail.

Why Worry about Skewness?

If the world were perfectly round and green cheese from the Moon were easy to obtain free of charge, we needn’t worry about skewness. In a perfect universe, all calculations are simple, relationships are linear and conclusions are easy to draw. However, life’s not like that and in many cases, the normal distribution does not apply. Skewness in a data set shows this situation very well.

By skewness we mean that the mean, the median and the mode are not equal to each other, as is the case with the normal distribution. In some cases, the mean is greater than the median and in other cases the median is greater than the mean. When the mean and the median are not the same, it is likely that some level of skewness is present in the data set we are analysing.

The following section shows how we can start to see whether our data set is skewed.

Data Analysis ToolPak: descriptive statistics

When you click on the Data Tab in the Excel for Windows ribbon you should see the phrase Data Analysis on the extreme right of the tab. If it’s there, click on it, choose Descriptive Statistics and complete the dialogue box you will be presented with. If Data Analysis is not there:

  • Click the office button (top left of the window)
  • Click Options
  • Select Add-Ins
  • At the bottom is says, Manage Add-Ins: click Go
  • In the new dialogue box click ToolPak: it should be the first option
  • Click OK

The Data Analysis ToolPak should now install and you might have to restart Excel to see it.

You can analyse one or more columns of data at one time and the output looks like this:

desc_stats

For each data series you choose, you will be presented with these two columns: it’s best if you select the Labels in first row option to give a meaningful name to each data series in this output.

You can see a value for skewness just over half way down this list and that is what we are looking for.

Please note: the descriptive statistics output from the ToolPak is static, non volatile. What this means is that if you change your data in any way or even change one data series for another one in the same input range, these outputs will not change. To update these results, you would have to run the descriptive statistics utility again.

What the Normal and Skewed Distributions Look Like

When we come to analyse a set of data, whether it’s the entire population of data or just a sample, we cannot always guarantee that it will look perfectly normal, like this:

norm_dist

In this case, we see that the mean, the median and the mode are all equal to each other and if we slice the diagram down the centre from top to bottom, the left hand side will be equal to the right hand side: the normal distribution.

In reality, we might find that our data set looks like one of these:

 positive  negative
Positive Skew Negative Skew
Skewed Left: long tail points left Skewed Right: long tail points right

That is, in the data that are skewed left, there are more values larger than there are smaller and vice versa for the data that are skewed right.

We find a value for skewness from Excel when we use either the Data Analysis ToolPak’s descriptive statistics utility and when we use the SKEW() function.

Note: Excel for Mac 2011 does not have the descriptive statistics utility installed so we have already outlined how to use the ToolPak in this context, assuming you are using Excel on a Windows machine. Now let’s see how to use it the SKEW() function which is useful for Windows users by necessary for Mac users.

The SKEW() Function

The SKEW() function is volatile so if and when you change your input data, it will give you a new result. Here is an example for you:

Enter the following in, say, cells A7:A14 with the heading Data in cell A6:

2, 3, -1, 3, 4, 5, 0, 2

In cell E5 enter =SKEW(A7:A14) and press enter. You should see the value -0.4587 in that cell now. Change any of the input values and see how the value of skewness changes too: change one value or more or all of them. Once you have experimented with this feature of the function, go back to the data set above.

The skewness value of -0.4587 tells us that these data are negatively skewed; and since the median is greater than the mean, that confirms the negative skewness (see below). Does the histogram below confirm that? Hmm, it’s a bit tricky in this case, so read on!

small_histo

There is an Excel work book to accompany this page that is free for you to download, see the link at the bottom of the page. On the tab larger_skew, you will find a data set comprising 50 data points. Take a look at that work sheet to see everything I have done there. In the end, you are presented with this graph of the data, in the form of a histogram: before you read on, try to estimate the SKEW value from this graph.

larger_histo

SKEW() gives us the value of 0.1175: that is, positive and not such a large value. Again appearances are deceptive: just look at the data carefully, check the calculations to see that skewness is 0.1175, positive and that the mean > median. However, the histogram appears to show a negative skewness. Sorry, it’s just the way these data are appearing and it helps to demonstrate why most graphs need to be supported by calculations rather than just left as standalone evidence. Take a look at the next section, general measures of skewness which helps us here. The value of 0.1175 for skewness shows that the data are relatively symmetric so we need to be ready to be deceived by a graph!!

When you get to the Simulation of Skewness section of this page you can see that you can also look at the simul_large worksheet and press the F9 key you can see as many examples as you like of what positive and negative skewness can look like: including skewness values between -1 and +1.

General Measures of Skewness

There are three general measures of skewness as the following three values help to illustrate:

  • a skew greater than +1 indicates a high degree of positive skew
  • a skew less than -1 indicates a high degree of negative skew
  • a skew between -1 and +1 indicates a relatively symmetric data set

Confirmation: the Mean and the Median tell us the Direction of Skew

I have used the relationship between the mean and the median in our skewness analysis already; but here is the confirmation of that relationship.

In the case of the first example above, you will find:

mean    = 2.2500

median  = 2.5000

In this case the median is greater than the mean so we know we are dealing with negative skewness

In the case of the second example above, you will find:

mean    = 2,570.32

median  = 2,304.50

In this case the mean is greater than the median so we know we are dealing with positive skewness

If the mean = the median, this means … ? What do you think?

Mean v Median: the King Kong Effect

While we are talking about the mean and the median, there is always the question of which one is the better? Here is the short answer:

essentially, the mean is the best measure of central tendency if the data set does not show excessive skewness. Otherwise, you should use the median as the measure of central tendency. The reason here is that extreme skewness distorts the mean and when this happens, the median is the better alternative. There is the example of what is called the King Kong Effect to illustrate the distortion of the mean: this example works as you see below.

The following graph shows the heights and weights of 20 gorillas in a zoo:

gorilla_0KK

In this case, the mean and median are 46.05 and 47 respectively and the skewness coefficient is -0.3980. Note the graph is an XY scatter graph and not a histogram so we cannot use it to predict or confirm the direction of skewness.

If we introduce King Kong to the zoo and to our XY scatter graph, this happens:

gorilla_KK

Now, the mean and median are 50.05 and 47 respectively; but the skewness coefficient is a huge 3.4594 meaning that skewness has gone from negative 0.34980 to positive 3.4594 and all because we added King Kong to the list, just one more data point.

Simulation of Skewness

On the third worksheet in the work book we are working on here, I have used the NORMINV() function to simulate the second of the two examples above. We begin with the mean and standard deviations of the example:

mean (AVERAGE())                   = 2,570.32

standard deviation (STDEV())    = 1,329.70

And then in each of the 50 cells for which we want to simulate, we enter:

=NORMINV(E8,B$3,B$4)

In this case, this function is entered in cell D8 and then filled down to D57.

This simulation assumes we are dealing with the normal distribution and it generates values based on that idea together with the mean and standard deviation of our first example or sample.

In cells E8:E57, which are “helper” cells, is the function =RAND() which we see in the NORMINV() function as the first element, probability: in this case the probability can be any value from 0 to 1 or 0 to 100%. Since these numbers are random and volatile, every time you press the F9, calculation, key, you will generate a completely new simulation, based on the mean and standard deviation provided.

I built this model specifically so that you can press the F9 key over and over again to help you to learn what different values of skewness look like. In fact, the title of the graph is linked to cell H10 in which you will see this: =F10&” “&ROUND(G10,4).

Three results of this simulation are:

left_skew little_skew  right_skew
Skewness = 0.2863: positive, left Skewness = -0.0131: negative, right Skewness = -0.317: negative, right

I have labelled the middle graph as negative, right; but in reality it is almost skew free because the value is almost zero … go back to the second example, the larger_skew worksheet, above to confirm that we are dealing with similar situations here.

Open that spreadsheet file, go to the simul_large tab and press the F9 key over and over to see a very wide range of values of skewness.

Symmetry and Skewness

What the above discussion tells us is that we can use skewness as a measure of symmetry. If the skewness = 0 then the distribution represented is perfectly symmetric and the mean = the median.

As we have seen, when the distribution is positively skewed the mean > median and when the distribution is negatively skewed the mean < median: see the graphs and discussion above again as examples.

Optional Text: The formula that Excel uses to calculate skewness prior to Excel 2013 is SKEW():

skew_xl2010

where is the mean and s is the standard deviation of the data we are analysing.

Excel 2013: SKEW.P()

skew_xl2013

In Excel 2013, there is also a population version of the skewness given by the formula: SKEW.P(), which is based on this formula:

We can compare the results of SKEW.P and SKEW as follows:

SKEW.P(R) = SKEW(R)(n–2)/SQRT(n(n–1))

where R is the range of data we are analyzing

n = COUNT(R)

Example

Using the first of our examples above, use the data 2, 5, -1, 3, 4, 5, 0, 2:

SKEW(2, 5, -1, 3, 4, 5, 0, 2)       = -0.4587

SKEW.P(R)                               = -0.3677

Further Diagrams

The following two diagrams will prove useful as additional forms of presentation of skewness for anyone interested in studying this topic further:

  • Box and Whisker Plots

Here is the boxplot for the larger skew example discussed above, where the skewness value is 0.1175:

boxplot

This boxplot is included in the Excel spreadsheet that accompanies this page and a description of how to construct one is given in my Excel book, reference at the bottom of the page.

  • Dot Plots

Using the same data as for the boxplot, the following dot plot has been constructed using the REPT() function:

dotplot

This dot plot is included in the Excel spreadsheet that accompanies this page and a description of how to construct one is given here in this blog: it’s an advance on the REPT() function way of graphing data without the need for a graph!!

Conclusions

This page has shown how to calculate the value of skewness for a data set using two main techniques: the Data Analysis ToolPak and the SKEW() function.

Skewness tells us by how much a data set might deviate from the normal distribution and it is a vital aspect in the analysis of data since we often assume data are normally distributed when, as we see here, sometimes that is not a valid assumption.

Download the fileskewness

References

1 David P Doane Oakland University, Lori E Seward University of Colorado (2011) Measuring Skewness: A Forgotten Statistic? Journal of Statistics Education Volume 19, Number 2, www.amstat.org/publications/jse/v19n2/doane.pdf

2 http://www.real-statistics.com/descriptive-statistics/symmetry-skewness-kurtosis/

3 Duncan Williamson (2012) The Excel Project: Excel for accountants, business people … from the beginning Duncan Williamson Ltd August

4 Spyros Makridakis, Steven C Wheelwright and Rob J Hyndman (1998) Forecasting Methods and Applications 3/e John Wiley & Sons

5 Wayne L Winston (2011) Microsoft Excel 2010: data analysis and business modelling Microsoft Press Chapter 42

3 Responses to “Skewness of Data”

  1. Carole Says:

    thanks, just what I needed to understand.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: