This guide to statistics is not designed to make you a statistician, but rather to give you some basic tools with which to make better decisions as an HRD practitioner. Statistics can be very complex, but it can also be simple and straightforward. I prefer the latter, so let's get started.
As you work your way through the following material you may find it helpful to visit the Virtual Statistics Lab at Rice University, VassarStats at Vassar or Statpages.net
Levels of Measurement
There are four levels of measurement. They are:
Nominal Level (Grouping)
Nominal level data is generally preferred to as the "lowest" level of measure. Data is limited to groups and categories. No numerical data is ever provided.
Catholic Protestant Jewish Muslim Other
Ordinal Level (Grouping and Ranking)
Ordinal level data can be grouped and and ranked. With Ordinal data, you can say that a measure is higher or lower than another measure. But, you may not say how much higher or lower.
Preferred Flavors of Ice Cream
Interval Level (Grouping, ranking, and includes the exact distance between measures)
Interval level data can be grouped, ranked, and include the exact distance between measures. Note: Interval measures never contain a zero ( 0 ) as a starting point.
Sam is 2" taller than Bill and 3" taller than Steve.
Bill is 2" shorter than Sam and 1" taller than Steve.
Steve is 3" shorter than Sam and 1" shorter than Bill.
Note: We do not know how tall Sam, Bill or Steve are, we only know exactly the difference in their heights when compared to one another.
Ratio Level (Grouping, ranking, exact distance between measurement, and contains an absolute "0")
Ratio level data are said to be at the highest level and can be grouped, ranked, and the exact distance between measures determined. Also, Ratio level measures contain an absolute "0". By having an absolute "0" in your measurement "scale", you are able to describe data in terms of ratios.
You could say that Jack, who weighs 200 lbs., is twice as heavy as Mary who weighs 100 lbs. (twice as heavy is a ratio statement).
It should be noted that with social science data, there are rarely any outside standard scales, such as a yardstick to measure height. Therefore, social research rarely generates data that goes beyond Interval level measures.
The question ""how large should my sample be?" is a common one. And, one with no simple answer. While there are a number of elegant approaches to answer this question, for our purposes, several "rules of thumb" will serve us better.
Rule of Thumb #1
Use sample groups larger than 30 for interval level measures.
Rule of Thumb #2
If the total population that you are examining is less than 30. Use all of them.
Rule of Thumb #3
You should have a sample size of 30 for every relationship you measure.
30+ people ----------------> compared against "X" OK
15 Women ----------------> compared
against "X" Not OK
15 Men -------------------> compared against "X" Not OK
30+ Women --------------> compared
against "X" OK
30+ Men -----------------> compared against "X" OK
Rule of Thumb #4
Consult this table
To select a random sample, use a table of Random Numbers or use a computerized random number generator.
Note: For a more detailed discussion of sample size, see pages 385-387 in your textbook.
Measures of Central Tendency
If you want to find a Yak-Yak bird, the first question you might ask yourself is, "Where do most Yak-Yak birds live?" In other words, "where would I have to go to have the greatest chance of finding a Yak-Yak bird?" Measures of Central Tendency tell you where most of whatever you are measuring can be found.
Mean = Average
All scores are added up and divided by the number of scores.
Median = Middle score
Count the total number of scores. The one in the middle is the median.
Note: If there are an even number of scores, select the middle two and average them. This will give you the median.
Mode = Most common score
The mode is the score that occurs most often.
Range is the difference between the highest and lowest scores. You should only use the range to describe interval or ratio level data. To calculate the range, subtract the lowest score from the highest score
Note: In some statistics books, they will define range as the High Score minus the Low Score, Plus one (1). This is an inclusive measure of range, rather than a measure of the difference between two scores. For example: the inclusive range for data ranging from 6 to10 would be 5.
For our purposes, we will define the range as the difference between the highest and lowest scores.
When the mean, median and mode are equal, you will have a normal or bell shaped distribution of scores.
Scores: 7, 8, 9, 9, 10, 10, 10, 11, 11, 12,
We should note at this point that a normal distribution (Bell Curve) is an important concept for statisticians because it gives them a "theoretical standard" by which to compare data that may not form a perfect bell curve.
If you have data where the mean, median and mode are quite different, the scores are said to be skewed.
Scores: 7, 8, 9, 10, 11, 11, 12, 12, 12,
Scores that are "bunched" at the right or high end of the scale are said to have a negative skew.
In a positive skew, scores are bunched near the left or low end of a scale.
A bi-modal distribution occurs when the data forms two clumps
When this happens, it is a good idea to look for common characteristics within the two data clumps, and for differences between the two data clumps. For example you might generate a bi-modal distribution if you asked people, "On a scale from 1 to 10 do you like romantic movies that explore feelings and relationships?" Men would probably provide low scores, while women would provide scores that were high.
Kurtosis refers to how tall or flat your curve is as compared to a normal curve. Curves taller than a normal curve are call Leptokurtic. Curves that are flatter than a normal curve are call Platykurtic.
Standard Deviation is a measure of dispersion, i.e., the extent to which data (scores) are "spread out" from the average or Mean. Other measures of dispersion are the range and variance. Since variance is not often used, and range has already been discussed, let's focus upon standard deviation. Knowing the standard deviation for a set of scores is important because it is a good indicator for judging the Mean as a representation of the "average" response.
Let's look at this concept graphically. If I have a large standard deviation, it indicates that my scores are widely dispersed, and therefore, my Mean doesn't really tell me that much about the average score.
On the other hand, if my standard deviation is small, it indicates that my scores are close to the Mean, and therefore, the Mean is a good indicator of the "average" score.
To calculate the standard deviation of a set of scores, use the following formula.
The following should help. Let's say the data below represents the test scores for 10 trainees. (The top score anyone could make was 50).
Scores (n) = 10
Mean = 400/10 = 40
Median = 40.5
Mode = 41
Range = 16
Let me try it on the Internet
So what does the standard deviation tell you? It tells you that most trainees made 40, give or take 5 points. And, that in this case, you can say with confidence that the Mean is a very good indicator of the "average" score.
By establishing the standard deviation for a set of scores, you will be able to describe accurately how various scores differs from one another and from the Mean (average).
Note: In a Normal Distribution:
When we know the standard deviation for a set of scores, it is possible to compare our data at a glance with a Normal Distribution to determine the degree of dispersion.
In the above example, we can see that all of our scores fell within two standard deviations of the Mean. Which again reaffirms that our Mean is a very good indicator of the "average" score.
A Quick and Dirty Way for Determining the Standard Deviation for a Set of Scores
If your Mean, Median and Mode are very close to one another, you can calculated the Standard Deviation by:
1. Determine the Range
2. Divide the Range by four
3. The resulting number will approximately equal the true standard deviation
Scores: 7, 8, 9, 10, 10, 10, 11, 11, 12, 13
True Standard Deviation: 1.67
Q&D Standard Deviation: 6/4 = 1.5
CAUTION: Only use this technique IF your Mean, Media and Mode are very similar.
Measures of Association
Interval Level Measure
There will be many times when you will want to know if two variables are "related," and to what extent. For example, you may want to find out the relationship between Yearly Income and Education. To do this, you would randomly select 10 individuals . To help you "sort out" your data, you would construct the following table.
With this data, you can now use the Pearson's r or product moment correlation coefficient formula.
Let me try it on the Internet
Using the following table, you can see that there is a very high correlation between Income and Years of Education.
Measures of Association
Spearman's Rank Order
Ordinal Level Measure
Once in a while, it is useful to compare rankings between two people or groups. Fox example, let's say a group of employees and a group of managers want to find out if there is a difference in the workplace value held by each group. In this example, each group ranks 10 workplace values.
In order to calculate a Spearman Rank Order, you must first construct the following table.
Let me try it on the Internet
Again, using the following table, you can see that this time there is no correlation between the two rankings.
Test of Significance
Nominal Level Measure
Let's say you wanted to know if there was any significant difference between the production rates for departments which had trained supervisors versus those departments whose supervisors were not trained.
Lets begin by collecting some data.
From this data, you can construct the following cross-tabs table.
Note: We have included the totals for trows and columns in our table.
We must now construct a table which describes what would have happened if training did not impact production. (We call this table an Expectancy Table).
A = 9 departments with above standard
B = 11 departments with standard production
C = 10 departments with below standard production
M = Number of departments with trained supervisors (17)
N = Number of departments with untrained supervisors (13)
T = Total number of supervisors
We are now ready to calculate the Chi-square using the following formula.
Where O = the actual production records for
each cell in the table
Where E = the expected production record for each cell in the table
try it on the Internet v.1 - Chi Square
me try it on the Internet v.2 - Contingency Table
A Chi-square by itself is not of much value. You have to use a Chi-square table that you can find in most statistics books. However, before you can use the table, you must first determine the degree of freedom of your table. To do this, blot out one row and one column. The remaining number of cells will be your degrees of freedom.
In our example table, we can see that we have 2 cells that are not filled in, which means we have 2 degrees of freedom (df)
Now, using a Chi-square table, find where the 8.416 falls on the line of numbers for df 2. (Online Chi Square Table)
What this tells you is that with 98%+* confidence (probability) that supervisory training is responsible for increase production.
Note: You can use Chi-square with
cross-tab tables that have up to 30 degrees of freedom. (The
maximum for most Chi-square tables).
* .02 = 2%
100% - 2% = 98%