(SSC Arms)

Statistical Society of Canada



Basic Introduction to Statistics





Introduction

SSC
Prize at CWYSF
Intervals
Statistical Tests
Examples
Graphing data
The following paragraphs give a brief overview of the main tools of statistical inference: intervals and tests. These are followed by a few examples, aimed at illustrating the importance of choosing the statistical technique or graphical display which is most appropriate for the situation.

Intervals

(Forward: Tests)

When a set of data is used to estimate a population characteristic, it is essential to quantify the doubt associated with that estimate. Some standard techniques are available as part of the school curriculum. Once again, mastery of these techniques is not essential to win an SSC prize. A good ``feeling'' for the doubt is sufficient.

Suppose you were to collect height data (in centimetres) from female students in your grade. You have learned various methods to summarize the data. Three of them are:

  1. a histogram,
  2. the average, usually denoted by ,
  3. the standard deviation, usually denoted by s.

In most cases, the histogram of heights will look fairly symmetric around the average and about 95% of the data will lie within two standard deviations of the average. (If there is one unusually short or tall student, then the standard deviation will be very large, and more than 95% of the data will lie within two standard derivations of the average.)

If you hear that a new female student is about to join the class, then you can predict her height (with 95% confidence) to be

(This may be a frustratingly long interval).

Not all data follows this nice symmetric pattern. If you want to see some asymmetric data, try collecting some of these: student weights, number of week days between school cancellations (snow, holiday, in-service), number of red Smarties© in a small box, or number of green Rockets© in a roll.

Although it is difficult to predict individual observations, it is usually easier to say what happens on average. In order to estimate the mean (or expected) height of all female students in your grade level, using a sample of 30 students from the school, proceed as follows:
n = # sampled = 30.
= average height of those 30 students.
s = standard deviation for those 30 students.
Approximate 95% confidence interval for mean height:

Notice the important role played by the sample size: as n increases, then the interval gets narrower (and more precise). So, a large sample allows one to ``tie down'' a mean. However, when predicting the height of individuals, the length of the interval does not decrease with larger n.

Suppose that data were collected on height (in cm) of 55 young adult males, giving average height 179 cm, and standard deviation 6.6 cm. Assume that the 55 men were a random sample from the population to which we wish to generalize. Then, with 95% confidence, we estimate the mean height of the population to be between
179 +/- 1.8 cm, ie between 177 and 181 cm (5'9.7" to 5'11.3").
But, if we want to predict the height of another individual, with 95% confidence, we give a much wider range:
179 +/- 13.2 cm, ie 165.8 to 192.2 cm.

Now for an example with discrete data. Suppose I interview 20 students and learn that 13 of them prefer cola to orange pop. So, 65% of the sample prefer cola. A different sample would probably give some other estimate of the proportion of students who prefer cola. What is the true proportion of all students in the population who prefer cola? (By the way, what is the population of interest?) Here is a standard procedure for calculating an approximate 95% confidence interval for that evasive population proportion:

n = # sampled = 20.

.

Approximate 95% confidence interval for proportion of population who prefer cola:

That is,

So, with the information available (responses from 20 students), and assuming those students have been selected randomly from the population they represent, we are 95% `certain' that somewhere between 44% and 86% of students prefer cola to orange. Since the interval straddles 50%, we really cannot answer the interesting question: on average, do students prefer one of these two types of cola? If the interval (3) gives silly answers, then the sample size is too small.

Statisticians construct confidence intervals all the time. Suppose that a statistician calculates 160 95% confidence intervals in the course of a week's work. Then about 150 of them will have provided accurate information and the others would have been totally misleading. The statistician's dilema is this: he / she does not know which intervals are bad!

Do we have to be 95% confident? No. Many research fields seem to be content with 95% confidence, possibly because 95% confidence uses that simple number 2 in equations (1), (2), (3). For 90% confidence, replace the 2 by 1.65. For 99% confidence, replace the 2 by 2.58.

The justification can be found in any discussion of standard normal tables, where you will learn that the number two we have been using is actually a nice approximation to 1.96. The values 2, 1.65 and 2.58 work well in all kinds of situations where you might want to estimate an average. They do not work so well when you want to predict a new observation from an non-symmetric distributions ... but a simple histogram gives a fair bit of information about individual variation.

Statistical Tests

(Forward: Examples
Back: Intervals)

Suppose that Fred (from Fredericton, New Brunswick) tosses a coin 20 times, and it comes down ``Heads" sixteen of those times. Is this sufficient evidence for him to conclude that the coin is not fair? A statistician handles the problem as follows. Let p be the probability that this coin will fall ``Heads".


The null hypothesis is                    :             The coin is fair.

The alternate hypothesis is : The coin is not fair.

In symbols:
The null hypothesis is: .
The alternate hypothesis is:.

The test statistic is X = number of Heads in 20 tosses. (Fred has observed X = 16). Now, here is the crunch.
Assuming the null hypothesis to be correct, we expect X = 10. What is the probability of observing X = 16 or X even further away from 10?

We can answer this question by looking up Binomial tables. The resulting probability is called the P-value of the test. In this example,

first calculation of the P-value is .

But, if 16 heads (and 4 tails) is an unusual outcome, then surely 4 heads (and 16 tails) would be equally unusual. Thus, the P-value is really twice as big as our first guess:

second calculation of the P-value = .

If Fred persists in believing the null hypothesis, he now has to explain why something very unusual (probability 1.2%) has occurred. It is time to reject the null hypothesis in favour of the alternate. That is, reject in favour of , P-value = .012.
Conclusion: If the coin were fair, there would be only a 1.2% chance of observing such unusual data, so he concludes that the coin is not fair.

All statistical tests work in a way similar to that outlined above. The null hypothesis is an innocuous, boring, trivial statement. The alternate hypothesis is an interesting result (perhaps world-shattering), which should be announced beforehand. That is:

: The result that the researcher is trying to demonstrate.

For example:
: People cannot tell coke and pepsi apart.
: People can tell coke and pepsi apart.

: The response is the same for the treatment as for the placebo.
:The response is different for the treatment than for the placebo.

: The machinery is working fine.
: The machinery should be repaired.

People will only listen to our world-shattering result if we can demonstrate that there is very little chance that we have made a mistake, that is, if the P-value is small. Nobody will listen when P-values are above .05 (or 5%). Most people will listen when the P-value is less than .01 (or 1%). In summary:

When calculating P-values, we do not always use Binomial tables. Sometimes we look up standard normal, sometimes T, or F tables. It is often easier to test
: The response is different for the treatment than for the placebo.
than to test
: The response is better for the treatment than for the placebo.

When a researcher claims to have `tested at level of significance 0.05 (or 5%),' then he/she means that the P-value had to be less than 0.05 in order to be called `small'.

For Experts only.
The next few paragraphs address a difficult concept.
Feel free to jump to the next section Examples.

Look back at the calculations of the P-value for Fred's experiment. If Fred had decided to test at level of significance 0.05, then the P-value would have to be less than 0.05 before he could say: `This coin is not fair.' So, the first calculation of the P-value would have had to be less than 0.025. (Remember this point when reading through the next example.)

The Problem : this is a fair coin, equivalently : p = is special. We now show the general procedure. One day, Fred wanted to see whether a particular die (1 die, 2 dice) was fair, and delegated his little brother to roll the die 20 times. Unfortunately, Fred did not give precise instructions to his little brother. At the end of the experiment, the little brother reported: ``I rolled the die 20 times and there was exactly 1 six.'' No other information was available.

So, Fred wanted to test

where p = Prob(the die lands ``six''). Here are the probabilities of obtaining 0, 1, 2, ..., 20 sixes when a fair die is rolled 20 times (after a while, the probabilities are so close to 0 that we have truncated the table):

In order to calculate the P-value, we first have to notice whether the observed number of sixes is above or below what we would expect under .
(We would expect sixes.) Since , the P-value is


This P-value is large by anyone's standards, so Fred says ``there is no evidence that the die is unfair.''

For a two-sided test at level of significance . the null hypothesis is rejected only if the P-value is less than . Why? Well, suppose the little brother had rolled 20 times and obtained 8 sixes ( more than we would expect). Then


and Fred would want to declare the die to be unfair. This procedure is consistent with that used to test the coin.

Suppose the little brother had rolled the die 30 or 40 times and kept track of all the information: how many ``ones'', ``twos'',..., ``sixes''. Then Fred could have used a goodness of fit test to see whether the die appeared to be fair. But this is beyond the school curriculum.

Examples

(Forward: Graphing Data
Back: Statistical Tests)

The following examples may help students to understand what is important.

Matched Pairs versus independent samples, continuous data.

Suppose a student is looking for evidence that caffeine effects blood pressure. The student measures blood pressure on subjects before and after they drink a cup of coffee or cola. A histogram of ``before'' blood pressures superimposed on a histogram of ``after'' blood pressures will not tell much of a story, because blood pressures vary greatly from individual to individual.

However these data are recorded as matched pairs. If the student were to record the change in pressure (blood pressure after minus blood pressure before), most of those differences will be positive. So, the histogram of individual changes in blood pressure may tell a story.

If an experiment is designed so that data are recorded as matched pairs, triples, etc., then you should not lose that information.

If a student compares blood pressure for male and female subjects, then there is no obvious matching.

Means versus individuals.

By collecting enough data, it is easy enough to show that on average adult males are taller than adult females.

On the other hand, we all know a few ``short'' men and a few ``tall'' women. So, individuals vary considerably from (average) trends.

Discrete data.

Techniques for handling discrete (count) data are generally different from those for continuous (measurement) data.

If a student checks attendance at a school dance and discovers that, of the 210 students attending, 50 were from grade 9, 103 were from grade 10, 25 were from grade 11 and 32 were grade 12s.

Does it make much sense to report the average grade (10.2)? No. That is because the data are discrete. It would be better to describe attendance with a bar chart or pie diagram.

Suppose the student wants to say: Grade 10s are most likely to attend a dance. Well, the student would need more information, such as enrolment in each grade. Suppose the data looks like this.

We notice immediately that almost half the students in the school are grade 10s. This fact alone is interesting. Once we acknowledge that grade 10s are rather abundant in this particular school, a test leads to the conclusion that there is no evidence that probability of attending the dance is related to grade level.

Matched Pairs versus independent samples, discrete data.

Sometimes the usual test is not x correct. Suppose that home-room teachers gather information from grade 10 students only, asking each student two questions.

Then we are not surprised to see large numbers (70 and 67) on the ``main diagonal''; some students try to go to every dance; others avoid dances. The interesting numbers here are the ``off diagonals'', since they indicate a change of minds. Of the 83 students who changed their mind, 50/83 or 60% were students who went to last year's dance but decided not to go this year. The interesting question is: were the mind-changes attributable to chance, or do they indicate a trend? (i.e. Is 60% ``significantly greater than'' 40% in this context?) For further information, read about McNemar's Test.

Graphing Data

(Back: Examples)

Suppose we record two pieces of information on each subject. How can the relationship between these two measurements be displayed? The answer depends on the type of data collected.

  1. (Continuous, Continuous)

    Span of writing hand (distance from tip of thumb to tip of little finger, when hand is spread out) versus height of individual. A scatter plot of span versus height (each point on the graph representing one person's measurements) should tell the story: generally, taller people have bigger hands, but there is a lot of variation from individual to individual.

    Suppose the two numbers are height and weight. Which plot makes more sense:
    Weight versus height, or log (Weight) versus log (Height)?
    Why?
    Hint: Our eyes do a better job of interpreting data if scatter plots look like straight lines rather than curves.

    It is important to keep track of information that may be `correlated' with the measurements that interest us. If data are collected over time, a plot of measurement versus time observed often shows a trend.

  2. (Discrete, Continuous) or (Continuous, Discrete)

    Height versus gender. A scatter plot would be difficult to interpret. Two histograms or boxplots, one for each gender, and with the same scale, should tell the story.

  3. (Discrete, Discrete)

    Pass/fail on a course and gender. A bar chart or pie diagram for each gender should tell the story.

    Pie diagrams are used most frequently to describe allocations of expenditures, or sources of income. In most other situations, bar charts are used, often with different coloured bars to represent different groups. Look in a few news magazines.


More information about basic statistics, can be found on a web-page located at the
Department of Mathematics and Statistics at the University of New Brunswick.
The site was prepared by Professor William Knight, and contains many innovative lecture notes, exercises, and problems.  
There are also pointers to other good web sites that discuss basic statistical concepts.

Back:

Statistics award at Canada-Wide Science Fair.
Statistical Society of Canada ( SSC )

Prepared by M. Tingley