Statistics II, Contents

Chi-square tests of nominal (frequency) data

Introduction

Very often, the results of experiments involve frequency tables of nominal data. What is observed are numbers of occurenses of particular combinations of characteristics, e.g., different types of target words get stressed or not depending on the structure of the text or whether they are "new" or not.

The results of such an experiment can be summarized as a table in which every cell contains the number of times a specific combination of characteristics is observed (e.g., stressed new words at the start of a sentence). The question to be answered is whether or not the observed frequencies deviate from a known distribution or, alternatively, whether subsets from the table (rows, columns) have identical frequency distributions.

In principle, it is possible to determine all possible frequency distributions and to calculate a level of significance under the null hypothesis (i.e., H0). However, this is not practical. The number of possible frequency distributions can become quite large. A more efficient approach is to use the fact that the individual observations follow a Binomial or multinomial distribution which can be approximated with a Normal distribution. Using the fact that the variance estimated from the observed numbers follows a Chi-square distribution, it is possible to get a very robust and versatile family of tests: The Chi-square tests.

A Binomial example

We start with the simplest example: A binomial case. There are N1 and N2 number of observations of case 1 and case 2, respectively. The total number of observations is fixed:

N = N1 + N2

We want to know how likely this result would be under the null hypothesis that the probability of observing case 1 is p and of case 2 is 1 - p. We start with the known facts of the binomial distribution:

E(N1) = N * p, E(N2) = N * (1 - p), Var(N1) = Var(N2) = N * p * (1 - p)

Define X1 and X2 as:

X1 = (N1 - N * p) / SQRT( N * p * (1 - p) )
X2 = (N2 - N * (1 - p) )/ SQRT( N * p * (1 - p) )
X1 = - X2
(i.e., (N2 - N * (1 - p) ) = (N - N1 - N + N * p) = -(N1 - N * p) )

For large values of N1 and N2 the values of X1 and X2 will follow a Standard Normal distribution. The summed variance of the observed frequencies around their expected values would be the sum of X1^2 + X2^2 if N1 and N2 were independent. However, they are not independent. Therefore, the summed variance is less, it is half this sum, or better a weighted sum (X^2 is called Chi-square):

X^2 = (1-p) * X1^2 + p * X2^2

The choice of the weighting will not be motivated here, but it will prove very convenient.
Now if we write out in full this equation, and write it in terms of the expected values, we get:

X^2 = (1-p) * (N1 - N * p)^2 / (N * p * (1 - p)) + p * (N2 - N * (1 - p) )^2 / (N * p * (1 - p))
becomes
X^2 = (N1 - N * p)^2 / (N * p) + (N2 - N * (1 - p) )^2 / (N * (1 - p))
and finally:
X^2 = (N1 - E(N1) )^2 / E(N1) + (N2 - E(N2) )^2 / E(N2)

For large values of E(N1) and E(N2), the values of X^2 follow a Chi-square distribution with 1 degree of freedom. This degree of freedom is the sum of the weighting factors used to calculate X^2, i.e., p + (1-p) = 1. It takes into account that there is only one value that can be chosen freely, i.e., either N1 or N2. The other value is then fixed by the fact that N = N1 + N2. H0 can be tested using standard tables for the Chi-square distribution.

The last formulation of X^2 is very convenient. There is no explicit p value to choose, the expected values generally follow from H0 directly. But more important, this formulation can be used unaltered for more complex cases.

The Multinomial case

The case where there are more than two categories is a straightforeward extension of the binomial case. There are k categories with probabilities p1, ..., pk and observations N1, ..., Nk:

Sum i=1,k (pi) = 1
Sum i=1,k (Ni) = N

For each category we find:

E(Ni) = N * pi, Var(Ni) = N * pi * (1 - pi)

Define:

Xi = (Ni - N * pi) / SQRT( N * pi * (1 - pi) )

All Xi can be approximated with a Standard Normal distribution. The summed variance can be calculated as:

X^2 = Sum i=1,k ( (1-pi) * Xi^2 )

These weighting factors ensure that the variance will sum up correctly. Again, X^2 will follow a Chi-square distribution but now with k-1 degrees of freedom (the sum of all (1-pi) factors). This also works when rewritten to E(Ni):

X^2 = Sum i=1,k ( (Ni - E(Ni))^2 / E(Ni) )

Note: the correct derivation of these formulas is based on (the inverse of) the covariance matrix, no weighting factors are involved. The "derivation" given here is used to give some feeling in this matter.

General frequency tables

In the most general case, there is a table of frequencies. Here there are more indices, generally 2, the row number, i, and the column number, j. Each cell in the table contains Nij observations. In general, the expected values, Eij, are derived from the row and column totals, i.e.,
Eij = Row Total * Column Total / Ntotal.

The procedure to calculate X^2 is the same as before: Sum all factors ( (Nij - Eij)^2 / Eij ), i.e.,

X^2 = Sum i,j ( (Nij - Eij)^2 / Eij )

The degrees of freedom becomes more difficult to determine. The degrees of freedom is the number of table cell values that can be freely chosen, keeping all the row and column totals fixed. In most cases, this is:

Degrees of Freedom = (Number of Rows - 1) * (Number of Columns - 1)

For a 2*2 table, the degrees of freedom would be 1.

Continuity corrections and accuracy

The Normal distribution is an approximation of the real, multinomial, distribution. The question of how accurate it is is an important one. First of all, the Chi-square distribution used in the tests is a continuous distribution, but the frequencies are discrete. This means that we should apply a continuity correction before calculating the Chi-square sums. This continuity correction is done by reducing each absolute difference by 0.5, i.e., use (|Nij - Eij| - 0.5)^2. However, this correction is only rarely used.

Even with the continuity correction, most text-books advise to use the Chi-square approximation only when each expected value is larger than 5, i.e., all Eij > 5.

To assess the accuracy we compare the exact results of a Sign test with the results of a Chi-square approximation of this test. Below we have tabulated the significance levels side-by-side (with and without a continuity correction).

DoF = 1, E(N+) = E(N-) = N/2
   Sign Test     |    Chi-square      | With Continuity Correction
N+  N-  N   p<=  |  E     X^2    p<=  | X^2    p<= 
-----------------+--------------------+------------
 6  0   6  0.031 |  3     6     0.014 | 4.17  0.041
 7  0   7  0.016 |  3.5   7     0.008 | 5.14  0.023
 8  1   9  0.039 |  4.5   5.44  0.020 | 4     0.046
 9  1  10  0.022 |  5     6.4   0.011 | 4.9   0.027
10  2  12  0.039 |  6     5.33  0.021 | 4.08  0.043
11  2  13  0.023 |  6.5   6.43  0.011 | 4.92  0.027
11  3  14  0.057 |  7     4.57  0.033 | 3.5   0.061
12  3  15  0.035 |  7.5   5.4   0.020 | 4.27  0.039
13  4  17  0.049 |  8.5   4.76  0.029 | 3.76  0.052
14  4  18  0.031 |  9     5.56  0.018 | 4.5   0.034
15  5  20  0.041 | 10     5     0.025 | 4.05  0.044
It is evident that the continuity correction is indispensible for small numbers of observations or Degrees of Freedom. In this example, the difference between the results of the exact Sign test and the approximation with the Chi-square test (with continuity correction) becomes smaller than 0.005 when the expected number of observations becomes larger than 5. Especially the fact that the Chi-square test is more conservative than the exact test (i.e., p calculated with the Chi-square test is always larger than calculated with the Sign test) makes this a "save" approximation (you err on the side of caution).


Return to: Statistics II, Contents