McNemar's test and Simpson's Paradox (and the "hot hand" in basketball)

(I wrote this paper in 2007 for a Statistics class I took while trying to do a Ph.D. I am sharing it here for posterity.)

McNemar’s test is a non-parametric method used on nominal data to determine whether the row and column marginal frequencies are equal. It is applied to 2×2 contingency tables with a dichotomous trait with matched pairs of subjects.

Simpson’s paradox is a statistical paradox in which the successes of several groups seem to be reversed when the groups are combined. This seemingly impossible result is encountered often in social science statistics and occurs when a weighting variable, which is not relevant to the individual group assessment, must be used in the combined assessment.

The paper evaluates the potential effect of Simpson’s paradox in McNemar’s test results and conclusions.

Theory

McNemar’s test for the significance of changes

Named after Quinn McNemar, who introduced it in 1947.

The McNemar test can be introduced as a variation of the sign test for the case when the data is nominal, and thus can be expressed as “0’s” (zeroes) and “1’s” (ones).[1]

In a study of N subjects (i=1..N), where the effect of some treatment on a characteristic of the subject results in values represented by (Xi, Yi), each the result of treatments X and Y on subject i. We can say the set of values Xi and the set of values Yi constitute two paired samples, where each Xi and Yi can only take a value of 0 or 1.

Contingency table

The data can be presented in a 2 x 2 contingency table with the form:

Table 1. 2×2 contingency table

	Yi = 0	Yi = 1
Xi = 0	A = number of (0,0)	B = number of (0,1)
Xi = 1	C = number of (1,0)	D = number of (1,1)

Where A+B+C+D = N

Assumptions

The data consists of the characteristics resulting from treatments on N randomly selected subjects, denoted as (Xi, Yi).

The pairs (Xi, Yi) are mutually independent and the measurement scale is nominal. This results in four possible categories presented above as (0,0), (0,1), (1,0), and (1,1).

Hypotheses

We want to test the hypothesis that the treatments make a difference in the incidence of the characteristic, thus the null hypothesis will state that the treatment does not change the incidence of the characteristic or, what is the same, that the incidence of the characteristic is the same for both treatments[2].

Thus, we test:

H₀: P(Xi = 0) = P(Yi = 0), or [1]

H₀: P(Xi = 1) = P(Yi = 1)

Expressed in terms of proportions[3] the null hypothesis is:

H₀: p₁ = p₂ [2]

In summary, all possible hypotheses expressed in proportions are:

H₀: p₁ = p₂H₀: p₁ ≥ p₂H₀: p₁ ≤ p₂ [3]

H₁: p₁ ≠ p₂H₁: p₁ < p₂H₁: p₁ > p₂

Test statistic

Since we are testing for p₁ = p₂, we can re-write to test for p₁ – p₂= 0_.

Using the equivalence between [1] and [2] and the values in Table 1, we say:

p₁={A+B}/{N} [4]

p₂={A+C}/{N} [5]

Therefore

p₁– p₂={B-C}/{N}

McNemar showed that B-C ~ N(0,sqrt{B+C}), when B+C>10, and then the appropriate test statistic is:

Z={B-C}/{sqrt{B+C}} [6]

As Conover (1999) points out the two-tailed test of Z is comparable (for big enough values of B+C) to the one-tailed test of Z², using a chi-squared distribution with 1 degree of freedom.

Decision rule

For the two-sided test, the p-value is two times the probability of finding a Z greater than the Z found. We reject the null hypothesis if the p-value is less than the level of significance desired.

For the one-sided test, the p-value is the probability of finding a Z greater than the Z found. We reject the null hypothesis if the p-value is less than the level of significance desired.

Simpson’s Paradox

Simpson’s paradox is the common name for a situation that may occur when two populations are analyzed with respect to the frequency of some characteristic: if the populations are separated into two categories, the population with higher frequency might show a lower frequency within each category.

The paradox arises when the following counter-intuitive relationships are true:

a/b < A/B

c/d < C/D, and [7]

(a+c)/(c+d) > (A+C)/(B+D)

A simple illustrative example (adapted from Shapiro, 1982):

Table of the success rate of two treatments on men, women, and both:

Table 2. Paradox example

	Men	Women	Both Sexes
Treatment 1	60/80=0.75	40/120=0.33	100/200=0.50
Treatment 2	100/150=0.66	10/40=0.25	110/190=0.58

From looking at the data for each men and women, Treatment 1 seems more effective, looking at both men and women combined Treatment 2 seems more effective.

Rigorously the above constitutes a 2x2x2 contingency table with three variables: treatment, sex, and success rate. Simpson (1951) states that besides the interactions between attributes (characteristics) in pairs, the statistical paradox is caused either by the “second-order” interaction of the three taken together or by the dependence of the collapsed variable with respect to the other variables.

Aggregated contingency tables affected by Simpson’s paradox

Special care should be used when testing and drawing conclusions on data that is analyzed and presented as a 2×2 contingency situation when there is really the aggregation (collapse) of a 2x2x2 contingency situation.

The effect of the second-order interaction among the three variables or the collapsed variable dependency from the others can change the result of the overall test and mislead conclusions.

Concretely, Simpson (1951) presents the 2x2x2 contingency situation in the following form:

Table 3. 2x2x2 contingency table

	CB	CB̅	C̅B	C̅B̅
A	a	c	e	g
A̅	b	d	f	h

Where a+b+c+d+e+f+g+h=1

In the example given earlier, if A is treatment 1, B is men and C is success then a is 60, b is 100 and so forth.

According to Bartlett, as cited by Simpson, the condition for a zero second-order interaction is:

{ad}/{bc}={eh}/{fg} [8]

Which in the example given is true.

According to Simpson, the second condition, assuming zero second-order interaction, is that the collapsed variable, “sex”, is independent of treatment for both success or failure, or that it is independent of success for both treatments. Mathematically, the condition is:

af=be , or ag=ce [9]

Which in the example given are not true.

A practical example

Wardrop (1995) studies the effect of Simpson’s paradox on the perception of the existence of the “hot hand” in basketball: the fans believe that making a shot will influence a player to make the following shot.

He tests the player’s and the overall shooting data (two consecutive free throws) using the McNemar test and finds that the overall results support the “hot hand” leading the fans to believe in it even though the results for individual players might not indicate the same.

Data is as follows:

Table 4. Hot-hand summary

	Larry Bird				Rick Robey				Total
	Second shot				Second shot				Second shot
First shot	Hit	Miss	Tot	First shot	Hit	Miss	Tot	First shot	Hit	Miss	Tot
Hit	251	34	285	Hit	54	37	91	Hit	305	71	376
Miss	48	5	53	Miss	49	31	80	Miss	97	36	133
Tot	299	39	338	Tot	103	68	171	Tot	402	107	509

The analysis of the probability of a hit after a hit (p_hh) and after a miss (p_mh), and the p-value of the McNemar test for p_hh = p_mh[4] yields:

Table 5. Test results

	Larry Bird	Rick Robey	Total
phh	0.881	0.593	0.881
pmh	0.906	0.613	0.729
p-value	0.061	0.098	0.022

The overall p-value = 0.022 supports rejecting the hypothesis that the probabilities are not the same, leading to believe in the hot-hand phenomenon. The individual data contradicts this conclusion.

Conclusion

Evaluating frequency data that might include collapsed variables can lead to erroneous conclusions. Special care should be used in analyzing the presence of multiple contingency situations or in analyzing the conditions defined by Bartlett and Simpson to prevent the emergence of the statistical paradox.

Bibliography

Conover, W. (1999), “Practical Nonparametric statistics”, Third Edition, John Wiley & Sons.

Daniel, W. (1990), “Applied non-parametric statistics”, Second Edition, Duxbury Press, Pacific Grove, CA.

McNemar, Q. “Note on the sampling error of the difference between correlated proportions or percentages”, Psychometrika, 12 (1947) 153-157

Simpson,E. H. (1951). “The Interpretation of Interaction in Contingency Tables”. Journal of the Royal Statistical Society, Ser. B 13: 238-241.

Shapiro, S. “Collapsing contingency tables – A geometric approach”. The American Statistician, February 1982, Vol. 36, No. 1

Wardrop, R. “Simpson’s Paradox and the Hot Hand in Basketball”, The American Statistician, February 1995, Vol. 49, No. 1

[1] This is formally a representation of Yes/No data or any nominal data with two categories.

[2] Two different one-sided tests are also possible, testing to see whether the incidence of the characteristic is either increased or reduced after a treatment

[3] McNemar’s test is also called the “test for related samples when data consists of frequencies”, it then makes sense to use proportions

[4] Wardrop justifies not using a one-sided test in the overwhelming evidence against p_hh > p_mh

Things that I use, like, and am affiliated with:
Mint Mobile offers great cell phone service for $15 flat, get $15 off using the link. Get discounted phones with service activation and no contract.
Uber and Lyft are offering discount rates on your first rides using the links.
AirBnB where you can be home anywhere in the world; get up to $55 off with the link.
I never spend money before I check Mr Rebates, Raise or Ebates to get cashbacks, rebates, discounts, coupons or cheaper gift cards.
This blog is hosted at Hostgator

McNemar’s test and Simpson’s Paradox (and the “hot hand” in basketball)

Theory

Contingency table

Assumptions

Hypotheses

Test statistic

Decision rule

Simpson’s Paradox

A practical example

Related

Leave a ReplyCancel reply

Theory

Contingency table

Assumptions

Hypotheses

Test statistic

Decision rule

Simpson’s Paradox

A practical example

Share this:

Related

Leave a ReplyCancel reply