Categorical Data Analysis: Fundamentals and Perspective Applications in Health Sciences

Nilima¹, Veerendra Nayak², Vasudeva Guddattu³

¹ Senior Lecturer, Indian Institute of Public Health, Delhi NCR, Gurugram, Haryana, India (Current); Assistant Professor, Department of Statistics, Manipal Academy of Higher Education, Manipal, Karnataka, India (Previous).
² Student, Department of Statistics, Manipal Academy of Higher Education, Manipal, Karnataka, India.
³ Associate Professor, Department of Statistics, Manipal Academy of Higher Education, Manipal, Karnataka, India.

NAME, ADDRESS, E-MAIL ID OF THE CORRESPONDING AUTHOR: Ms. Nilima, Senior Lecturer, Indian Institute of Public Health, Delhi NCR, Sector-44, Gurugram-122002, Haryana, India.
E-mail: nilima3012@gmail.com

This paper introduces the statistical methods for testing differences between paired categorical responses. Application of the independent sample tests while analysing paired data is observed among health science researchers. Four common tests are described in detail for identifying specific differences between pairs of groups. Situation to use each test is discussed in general and in comparison with others. Almost all statistical analysis techniques involve assumptions about the data to be analysed. The paired situation tests including paired t-test and repeated measures analysis of variance requires the distribution of the differences be approximately normal, on the other hand, the unpaired t-test requires an assumption of normality to hold separately for both groups of observations. The data analysis technique also requires an assumption regarding the data generation process. Categorical data analysis approaches provide a series of statistical methods that require limited assumptions on the data. The tests more commonly used are McNemar’s, and Cochran’s Q, while some are not so widely reported, like Stuart Maxwell McNemar’s, and Cochran Mantel Haenszel correlation method.

Cochran mantel haenszel correlation test, Cochran’s Q test, McNemar’s, Paired categorical response, Stuart maxwell McNemar’s test

Introduction

A cross tabulation is often used while analysing a categorical variable in which the frequency of each possible combination are noted. Contingency table is a table of joint count for various combinations of categories of the two cross-classified categorical variable. The order of a contingency table, R×C, indicates the number of levels of the two categorical variables in consideration. When the response is categorical, the data fits a contingency table in two ways viz., unrestricted sampling and restricted sampling with fixed total sample size. In a restricted sampling scheme, it is assumed that the marginal or grand total is fixed. This sampling scheme is also referred as a binomial or multinomial sampling scheme. The unrestricted sampling scheme is also referred as a Poisson sampling scheme. We often assume that data has been generated from Poisson, binomial or multinomial sampling schemes [1]. All the three sampling schemes lead to the same estimated expected cell values [2]. In the present paper; the analysis of contingency tables generated using categorical data from a complex sampling scheme is discussed. A complex sampling scheme constitutes the data comprising between-observation dependence which makes the multinomial sampling scheme invalid. These departures from multinomial sampling affect Pearson’s chi-squared statistic and hence makes this test not suitable to be used in case there exists between observation dependence [3]. A categorical variable has a measurement scale consisting of a set of categories [1], assigning each individual to a particular group based on some qualitative property. Categorical data are the counts corresponding to a set of non-overlapping classes of a qualitative variable. Categorical scales are pervasive in the biomedical sciences to measure outcomes such as whether a treatment is successful or not [1].

The analysis of data collected depends on the measurement scale. The measurement scale is nominal if the categories are meant just for identification such as “males or females.” A variable is said to be measured on the ordinal scale if the categories exhibit a natural ordering, for example, severity of disease with categories “mild”, “moderate” and “severe.” Comparison of independent samples includes the variability in response along with the variability between subjects. However, when the data are paired, we look at the data within each subject. The paired comparison is not affected by the way subjects differ [4], ruling-out the possibility of between-subject variability. It is observed that the researcher use chi-square test of association or Fisher’s exact test [5] on paired data, which is not appropriate because these tests treat each observation as independent to each other. If a paired study is undertaken, a paired analysis must be used [6]. In this paper, we discuss in detail, the tests suitable to deal with paired nominal data.

Categorical Data Analysis

McNemar’s Test

The McNemar’s test is used on paired nominal data. This test is often used to test marginal homogeneity. Marginal homogeneity is said to hold if the row and corresponding column marginal frequencies are equal. This test applies to studies where cases serve as their control, or in studies with “before and after” design specifically when the variable of interest is dichotomous [4]. In such situations, one cannot apply any parametric tests since the parametric tests require the variable to be measured at-least in interval scale.

Consider a dichotomous variable measured at two different time points. The researcher is interested to investigate if there is any change in the response over time. A 2×2 contingency table for this is as illustrated in [Table/Fig-1].

[Table/Fig-1]:

Data layout for McNemar’s test.

	Time point 2
		Level A	Level B	Total
Time point 1	Level A	a	b	a+b
	Level B	c	d	c+d
	Total	a+c	b+d	n=a+b+c+d

The cells with count a and d are called as concordant cells as they represent individuals with no change in the status of response over time [4]. As the cell counts b and c indicate the change in the response over time, they are known as discordant cells. We hypothesize that there is a significant change in the response at two time points. There is significant difference in the proportion of individual with response A in the first and B in second-time point to the proportion of individual with response B in first and A in second time point i.e., π_AB≠π_BA. This can be simplified to π_A+≠π_+A, which implies that marginal proportions are not equal. Thus, the hypothesis can be revised as the proportion of individual with response A at first time point does not differ significantly to the proportion of individual with that at the second time point. The hypothesis mentioned is known as the hypothesis of marginal homogeneity [7].

The test statistic follows the chi-squared distribution with 1 degree of freedom under the null hypothesis of no change.

Case 1: A program to create awareness on the side effect of smoking was conducted among college students, at a regular interval of three months. Three contact programs were organised. The data on smoking status and other socio-demographic profile was collected at baseline and after completion of the program. We hypothesize that the intervention was effective. The aggregated data is illustrated in [Table/Fig-2].

[Table/Fig-2]:

Setup for the study requiring a binary categorical response at two time point on the same set of individual.

	After 3 months
		Smokers	Non-smokers	Total
Baseline	Smokers	70	130	200
	Non-smokers	30	154	184
	Total	100	284	384

The McNemar’s test statistic is calculated as . A significant difference in the proportion of smokers after three months was observed (χ²=62.5, df=1, p<0.001). There is enough evidence to conclude that the awareness program was effective. [Table/Fig-3] summarises the McNemar’s test.

[Table/Fig-3]:

Summary of McNemar’s test.

	General	Case 1
Hypothesis	There is significant difference in the proportion of individual with response A in the first and B in second time point to the proportion of individual with response B in first and A in second time point	There is significant difference in the proportion of smokers at baseline to proportion of smokers at three months
Test statistic		χ²=62.5, p<0.001
Decision rule	If or p≤0.05 Reject H₀	, p≤0.05 Reject H₀

Points to Ponder: McNemar’s test was used since the importance was given to baseline and the last time point observation. In case we wish to investigate the change in smoking status at each time point (contact program 1, 2 and 3) we would rather use Cochran’s Q test. A major limitation of McNemar’s test is that it cannot be used if the variable of interest has more than two levels or is measured at more than two time points. In such situations, one should utilise alternative tests like Stuart Maxwell test or Cochran Mantel Haenszel correlation test as discussed in this paper.

In Case 1, suppose the variable smoking status has more than two levels say non-smokers, 1-10 cigarettes per day and more than ten cigarettes per day. With the said modification in the response levels, McNemar’s test cannot be applied to test for marginal homogeneity.

Stuart Maxwell McNemar’s Test of Marginal Homogeneity

Stuart Maxwell McNemar’s test is an extension to McNemar’s test when there are two dependent samples and the response has three or more categories [8]. If the variable of interest has I categories and is put in a contingency table, then an I×I will be generated. Here we hypothesize that, . The data layout is shown in [Table/Fig-4].

[Table/Fig-4]:

Data layout for Stuart Maxwell McNemar’s test.

	Time point 2
		Level 1	Level 2	…	Level n	Total
Time point 1	Level 1	n₁₁	n₁₂		n_1n	n₁₊
	Level 2	n₂₁	n₂₂		n_2n	n₂₊
	...	...	...		...	...
	Level n	n_n1	n_n2		n_nn	n_n+
	Total	n₊₁	n₊₂		n_+n	n₊₊

The Stuart Maxwell McNemar’s test statistic,

The test statistic follows chi-squared distribution with (I–1) degrees of freedom. Where, V^ij is the variance covariance matrix [9], is difference in corresponding marginal total [7].

Case 2: Let us consider the smoking status has three categories say non-smokers, 1–10 cigarettes per day and >10 cigarettes per day. The aggregated data is illustrated in [Table/Fig-5].

[Table/Fig-5]:

Setup for the study requiring a multinomial categorical response at two time point on the same set of individual.

After 3 months
		Non-smokers	1-10 cigarettes per day	>10 cigarettes per day	Total
Baseline	non-smokers	45	37	28	110
	1-10 cigarettes per day	55	32	11	98
	>10 cigarettes per day	105	18	53	176
	Total	205	87	92	384

The Stuart Maxwell McNemar’s test statistic indicates enough evidence to reject the null hypothesis and conclude that the intervention was effective in creating awareness among college students. [Table/Fig-6] summarises the Stuart Maxwell McNemar’s test.

[Table/Fig-6]:

Summary of Stuart Maxwell McNemar’s test.

	General	Case 2
Hypothesis	There is no marginal homogeneity.	There is significant difference in the proportion of smoker at baseline to proportion of subjects at 3 months
Test statistic		, p<0.001
Decision rule	If or p≤0.05 Reject H₀	, p≤0.05 Reject H₀

Points to Ponder: Stuart Maxwell McNemar’s test is suitable only if we have a square table. It is suggested not to use this test when the response is measured at more than two time points. It can’t be either used in situations where k(>2) interventions are given to the same individual. Instead, we need to use Cochran Mantel Hanszel Correlation test in the above-mentioned situations.

Let’s suppose that in Case 1, the smoking status was recorded at more than two-time points. McNemar’s test is not suitable for a situation where a dichotomous response is observed at more than two time points.

Cochran’s Q Test

Cochran’s Q is a test for analysing data on three or more dependent samples where the response variable is binary [8,10]. It is an extension of McNemar’s test for related samples and provides a method for testing the differences between three or more matched sets or three or more time points. The test can also be used to compare two or more interventions on the same set of an individual with sufficient washout time ensuring no carryover effect of the previous intervention. In such case, each subject is treated as a block. Suppose a binary response is measured at K time points on individuals where each individualis a block. The data layout is shown in [Table/Fig-7].

[Table/Fig-7]:

Data layout for Cochran’s Q test.

Subjects	Time Point
Subjects	1	2	…	K
1	X₁₁	X₁₂	...	X_1k
2	X₂₁	X₂₂	...	X_2k
3	X₃₁	X₃₂	...	X_3k
…	...	...	...	...
B	X_b1	X_b2	...	X_bk

In such case, we are interested in testing if the proportion of response X_ij is the same at each time point. Here X_ij is the categorical response corresponding to the i^th subject at the j^th time point. Each X_ij take values either 0 or 1 where 0 implies non-occurrence and 1 implies the occurrence of the event. Then, X_+j represents the sum for the j^th column and X_i+ represent the sum for the i^th row (individual). Let N be the total number of success.

The test statistic follows a chi-squared distribution with k–1 degrees of freedom under the null hypothesis of no change.

Case 3: Let us consider a modification in case 1. The smoking status was measured at three time points say baseline, one year and after two years. The table set up is given in [Table/Fig-8].

[Table/Fig-8]:

Table setup for the study requiring a binary categorical response at multiple timepoint on same set of individual.

Subject	Baseline	After 1 year	After 2 years
1	1	1	0
2	0	1	0
3	0	1	0
4	0	1	0
5	1	1	0
6	0	0	1
7	1	0	1
8	1	0	1
9	0	0	0
10	0	1	0

In this case, we hypothesize that the proportion of smokers decrease significantly over time where, K=3, b=10, X₊₁=4, X₊₂=6, X₊₃=3, X₁₊=2, X₂₊=1, X₃₊=1....X₁₀₊=1 and N=13. The test statistic T=1.55, df=2, p=0.459, which indicates no enough evidence to reject the null hypothesis. The intervention is not effective in reducing the number of smokers over time. [Table/Fig-9] summarises the Cochran’s Q test.

[Table/Fig-9]:

Summary of Cochran’s Q test.

	General	Case 3
Hypothesis	Proportion of success differs significantly for at least one group (time point)	Proportion of smokers differs significantly for at least one time point.
Test statistic		T=1.55, p=0.459
Decision rule	If or p≤0.05Reject H₀	, p>0.05Not enough evidence to reject H₀

Points to Ponder: Cochran Q test is equivalent to McNemar test when K=2 [8]. For a similar design with an ordinal or continuous response, one instead uses the Friedman’s test. The case where there are exactly two treatments the test is equivalent to McNemar’s test. Post-hoc for Cochran’s Q is McNemar’s test for each pair, using Bonferroni-Dunn method of correction [8].

Cochran Mantel Haenszel (CMH) correlation test: This method is used when we have paired nominal data with more than two levels measured at more than two time points. McNemar’s test and Stuart Maxwell McNemar’s test are the special cases of CMH correlation [11]. Each subject is treated as a stratum. Within strata, number of rows represents time points, and columns represent categories of response [12]. For k^th subject, the partial table is represented in [Table/Fig-10].

[Table/Fig-10]:

k^th Stratum data Layout for CMH Correlation test.

Response category
Time Point	1	2 …………………………………C	Total
1	n_k11	n_k12……………………………….. n_k1C	1
2	n_k21	n_k22……………………………….. n_k2C	1
.	.	.	.
.	.	.	.
T	n_kT1	n_kT2……………………………….. n_kTC	1
Total	n_k+1	n_k+2 ………………………………..n_k+c	T

N_ij, where i=1,2,...,T, j=1,2,...,C, may take value either 0 or 1 depending on the status of the k^th subject at a particular time point such that row sum is equal to one. Thus, if we have n subjects, we will have n such partial tables. To test the conditional independence (two variables are said to be conditionally independent if they are independent in each partial table), CMH test statistic is used.

For an ixjxk table, the CMH test statistic is given by,

The test statistic follows chi-squared distribution with (T–1)x(C–1) degrees of freedom [13]. In the k^th stratum, . Each n_k is the vector of (T–1)×(C–1) cell counts, μ_k is the vector of expected frequencies of (T–1)x(C–1) cells, and V_k is the variance covariance matrix where. ∂_ab = 0 if a ≠ b and ∂_ab=1 if a=b. The equation (5) gives the variance covariance matrix.

Case 4: Let us consider a modification in case 1, where smoking status has three levels as discussed in case 2 (non-smokers, 1-10 cigarettes per day and >10 cigarettes per day) and is measured at more than 2 time points as in case 3 (baseline, after three months and after six months). Table setup for the i^th subject is given in [Table/Fig-11].

[Table/Fig-11]:

Table setup for the study requiring multinomial categorical response at three time points on k^th individual.

	Response category
	Non-smoker	1-10 cigarettes per day	>10 cigarettes per day	Total
Baseline	0	0	1	1
After 3 months	0	1	0	1
After 6 months	0	1	0	1

Data for all 384 subjects were analysed using the SAS University edition. The evidences were not enough to reject the null hypothesis (χ²=0.189, df=4, p=0.909). The intervention is not effective in reducing the number of smokers over time. The results for CMH correlation test is summarised in [Table/Fig-12].

[Table/Fig-12]:

Summary of CMH correlation test.

	General	Case 4
Hypothesis	There is a linear association between X and Y in at least one stratum.	Proportion of smokers differs significantly for at least one time point.
Test statistic		χ² = 0.1893, p=0.909
Decision rule	If or p ≤ 0.05Reject H₀	, p>0.05Not enough evidence to reject H₀

Points to Ponder: When the response is binary and is measured at more than two time points, we instead use Cochran’s Q test. When the response has more than two levels, measured at two time points, we use Stuart Maxwell McNemar’s test. When response is binary and is measures at two time points, we use McNemar’s test instead.

A tabular comparison to summarise the situation and the appropriate choice of the test is shown in the [Table/Fig-13].

[Table/Fig-13]:

Comparison of tests discussed in the article.

Test	Situation
McNemar’s	Response levels: 2Time points: 2
Stuart Maxwell McNemar’s	Response levels: >2Time points: 2
Cochran’s Q	Response levels: 2Time points: >2
Cochran Mantel Haenszel	Response levels: >2Time points: >2

Conclusion

The statistical test to be used on the paired data depends on the number of levels of categorical response and the number of time point (s) measurement is taken. Use of independent sample techniques on paired data results in loss of information and unreliable results. Therefore, it is recommended to study the characteristics before deciding on the statistical tests suitable for the data collected.

Financial or Other Competing Interests

None.

[1]. Agresti A, An introduction to categorical data analysis 1996 vol. 135Wiley New York  [Google Scholar]

[2]. Fienberg SE, The analysis of cross-classified categorical data 2007 Springer Science & Business Media10.1007/978-0-387-72825-421687832  [Google Scholar]  [CrossRef]  [PubMed]

[3]. Porteous B, The mutual independence hypothesis for categorical data in complex sampling schemesBiometrika 1987 74(4):857-62.10.1093/biomet/74.4.857  [Google Scholar]  [CrossRef]

[4]. Goodman MS, Biostatistics for Clinical and Public Health Research 2017 Routledge10.4324/9781315155661  [Google Scholar]  [CrossRef]

[5]. Chun HK, Kim KM, Park HR, Effects of hand hygiene education and individual feedback on hand hygiene behaviour, MRSA acquisition rate and MRSA colonization pressure among intensive care unit nursesInternational Journal of Nursing Practice 2015 21(6):709-15.10.1111/ijn.1228825354985  [Google Scholar]  [CrossRef]  [PubMed]

[6]. Dallal G, Paired data- In theoryhttp://wwwjerrydallalcom/lhsp/pairedhtm  [Google Scholar]

[7]. Sun X, Yang Z, Generalized McNemar’s test for homogeneity of the marginal distributionsIn: SAS Global forum 2008 2008:1-10.  [Google Scholar]

[8]. Sheskin DJ, Handbook of parametric and nonparametric statistical procedures 2003 CRC Press10.1201/978142003626812636158  [Google Scholar]  [CrossRef]  [PubMed]

[9]. Stuart A, A test for homogeneity of the marginal distributions in a two-way classificationBiometrika 1955 42(3/4):412-16.10.1093/biomet/42.3-4.412  [Google Scholar]  [CrossRef]

[10]. Bhapkar VP, On Cochran’s Q test and its modificationIn: Random Counts in Scientific Work 1970 Volume 2, ednPennsylvania University Press University Park  [Google Scholar]

[11]. Zhang J, Boos DD, Generalized Cochran-Mantel-Haenszel test statistics for correlated categorical dataCommunications in Statistics-Theory and Methods 1997 26(8):1813-37.10.1080/03610929708832016  [Google Scholar]  [CrossRef]

[12]. Agresti A, Categorical data analysis 2003 vol. 482John Wiley & Sons10.1080/03610929708832016  [Google Scholar]  [CrossRef]

[13]. Davis CS, Statistical methods for the analysis of repeated measurements 2002 Springer Science & Business Media  [Google Scholar]