# Group Analysis with Unbalanced Designs

Equal sample size in an experiment design has four advantages:

- It guarantees that each cell contributes equally to the analysis, and reduce any problems associated with violations of the assumptions underlying group analysis (independence, idential distribution within cell, identical distribution between cells), homogeneity of variance, and normal distribution).
- Test statistic is relatively insensitive to samll deviations from the assumption of equal variances for the factor levels.
- The power of the statistical tests is maximized.
- It carries the orthogonality property of main effects and interactions. This orthogonality aslo provides independent or nonoverlaping pieces of information, making factor effects independently estimable.

However, differences in sample size at group analysis level do happen in FMRI studies:

(1) accidental loss of subjects is due to nonexperimental designs in which the assignment to groups is out of the investigator's control,

(2) the inequalities are intrinsic to the design such as genotypes, or

(3) some subjects fail to perform some of the tasks.

Nonorthogonality among factors in group analysis with unequal sample size among cells (factor level combinations) makes the analysis much more difficult than that for balanced designs. It raises two problems: calculations are much more complicated; and equivalence among factor levels is lost.

Of course the most simple way to deal with an unbalanced design is to trim down some cells by randomly deleting of some data from analysis until balance is achieved. However the cost involved in FMRI data collection is so substantial that the researcher is unwilling to throw away any data. For example, if a subject in a rare disease/genotype group failed to perform some of the tasks, the investigator might not have the luxury to discard such a subject in the group analysis.

The problem of nonorthogonality from the perspective of calculation is the loss of addivity: With an unbalanced design, all sums of squares do not add up the way they would in a design with equal sample sizes. The between-cell sum of squares ~= the sum of the sums of squares for all the terms. Likewise the sums of squares for all the terms and error do not add up to the total sum. The unequal sample sizes cause all the effects to be partially confounded with each other, and this nonorthogonality is an intrinsic characteristic of unbalanced design.

Here we discuss two categories of unbalanced designs: unequal number os subjects; and missing data in some subjects.

`Unequal number of subjects`

If unequal number of subjects (cases (1) and (2) above) is a **random** (or ignorable) loss, the loss usually does not depend on what those subjects would have performed.

If there is only one factor of interest, designs with unequal number of subjects can be analyzed with either **3dttest** (two levels only) or **3dANOVA** (two or more levels).

The following 3 kinds of unbalanced designs are currently allowed in the Matlab package GroupAna:

(1) All factors are fixed

1-way ANOVA;

2-way ANOVA type 1: AXB; and

3-way ANOVA type 1: AXBXC

This case is usually useful for ANCOVA in which subjects are treated as repeated sample. See more discussion about this in ANCOVA

(2) When a random factor (subject) is nested within another factor A, each level of factor A contains a unique and unequal number of subjects

3-way ANOVA type 3: BXC(A);

4-way ANOVA type 3: BXCXD(A).

In this case the imbalance is that each level of factor A (gender, age or disease group, genotypes, between-subjects task, ...) contains unique and unequal levels of the random factor (subject).

(3) In the following design

4-way ANOVA type 5: CXD(AXB)

each level of both factors A and B has a unique set of subjects.

It was thought that **3dRegAna** could handle unbalanced nested designs to some extent as exemplified by Example 4 of the 3dRegAna manual on page 22, but that is valid if data from different subjects are treated as repeated observations instead of being pooled from different levels of a random factor. In other words, the unbalanced design in Example 4 is a 2-way crossed ANOVA AXB with subjects being duplicated observations, which might be more appropriately analyzed as BXC(A) if enough subjects are available.

Theorectically speaking unbalanced nested designs with a random factor (subject) such as the two design types - BXC(A) and BXCXD(A) - handled by the Matlab package could still be analyzed with **3dRegAna** by reassembling the output, but it would be too strenous and impractical for general users, if not unlikely, to go throuch such a tunnel.

Running unbalanced group analysis with unequal number of subjects in this package is basically the same as ANOVA except for a couple of input lines regarding the imbalance. So please refer to the instruction for running ANOVA for details.

Type I sums of square is currently used in this package for reasons discussed in Types of Sums of Square

`Missing Data in Within-Subject and Between-Subjects Designs`

In this case the loss of data for some subjects are usually **nonrandom** (or nonignorable), and it depends on what those subjects would have performed. The analysis and inferences might have to compromise somehow, which transcends concerns regarding the inequality of sample sizes.

Ongoing research in this field might provide moresolutions, but currently there are 3 approaches to handling missing data, and differences are really small when only a few cells are missing. To make the approaches more concrete and palpable, I will use an example, 3-way ANOVA type 3 BXC(A): A - Group (Normal and Patient), B - Condition (4 tasks), and C - Subject (24 subjects, 12 in each group). Let's assume subject #12 in patient group failed to perform task #4.

**Complete-case analysis**or**casewise deletion**: Throw away the subjects with missing data.

This is the simplest, most reasonable and probably safest approach when only one subject or two have missing data. Another advantage of discarding subjects with missing data is that the analysis is balanced in the sense that all subjects in the analysis equally contribute to all cells. In the above example, we would run an unbalanced 3-way ANOVA BXC(A).

Minuses:

Doesn't compensate the sampling bias due to the loss of data;

May lose more considerable data;

Can create some cost on power and efficiency;

If the original design was balanced, the effect of deleting a subject can be substantial.

**Available-case analysis**: Use the available data as much as possible

With massive analysis like FMRI data, it is very unwieldy to really utilize all the available data in one analysis as in relatively simple calculations in social sciences, but still we could salvage some information from the available data as much as possible. This approach might be comforting when the investigator does not have the luxury of discarding any subjects in the analysis, but in the meantime there are no sophisticated methods available to deal with missing data.

With the above example, we can run separate analyses for different contrasts. For any contrasts excluding task #4, implement a balanced 3-way ANOVA BXC(A) only on the 3 tasks of factor B with all 24 subjects, while for any contrasts involving task #4, compromise at an unbalanced 3-way ANOVA BXC(A) with 23 subjects.

The disadvantage of this approach is quite obvious: piecemeal analyses. Some of the inferences are based on different set of subjects, and when the missing data is large, things could go so awry and results can be inconsistent.

**Model fitting**: Data imputation

To utilize all the information available in the case of missing data, a more coherent approach is the general linear (cell-means) model, fitting all the data once. There are several methods under this category, and new research is still under development. One of them is data imputation: fit the missing data with the rest of cells, and then run a complete-case analysis. Most people are comfortable with data imputation, and it can handle a small-to-moderate amount of missing data.

The trouble in the linear model with missing data is that any terms involving the missing cells, such as the interaction between that subject and other factors and the error corresponding to that subject, would have to be estimated based on the available data. Such an estimating process would involve very complicated techniques such as **maximum-likelihood estimation**.

Based on the assessment of the cost/benefit ratio at this point and effective approaches of complete-case and available-case analysis, I have decided to suspend any consideration of implementing the model fitting approach in the package, and recommend users to consider complete-case or available-case analysis.

`References`

Keppel, G. and Wickens, T. D., Design and Analysis: A Researcher's Handbook, 4th Ed., Prentice Hall, 2004.

Schafer, J. L. and Graham, J. W. (2002), Missing Data: Our view of the state of art, Psychological Methods, Vol. 7. No.2, 147-177.

`Related links`

Last modified: May. 25, 2005

Last modified 2007-08-01 18:04