# Multiple testing correction

In FMRI studies data analysis is usually done voxel-wise with all statistical tests conducted separately and simultaneously. Although these voxel-by-voxel tests increase the precision of the conclusions in terms of clusters, they also lead to the increase of the chance that at least one of them is wrong. Therefore a family of statistical tests suffer one serious problem: the probability --traditionally called alpha - of at least one error (type I, or false positive) is greater than that of an error on an individual test. To control the severity of this problem of **alpha escalation**, some measure of **multiple testing correction (**similar to** multiple comparisons correction** in the traditional sense) is desirable during group analysis.

There are three occurrences of multiple comparisons in FMRI analysis: individual subject analysis, group analysis, and conjunction analysis. Compared to the number of EPI voxels (in the order of 10^{4} inside the brain out of total ~10^{5} voxels), conjunction analysis with a few contrasts is much less a severe problem than the case for individual subject and group analysis, and can be simply dealt with Bonferroni correction if the researcher is willing to.

If an analysis fails to survive for correction, a lot of factors could contribute to the failure since the analysis is such a long chain of steps, but one possibility is the power of the analysis. If statistical power is an issue, then you might consider optimizing the experiment design or increasing the number of subjects.

**Familywise approach**

Familywise approach fixes alpha for the whole family (brain) of tests. For example in a brain with 25,000 voxels, a fixed type I error of 0.05 would lead to false detection of 1,250 active voxels simply by chance. By significantly lowering individual type I error, we could achieve the control of total type I error. This is usually done with the consideration of cluster size in the brain, in which case a corrected type I error of *p* means that among 100 such brain activation maps on average there would have 100*p% of them having false detection. However the downside is that the cost of the approach is the loss of power on the individual tests, increasing type II error (beta).

The Bonferroni correction (also called Fisher's method of alpha splitting) is such a familywise multiple-comparison correction used when multiple dependent or independent statistical tests are being performed simultaneously, but it is overly conservative in the case of fMRI analysis as it doesn't consider cluster size and spatial correlation among neighboring voxels. It works well if the brain behaves like Brownian motion of water molecules. Of course the brain has various well-organized structures.

`Monte Carlo simulations`

One approach to dealing with this problem in AFNI is to run Monte Carlo simulations with AlphaSim to obtain corrected type I error for the statistical analysis in the whole brain or an ROI. Such a dual thresholding of both type I error (i.e., alpha) and cluster size gives a reasonable correction for simultaneous tests during the group analysis.

As data are acquired in original voxel size, do make sure that Monte Carlo simulations are done in original voxel size (but not original brain shape - you will see the difference later) instead of higher resolution such as tlrc space. This is to make sure cluster formation is properly simulated.

The following are general steps to run Monte Carlo simulations for correcting multiple testings of group analysis with a hypothetical resolution of 3x3x3 mm^3.

Correction for individaul subject analysis is basically the same except for skipping the step of creating a group mask.

(1) Create a common mask or ROI among all subjects

3dAutomask and 3dcalc are the primary tools to make such a mask. You might have obtained such a mask for each subject during individual subject analysis with 3dDeconvovle. Create a group mask by warping all the masks to talairach space, summing them, and thresholding by n-1 (n = number of subjects); Or, multiplying all individual talairach masks. Applying this mask to group analysis, you would avoid some spurious activations along the edge of the brain. If a template was used in spatial normalization, you can obtain a group mask by running 3dAutomask on the template.

The mask will be used for multiple comparisons correction: With a restriction on the region of interest, less voxels would be considered for correction, thus avoiding unnecessary penalty. If you have problem obtaining survival clusters in the end, consider smaller mask. One possibility is to restrict the correction only on the grey matter. You may use the segmentation tool FAST in FSL to create such a mask for correction.

In case you need to convert the mask to hypothetical voxel size, run

3dresample -dxyz 3 3 3 -rmode 'NN' -inset OriginalMask+tlrc -prefix GroupMask

(2) Obtain spatial correlation (smoothness)

If you used 3dBlurToFWHM during preprocessing for all subjects, then simply adopt the numbers (under option -FWHM or -FWHMxy) from there. If you didn't, obtain the residual time series for each subject (through -errts in 3dDeconvolve), and then use 3dFHWMx on the residual time series in EPI resolution:

3dFWHMx -dset errts+orig -mask ... -out _fwhm.txt

Average the numbers (they should not vary much) across subjects. Using the input of 3dDeconvolve may work well too in the above command line (add option -detrend).

(3) Calculate connectivity radius

There are 3 different ways to define how voxels are considered neighbors: two voxels touch one point (vertex), one edge, or one face. Correspondingly you can choose a number slightly larger than the diagonal of a voxel, or the in-plane diagonal, or one dimension of the voxel. This means any two voxels with distance smaller than the connectivity radius are treated as neighbors in a cluster. For example, with a voxel size of 3 mm X 3 mm X 3 mm, the connectivity radius can be defined as 3.1 mm (> 3), 4.3 mm (> sqrt(3^{2} + 3^2)), or 5.2 mm (> sqrt(3^2 + 3^2 + 3^2)). **The smaller the connectivity radius, more difficult to form clusters. **

(4) Individual voxel thresholding *p* value

Basically you need to compromise among overall significance level (alpha), minimum cluster size, and individual voxel significance level, and may have to vary different *p* value to find the compromise you are interested. Keep in mind about the compromise between *p* and minimum cluster size: smaller the *p* value, smaller the minimum cluster size when other parameters are fixed, but it is not true that smaller the *p* value the better. Any *p* from 0.05 and less is appropriate, and the appropriate the compromise depends on each circumstance.

(5) Run Marte Carlo simulations to obtain corrected *p* value

AlphaSim -mask ...+orig -iter 1000 -rmm ... -pthr ... -fwhmx -fwhmy -fwhmz -out ...

There are 6 columns in the output table, but you only need to pay attention to the first (cluster size) and last (corrected *p* value) columns. The relationship between the two is basically a trade-off: the smaller the minimum cluster size, the bigger the corrected *p* value. You might have to try a few different individual voxel *p* values to find the desired pair of (mini cluster size, corrected *p*).

(6) If **3dclust** (or **3dmerge**) is used later, the user has to be careful about the unit (mm^{3} for the option -vmul in **3dclust**: cluster size multiplied by the EPI resolution!

Report correction parameters in paper:

**Voxel-level approach**

In stead of controlling false positives in conjunction with minimum cluster size, an alternative is to put a limit on the total number of false positive at individual voxel level. In other words, a corrected type I error of *p* indicates that among those active voxels, proportion *p* of them are false positives. The conventional method is determining the false discovery rate with the program 3dFDR, and focuses on the total number of detected voxels in a brain without considering minimum cluster size. Thus spatial filtering (smoothing) for FDR is not as important as for family-wise error correction.

A natural question is: Which method should be used, family-wise error correction or false discovery rate? Practically this might boil down to whichever method gives "nice" active maps, but the choice depends more on the nature of the study: If you want some extent of certainty (sensitivity) on cluster detection, go with family-wise error correction; If specificity combined with the control of false positives is more important, choose FDR.

`False Discovery Rate (FDR)`