TRR¶
Welcome to TRR¶
Test-Retest Reliability Program through Bayesian Multilevel Modeling
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Version 1.0.5, March 13, 2023
Author: Gang Chen (gangchen@mail.nih.gov)
Website - https://afni.nimh.nih.gov/gangchen_homepage
SSCC/NIMH, National Institutes of Health, Bethesda MD20892
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Usage:¶
TRR performs test-retest reliability analysis for behavior data as well as
region-based neuroimaging data. If no multiple trials are involved in a
dataset, use the conventional intraclass correlation (ICC) with, for
example, 3dICC for neuroimaging data. However, when there are multiple
trials for each condition, the traditional intraclass correlation may
underestimate TRR to various extent. 3dLMEr could be utilized with the
option -TRR to estimate test-retest reliability with trial-level data for
whole-brain analysis; however, it may only work for data with strong
effects such as a single effect (e.g., one condition or average across
conditions).
The input data for the program TRR have to be at the trial level without
any summarization at the condition level. The TRR estimation is conducted
through a Byesian multilevel model with a shell script (as shown in the
examples below). The input data should be formulated in a pure-text table
that codes all the variables.
Citation:¶
If you want to cite the modeling approach for TRR, consider the following¶
Chen G, et al., Beyond the intraclass correlation: A hierarchical modeling
approach to test-retest assessment.
https://www.biorxiv.org/content/10.1101/2021.01.04.425305v1
Read the following carefully!
A data table in pure text format is needed as input for an TRR script. The
data table should contain at least 3 (with a single condition) or 4 (with
two conditions) columns that specify the information about subjects,
sessions and response variable values:
Subj session Y
S1 T1 0.2643
S1 T2 0.3762
Subj condition session Y
S1 happy T1 0.2643
S1 happy T2 0.3762
S1 sad T1 0.3211
S1 sad T2 0.3341
0) Through Bayesian analysis, a whole TRR distribution will be presented in
the end as a density plot in PDF. In addition, the distribution is
summarized with a mode (peak) and a highest density interval that are
stored in a text file with a name specified through -prefix with the
appendix .txt.
1) Avoid using pure numbers to code the labels for categorical variables. The
column order does not matter. You can specify those column names as you
prefer, but it saves a little bit scripting if you adopt the default naming
for subjects ('Subj'), sessions ('sess') and response variable ('Y').
2) Sampling error for the trial-level effects can be incorporated into the
model. This is especially applicable to neuroimaging data where the trial
level effects are typically estimated through time series regression with
GLS (e.g., 3dREMLfit in AFNI); thus, the standard error or t-statistic can
be provided as part of the input through an extra column in the data table
and through the option -se in the TRR script.
3) If there are more than 4 CPUs available, one could take advantage of within
chain parallelization through the option -WCP. However, extra stpes are
required: both 'cmdstan' and 'cmdstanr' have to be installed. To install
'cmdstanr', execute the following command in R:
install.packages('cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')))
Then install 'cmdstan' using the following command in R:
cmdstanr::install_cmdstan(cores = 2)
4) The results from TRR can be slightly different from each execution or
different computers and R package versions due to the nature of randomness
involved in Monte Carlo simulations, but the differences should be negligle
unless numerical failure occurs.
Installation requirements:¶
In addition to R installation, the R packages "brms", "coda" and "ggplot2" are
required for TRR. Make sure you have a recent version of R. To install these
packages, run the following command at the terminal:
rPkgsInstall -pkgs "brms,coda,ggplot2" -site http://cran.us.r-project.org"
Alternatively you may install them in R:
install.packages("brms")
install.packages("coda")
install.packages("ggplot2")
To take full advantage of parallelization, install both 'cmdstan' and 'cmdstanr'
and use the option -WCP in TRR (see comments above).
Running:¶
Once the TRR command script is constructed saved as a text file, for example,
called myTRR.txt, execute it with the following (assuming on tcsh shell),
nohup tcsh -x myTRR.txt > diary.txt &
nohup tcsh -x myTRR.txt |& tee diary.txt &
The progression of the analysis is stored in the text file diary.txt and can
be examined later. The 'nohup' command allows the analysis running in the
background even if the terminal is killed.
Examples:¶
Example 1 --- TRR estimation for a single effect - simple scenario: one
condition, two sessions. Trial level effects are the input
from each subject, and test-retest reliability between two sessions is
the research focus.
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -dataTable myData.tbl \
If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
...), a speedup feature can be adopted through within-chain parallelization
with the option -WCP. For example, the script assumes a
computer with 24 CPUs (6 CPUs per chain):
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -WCP 6 -dataTable myData.tbl \
If the data are skewed or have outliers, use exGaussian or Student's t:
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -distY exgaussian -dataTable myData.tbl \
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -distY student -dataTable myData.tbl \
The input file 'myData.txt' is a data table in pure text format as below:
Subj sess Y
S01 sess1 0.162
S01 sess1 0.212
S01 sess2 -0.598
S01 sess2 0.327
S02 sess1 0.249
S02 sess1 0.568
Example 2 --- TRR estimation for a contrast between two conditions. Input
data include trial-level effects for two conditions during two sessions.
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -dataTable myData.tbl \
A version with within-chain parallelization through option '-WCP 6' on a
computer with 24 CPUs:
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -WCP 6 \
-dataTable myData.tbl \
Another version with the assumption of student t-distribution:
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -distY student -dataTable myData.tbl \
The input file 'myData.txt' is a data table in pure text format as below:
Subj sess cond Y
S01 sess1 C1 0.162
S01 sess1 C1 0.212
S01 sess1 C2 0.262
S01 sess1 C2 0.638
S01 sess2 C1 -0.598
S01 sess2 C1 0.327
S01 sess2 C2 0.249
S01 sess2 C2 0.568
Example 3 --- TRR estimation for a contrast between two conditions. Input
data include trial-level effects plus their t-statistic or standard error
values for two conditions during two sessions.
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -tstat tvalue -dataTable myData.tbl \
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -se SE -dataTable myData.tbl \
A version with within-chain parallelization through option '-WCP 6' on a
computer with 24 CPUs:
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -tstat tvalue -WCP 6 \
-dataTable myData.tbl \
Another version with the assumption of Student t-distribution:
TRR -prefix myTRR -chains 4 -iterations 1000 -Y RT -subject Subj \
-repetition sess -condition cond -tstat tvalue -distY student \
-dataTable myData.tbl \
The input file 'myData.txt' is a data table in pure text format as below:
Subj sess cond tvalue Y
S01 sess1 C1 2.315 0.162
S01 sess1 C1 3.212 0.341
S01 sess1 C2 1.262 0.234
S01 sess1 C2 0.638 0.518
S01 sess2 C1 -2.598 -0.213
S01 sess2 C1 3.327 0.423
S01 sess2 C2 4.249 0.791
S01 sess2 C2 3.568 0.351
Options:¶
Options in alphabetical order:
-chains N: Specify the number of Markov chains. Make sure there are enough
processors available on the computer. Most of the time 4 cores are good
enough. However, a larger number of chains (e.g., 8, 12) may help achieve
higher accuracy for posterior distribution. Choose 1 for a single-processor
computer, which is only practical only for simple models.
-condition var_name: var_name is used to specify the column name that is
designated as the condition variable. Currently TRR can only handle
two conditions. Note that when this option is not invoked, no
condition variable is assumed to be present, and the TRR analysis
will proceed with a singl effect instead of a contrast between two
conditions.
-cVars variable_list: Identify categorical (qualitive) variables (or
factors) with this option. The list with more than one variable
has to be separated with comma (,) without any other characters such
as spaces and should be surrounded within (single or double) quotes.
For example, -cVars "sex,site"
-dataTable TABLE: List the data structure in a table of long format (cf. wide
format) in R with a header as the first line.
NOTE:
1) There should have at least three columns in the table. These minimum
three columns can be in any order but with fixed and reserved with labels:
'Subj', 'ROI', and 'Y'. The column 'ROI' is meant to code the regions
that are associated with each value under the column Y. More columns can
be added in the table for explanatory variables (e.g., groups, age, site)
if applicable. Only subject-level (or between-subjects) explanatory variables
are allowed at the moment. The labels for the columns of 'Subj' and 'ROI'
can be any identifiable characters including numbers.
2) Each row is associated with one and only one 'Y' value, which is the
response variable in the table of long format (cf. wide format) as
defined in R. With n subjects and m regions, there should have totally mn
rows, assuming no missing data.
3) It is fine to have variables (or columns) in the table that are not used
in the current analysis.
4) The context of the table can be saved as a separate file, e.g., called
table.txt. In the script specify the data with '-dataTable table.txt'.
This option is useful when: (a) there are many rows in the table so that
the program complains with an 'Arg list too long' error; (b) you want to
try different models with the same dataset.
-dbgArgs: This option will enable R to save the parameters in a file called
.TRR.dbg.AFNI.args in the current directory so that debugging can be
performed.
-distY distr_name: Use this option to specify the distribution for the response
variable. The default is Gaussian when this option is not invoked. When
skewness or outliers occur in the data, consider adopting the Student's
t-distribution, exGaussian, log-normal etc. by using this option with
'student', 'exgaussian', 'lognormal' and so on.
-help: this help message
-iterations N: Specify the number of iterations per Markov chain. Choose 1000 (default)
for simple models (e.g., one or no explanatory variables). If convergence
problem occurs as indicated by Rhat being great than 1.1, increase the number of
iterations (e.g., 2000) for complex models, which will lengthen the runtime.
Unfortunately there is no way to predict the optimum iterations ahead of time.
-model FORMULA: This option specifies the effects associated with explanatory
variables. By default (without user input) the model is specified as
1 (Intercept). Currently only between-subjects factors (e.g., sex,
patients vs. controls) and quantitative variables (e.g., age) are
allowed. When no between-subject factors are present, simply put 1
(default) for FORMULA. The expression FORMULA with more than one
variable has to be surrounded within (single or double) quotes (e.g.,
'1+sex', '1+sex+age'. Variable names in the formula should be consistent
with the ones used in the header of data table. A+B represents the
additive effects of A and B, A:B is the interaction between A
and B, and A*B = A+B+A:B. Subject as a variable should not occur in
the model specification here.
-PDP width height: Specify the layout of posterior distribution plot (PDP) with
the size of the figure windown is specified through the two parameters of
width and height in inches.
-prefix PREFIX: Prefix is used to specify output file names. The main output is
a text with prefix appended with .txt and stores inference information
for effects of interest in a tabulated format depending on selected
options. The prefix will also be used for other output files such as
visualization plots, and saved R data in binary format. The .RData can
be used for post hoc processing such as customized processing and plotting.
Remove the .RData file to save disk space once you deem such a file is no
longer useful.
-qVars variable_list: Identify quantitative variables (or covariates) with
this option. The list with more than one variable has to be
separated with comma (,) without any other characters such as
spaces and should be surrounded within (single or double) quotes.
For example, -qVars "Age,IQ"
-repetition var_name: var_name is used to specify the column name that is
designated as for the repetition variable such as sess<ion. The default
(when this option is not invoked) is 'repetition'. Currently it only allows
two repetitions in a test-test scenario.
-se: This option indicates that standard error for the response variable is
available as input, and a column is designated for the standard error
in the data table. If effect estimates and their t-statistics are the
output from preceding analysis, standard errors can be obtained by
dividing the effect estimatrs ('betas') by their t-statistics. The
default assumes that standard error is not part of the input.
-se: This option indicates that standard error for the response variable is
available as input, and a column is designated for the standard error
in the data table. If effect estimates and their t-statistics are the
output from preceding analysis, standard errors can be obtained by
dividing the effect estimatrs ('betas') by their t-statistics. The
default assumes that standard error is not part of the input.
-show_allowed_options: list of allowed options
-subject var_name: var_name is used to specify the column name that is
designated as for the subject variable. The default (when this option
is not invoked) is 'subj'.
-subject var_name: var_name is used to specify the column name that is
designated as for the subject variable. The default (when this option
is not invoked) is 'subj'.
-tstat var_name: var_name is used to specify the column name that lists
the t-statistic values, if available, for the response variable 'Y'.
In the case where standard errors are available for the effect
estiamtes of 'Y', use the option -se.
-verb VERB: Specify verbose level.
-WCP k: This option will invoke within-chain parallelization to speed up runtime.
To take advantage of this feature, you need the following: 1) at least 8
or more CPUs; 2) install 'cmdstan'; 3) install 'cmdstanr'. The value 'k'
is the number of thread per chain that is requested. For example, with 4
chains on a computer with 24 CPUs, you can set 'k' to 6 so that each
chain will be assigned with 6 threads.
-Y var_name: var_name is used to specify the column name that is designated as
as the response/outcome variable. The default (when this option is not
invoked) is 'Y'.