16.2. Elementary OpenMP in AFNI¶
16.2.1. Overview¶
These are notes by Bob Cox (AKA Zhark the Parallelizer) from May, 2009.
Several programs in AFNI are written using OpenMP, which provides support for parallelization of tasks. Some of the programs that use OpenMP are:
3dAllineate |
3dAutoTcorrelate |
3dBandpass |
3dBlurToFWHM |
3dClustSim |
3dDegreeCentrality |
3dDWUncert |
3dECM |
3dFWHMx |
3dGroupInCorr |
3dLFCD |
3dLocalACF |
3dLocalHistog |
3dLocalPV |
3dLocalstat |
3dMSE |
3dNwarpAdjust |
3dNwarpApply |
3dNwarpCalc |
3dNwarpCat |
3dNwarpFuncs |
3dNwarpXYZ |
3dQwarp |
3dREMLfit |
3dTcorr1D |
3dTcorrMap |
3dUnifize |
3dXClustSim |
… |
The above is not an exhaustive list (though I sure am tired of reading it).
Note that there are some other scripts or programs in AFNI that wrap
around one or more of the above programs (e.g., 3dQwarp
is used
within auto_warp.py
, @SSwarper
and @animal_warper
, among
others).
16.2.2. Introduction¶
In the last couple months, I have explored using the parallelization
toolkit OpenMP in AFNI. OpenMP is built in to several compilers,
including gcc-4.2. OpenMP comprises a set of #pragma
s,
functions, etc., that are used to control parallel threads on a
shared-memory multi-CPU system (e.g., my octo-core Mac).
OpenMP should not be confused with Open MPI, which is a completely different piece of software for parallelization! MPI (= Message Passing Interface) lets you distribute work across multiple CPUs and multiple systems by explicitly passing messages back and forth between processes/threads. MPI is more flexible than OpenMP in that it allows a whole network of computers to be harnessed into one job.
MPI is also harder to use than OpenMP, in that the programmer has to manage the inter-thread data flow very explicitly. In OpenMP, the data model is that some data is explicitly in shared memory, visible to be read and written by each thread by normal C statements, and some data is explictly private to each thread, and when modified in one thread that change will not be visible to the other threads. In OpenMP, the principal burden on the programmer is to make sure that the shared data is written to in a coherent manner by the different threads. In MPI, the burden on the programmer is to explicitly determine when data held in one thread is communicated to another thread. Converting an existing program to use OpenMP is generally much simpler than making it work with MPI.
16.2.3. Memory Concepts¶
The two most important things to realize about OpenMP’s parallel model for computation are
OpenMP is shared memory computing; unless you take care to make certain variables and arrays private to each thread, all the data is shared between threads.
You are responsible for managing the access to shared memory so that conflicts don’t occur (e.g., two processes don’t try to write to the same shared variable at the same time). The compiler does not even attempt to help you with analyzing the code to detect this potential problem.
In particular, if your parallel code is writing results into a shared memory array (as is almost certain to be happening so that you can get some results out), then you have to make sure that different threads won’t be writing to the same elements of the array at the same time, where one thread potentially destroys the results from another thread.
If your parallel code is summing up some value (say), you have to make sure that different threads don’t conflict when they modify this value.
If your parallel code is using some temporary variable or array, you probably want to make sure that each thread gets a separate copy of that object.
None of these things are hard, once you think about the concepts of parallel access to shared memory for a while, and realize the potential for conflicts between threads trying to update the same location more or less at the same instant. In particular, assigning to static variables is a real potential problem – and this might occur inside some function call that you haven’t looked at. (Hint: look at the functions you call from inside a parallelized part of your code!)
16.2.4. Compiling to use OpenMP¶
In your AFNI Makefile, you have to set the variable OMPFLAG
to be
the value of the flag(s) to the compiler/linker needs for dealing with
OpenMP, plus you have to enable OpenMP inside AFNI code by setting the
C macro variable USE_OMP
to be defined. For gcc-4.2, my Makefile
contains the single extra line:
OMPFLAG = -fopenmp -DUSE_OMP
Some AFNI Makefile.*-s now contain definitions of OMPFLAG so that the binary builds include some parallelized programs. At this moment, these programs are:
Program |
OpenMP use |
---|---|
3BlurInMask |
parallelized across sub-bricks |
3dTcorrMap |
parallelized across the inner voxel loop |
3dDespike |
parallelized across voxels |
3dAllineate |
parallelized across voxels in the interpolation functions (Example #1 below) |
AlphaSim |
parallelized across simulated 3D volumes |
3dREMLfit |
parallelized across voxels in the REML estimation loop, and across ARMA(1,1) (a,b) parameter pairs in the REML matrix setup loop. |
3dLocalPV |
parallelized across voxels |
3dLocalStat |
parallelized across voxels |
3dBandpass |
the |
3dGroupInCorr |
computations of correlations are parallelized across time series datasets; computations of t-tests are parallelized across voxels |
3dTcorr1D |
computations are parallelized across columns of the input 1D file |
3dClustSim |
(the new and improved version of ``AlphaSim``, which has been
deprecated for a long time) parallelized across simulated 3D
volumes, and with fewer |
3dAutoTcorrelate |
parallelized across the outer voxel loop; to get any decent speedup required converting the input dataset to a ‘vectim’ struct (time order first rather than last): otherwise, thrashing through the input dataset time points over and over was grossly slow. |
In each case, I chose to invoke OpenMP at the simplest place that did
a lot of work that was independent between components
- voxels or
sub-bricks. In the case of 3dBlurInMask
, parallelizing across voxels
would be quite difficult, due to the structure of the algorithm
- but
parallelizing across sub-bricks was essentially trivial, since the
blurring algorithm is completely independent for each volume of data.
All of the above programs, except for 3dGroupInCorr
[Dec 2009] and
3dClustSim
[Jul 2010], were originally written as serial programs,
and later converted to OpenMP. Some of the problems arising in these
conversions are outlined below. It is much easier to write a program
from scratch to use OpenMP than to retrofit it later!
In AFNI’s Makefile.INCLUDE, files that are to be compiled/linked with
OpenMP use CCOMP
instead of CC
or CCFAST
for the
compilation/linking. You’ll have to add an extra make rule for each
file that needs this special handling. NB: Any program that calls
an OpenMP-enabled function, even if the main program knows nothing
about OpenMP, must be linked with CCOMP
to get the proper libraries
included.
At the top of any C source file that’s going to use OpenMP
#pragma
s or function calls, you should put a code block like so:
#ifdef USE_OMP
#include <omp.h>
#endif
By default, OpenMP will use all CPUs available in any parallel block
of code. This behavior can be changed by setting the environment
variable OMP_NUM_THREADS
to some smaller integer value. (There are
also OpenMP library functions that let you control this from within a
program, but I’ve not used these.)
16.2.5. Example #1 – A basic case¶
My first example is a function from mri_genalign_util.c
. The
function below does linear interpolation from the input image fim at
npp output points whose index coordinates are given in input arrays
ip
, jp
, and kp
, storing the results into user-allocated
array vv
. (Rather than make up a trivial sample case, I’m showing
a real piece of code that’s used in 3dAllineate
.) The two OpenMP
directives are shown in the lines starting with #pragma omp ..
:
#define FAR(i,j,k) far[(i)+(j)*nx+(k)*nxy]
#define CLIP(mm,nn) if(mm < 0)mm=0; else if(mm > nn)mm=nn
void GA_interp_linear( MRI_IMAGE *fim ,
int npp, float *ip, float *jp, float *kp, float *vv )
{
ENTRY("GA_interp_linear") ;
#pragma omp parallel if(npp > 9999)
{
int nx=fim->nx , ny=fim->ny , nz=fim->nz , nxy=nx*ny , pp ;
float nxh=nx-0.501f , nyh=ny-0.501f , nzh=nz-0.501f , xx,yy,zz ;
float fx,fy,fz ;
float *far = MRI_FLOAT_PTR(fim) ;
int nx1=nx-1,ny1=ny-1,nz1=nz-1 ;
float ix,jy,kz ;
int ix_00,ix_p1 ; /* interpolation indices */
int jy_00,jy_p1 ;
int kz_00,kz_p1 ;
float wt_00,wt_p1 ; /* interpolation weights */
float f_j00_k00, f_jp1_k00, f_j00_kp1, f_jp1_kp1, f_k00, f_kp1 ;
#pragma omp for
for( pp=0 ; pp < npp ; pp++ ){
xx = ip[pp] ; if( xx < -0.499f || xx > nxh ){ vv[pp]=outval; continue; }
yy = jp[pp] ; if( yy < -0.499f || yy > nyh ){ vv[pp]=outval; continue; }
zz = kp[pp] ; if( zz < -0.499f || zz > nzh ){ vv[pp]=outval; continue; }
ix = floorf(xx) ; fx = xx - ix ; /* integer and */
jy = floorf(yy) ; fy = yy - jy ; /* fractional coords */
kz = floorf(zz) ; fz = zz - kz ;
ix_00 = ix ; ix_p1 = ix_00+1 ; CLIP(ix_00,nx1) ; CLIP(ix_p1,nx1) ;
jy_00 = jy ; jy_p1 = jy_00+1 ; CLIP(jy_00,ny1) ; CLIP(jy_p1,ny1) ;
kz_00 = kz ; kz_p1 = kz_00+1 ; CLIP(kz_00,nz1) ; CLIP(kz_p1,nz1) ;
wt_00 = 1.0f-fx ; wt_p1 = fx ; /* weights for ix_00 and ix_p1 points */
#undef XINT
#define XINT(j,k) wt_00*FAR(ix_00,j,k)+wt_p1*FAR(ix_p1,j,k)
/* interpolate to location ix+fx at each jy,kz level */
f_j00_k00 = XINT(jy_00,kz_00) ; f_jp1_k00 = XINT(jy_p1,kz_00) ;
f_j00_kp1 = XINT(jy_00,kz_p1) ; f_jp1_kp1 = XINT(jy_p1,kz_p1) ;
/* interpolate to jy+fy at each kz level */
wt_00 = 1.0f-fy ; wt_p1 = fy ;
f_k00 = wt_00 * f_j00_k00 + wt_p1 * f_jp1_k00 ;
f_kp1 = wt_00 * f_j00_kp1 + wt_p1 * f_jp1_kp1 ;
/* interpolate to kz+fz to get output */
vv[pp] = (1.0f-fz) * f_k00 + fz * f_kp1 ;
}
} /* end OpenMP */
EXRETURN ;
}
The directive:
#pragma omp parallel if(npp > 9999)
is at the head of a “structured block” of code that will be executed in parallel. (A “structured block” in OpenMP-lingo means a C {…} block that contains no way out (e.g., no return) except to fall through the bottom.) You should imagine that all the code inside this block will be executed in parallel on multiple CPUs – even code that does exactly the same thing. To get different things done on different CPUs, we need the second directive, that will specify the “work-sharing”.
In the above code, I’ve declared all the internal variables used in the function inside the parallel block. This means that these variables are all private to each thread. Assignments to any one of these in one thread will have no impact on the other threads. Declaring variables like this is the easy way to make sure they are thread-private and won’t accidentally conflict. It is also possible to declare outside variables to be thread-private in the parallel #pragma, but I’d rather skip that – doing it with private declarations, as above, is simpler to think about and to program. Thus, for example, the pointer:
float *far = MRI_FLOAT_PTR(fim) ;
will have N identical copies spread around amongst the N threads. This is slightly inefficient with respect to memory usage – since far is never changed after the initial assignment, it could be a shared variable declared and initialized outside the parallel block – but unless the amount of memory duplication is huge, the rule:
For most variables used and assigned to in the
parallel
block, declare them inside theparallel
block.
is the simplest to code with. The only exception would be variables
whose values you wish to preserve when the parallel
block ends and
normal (sequential) program execution resumes – typically, these
variables are output arrays.
Note that the if(npp > 9999)
part of the parallel
directive
means that the code will actually only be parallelized if the number
of points to interpolate at one shot is 10,000 or more. There is no
point in parallelizing at too fine a level – the thread startup and
management overhead will be too large to get any net program speedup
from the OpenMP-ization. (You might well ask where I got the number
9999 from – the answer is that I just made it up – I didn’t actually
test the function to see where the breakeven point for npp
might
lie.)
The directive:
#pragma omp for
indicates that the next for statement should be parallelized across
the threads that were stared with the parallel directive – that is,
that different threads should get different subsets of the index
pp
as it ranges over 0..npp-1
. In this code, pp
is the
voxel index into the input arrays ip
, jp
, and kp
, and into
the output array vv
. The goal of the loop body is to compute
vv[pp]
. Each different value of pp
writes to a different
output location, so there is no conflict possible even if two threads
were executing the same statement at exactly the same time (something
you always have to think about).
For a for
statement to be parallelizable, the number of iterations
must be easily determinable when the loop is started. In this case, it
is obviously npp
. For example, a loop of the form:
for( pp=0 ; pp < npp && vv[pp] != 666.0f ; pp++ ){ ... }
cannot be parallelized by OpenMP since the terminal condition is not
determinable when the loop is started. Similarly, the loop body cannot
contain any break
statements.
This function (and its analogs for NN, cubic, quintic, and wsinc5
interpolation) were pretty easy to adapt to OpenMP. I simply moved all
the local variables into the parallel
block, and that was about
it. The only write to a variable visible outside the parallel
block is to vv[pp]
and there is obviously no possible thread
conflict there. The only external function called is in the C library,
and these are pretty much all supposed to be “thread-safe” (the
technical term is re-entrant), unless the man
page specifies
otherwise.
Note that the parallel for
loop will not be executed in sequential
order of the control variable pp
, even within a single
thread. OpenMP chooses the order of execution, so the externally
visible results (in this example, the vv[pp]
values) should not
depend on the order in which the pp
values are chosen. (It is
possible for the programmer to have some control over the division and
sequence of labor in the different threads, but I’ve not used this
feature of OpenMP, nor do I plan to.)
When parallelizing 3dDespike
, I chose to parallelize the voxel
loop in the main program. The first thing was to identify all the
variables that receive assignments in this loop, and move the
declarations of those that are purely internal to the loop (not
containing output data) from the top of main
down to be inside the
parallel
block. These variables include a number of work arrays
for processing the voxel time series. Each thread gets its own
instance of each of these work arrays, since the malloc
call was
moved inside the parallel
block. I simply had to examine the code
carefully to ensure that every variable that received an assignment
was local to the thread – since this loop wasn’t self-contained
inside a function, the work of scrutinizing the code was a little more
tedious than for the interpolation functions described above. At the
end of the voxel loop, the results are put into the output
dataset. Each voxel calculation and assignment is independent of all
others, so there is no potential thread conflict. However, a couple
more issues arose after I got the code running – these are described
below in Examples #3 and #4.
16.2.6. Example #2 – A Little Bit of Restructuring¶
My second example comes from program 3dTcorrMap.c
and shows how a
small change to the logic of the program helped OpenMP-ize it. In this
program, the innermost loop is computing the sum (or other
combination) of a lot of calculations. Clearly, when the code is
adding the new result into the accumulating sum, there is the
potential for conflict between two threads executing this summation at
the same time. One way to avoid this conflict is to mark this
statement as being “critical” – to be only executed by one thread at
a time. The other way to avoid this conflict is to modify the code to
put each iteration’s result into a temporary array, and then add the
results up afterwards, outside the parallel
block. The second way is
what I chose to do. Here is the parallelized code:
float *ccar = (float *)malloc(sizeof(float)*nmask) ; /* temporary array */
for( ii=0 ; ii < nmask ; ii++ ){ /* outer loop over voxels: */
/* time series to correlate with */
xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ; /* ii-th time series */
#pragma omp parallel
{ int vv,uu ; float *ysar ; float qcc ;
#pragma omp for
for( vv=0 ; vv < nmask ; vv++ ){ /* inner loop over voxels */
if( vv==ii ){ ccar[vv] = 0.0f ; continue ; }
ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,vv) ) ;
/** dot products (unrolled by 2 on 29 Apr 2009) **/
if( isodd ){
for( qcc=xsar[0]*ysar[0],uu=1 ; uu < ntime ; uu+=2 )
qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ;
} else {
for( qcc=0.0f,uu=0 ; uu < ntime ; uu+=2 )
qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ;
}
ccar[vv] = qcc ; /* save correlation in ccar for later (OpenMP mod) */
} /* end of inner loop over voxels (vv) */
} /* end OpenMP */
/* below here, combine results from ccar[] to get output for voxel #ii */
} /* end of outer loop over voxels (ii) */
Note that all other variables (besides ccar[]
) on the receiving
end of an assignment inside the parallel
block are local variables
inside that block, and so are private to each thread. Note also that
only the loop over vv
is parallelized – the innermost loops over
uu
run sequentially in each thread. In principle, you can nest
parallel
blocks, but I have not tried this. OpenMP version 2.5
does not require an implementation to support nested parallelism, and
I’ve not bothered to try to use this feature.
The original code for the above fragment read like so:
for( ii=0 ; ii < nmask ; ii++ ){ /* time series to correlate with */
xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ;
Tcount = Mcsum = Zcsum = Qcsum = 0.0f ;
for( jj=0 ; jj < nmask ; jj++ ){ /* loop over other voxels, correlate w/ ii */
if( jj==ii ) continue ;
ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,jj) ) ;
/** dot products (unrolled by 2 on 29 Apr 2009) **/
if( isodd ){
for( cc=xsar[0]*ysar[0],kk=1 ; kk < ntime ; kk+=2 )
cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ;
} else {
for( cc=0.0f,kk=0 ; kk < ntime ; kk+=2 )
cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ;
}
Mcsum += cc ;
Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc));
Qcsum += cc*cc ;
if( fabsf(cc) >= Thresh ) Tcount++ ;
} /* end of loop over jj */
if( Mar != NULL ) Mar[indx[ii]] = Mcsum / (nmask-1.0f) ;
if( Zar != NULL ) Zar[indx[ii]] = tanh( Zcsum / (nmask-1.0f) ) ;
if( Qar != NULL ) Qar[indx[ii]] = sqrt( Qcsum / (nmask-1.0f) ) ;
if( Tar != NULL ) Tar[indx[ii]] = Tcount ;
} /* end of loop over ii */
Note the summation into Mcsum
and other variables inside the
jj
loop – these are not thread-safe. Instead of storing the
components of Mcsum
into an array ccar
as done in the
OpenMP-ization above, another way would be to force the OpenMP thread
manager to ensure that only one thread at time updates these
variables. This could be done with the following construction:
#pragma omp critical (TcorrMap)
{ Mcsum += cc ;
Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc));
Qcsum += cc*cc ;
if( fabsf(cc) >= Thresh ) Tcount++ ;
}
This OpenMP construction would block more than one thread at a time
from entering the structured block that follows the critical
directive. So why didn’t I use this approach to modifying
3dTcorrMap.c
? The answer is simple – I wasn’t really aware of
critical
when I made the changes to the code in the distant past
(3+ weeks ago now). (I probably would have tried that first, but I’m
not going to go back and patch things up just for fun.) Of course,
it’s important not to put too much code inside a critical
block,
or that will slow the program down as threads are forced to wait.
16.2.7. Example #3 – OpenMP and malloc¶
The C library functions malloc
etc. are re-entrant, so that’s good
– they can be used inside parallel
blocks with no real
worries. However, the tracking wrappers I wrote, mcw_malloc
etc., are not re-entrant, since they use and modify a static
structure to keep track of whats been allocated. Therefore, any
OpenMP-ized function that might call (even indirectly in another
function) an cw_malloc
function must put this call inside a
critical
block. This is pretty annoying. You can find some
examples of this in 3dDespike.c
and other places. However, later
(about 1 day) I realized that this is unneeded. All you need to do is
not turn on mcw_malloc
in OpenMP-enabled main programs. At the top
of many AFNI main programs is code like so:
#ifdef USING_MCW_MALLOC
enable_mcw_malloc() ;
#endif
This code snippet should be replaced with:
#if defined(USING_MCW_MALLOC) && !defined(USE_OMP)
enable_mcw_malloc() ;
#endif
In this way, a program that gets compiled with OpenMP will not turn on
the mcw_malloc
functions, and life will be cool.
Unfortunately, the same issue arises with the NI_malloc
functions
in the NIML library. I’ve not decided what to do about those just
yet. These functions (in sub-directory niml
) are designed to be
compiled without reference to other AFNI code – my (unfulfilled)
dream was that other people might like to use these functions for
inter-process communication and data storage, so I didn’t want them to
be dependent on the complicated forest of AFNI headers, macros, etc.
16.2.8. Example #4 – OpenMP and static variables¶
Writing to static
variables is a potential thread conflict
problem. When OpenMP-izing 3dDespike
, I found that there were
differences in the output between the old and new versions in a tiny
number of voxels. This was pretty annoying. After some thought, I
traced the problem down to the Fortran-to-C translated function in
cl1.c
. All the local variables in the translated code are declared
static
, since that is the semantics of Fortran-77 (local variable
values persist across function calls, unlike in C). However, the
function doesn’t actually re-use any values from old calls, so I
removed all the static
local variable declarations in this translated
code. The two versions of 3dDespike
now agreed exactly.
Lesson A: when OpenMP-izing, scrutinize all functions being called
for static
variables that might get written into. If you find any,
you’ll have to de-static
the code, or critical
-ize the call
to that function. (In the case of cl1.c
, the latter choice wasn’t
an option, since 3dDespike
spends most of its CPU time in that
function.)
Lesson B: Be sure to run the program with and without OpenMP and
diff
the result files to make sure they are identical. Any differences
need to be investigated, understood, and fixed.
A potential static
problem arises with the function call traceback
macros ENTRY
and RETURN
/EXRETURN
defined in
debugtrace.h
. These macros use a static
stack to keep track of
function calls. Clearly, this stack could get confused with OpenMP. To
‘fix’ this problem, I’ve defined a macro that should be in as the
first executable statement in a parallel
block, and another as the
last executable statement; for example:
#pragma omp parallel if(npp > 9999)
{
int fred ; /* other declarations ... */
AFNI_OMP_START ;
/* parallel code ... */
AFNI_OMP_END ;
} /* end OpenMP */
At present, all these macros do is stop and re-start (respectively)
the RETURN
/EXRETURN
stack operations. In the future, they
may do more things, which is why I wrote them as macros, to be more
flexible.
16.2.9. Example #5 – OpenMP and random numbers – AlphaSim¶
A problem arose when I tried to OpenMP-ize the AlphaSim
program. This code generates a lot of random 3D volumes and processes
them to get some statistics. As it turns out, it spends most its time
(80+%) generating the random numbers. The routine that is used boils
down to the C library function drand48()
and its
ilk. Parallelizing across instances of the volumes seems pretty
straightforward – but the program becomes much slower. After some
playing around, I found the reason: the drand48()
set of functions
uses an internal “state” to generate the sequence of random variates
– and this state data is static. To be thread-safe, these library
functions simply block on re-entrancy – that is, they are all
critical
. Not very good for speedup!
But there is a solution that doesn’t involve writing my own
thread-safe random number generator (which was my first thought, and
which gave me a headache). Instead of drand48()
to get a
U(0,1)
random variate, I used the function erand48(unsigned
short x[3])
, where the random generation state is stored in the
array x[]
. I create a separate state array for each thread (at the
start of the parallel
block, before the parallelized for
), and
initialize them separately (so each thread gets a different sequence
of random variates!). This solves the problem, and the program now
runs with nearly perfect speedup. For the tedious details, see the
AlphaSim.c
source code – including the file zgaussian.c
,
where the Gaussian N(0,1)
random number generator resides.
16.2.10. Example #6 – OpenMP and malloc() and memcpy() – 3dREMLfit¶
Parallelizing 3dREMLfit
effectively and correctly was much harder
than the programs above. The first problem was where to parallelize. I
tried various places, but none of them gave much speedup – in some
cases, OpenMP slowed the program down significantly. After a long
time, I finally thought of using Apple’s Shark profiler program, and
figured out that the program was spending a vast amount of time
spinning threads waiting for malloc
and free
to do their
work. Aha! The big CPU time sink in the “REML voxel loop”,
REML_func()
allocates and frees workspace arrays. But this
function is called hundreds of times per voxel, so the threads were
interfering with each other – because the system memory allocator is
thread-safe simply by blocking – it is not parallelized itself. I
didn’t want to install a parallelized version of malloc
, so I
rearranged to code to provide the workspace from outside, so that it
wouldn’t constantly be created and destroyed. Speedup!!!
… But … In a sample run with about 133K voxels, about 40-100
voxels did not agree with the single-threaded case. Arrrgghhh!!!
After printing out vast amounts of information from each thread’s loop
through the voxels, I finally found that these traitorous voxels in
fact had the wrong data passed to REML_func()
. This data time
series vector was copied using memcpy
. For whatever reason, this
wasn’t completely thread-safe – I don’t know why, and can’t find any
other info about this on the Web. But by critical
-izing this
statement, the problems vanished. Simple .. after 2.5 days of
frustration, that is.
Other minor details in parallelizing this program: the -usetemp
option writes intermediate data to disk and then re-reads it later,
with the assumption that things are written in voxel order. This won’t
be true for OpenMP, so a special simpler loop, with all the
-usetemp
stuff deleted, was written for the parallel section of
the code. This change makes it easier to deal with the
parallelization, but of course means that any changes to the “REML
voxel loop” must be made twice in the future, since there are now two
different versions of this loop’s code extant. C’est la soft-guerre.
16.2.11. Example #7 – Looping over voxels in 3dAutoTcorrelate¶
This program correlates the time series from all voxels inside a mask
with all other voxels (inside the same mask, or all voxels). In the
original sequential code, the outer loop over voxel index ii
just
tested the voxel index to see if was in the mask – if not, then it
just continue
-d to the next voxel. But in OpenMP, this means
that some parallel sub-loops get more work than others, if they happen
to fall into a denser part of the mask. Speedup was not all it should
be. So by pre-making an index table imap
of voxel indexes in the
mask, I could just loop over that table and be sure each loop
iteration was doing the same amount of labor.
Another issue that arose, that had a bigger impact, was the fact that
extracting the time series from the space-then-time ordered input
dataset was inefficient – since each voxel time series would be
extracted many many times as the inner voxel loop was
traversed. Inverting the input dataset into an MRI_vectim
struct
(time-then-space ordered) provided a big speedup, making memory
accesses much more local. Also, each time series could be
pre-detrended, avoiding repeating this operation many many times.
With these 2 changes, 3dAutoTcorrelate
became much faster, and in
fact it now spends a significant fraction of its time just writing the
lengthy results to disk. I suppose this would be the next target for
speedup – probably by writing each completed sub-brick to the output
*.BRIK
file via mmap()
when it is completed.
16.2.12. OpenMP Library Functions¶
I’ve only used a couple of these – see the specification for the real scoop.
omp_get_num_procs()
returns the number of processors OpenMP thinks is available – see thePRINT_AFNI_OMP_USAGE
macro inmrilib.h
, for example.omp_get_max_threads()
returns the maximum number of threads that are allowed to be used – usually, the value fromOMP_NUM_THREADS
or the number of CPUs on the system.omp_get_num_threads()
returns the number of threads executing in the currentparallel
block – seeAlphaSim.c
for an example of how this can be used to allocate workspaces, combined with the OpenMPmaster
andbarrier
pragma
s.omp_get_thread_num()
returns the thread number of the executing thread (counting starts at 0 for the “master thread”) – this is also used inAlphaSim.c
to sum output values into the thread-specific workspaces (to avoid collisions between threads).These latter 2 functions should only be used inside
parallel
blocks, since they don’t have much utility outside a parallelized piece of code.
See the OpenMP 2.5 specification for more details and a complete list of all such functions, etc.
16.2.13. Printing -help for OpenMP-ized programs¶
In mrilib.h
, the macro PRINT_AFNI_OMP_USAGE(pnam,extra)
is
defined. When USE_OMP is defined (i.e., on the compile command line
via CCOMP
), then this macro prints out some extra help relevant to
OpenMP usage. The argument `pnam
is intended to be the program
name, and the argument extra
is any extra text to be put at the
end of the standard boilerplate (extra
can be NULL
). If
USE_OMP is not defined, then PRINT_AFNI_OMP_USAGE(pnam,extra)
prints a message that the program is OpenMP-compatible, but this
binary copy is not compiled with OpenMP. Thus, this macro is designed
to be inserted at the end of the -help
output section in any
OpenMP-ized program (cf. AlphaSim.c
).