16.2. Elementary OpenMP in AFNI¶

16.2.1. Overview ¶

These are notes by Bob Cox (AKA Zhark the Parallelizer) from May, 2009.

Several programs in AFNI are written using OpenMP, which provides support for parallelization of tasks. Some of the programs that use OpenMP are:

3dAllineate	3dAutoTcorrelate	3dBandpass
3dBlurToFWHM	3dClustSim	3dDegreeCentrality
3dDWUncert	3dECM	3dFWHMx
3dGroupInCorr	3dLFCD	3dLocalACF
3dLocalHistog	3dLocalPV	3dLocalstat
3dMSE	3dNwarpAdjust	3dNwarpApply
3dNwarpCalc	3dNwarpCat	3dNwarpFuncs
3dNwarpXYZ	3dQwarp	3dREMLfit
3dTcorr1D	3dTcorrMap	3dUnifize
3dXClustSim	…

The above is not an exhaustive list (though I sure am tired of reading it).

Note that there are some other scripts or programs in AFNI that wrap around one or more of the above programs (e.g., 3dQwarp is used within auto_warp.py, @SSwarper and @animal_warper, among others).

16.2.2. Introduction ¶

In the last couple months, I have explored using the parallelization toolkit OpenMP in AFNI. OpenMP is built in to several compilers, including gcc-4.2. OpenMP comprises a set of #pragmas, functions, etc., that are used to control parallel threads on a shared-memory multi-CPU system (e.g., my octo-core Mac).

An OpenMP tutorial can be found at
https://computing.llnl.gov/tutorials/openMP/
and the OpenMP 2.5 spec can be found at
http://www.openmp.org/mp-documents/spec25.pdf
Both of these are valuable documents with sample code. Another
interesting document is
http://www.hpcwire.com/features/17884424.html
which discusses OpenMP vs. MPI and an extended version of OpenMP
called Cluster OpenMP from Intel.

OpenMP should not be confused with Open MPI, which is a completely different piece of software for parallelization! MPI (= Message Passing Interface) lets you distribute work across multiple CPUs and multiple systems by explicitly passing messages back and forth between processes/threads. MPI is more flexible than OpenMP in that it allows a whole network of computers to be harnessed into one job.

MPI is also harder to use than OpenMP, in that the programmer has to manage the inter-thread data flow very explicitly. In OpenMP, the data model is that some data is explicitly in shared memory, visible to be read and written by each thread by normal C statements, and some data is explictly private to each thread, and when modified in one thread that change will not be visible to the other threads. In OpenMP, the principal burden on the programmer is to make sure that the shared data is written to in a coherent manner by the different threads. In MPI, the burden on the programmer is to explicitly determine when data held in one thread is communicated to another thread. Converting an existing program to use OpenMP is generally much simpler than making it work with MPI.

16.2.3. Memory Concepts ¶

The two most important things to realize about OpenMP’s parallel model for computation are

OpenMP is shared memory computing; unless you take care to make certain variables and arrays private to each thread, all the data is shared between threads.
You are responsible for managing the access to shared memory so that conflicts don’t occur (e.g., two processes don’t try to write to the same shared variable at the same time). The compiler does not even attempt to help you with analyzing the code to detect this potential problem.
- In particular, if your parallel code is writing results into a shared memory array (as is almost certain to be happening so that you can get some results out), then you have to make sure that different threads won’t be writing to the same elements of the array at the same time, where one thread potentially destroys the results from another thread.
- If your parallel code is summing up some value (say), you have to make sure that different threads don’t conflict when they modify this value.
- If your parallel code is using some temporary variable or array, you probably want to make sure that each thread gets a separate copy of that object.

None of these things are hard, once you think about the concepts of parallel access to shared memory for a while, and realize the potential for conflicts between threads trying to update the same location more or less at the same instant. In particular, assigning to static variables is a real potential problem – and this might occur inside some function call that you haven’t looked at. (Hint: look at the functions you call from inside a parallelized part of your code!)

16.2.4. Compiling to use OpenMP ¶

In your AFNI Makefile, you have to set the variable OMPFLAG to be the value of the flag(s) to the compiler/linker needs for dealing with OpenMP, plus you have to enable OpenMP inside AFNI code by setting the C macro variable USE_OMP to be defined. For gcc-4.2, my Makefile contains the single extra line:

OMPFLAG = -fopenmp -DUSE_OMP

Some AFNI Makefile.*-s now contain definitions of OMPFLAG so that the binary builds include some parallelized programs. At this moment, these programs are:

Program	OpenMP use
3BlurInMask	parallelized across sub-bricks
3dTcorrMap	parallelized across the inner voxel loop
3dDespike	parallelized across voxels
3dAllineate	parallelized across voxels in the interpolation functions (Example #1 below)
AlphaSim	parallelized across simulated 3D volumes
3dREMLfit	parallelized across voxels in the REML estimation loop, and across ARMA(1,1) (a,b) parameter pairs in the REML matrix setup loop.
3dLocalPV	parallelized across voxels
3dLocalStat	parallelized across voxels
3dBandpass	the `-blur` option is parallelized across sub-bricks
3dGroupInCorr	computations of correlations are parallelized across time series datasets; computations of t-tests are parallelized across voxels
3dTcorr1D	computations are parallelized across columns of the input 1D file
3dClustSim	(the new and improved version of ``AlphaSim``, which has been deprecated for a long time) parallelized across simulated 3D volumes, and with fewer `malloc`/`free` spin problems, since it uses a customized clustering procedure rather than the general one `AlphaSim` used; also, workspaces for each thread are allocated before the work begins, so that `malloc`/`free` invocations inside the simulation loop are limited; cf. Example #6 below for a discussion of why this is bad.
3dAutoTcorrelate	parallelized across the outer voxel loop; to get any decent speedup required converting the input dataset to a ‘vectim’ struct (time order first rather than last): otherwise, thrashing through the input dataset time points over and over was grossly slow.

In each case, I chose to invoke OpenMP at the simplest place that did a lot of work that was independent between components - voxels or sub-bricks. In the case of 3dBlurInMask, parallelizing across voxels would be quite difficult, due to the structure of the algorithm - but parallelizing across sub-bricks was essentially trivial, since the blurring algorithm is completely independent for each volume of data.

All of the above programs, except for 3dGroupInCorr [Dec 2009] and 3dClustSim [Jul 2010], were originally written as serial programs, and later converted to OpenMP. Some of the problems arising in these conversions are outlined below. It is much easier to write a program from scratch to use OpenMP than to retrofit it later!

In AFNI’s Makefile.INCLUDE, files that are to be compiled/linked with OpenMP use CCOMP instead of CC or CCFAST for the compilation/linking. You’ll have to add an extra make rule for each file that needs this special handling. NB: Any program that calls an OpenMP-enabled function, even if the main program knows nothing about OpenMP, must be linked with CCOMP to get the proper libraries included.

At the top of any C source file that’s going to use OpenMP #pragmas or function calls, you should put a code block like so:

#ifdef USE_OMP
#include <omp.h>
#endif

By default, OpenMP will use all CPUs available in any parallel block of code. This behavior can be changed by setting the environment variable OMP_NUM_THREADS to some smaller integer value. (There are also OpenMP library functions that let you control this from within a program, but I’ve not used these.)

16.2.5. Example #1 – A basic case ¶

My first example is a function from mri_genalign_util.c. The function below does linear interpolation from the input image fim at npp output points whose index coordinates are given in input arrays ip, jp, and kp, storing the results into user-allocated array vv. (Rather than make up a trivial sample case, I’m showing a real piece of code that’s used in 3dAllineate.) The two OpenMP directives are shown in the lines starting with #pragma omp ..:

#define FAR(i,j,k)  far[(i)+(j)*nx+(k)*nxy]
#define CLIP(mm,nn) if(mm < 0)mm=0; else if(mm > nn)mm=nn

void GA_interp_linear( MRI_IMAGE *fim ,
                       int npp, float *ip, float *jp, float *kp, float *vv )
{
ENTRY("GA_interp_linear") ;

#pragma omp parallel if(npp > 9999)
 {
   int nx=fim->nx , ny=fim->ny , nz=fim->nz , nxy=nx*ny , pp ;
   float nxh=nx-0.501f , nyh=ny-0.501f , nzh=nz-0.501f , xx,yy,zz ;
   float fx,fy,fz ;
   float *far = MRI_FLOAT_PTR(fim) ;
   int nx1=nx-1,ny1=ny-1,nz1=nz-1 ;
   float ix,jy,kz ;
   int ix_00,ix_p1 ;         /* interpolation indices */
   int jy_00,jy_p1 ;
   int kz_00,kz_p1 ;
   float wt_00,wt_p1 ;       /* interpolation weights */
   float f_j00_k00, f_jp1_k00, f_j00_kp1, f_jp1_kp1, f_k00, f_kp1 ;

#pragma omp for
   for( pp=0 ; pp < npp ; pp++ ){
     xx = ip[pp] ; if( xx < -0.499f || xx > nxh ){ vv[pp]=outval; continue; }
     yy = jp[pp] ; if( yy < -0.499f || yy > nyh ){ vv[pp]=outval; continue; }
     zz = kp[pp] ; if( zz < -0.499f || zz > nzh ){ vv[pp]=outval; continue; }

     ix = floorf(xx) ;  fx = xx - ix ;   /* integer and       */
     jy = floorf(yy) ;  fy = yy - jy ;   /* fractional coords */
     kz = floorf(zz) ;  fz = zz - kz ;

     ix_00 = ix ; ix_p1 = ix_00+1 ; CLIP(ix_00,nx1) ; CLIP(ix_p1,nx1) ;
     jy_00 = jy ; jy_p1 = jy_00+1 ; CLIP(jy_00,ny1) ; CLIP(jy_p1,ny1) ;
     kz_00 = kz ; kz_p1 = kz_00+1 ; CLIP(kz_00,nz1) ; CLIP(kz_p1,nz1) ;

     wt_00 = 1.0f-fx ; wt_p1 = fx ;  /* weights for ix_00 and ix_p1 points */

#undef  XINT
#define XINT(j,k) wt_00*FAR(ix_00,j,k)+wt_p1*FAR(ix_p1,j,k)

     /* interpolate to location ix+fx at each jy,kz level */

     f_j00_k00 = XINT(jy_00,kz_00) ; f_jp1_k00 = XINT(jy_p1,kz_00) ;
     f_j00_kp1 = XINT(jy_00,kz_p1) ; f_jp1_kp1 = XINT(jy_p1,kz_p1) ;

     /* interpolate to jy+fy at each kz level */

     wt_00 = 1.0f-fy ; wt_p1 = fy ;
     f_k00 =  wt_00 * f_j00_k00 + wt_p1 * f_jp1_k00 ;
     f_kp1 =  wt_00 * f_j00_kp1 + wt_p1 * f_jp1_kp1 ;

     /* interpolate to kz+fz to get output */

     vv[pp] = (1.0f-fz) * f_k00 + fz * f_kp1 ;
   }
 } /* end OpenMP */

   EXRETURN ;
}

The directive:

#pragma omp parallel if(npp > 9999)

is at the head of a “structured block” of code that will be executed in parallel. (A “structured block” in OpenMP-lingo means a C {…} block that contains no way out (e.g., no return) except to fall through the bottom.) You should imagine that all the code inside this block will be executed in parallel on multiple CPUs – even code that does exactly the same thing. To get different things done on different CPUs, we need the second directive, that will specify the “work-sharing”.

In the above code, I’ve declared all the internal variables used in the function inside the parallel block. This means that these variables are all private to each thread. Assignments to any one of these in one thread will have no impact on the other threads. Declaring variables like this is the easy way to make sure they are thread-private and won’t accidentally conflict. It is also possible to declare outside variables to be thread-private in the parallel #pragma, but I’d rather skip that – doing it with private declarations, as above, is simpler to think about and to program. Thus, for example, the pointer:

float *far = MRI_FLOAT_PTR(fim) ;

will have N identical copies spread around amongst the N threads. This is slightly inefficient with respect to memory usage – since far is never changed after the initial assignment, it could be a shared variable declared and initialized outside the parallel block – but unless the amount of memory duplication is huge, the rule:

For most variables used and assigned to in the parallel block, declare them inside the parallel block.

is the simplest to code with. The only exception would be variables whose values you wish to preserve when the parallel block ends and normal (sequential) program execution resumes – typically, these variables are output arrays.

Note that the if(npp > 9999) part of the parallel directive means that the code will actually only be parallelized if the number of points to interpolate at one shot is 10,000 or more. There is no point in parallelizing at too fine a level – the thread startup and management overhead will be too large to get any net program speedup from the OpenMP-ization. (You might well ask where I got the number 9999 from – the answer is that I just made it up – I didn’t actually test the function to see where the breakeven point for npp might lie.)

The directive:

#pragma omp for

indicates that the next for statement should be parallelized across the threads that were stared with the parallel directive – that is, that different threads should get different subsets of the index pp as it ranges over 0..npp-1. In this code, pp is the voxel index into the input arrays ip, jp, and kp, and into the output array vv. The goal of the loop body is to compute vv[pp]. Each different value of pp writes to a different output location, so there is no conflict possible even if two threads were executing the same statement at exactly the same time (something you always have to think about).

For a for statement to be parallelizable, the number of iterations must be easily determinable when the loop is started. In this case, it is obviously npp. For example, a loop of the form:

for( pp=0 ; pp < npp && vv[pp] != 666.0f ; pp++ ){ ... }

cannot be parallelized by OpenMP since the terminal condition is not determinable when the loop is started. Similarly, the loop body cannot contain any break statements.

This function (and its analogs for NN, cubic, quintic, and wsinc5 interpolation) were pretty easy to adapt to OpenMP. I simply moved all the local variables into the parallel block, and that was about it. The only write to a variable visible outside the parallel block is to vv[pp] and there is obviously no possible thread conflict there. The only external function called is in the C library, and these are pretty much all supposed to be “thread-safe” (the technical term is re-entrant), unless the man page specifies otherwise.

Note that the parallel for loop will not be executed in sequential order of the control variable pp, even within a single thread. OpenMP chooses the order of execution, so the externally visible results (in this example, the vv[pp] values) should not depend on the order in which the pp values are chosen. (It is possible for the programmer to have some control over the division and sequence of labor in the different threads, but I’ve not used this feature of OpenMP, nor do I plan to.)

When parallelizing 3dDespike, I chose to parallelize the voxel loop in the main program. The first thing was to identify all the variables that receive assignments in this loop, and move the declarations of those that are purely internal to the loop (not containing output data) from the top of main down to be inside the parallel block. These variables include a number of work arrays for processing the voxel time series. Each thread gets its own instance of each of these work arrays, since the malloc call was moved inside the parallel block. I simply had to examine the code carefully to ensure that every variable that received an assignment was local to the thread – since this loop wasn’t self-contained inside a function, the work of scrutinizing the code was a little more tedious than for the interpolation functions described above. At the end of the voxel loop, the results are put into the output dataset. Each voxel calculation and assignment is independent of all others, so there is no potential thread conflict. However, a couple more issues arose after I got the code running – these are described below in Examples #3 and #4.

16.2.6. Example #2 – A Little Bit of Restructuring ¶

My second example comes from program 3dTcorrMap.c and shows how a small change to the logic of the program helped OpenMP-ize it. In this program, the innermost loop is computing the sum (or other combination) of a lot of calculations. Clearly, when the code is adding the new result into the accumulating sum, there is the potential for conflict between two threads executing this summation at the same time. One way to avoid this conflict is to mark this statement as being “critical” – to be only executed by one thread at a time. The other way to avoid this conflict is to modify the code to put each iteration’s result into a temporary array, and then add the results up afterwards, outside the parallel block. The second way is what I chose to do. Here is the parallelized code:

float *ccar = (float *)malloc(sizeof(float)*nmask) ;  /* temporary array */

for( ii=0 ; ii < nmask ; ii++ ){  /* outer loop over voxels: */
                                  /* time series to correlate with */

xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ;      /* ii-th time series */

#pragma omp parallel
   { int vv,uu ; float *ysar ; float qcc ;
#pragma omp for
      for( vv=0 ; vv < nmask ; vv++ ){ /* inner loop over voxels */

         if( vv==ii ){ ccar[vv] = 0.0f ; continue ; }
         ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,vv) ) ;

         /** dot products (unrolled by 2 on 29 Apr 2009) **/

         if( isodd ){
            for( qcc=xsar[0]*ysar[0],uu=1 ; uu < ntime ; uu+=2 )
               qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ;
         } else {
            for( qcc=0.0f,uu=0 ; uu < ntime ; uu+=2 )
               qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ;
         }
         ccar[vv] = qcc ; /* save correlation in ccar for later (OpenMP mod) */
      } /* end of inner loop over voxels (vv) */
   } /* end OpenMP */

/* below here, combine results from ccar[] to get output for voxel #ii */

} /* end of outer loop over voxels (ii) */

Note that all other variables (besides ccar[]) on the receiving end of an assignment inside the parallel block are local variables inside that block, and so are private to each thread. Note also that only the loop over vv is parallelized – the innermost loops over uu run sequentially in each thread. In principle, you can nest parallel blocks, but I have not tried this. OpenMP version 2.5 does not require an implementation to support nested parallelism, and I’ve not bothered to try to use this feature.

The original code for the above fragment read like so:

for( ii=0 ; ii < nmask ; ii++ ){  /* time series to correlate with */

  xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ;

  Tcount = Mcsum = Zcsum = Qcsum = 0.0f ;
  for( jj=0 ; jj < nmask ; jj++ ){  /* loop over other voxels, correlate w/ ii */

    if( jj==ii ) continue ;
    ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,jj) ) ;

    /** dot products (unrolled by 2 on 29 Apr 2009) **/

    if( isodd ){
      for( cc=xsar[0]*ysar[0],kk=1 ; kk < ntime ; kk+=2 )
        cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ;
    } else {
      for( cc=0.0f,kk=0 ; kk < ntime ; kk+=2 )
        cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ;
    }

    Mcsum += cc ;
    Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc));
    Qcsum += cc*cc ;
    if( fabsf(cc) >= Thresh ) Tcount++ ;
  } /* end of loop over jj */
  if( Mar != NULL ) Mar[indx[ii]] = Mcsum / (nmask-1.0f) ;
  if( Zar != NULL ) Zar[indx[ii]] = tanh( Zcsum / (nmask-1.0f) ) ;
  if( Qar != NULL ) Qar[indx[ii]] = sqrt( Qcsum / (nmask-1.0f) ) ;
  if( Tar != NULL ) Tar[indx[ii]] = Tcount ;

} /* end of loop over ii */

Note the summation into Mcsum and other variables inside the jj loop – these are not thread-safe. Instead of storing the components of Mcsum into an array ccar as done in the OpenMP-ization above, another way would be to force the OpenMP thread manager to ensure that only one thread at time updates these variables. This could be done with the following construction:

#pragma omp critical (TcorrMap)
     { Mcsum += cc ;
       Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc));
       Qcsum += cc*cc ;
       if( fabsf(cc) >= Thresh ) Tcount++ ;
     }

This OpenMP construction would block more than one thread at a time from entering the structured block that follows the critical directive. So why didn’t I use this approach to modifying 3dTcorrMap.c? The answer is simple – I wasn’t really aware of critical when I made the changes to the code in the distant past (3+ weeks ago now). (I probably would have tried that first, but I’m not going to go back and patch things up just for fun.) Of course, it’s important not to put too much code inside a critical block, or that will slow the program down as threads are forced to wait.

16.2.7. Example #3 – OpenMP and malloc ¶

The C library functions malloc etc. are re-entrant, so that’s good – they can be used inside parallel blocks with no real worries. However, the tracking wrappers I wrote, mcw_malloc etc., are not re-entrant, since they use and modify a static structure to keep track of whats been allocated. Therefore, any OpenMP-ized function that might call (even indirectly in another function) an cw_malloc function must put this call inside a critical block. This is pretty annoying. You can find some examples of this in 3dDespike.c and other places. However, later (about 1 day) I realized that this is unneeded. All you need to do is not turn on mcw_malloc in OpenMP-enabled main programs. At the top of many AFNI main programs is code like so:

#ifdef USING_MCW_MALLOC
   enable_mcw_malloc() ;
#endif

This code snippet should be replaced with:

#if defined(USING_MCW_MALLOC) && !defined(USE_OMP)
   enable_mcw_malloc() ;
#endif

In this way, a program that gets compiled with OpenMP will not turn on the mcw_malloc functions, and life will be cool.

Unfortunately, the same issue arises with the NI_malloc functions in the NIML library. I’ve not decided what to do about those just yet. These functions (in sub-directory niml) are designed to be compiled without reference to other AFNI code – my (unfulfilled) dream was that other people might like to use these functions for inter-process communication and data storage, so I didn’t want them to be dependent on the complicated forest of AFNI headers, macros, etc.

16.2.8. Example #4 – OpenMP and static variables ¶

Writing to static variables is a potential thread conflict problem. When OpenMP-izing 3dDespike, I found that there were differences in the output between the old and new versions in a tiny number of voxels. This was pretty annoying. After some thought, I traced the problem down to the Fortran-to-C translated function in cl1.c. All the local variables in the translated code are declared static, since that is the semantics of Fortran-77 (local variable values persist across function calls, unlike in C). However, the function doesn’t actually re-use any values from old calls, so I removed all the static local variable declarations in this translated code. The two versions of 3dDespike now agreed exactly.

Lesson A: when OpenMP-izing, scrutinize all functions being called for static variables that might get written into. If you find any, you’ll have to de-static the code, or critical-ize the call to that function. (In the case of cl1.c, the latter choice wasn’t an option, since 3dDespike spends most of its CPU time in that function.)

Lesson B: Be sure to run the program with and without OpenMP and diff the result files to make sure they are identical. Any differences need to be investigated, understood, and fixed.

A potential static problem arises with the function call traceback macros ENTRY and RETURN/EXRETURN defined in debugtrace.h. These macros use a static stack to keep track of function calls. Clearly, this stack could get confused with OpenMP. To ‘fix’ this problem, I’ve defined a macro that should be in as the first executable statement in a parallel block, and another as the last executable statement; for example:

#pragma omp parallel if(npp > 9999)
 {
   int fred ; /* other declarations ... */
   AFNI_OMP_START ;
   /* parallel code ... */
   AFNI_OMP_END ;
 } /* end OpenMP */

At present, all these macros do is stop and re-start (respectively) the RETURN/EXRETURN stack operations. In the future, they may do more things, which is why I wrote them as macros, to be more flexible.

16.2.9. Example #5 – OpenMP and random numbers – AlphaSim ¶

A problem arose when I tried to OpenMP-ize the AlphaSim program. This code generates a lot of random 3D volumes and processes them to get some statistics. As it turns out, it spends most its time (80+%) generating the random numbers. The routine that is used boils down to the C library function drand48() and its ilk. Parallelizing across instances of the volumes seems pretty straightforward – but the program becomes much slower. After some playing around, I found the reason: the drand48() set of functions uses an internal “state” to generate the sequence of random variates – and this state data is static. To be thread-safe, these library functions simply block on re-entrancy – that is, they are all critical. Not very good for speedup!

But there is a solution that doesn’t involve writing my own thread-safe random number generator (which was my first thought, and which gave me a headache). Instead of drand48() to get a U(0,1) random variate, I used the function erand48(unsigned short x[3]), where the random generation state is stored in the array x[]. I create a separate state array for each thread (at the start of the parallel block, before the parallelized for), and initialize them separately (so each thread gets a different sequence of random variates!). This solves the problem, and the program now runs with nearly perfect speedup. For the tedious details, see the AlphaSim.c source code – including the file zgaussian.c, where the Gaussian N(0,1) random number generator resides.

16.2.10. Example #6 – OpenMP and malloc() and memcpy() – 3dREMLfit ¶

Parallelizing 3dREMLfit effectively and correctly was much harder than the programs above. The first problem was where to parallelize. I tried various places, but none of them gave much speedup – in some cases, OpenMP slowed the program down significantly. After a long time, I finally thought of using Apple’s Shark profiler program, and figured out that the program was spending a vast amount of time spinning threads waiting for malloc and free to do their work. Aha! The big CPU time sink in the “REML voxel loop”, REML_func() allocates and frees workspace arrays. But this function is called hundreds of times per voxel, so the threads were interfering with each other – because the system memory allocator is thread-safe simply by blocking – it is not parallelized itself. I didn’t want to install a parallelized version of malloc, so I rearranged to code to provide the workspace from outside, so that it wouldn’t constantly be created and destroyed. Speedup!!!

… But … In a sample run with about 133K voxels, about 40-100 voxels did not agree with the single-threaded case. Arrrgghhh!!! After printing out vast amounts of information from each thread’s loop through the voxels, I finally found that these traitorous voxels in fact had the wrong data passed to REML_func(). This data time series vector was copied using memcpy. For whatever reason, this wasn’t completely thread-safe – I don’t know why, and can’t find any other info about this on the Web. But by critical-izing this statement, the problems vanished. Simple .. after 2.5 days of frustration, that is.

Other minor details in parallelizing this program: the -usetemp option writes intermediate data to disk and then re-reads it later, with the assumption that things are written in voxel order. This won’t be true for OpenMP, so a special simpler loop, with all the -usetemp stuff deleted, was written for the parallel section of the code. This change makes it easier to deal with the parallelization, but of course means that any changes to the “REML voxel loop” must be made twice in the future, since there are now two different versions of this loop’s code extant. C’est la soft-guerre.

16.2.11. Example #7 – Looping over voxels in 3dAutoTcorrelate ¶

This program correlates the time series from all voxels inside a mask with all other voxels (inside the same mask, or all voxels). In the original sequential code, the outer loop over voxel index ii just tested the voxel index to see if was in the mask – if not, then it just continue-d to the next voxel. But in OpenMP, this means that some parallel sub-loops get more work than others, if they happen to fall into a denser part of the mask. Speedup was not all it should be. So by pre-making an index table imap of voxel indexes in the mask, I could just loop over that table and be sure each loop iteration was doing the same amount of labor.

Another issue that arose, that had a bigger impact, was the fact that extracting the time series from the space-then-time ordered input dataset was inefficient – since each voxel time series would be extracted many many times as the inner voxel loop was traversed. Inverting the input dataset into an MRI_vectim struct (time-then-space ordered) provided a big speedup, making memory accesses much more local. Also, each time series could be pre-detrended, avoiding repeating this operation many many times.

With these 2 changes, 3dAutoTcorrelate became much faster, and in fact it now spends a significant fraction of its time just writing the lengthy results to disk. I suppose this would be the next target for speedup – probably by writing each completed sub-brick to the output *.BRIK file via mmap() when it is completed.

16.2.12. OpenMP Library Functions ¶

I’ve only used a couple of these – see the specification for the real scoop.

omp_get_num_procs() returns the number of processors OpenMP thinks is available – see the PRINT_AFNI_OMP_USAGE macro in mrilib.h, for example.
omp_get_max_threads() returns the maximum number of threads that are allowed to be used – usually, the value from OMP_NUM_THREADS or the number of CPUs on the system.
omp_get_num_threads() returns the number of threads executing in the current parallel block – see AlphaSim.c for an example of how this can be used to allocate workspaces, combined with the OpenMP master and barrierpragmas.
omp_get_thread_num() returns the thread number of the executing thread (counting starts at 0 for the “master thread”) – this is also used in AlphaSim.c to sum output values into the thread-specific workspaces (to avoid collisions between threads).
- These latter 2 functions should only be used inside parallel blocks, since they don’t have much utility outside a parallelized piece of code.

See the OpenMP 2.5 specification for more details and a complete list of all such functions, etc.

16.2.13. Printing -help for OpenMP-ized programs ¶

In mrilib.h, the macro PRINT_AFNI_OMP_USAGE(pnam,extra) is defined. When USE_OMP is defined (i.e., on the compile command line via CCOMP), then this macro prints out some extra help relevant to OpenMP usage. The argument `pnam is intended to be the program name, and the argument extra is any extra text to be put at the end of the standard boilerplate (extra can be NULL). If USE_OMP is not defined, then PRINT_AFNI_OMP_USAGE(pnam,extra) prints a message that the program is OpenMP-compatible, but this binary copy is not compiled with OpenMP. Thus, this macro is designed to be inserted at the end of the -help output section in any OpenMP-ized program (cf. AlphaSim.c).