Skip to content

AFNI and NIfTI Server for NIMH/NIH/PHS/DHHS/USA/Earth

Sections
Personal tools
You are here: Home » Misc Items » Notes on Parallelizing AFNI Programs

AFNI Parallelization

Document Actions

Notes on Parallelizing AFNI Programs

May 2003 -- RWCox

Links to content far below:


Parallelizing via fork() and Shared Memory:
3dDeconvolve is the first program AFNI to be parallelized. This advance was relatively easy due to the highly structured way in which Doug Ward wrote the code. In particular, all the results are first calculated into a set of output arrays that are allocated in a single routine zero_fill_volume(). After these arrays are filled up by the major calculation loop in calculate_results(), they are used to construct the output dataset. Furthermore, like most AFNI programs, the input dataset is readonly -- the memory allocated for it is not written into after the dataset is loaded from disk.

The strategy I developed for a quick parallelizing of 3dDeconvolve relied on the semantics of the Unix fork() function. When fork() is called, a new process is created that is a copy of the old one. In particular, all the memory values set prior to fork() are duplicated. However, in modern virtual memory systems, actual copies of the memory aren't made -- the processes actually use the same memory locations, unless one of them writes into the memory. At that point, the writing process gets a copy instead of the original, and it that copy that is modified. Thus, processes share readonly memory, but cannot communicate by writing into memory. Since the largest chunk of data is the input dataset, which is never modified after it is input, this behavior is ideal. Each child process can read from the input dataset without duplicating the data and without interfering with the other processes.

To communicate results back from the child processes, something special must be done. The simple solution was to put the output arrays all into a shared memory segment. This is a block of memory that is shared between processes that are "attached" to it. When a process creates a shared memory segment and later uses fork(), both the parent and child process are automatically attached to the segment. Since Doug Ward centralized the creation of the output arrays in one function, it was straightforward to modify that function to create a shared memory segment to store all these arrays (instead of using malloc()), and then set the pointers to these arrays to point to the various sub-arrays in the segment. All this is done before the first fork(), so that the values of pointers to the output arrays are inherited by the child processes.

One of the main problems in many parallel programs is making sure that the parallel processes don't try to modify the same resource (e.g., memory location) at the same time. This is easily ensured in this program since each voxel calculation is independent. The only shared resource is the shared memory arrays of output results, visible to all processes. But each process has assigned to it an independent block of voxels, so each process will only write into the memory locations that correspond to its task. There can be no overlap (e.g., job #1 and job #2 trying to write to the same shared memory location), so the mutual exclusion problem doesn't arise.

All this relies on the "copy-on-write" strategy being used by fork(), which I know is true in Linux, Solaris, and Mac OS X, and is probably true in most other modern Unix-ish systems. It also relies on symmetric multi-processing (SMP), so that separate processes will automatically be offloaded to separate CPUs when available. (Back around 1990, SMP was exotic unreliable technology. Now it is "easy" and commonplace.) No use is made of MPI, PVM OpenMP, Data Parallel C, or other external libraries/software for parallelization. This may be considered a strength or a weakness.

A number of other bookkeeping points had to be managed:

  • "Saving up" all output array allocations until they are all ready, so that the shared memory segment can be allocated at the right size (there is no way to change a segment size -- no analog to realloc()).
  • Setting the job count back to 1 if "-nodata" or "-input1D" are used.
  • Avoiding the use of free() on arrays stored in shared memory.
  • Making sure the shared memory segment is deleted if the program crashes.
    • This is done by installing a signal handler to catch the most common fatal error signals (e.g., segfault), and an atexit() handler to catch any calls to exit().
    • The children finish up with _exit(), which will not invoke the atexit() handler.
  • Dividing the work up among the child processes.
    • Since each voxel is an independent calculation, this is easily done by giving each process a range of voxel indexes upon which to operate.
    • The only slightly tricky part here is when a mask dataset is used. Then it is necessary to scan the mask to make sure the work is divided evenly.
  • When the parent process ("job #0") finishes its set of voxels, it must wait for the children to finish before continuing on to write the output dataset(s). This is done using the Unix function waitpid().

Some operating systems may have trouble when fork()-ing a process with a large readonly memory region (such as the 3dDeconvolve input dataset). The reason is that the system doesn't know that the processes won't write to that memory, and so it may try to set aside a disk swap buffer for the dataset memory for each process. If you have a 1 GB dataset and 8 jobs, this would require 8 GB of swap space, none of which would be used. If you run into this problem, I'm not sure what to tell you. As I understand it, SGI IRIX may have this problem, but can alleviate it by enabling something called "virtual swap". Not having any experience with this concept, I won't comment further.

Shared Memory Allocation:
When you run 3dDeconvolve with the "-jobs" option, you may get an error message of the form

  ** Can't create shared memory of size XXXXXXXXXX!
where "XXXXXXXXXX" is replaced by the number of bytes needed for the output arrays described earlier. This error is usually caused by the operating system settings not being large enough. The tuning information below can be used to try to cure this situation.

It is possible that the program could crash and leave the shared memory segment behind. (A Unix shared memory segment can sit in memory forever, even if it is not attached to any process.) As I mentioned above, the program is careful to try to avoid this. However, if the problem occurs, then you can look at the shared memory segments in use on the system with the command ipcs and remove any unwanted ones with the ipcrm command. For more information, see the man pages for these commands.

The actual allocation of shared memory is done via AFNI library functions in thd_iochan.c, which is compiled into libmri.a.


3dDeconvolve Results

Tom Ross of NIDA ran a not-entirely-trivial 3dDeconvolve computation on an 8 processor Linux Xeon system. The results were

   Jobs   Elapsed time  177.87/Jobs  Elapsed*Jobs
   ----   ------------  -----------  ------------
     1      177.87        177.87        177.87
     2       92.75         88.935       185.5
     3       64.03         59.29        192.09
     4       50.55         44.4675      202.2
     5       42.8          35.574       214
     6       36.49         29.645       218.94
     7       33.71         25.41        235.97
     8       30.62         22.2338      244.96
These curves are plotted in the graphs below:


[Black = Elapsed time from table; Red = 177.87/Jobs = Perfect speedup]

Perfect speedup would have Elapsed*Jobs be a constant (177.87 in this case), and have Elapsed(Jobs)=177.87/Jobs. For this problem, speedup isn't "perfect", but is still pretty good. For example, the 8 jobs case finishes in about the same time that 6 "perfect" processors would have. The dataset input, scaling of results from floats to shorts, and dataset(s) output are all done in the master job (job #0), so these computations are not parallelized. A better fit to the Elapsed data is Elapsed(Jobs)=9.6+168.3/Jobs, indicating the amount of non-parallelized overhead (9.6 s) "necessary" to run this calculation (which has a 270 MB input dataset).

[Later] Tom Ross also ran a larger job, with a single CPU time of 898 s. The 8 jobs run took 132 s, or a speedup ratio of 6.8 times faster, which is better than the shorter example above, whose speedup ratio with 8 jobs is 177.9/30.6=5.8. This result indicates the normal parallelizer's intuition that larger calculations get more from multiple CPUs is holding true in this case.

For further speed tests, and for some data/scripts to test your machine, see this page.


Tuning Operating System Shared Memory Parameters

N.B.: THIS SECTION IS MOSTLY OBSOLETE. 3dDeconvolve now uses mmap() to create shared memory -- this process does not involve the shmem system. [RWCox: Oct 2005]

Many Unix-ish operating systems come configured with a low maximum value for the size of a shared memory segment (shmem). Since we use a large shmem to communicate results from the child processes back to the parent, you may have to tune the operating system parameters to increase this maximum value. The methods for doing this vary considerably between systems. The following notes are adapted from the PostgreSQL Administrator's Guide.

There is a small set of parameters that control the shared memory configuration. Their names and meanings are listed below. (Not all Unix-ish systems use all these parameters, and their names may be lower- or upper-case.)

  • SHMMAX = Maximum size of shared memory segment (bytes)
  • SHMMIN = Minimum size of shared memory segment (bytes)
  • SHMALL = Total amount of shared memory available (bytes or pages, depending on OS)
    • Pages are usually 4096 bytes on modern Unix-ish systems (but this varies)
  • SHMSEG = Maximum number of shared memory segments per process
  • SHMMNI = Maximum number of shared memory segments system-wide (all processes)
If you have to increase SHMMAX, you may also have to increase SHMALL to (at least) match.

The following system administration information is to be taken with extreme caution! For most systems, you have to reboot after making these changes (Linux is an exception). In all cases, only a superuser (e.g., root) can do these things. Please be careful, make backup copies of any configuration files that you edit, and read the documentation for your system to make sure that you understand what you are doing.

  • Linux
    The default shared memory limit (both SHMMAX and SHMALL) is 32 MB in 2.2 kernels, but it can be changed in the proc file system (without reboot). For example, to allow 128 MB, issue the following commands:

          echo 134217728 >/proc/sys/kernel/shmall
          echo 134217728 >/proc/sys/kernel/shmmax
    
    You could put these commands into a script run at boot-time (e.g., /etc/rc.d/rc.local). To see the current value of SHMMAX, for example, issue this command:
          cat /proc/sys/kernel/shmmax
    
    Alternatively, you can use sysctl, if available, to control these parameters. Look for a file called /etc/sysctl.conf and add lines like the following to it:
          kernel.shmall = 134217728
          kernel.shmmax = 134217728
    
    This file is usually processed at boot time, but sysctl can also be called explicitly later.
  • Mac OS/X (Darwin)

    • For Jaguar (OS/X 10.2), you need to edit the file /System/Library/StartupItems/SystemTuning/SystemTuning .
    • For Panther (OS/X 10.3), you need to edit the file /etc/rc .

    These files can only be altered by a user with admin privileges. Edit the following values:

          sysctl -w kern.sysv.shmmax
          sysctl -w kern.sysv.shmmin
          sysctl -w kern.sysv.shmmni
          sysctl -w kern.sysv.shmseg
          sysctl -w kern.sysv.shmall
    
    These values have the same meanings on OS/X as those listed for other operating systems. Note that shmall is in pages (4096 bytes each), and shmmax is in bytes. To see the current value of SHMMAX, for example, issue the following command:
          sysctl -n kern.sysv.shmmax
    
    An alternative command that will show you all the shared memory kernel parameters at once is
          sysctl -a | grep shm
    
    The following are the values that I set in my startup file (SystemTuning or /etc/rc):
          sysctl -w kern.sysv.shmall=10240
          sysctl -w kern.sysv.shmmax=41943040
          sysctl -w kern.sysv.shmmin=12
          sysctl -w kern.sysv.shmmni=320
          sysctl -w kern.sysv.shmseg=80
    
    I found that changing just shmall and shmmax didn't always work well, but these values seem to work OK.
  • Solaris
    The default maximum size of a shared memory segment is pretty low (1 MB on the systems I've seen). The relevant settings can be changed in /etc/system, for example:

          set shmsys:shminfo_shmmax=0x2000000
          set shmsys:shminfo_shmmin=1
          set shmsys:shminfo_shmmni=256
          set shmsys:shminfo_shmseg=256
    
    You need to reboot for the changes to take effect. The following command will show the current value of SHMMAX:
          /usr/sbin/sysdef | grep SHMMAX
    
  • BSD/OS
    By default, only 4 MB of shared memory is supported. Keep in mind that shared memory is not pageable; it is locked in RAM. A SHMALL value of 1024 represents 4 MB of shared memory. The following increases the maximum shared memory area to 32 MB:

          options "SHMALL=8192"
          options "SHMMAX=\(SHMALL*PAGE_SIZE\)"
    
    For those running 4.1 or later, just make the above changes, recompile the kernel, and reboot. For those running earlier releases, use bpatch to find the sysptsize value in the current kernel. This is computed dynamically at boot time.
          $ bpatch -r sysptsize
          0x9 = 9
    
    Next, add SYSPTSIZE as a hard-coded value in the kernel configuration file. Increase the value you found using bpatch. Add 1 for every additional 4 MB of shared memory you desire.
          options "SYSPTSIZE=16"
    
    sysptsize cannot be changed by sysctl.
  • FreeBSD, NetBSD, OpenBSD
    The option SYSVSHM needs to be enabled when the kernel is compiled. (It is by default.) The maximum size of shared memory is determined by the option SHMMAXPGS (in pages). The following shows an example of how to set the various parameters:

          options         SYSVSHM
          options         SHMMAXPGS=4096
          options         SHMSEG=256
    
    (On NetBSD and OpenBSD the keyword is actually "option" singular.)

    More on configuring shared memory in FreeBSD 5.x here.

  • HP-UX
    System parameters can be set in the System Administration Manager (SAM) under "Kernel Configuration->Configurable Parameters". Hit "Create A New Kernel" when you're done. However, the defaults are probably big enough (since HP-UX 10 was released, anyway).

  • SGI IRIX
    The default value of SHMMAX is so large that it seems unlikely you'll have to change it. If you need to try this, the relevant file is /var/sysgen/mtune/kernel.

  • SCO OpenServer
    In the default configuration, only 512 kB of shared memory per segment is allowed. To increase the setting, first change directory to /etc/conf/cf.d. To display the current value of SHMMAX, in bytes, run

          ./configure -y SHMMAX
    
    To set a new value for SHMMAX, run:
          ./configure SHMMAX=value
    
    where value is the new value you want to use (in bytes). After setting SHMMAX, rebuild the kernel:
          ./link_unix
    
  • UnixWare
    On UnixWare 7, the maximum size for shared memory segments is 512 kB in the default configuration. To display the current value of SHMMAX, run

          /etc/conf/bin/idtune -g SHMMAX
    
    which displays the current, default, minimum, and maximum values, in bytes. To set a new value for SHMMAX, run:
          /etc/conf/bin/idtune SHMMAX value
    
    where value is the new value you want to use (in bytes). After setting SHMMAX, rebuild the kernel
          /etc/conf/bin/idbuild -B
    
    and reboot.
  • Cygwin
    At present, Cygwin does not support Unix-ish shared memory, so this feature is not available on the Cygwin port of AFNI. I understand that shared memory support is being implemented into Cygwin; however, the fork() in Cygwin does not do copy-on-write, so the parallelizing strategy used here is not a good idea for Cygwin.


Details of Code Changes

In the codes changed thus far ( 3dDeconvolve.c and 3dNLfim.c), most of the variables that were added start with the string "proc_", so searching for that will take you to the modifications. The code fragments below are from 3dDeconvolve.c (as of 08 May 2003), but are virtually identical to those in 3dNLfim.c.

Global Variables
The declarations below define the global variables that will be used to control the parallel processes. PROC_MAX is defined as the max number of processes to be allowed; also, if this macro is defined, then later code segments will be included.

   #if !defined(DONT_USE_SHM) && !defined(DONT_USE_FORK)

   # include "thd_iochan.h"                /* prototypes for shm stuff */

   # define PROC_MAX   32                  /* max num processes */

     static int proc_numjob        = 1   ; /* num processes */
     static pid_t proc_pid[PROC_MAX]     ; /* IDs of processes */
     static int proc_shmid         = 0   ; /* shared memory ID */
     static char *proc_shmptr      = NULL; /* pointer to shared memory */
     static int proc_shmsize       = 0   ; /* total size of shared memory */

     static int proc_shm_arnum     = 0   ; /* num arrays in shared memory */
     static float ***proc_shm_ar   = NULL; /* *proc_shm_ar[i] = ptr to array #i */
     static int *proc_shm_arsiz    = NULL; /* proc_shm_arsiz[i] = floats in #i */

     static int proc_vox_bot[PROC_MAX]   ; /* 1st voxel to use in each process */
     static int proc_vox_top[PROC_MAX]   ; /* last voxel (+1) in each process */

     static int proc_ind                 ; /* index of THIS job */

   #else   /* can't use multiple processes */

   # define proc_numjob 1   /* flag that only 1 process is in use */
   # define proc_ind    0   /* index of THIS job */

   #endif
If parallel processing isn't allowed (because shared memory isn't available), then proc_numjob is defined to be 1 and proc_ind to be 0. These values are tested at a few places in the code to determine if this is a parallel run (proc_numjob > 1) and if this is the master process (proc_ind == 0). Having these always be defined, even if the compilation is non-parallel, avoids having to have special code in a few places.

-jobs Command Line Option
This code (in function get_options) sets the proc_numjob variable that determines how many processes will be used.

         if( strcmp(argv[nopt],"-jobs") == 0 ){   /* RWCox */
           nopt++ ;
           if (nopt >= argc)  DC_error ("need J parameter after -jobs ");
   #ifdef PROC_MAX
           proc_numjob = strtol(argv[nopt],NULL,10) ;
           if( proc_numjob < 1 ){
             fprintf(stderr,"** setting number of processes to 1!\n") ;
             proc_numjob = 1 ;
           } else if( proc_numjob > PROC_MAX ){
             fprintf(stderr,"** setting number of processes to %d!\n",PROC_MAX);
             proc_numjob = PROC_MAX ;
           }
   #else
             fprintf(stderr,"** -jobs not supported in this version\n") ;
   #endif
           nopt++; continue;
         }

Memory Allocation
Results variables are stored in arrays with names ending in "_vol". These arrays are float and have dimension nxyz=number of voxels. They are allocated and zero-ed out in zero_fill_volume().

The code inside "if( proc_numjob == 1 ){" is the original malloc()-based code by Doug Ward. The code inside "#ifdef PROC_MAX" is the shared memory modification. Since shared memory segments can't be changed in size, what this new code does is accumulate a list of pointers that must be set to point to arrays (the input *fvol). These are accumulated in the proc_shm_ar array, and the size of each one (in floats) in the proc_shm_arsiz array. (They all have the same size, but I wrote this function to be generic, in case arrays of different sizes are needed someday.) proc_shm_arnum holds the number of arrays that will have to go into the shared memory segment.

   void zero_fill_volume (float ** fvol, int nxyz)
   {
     int ixyz;

     if( proc_numjob == 1 ){ /* 1 process ==> allocate locally */

       *fvol  = (float *) malloc (sizeof(float) * nxyz);   MTEST(*fvol);
       for (ixyz = 0;  ixyz < nxyz;  ixyz++)
         (*fvol)[ixyz]  = 0.0;

     }
   #ifdef PROC_MAX
      else {             /* multiple processes ==> prepare for shared memory */
                         /*                        by remembering what to do */

       proc_shm_arnum++ ;
       proc_shm_arsiz = (int *)  realloc( proc_shm_arsiz ,
                                          sizeof(int)     *proc_shm_arnum ) ;
       proc_shm_ar = (float ***) realloc( proc_shm_ar ,
                                          sizeof(float **)*proc_shm_arnum ) ;
       proc_shm_arsiz[proc_shm_arnum-1] = nxyz ;
       proc_shm_ar[proc_shm_arnum-1]    = fvol ;

       /* actual allocation and pointer assignment (to *fvol)
          will take place in function proc_finalize_shm_volumes() */
     }
   #endif
   }

Shared Memory Creation
This section of code includes several functions, which I will discuss in turn.

The functions immediately below are designed to ensure that the shared memory segment is deleted when the program exits. The signal handler will (hopefully) catch crashes. The atexit handler will deal with normal calls to exit(). (The child processes (proc_ind > 0) will finish with _exit(), which will not call the atexit handler.)

#ifdef PROC_MAX

   /*** signal handler for fatal errors -- make sure shared memory is deleted ***/

   #include 

   void proc_sigfunc(int sig)
   {
      char * sname ; int ii ;
      static volatile int fff=0 ;
      if( fff ) _exit(1); else fff=1 ;
      switch(sig){
        default:      sname = "unknown" ; break ;
        case SIGHUP:  sname = "SIGHUP"  ; break ;
        case SIGTERM: sname = "SIGTERM" ; break ;
        case SIGILL:  sname = "SIGILL"  ; break ;
        case SIGKILL: sname = "SIGKILL" ; break ;
        case SIGPIPE: sname = "SIGPIPE" ; break ;
        case SIGSEGV: sname = "SIGSEGV" ; break ;
        case SIGBUS:  sname = "SIGBUS"  ; break ;
        case SIGINT:  sname = "SIGINT"  ; break ;
        case SIGFPE:  sname = "SIGFPE"  ; break ;
      }
      if( proc_shmid > 0 ){
        shmctl( proc_shmid , IPC_RMID , NULL ) ; proc_shmid = 0 ;
      }
      fprintf(stderr,"Fatal Signal %d (%s) received in job #%d\n",
              sig,sname,proc_ind) ;
      exit(1) ;
   }

   void proc_atexit( void )  /*** similarly - atexit handler ***/
   {
     if( proc_shmid > 0 )
       shmctl( proc_shmid , IPC_RMID , NULL ) ;
   }
The function below is where the shared memory segment is actually created. At the end, the pointers from zero_fill_volume() are actually assigned, which is the main purpose. Along the way, the signal and atexit handlers are installed. Creation of shared memory is actually quite easy; a call to shm_create() and then a call to shm_attach(). Most of the code is dealing with or allowing for error conditions.
   /*---------------------------------------------------------------*/
   /*** This function is called to allocate all output
        volumes at once in shared memory, and set their pointers ***/

   void proc_finalize_shm_volumes(void)
   {
      char kstr[32] ; int ii ;

      if( proc_shm_arnum == 0 ) return ;   /* should never happen */

      proc_shmsize = 0 ;                       /* add up sizes of */
      for( ii=0 ; ii < proc_shm_arnum ; ii++ ) /* all arrays for */
        proc_shmsize += proc_shm_arsiz[ii] ;   /* shared memory */

      proc_shmsize *= sizeof(float) ;          /* convert to byte count */

      /* create shared memory segment */

      UNIQ_idcode_fill( kstr ) ;               /* unique string "key" */
      proc_shmid = shm_create( kstr , proc_shmsize ) ; /* thd_iochan.c */
      if( proc_shmid < 0 ){
        fprintf(stderr,"\n** Can't create shared memory of size %d!\n"
                         "** Try re-running without -jobs option!\n" ,
                proc_shmsize ) ;

        /** if failed, print out some advice on how to tune SHMMAX **/

        { char *cmd=NULL ;
   #if defined(LINUX)
          cmd = "cat /proc/sys/kernel/shmmax" ;
   #elif defined(SOLARIS)
          cmd = "/usr/sbin/sysdef | grep SHMMAX" ;
   #elif defined(DARWIN)
          cmd = "sysctl -n kern.sysv.shmmax" ;
   #endif
          if( cmd != NULL ){
           FILE *fp = popen( cmd , "r" ) ;
           if( fp != NULL ){
            unsigned int smax=0 ;
            fscanf(fp,"%u",&smax) ; pclose(fp) ;
            if( smax > 0 )
              fprintf(stderr ,
                      "\n"
                      "** POSSIBLY USEFUL ADVICE:\n"
                      "** Current max shared memory size = %u bytes.\n"
                      "** For information on how to change this, see\n"
                      "**   http://afni.nimh.nih.gov/afni/parallize.htm\n"
                      "** and also contact your system administrator.\n"
                      , smax ) ;
           }
          }
        }
        exit(1) ;
      }

      /* set a signal handler to catch most fatal errors and
         delete the shared memory segment if program crashes */

      signal(SIGPIPE,proc_sigfunc) ; signal(SIGSEGV,proc_sigfunc) ;
      signal(SIGINT ,proc_sigfunc) ; signal(SIGFPE ,proc_sigfunc) ;
      signal(SIGBUS ,proc_sigfunc) ; signal(SIGHUP ,proc_sigfunc) ;
      signal(SIGTERM,proc_sigfunc) ; signal(SIGILL ,proc_sigfunc) ;
      signal(SIGKILL,proc_sigfunc) ; signal(SIGPIPE,proc_sigfunc) ;
      atexit(proc_atexit) ;   /* or when the program exits */

      fprintf(stderr , "++ Shared memory: %d bytes at id=%d\n" ,
              proc_shmsize , proc_shmid ) ;

      /* get pointer to shared memory segment we just created */

      proc_shmptr = shm_attach( proc_shmid ) ; /* thd_iochan.c */
      if( proc_shmptr == NULL ){
        fprintf(stderr,"\n** Can't attach to shared memory!?\n"
                         "** This is bizarre.\n" ) ;
        shmctl( proc_shmid , IPC_RMID , NULL ) ;
        exit(1) ;
      }

      /* clear all shared memory */

      memset( proc_shmptr , 0 , proc_shmsize ) ;

      /* fix the local pointers to arrays in shared memory */

      *proc_shm_ar[0] = (float *) proc_shmptr ;
      for( ii=1 ; ii < proc_shm_arnum ; ii++ )
        *proc_shm_ar[ii] = *proc_shm_ar[ii-1] + proc_shm_arsiz[ii] ;
   }
I think the comments explain the purpose of this next function. The "_vol" arrays are free()-ed at the end of the program. But if they are really in the shared memory segment, free() won't work, since they weren't malloc()-ed.
   /*-------------------------------------------------------------*/
   /*** This function replaces free();
        it won't try to free things stored in the shared memory ***/

   void proc_free( void *ptr )
   {
      int ii ;

      if( ptr == NULL ) return ;
      if( proc_shmid == 0 ){ free(ptr); return; }  /* no shm */
      for( ii=0 ; ii < proc_shm_arnum ; ii++ )
        if( ((float *)ptr) == *proc_shm_ar[ii] ) return;
      free(ptr); return;
   }

   #undef  free            /* replace use of library free() */
   #define free proc_free  /* with proc_free() function     */

   #endif /* PROC_MAX */

Dividing up the Work and Starting up the Child Processes
The first part is to figure out what range of voxels will be assigned to each process. If no mask is used, then the division of labor is easy: each process gets an even share of the voxels. But if a mask is used, this might not be a fair division. In the mask case, we have to count the number of voxels in the mask, then find subdivisions of them that are approximately equal in size. These subdivisions are computed into the proc_vox_bot and proc_vox_top arrays.

   #ifdef PROC_MAX
      if( proc_numjob > 1 ){    /*---- set up multiple processes ----*/
        int vv , nvox=nxyz , nper , pp , nv ;
        pid_t newpid ;

        /* count number of voxels to compute with into nvox */

        if( mask_vol != NULL ){
          for( vv=nvox=0 ; vv < nxyz ; vv++ )
            if( mask_vol[vv] != 0 ) nvox++ ;
        }

        if( nvox < proc_numjob ){  /* too few voxels for multiple jobs? */

          proc_numjob = 1 ;

        } else {                   /* prepare jobs */

          /* split voxels between jobs evenly */

          nper = nvox / proc_numjob ;  /* # voxels per job */
          if( mask_vol == NULL ){
            proc_vox_bot[0] = 0 ;
            for( pp=0 ; pp < proc_numjob ; pp++ ){
              proc_vox_top[pp] = proc_vox_bot[pp] + nper ;
              if( pp < proc_numjob-1 ) proc_vox_bot[pp+1] = proc_vox_top[pp] ;
            }
            proc_vox_top[proc_numjob-1] = nxyz ;
          } else {
            proc_vox_bot[0] = 0 ;
            for( pp=0 ; pp < proc_numjob ; pp++ ){
              for( nv=0,vv=proc_vox_bot[pp] ;         /* count ahead until */
                   nv < nper && vv < nxyz  ; vv++ ){  /* find nper voxels */
                if( mask_vol[vv] != 0 ) nv++ ;        /* inside the mask */
              }
              proc_vox_top[pp] = vv ;
              if( pp < proc_numjob-1 ) proc_vox_bot[pp+1] = proc_vox_top[pp] ;
            }
            proc_vox_top[proc_numjob-1] = nxyz ;
          }

          /* make sure dataset is in memory before forks */

          DSET_load(dset) ;  /* so dataset will be common */
The immediate next step is to actually create the child processes. In reading the following code (which directly follows the code above), recall that each child process receives a copy of the program as it was at the call to fork(). For example, the code below assigns the voxel index range for each process to variables ixyz_bot and ixyz_top, just before the call to fork(). After the fork(), the child process never changes these. The parent process changes them before fork()-ing the next child, but the previous child will not see those changes -- the processes have completely separate memory spaces (except for the explicit shared memory segment).
          /* start processes */

          fprintf(stderr,"++ Voxels in dataset: %d\n",nxyz) ;
          if( nvox < nxyz )
          fprintf(stderr,"++ Voxels in mask:    %d\n",nvox) ;
          fprintf(stderr,"++ Voxels per job:    %d\n",nper) ;

          for( pp=1 ; pp < proc_numjob ; pp++ ){
            ixyz_bot = proc_vox_bot[pp] ;   /* these 3 variables   */
            ixyz_top = proc_vox_top[pp] ;   /* are for the process */
            proc_ind = pp ;                 /* we're about to fork */
            newpid   = fork() ;
            if( newpid == -1 ){
              fprintf(stderr,"** Can't fork job #%d! Error exit!\n",pp);
              exit(1) ;
            }
            if( newpid == 0 ) break ;   /* I'm the child */
            proc_pid[pp] = newpid ;     /* I'm the parent */
            iochan_sleep(10) ;
          }
          if( pp == proc_numjob ){       /* only in the parent */
            ixyz_bot = proc_vox_bot[0] ; /* set the 3 control */
            ixyz_top = proc_vox_top[0] ; /* variables needed */
            proc_ind = 0 ;               /* below           */
          }
          fprintf(stderr,"++ Job #%d: processing voxels %d to %d; elapsed time=%.3f\n",
                  proc_ind,ixyz_bot,ixyz_top-1,COX_clock_time()) ;
        }
      }
   #endif /* PROC_MAX */
The loop over voxels in each child looks something like this:
    for (ixyz = ixyz_bot;  ixyz < ixyz_top;  ixyz++)
      {
        /*----- Apply mask? -----*/
        if (mask_vol != NULL)
          if (mask_vol[ixyz] == 0)  continue;

        /*..... get data from voxel ixyz, calculate results into *_vol[ixyz] .....*/
      }

Parallel Process Rundown
At the end of the ixyz loop, the calculations are done. The child processes (proc_ind > 0) exit. The parent process (proc_ind == 0) will continue on to create the output dataset(s) from the _vol arrays. But before the parent can continue from the end of its calculation loop, it must be sure that all the children are done, too (so that the _vol arrays are complete). This is done by using the Unix function waitpid(), which will wait on the process ID (in proc_pid[]) of each child.

   #ifdef PROC_MAX
     if( proc_numjob > 1 ){
       if( proc_ind > 0 ){                          /* death of child */
         fprintf(stderr,"++ Job #%d finished; elapsed time=%.3f\n",proc_ind,COX_clock_time()) ;
         _exit(0) ;

       } else {                      /* parent waits for children */
         int pp ;
         fprintf(stderr,"++ Job #0 waiting for children to finish; elapsed time=%.3f\n",COX_clock_time()) ;
         for( pp=1 ; pp < proc_numjob ; pp++ )
           waitpid( proc_pid[pp] , NULL , 0 ) ;
         fprintf(stderr,"++ Job #0 now finishing up; elapsed time=%.3f\n",COX_clock_time()) ;
       }

       /* when get to here, only parent process is left alive,
          and all the results are in the shared memory segment arrays */
     }
   #endif

Possible Improvements
When a mask (e.g., from 3dAutomask) is used, usually only 20-30% of the voxels are actually in the brain. So the amount of shared memory required could be cut down considerably by creating an index array ixyz_to_vol of length nxyz, so that ixyz_to_vol[ixyz] would be the index in the _vol arrays of the ixyz-th voxel. In the notation above, nvox is the number of voxels in the mask. Then each _vol array would only have to be nvox long. To make this change would require modifying the _vol array to dataset conversion code -- basically, each _vol array would have to be expanded into a full volume so that it could be placed into the output dataset. These modifications would be straightforward but tedious. We'll see how things go.

Created by Robert Cox
Last modified 2006-10-17 15:33
 

Powered by Plone

This site conforms to the following standards: