3dDeconvolve Speed Test

Document Actions

A speed test of the AFNI program 3dDeconvolve is available for download at http://afni.nimh.nih.gov/pub/dist/data/speed_test.tgz -- instructions on how to run the test are in the README file within.

This is a fairly lengthy run (250 MB of EPI data, 8 tasks, 1424 time points -- 47+ minutes of imaging, 8 concatenated runs).
About 26 billion 64 bit floating points operations are required, plus disk I/O, memory access, etc.

Some timing results (2 CPU results used the "-jobs 2" flag to 3dDeconvolve; for more details, see this):

System Description 1 CPU
double precision 2 CPUs
double precision System Name 1 CPU
single precision 2 CPUs
single precision

1.4 GHz Altix (Intel cc) 32 s 29 s hurin (new) 21 s 16 s

2.5 GHz Mac G5 (OS X) 45 s -- eomer 29 s --

2 GHz Opteron (Linux) 48 s 39 s manwe 33 s 25 s

2 GHz Mac G5 (OS X) 56 s 36 s ilmarin 45 s --

3 GHz Xeon (Linux) 62 s 36 s aurum -- --

1.8 GHz Opteron (Linux) 62 s 47 s Aspen -- --

1050 MHz SPARCv9 (Sun C) 77 s 55 s kampos (new) 40 s 29 s

1050 MHz SPARCv9 (gcc) 134 s 65 s kampos (new) -- --

2 GHz Athlon (Linux) 232 s 124 s hurin (old) -- --

1.67 GHz G4 PowerBook 266 s -- estel 146 s --

867 MHz G4 (PowerBook) 355 s -- boromir 236 s --

1 GHz Athlon (Linux) 387 s -- orome -- --

700 MHz G4 (iMac) 434 s -- thalion 270 s --

250 MHz R10000 (SGI) 501 s 307 s finrod 405 s 242 s

336 MHz SPARCv9 (Sun) 658 s 334 s kampos (old) -- --

These are elapsed times, including disk I/O.
All compilers are gcc (SGI and Sun excepted).
No hand optimizations for processor-specific vector units
Except: Altivec on single precision Mac; BLAS-1 on SGI Altix.

Conclusions:
The Apple Mac G5, Intel Xeon, and AMD Opteron systems are all contenders for a place in a lab that wants to process data quickly.^* The choice amongst them would depend on your preferences in other software (e.g.,Linux vs. MacOSX), cost, etcetera.

I attribute these systems' speedup (relative to my old desktop hurin) primarily to their higher bandwidth to memory and large caches. An important secondary effect is their fast floating point units.

Single precision (i.e., 3dDeconvolve_f - OK) can give a 50-100% speedup over double precision. However, you should make sure the calculations are stable when using the lower precision. Stability of the numerical solution can be tested by examining the matrix condition number and inverse average error, both output by the program when it starts up. For the sample problem, these output lines are



++ (1424x24) Matrix condition [X]:  32.1314

++ Matrix inverse average error = 1.24046e-07

These values are good -- single precision is about 7 decimal places of accuracy, and that is reflected in the inverse error. A value much larger (e.g., near 1.0e-4) would be cause for concern, since that would mean 3 decimal places of accuracy were being lost in the calculations. In such a case, to be cautious, you should re-run the analysis in double precision (3dDeconvolve) and compare the results.

Altix [01 Mar 2005]: We just got this system (the new 'hurin'). By modifying the matrix-vector multiply functions used in 3dDeconvolve to call the SGI BLAS-1 functions, we got a significant speedup over the default straight-out C code (execution times cut in half). This was with gcc; we are waiting for the Intel compiler. See the MCW results (below) for their results on a similar system.

Newly available are the speed test results from the Medical College of Wisconsin. These results include a 4 CPU Itanium-2 system, and a lot more.

Bigger Speed Test [March 2006]

For fast computers, the above speed test is just too easy a job. A longer-running speed test is now available at http://afni.nimh.nih.gov/pub/dist/data/speed_test2.tgz -- instructions on how to run the test are in the README file within. This job requires about 1 GB of RAM to run; it takes much longer than the original speed test above because it is a deconvolution calculation rather than a simple regression model; there are 770 time points and 116 regressor columns. Approximately 300 billion elementary numerical operations are carried out in this analysis. There are essentially no differences between the single and double precision results -- matrix inverse average errors of 5e-08 and 9e-16, respectively, are so small as to warrant no concern whatsoever.

System Description 1 CPU
double precision 2 CPUs
double precision System Name 1 CPU
single precision 2 CPUs
single precision

1.4 GHz SGI Altix (Itanium) 268.331 s 240.971 s hurin 239.972 s 149.591 s

2 GHz iMac (Intel Duo) 461.900 s 307.577 s fingol 337.871 s 194.883 s

2 GHz Macbook Pro 488.523 s 315.956 s aule 345.342 s 204.407 s

IBM T60 (Intel Duo) 494.352 s 312.279 s bodurka 301.515 s 193.480 s

2.7 GHz Mac (G5) 523.590 s 358.721 s eomer 297.558 s 209.448 s

2 GHz Opteron 246 (2-way) 568.512 s 450.349 s manwe 422.898 s 265.800 s

2.2 GHz Opteron 875 (8-way) 659.059 s 365.159 s thalamos 434.617 s 243.562 s

2 GHz Mac (G5) 701.200 s 508.952 s ilmarin 414.922 s 280.611 s

1 GHz Sun (Sparc) 1539.645 s 851.657 s kampos 986.579 s 520.101 s

1.67 GHz G4 PowerBook 3081.265 s --- estel 1867.216 s ---

The following table is for the few multi-CPU machines available to me. The jobs are all in single precision (3dDeconvolve_f):

System -jobs 1 -jobs 2 -jobs 3 -jobs 4 -jobs 5 -jobs 6 -jobs 7

hurin (Itanium 4-way) 234 s 145 s 133 s 124 s ----- ----- -----

ilmarin (Mac Xeon 4-way) 245 s 140 s 110 s 98 s ----- ----- -----

thalamos (Opteron 8-way) 435 s 243 s 216 s 208 s 199 s 191 s 190 s

kampos (Sparc 6-way) 970 s 520 s 373 s 310 s ----- ----- -----

Interestingly, the 2 Linux (Itanium and Opteron CPUs) machines and the Mac machine (2 dual-core Xeon CPUs) in the above multi-CPU list don't show a lot of speedup past 2 CPUs, whereas the Solaris/Sparc machine does. I don't know if this effect is due to different memory management hardware or the OS software on these systems -- recall that '-jobs' is implemented using shared memory (for the input and output data arrays) between the subprocesses, so one can imagine some contention issues.

Tentative conclusions^*:

Good speedup on multi-CPU systems (past N=2) is hard to get.
The first Mac quad-core machines aren't too shabby, though. And cheaper than the marginally faster Itanium system.

* Nothing in this Web page should be taken as an endorsement or disparagement of any commercial product.
** N.B.: A change was made in the matrix-vector multiply routines in early 2006, unrolling the inner loop by 4 instead of by 2. This speeded up 3dDeconvolve by 5-20%, depending on the platform.

RWCox - 17 Oct 2006 AFNI main page

Created by Robert Cox
Last modified 2006-10-16 13:20

AFNI and NIfTI Server for NIMH/NIH/PHS/DHHS/USA/Earth

Sections

Personal tools

Navigation

Quick Links

3dDeconvolve Speed Test

Document Actions

Bigger Speed Test [March 2006]

latest compile date

latest afni version

News