Sunday, December 21, 2008

testing, testing, 123...

CUDA SDK provides a nice set of 20+ examples, some of which can also serve as benchmarks as constant heavy loads (run in many copies on a given gpu).

useful options: add --device=0 to command line in order to run most example programs on the gpu 0 (main one), --device=1 on gpu nr. 1 (slot nr 1, the furthest from cpu on my mobo), and so on.

bandwidth tests

~/cuda/bin/linux/release$ bandwidthTest --device={0,1,2,all} --memory={pageable,pinned}

very curiously, I once got this funny, misconfigured slot 3:
the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.2 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.7 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 0.79 GB/s PCIe: 1x4 (1.0)* theor: <1GB/s
* = as shown by nvidia-settings utility
inside the cards, the bandwidth is always as it should be: 128 GB/s or so!

the bandwidth between cpu and gpus in the slot _3 is intriguing. I have yet to understand why my third slot was configured as PCIe x4, not x16 during this test!

anyway, normal readings are now better. to achieve them, I have overclocked the PCI buses: slot 1,2 and 3, from automatic setting of 100MHz to, correspondingly, 115, 115, and 120 MHz, and the SPP-MCI comm speed from automatic 200 MHz to 240 MHz. I changed the latency timer of PCI to 100 from 128 CLK.

the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.874 GB/s, to cpu 5.875; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.875 GB/s, to cpu 5.876; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 2.075 GB/s, to cpu 2.017; PCIe: 1x16 (1.0)* theor: <4GB/s
gpu 0-2 cumulatively: from cpu 13.8 GB/s, to cpu 13.8 GB/s, internally 383 GB/s.
* = as shown by nvidia-settings utility
I must stress that, unless the benchmark is cheating, the troughput of the 3 cards is additive, i.e., they transfer data w/o mutual interference. the bandwidthTest shows cumulative throughput of 13.8 GB/s each way. [edit: yes, I think the benchmark is cheating.. but I wasn't able to run a concurrent bandwidth test with 3 cards.. :-( maybe the the cpu can't handle 14 GB/s concurrently.]

it is revealing to compare EVGA 790i nForce SLI FTW with the X58 motherboards from ASUS (P6T X58 deluxe) and Gigabyte (GA-EX58-extreme or UD5). The latter have 3 physical x16 slots just like the EVGA but unlike it, do not provide enough bandwidth to use them simultaneously: you can only use full (PCIe 2.0) x16 throughput on two first slots, and if you want to have a 3rd card, the second slot goes to x8. Thus, according to their documentation, they are WORSE than the EVGA board/chipset/cpu. well, it's sad. The ASUS Striker II Extreme board is a socket 775 board similar to my EVGA and from the documentation it seems to have very similar PCIe capabilities: two slots at full (2.0) x16 speed, one middle slot at (1.0) x16 speed.

thermal tests

the tests described above run ok. however, these workloads and PCI clock settings produce conditions close to a thermal instability of the motherboard.

temperatures depend a lot on applications. I ran a fluidsGL simulation of 'stable fluids'.
It's a Fourier-based, incompressible, implicit hydrocode, in which I modified the source code to have a much larger array of 2048^2 cells and a much lower viscosity coefficient (which doesn't affect the speed). iterations on this enlarged grid using calls to cufft FFT library yielded the following stabilized temp's: T ~69/58 C (gpu0, chip/card), 66/55 C (gpu1), 64/54 C (gpu2).

Zalman system was showing 43/35 C (coolant/box temperature).

it is quite easy to thermally destabilize the motherboard by increasing the number of tasks run on each gpu from one to a few. the temperatures cited above are close to a maximum for long-term runs. everything depends on the type of application, of course, for instance running many small fluid grids instead of an equivalent one large grid tends to increase the demand on SPP/MCP and raise temps. ideally, I should be running just one big simulation per gpu, or even one simulation on 3 gpus, no intensive output to monitor via gpu0, so I should(?) be fine.

I guess the thermal issues will stay with us for the foreseeable future, whether we have 65nm, 55nm or, one day, 25nm technology, since we're always going to push the performance to the limit.

computing power tests

the SDK 2.1 tests were done successfully.

fluidsGL test was running at 17 fps on one gpu, or ~10 fps in 3 copies on 3 gpus, at resolution (2K)^2. at the standard resolution of 512^2, the frame rate was 266fps on one pgu, or 180+180+90 (=450fps combined on 3 gpus). let me comment on this. the algorithm, in addition to lots of interpolation (to perform advection step) also does at least two fft transforms of the 2-D, multi-variable data. 200 timesteps per second is thus equivalent to almost 10^3 fft transforms per second on a 512^2 array. in several seconds, the simulated fluid can travel across the computational grid. admittedly different, but not necessarily more computationally intensive, CFD hydrocodes run on a cpu would easily take a full coffee break to accomplish this.

nbody test showed that there is no problem with having a sustained combined processing power (e.g., sum of FLOPs in many concurrently running nbody calculations) equal to 1.53 TFLOP. Theoretical sum is closer to 3 TFLOPs. 30000 particle N-body system oscillates, turns and evolves on the screen in a matter of seconds. (rendered by openGL in a game-like way, somewhat nicer than what we normally do in science :-). btw, somewhere here I said that Zalman cooling system never goes into high gear. well, that's no longer true, in this 3 x nbody test it did go to 1400rpm fans and 1.5 l/min flow rate. but check out the frame rate of those N-body simulations: about 20fps each, which means smooth real-time video during which each simulation computes N^2 ~ 10^9 gravitational interactions per frame. to add to the insanity of this calculation, by dragging
the mouse over the screen you can turn the simulated 3D objects in space to get a different view, whether the simulation is running or pausing.

I have to find the actual top FLOPs in my own applications this winter or spring (2009). I believe they may be higher than 1.53TFLOP, because I will try not to give device=0 so much graphics to display as do the examples, forcing high frame rates. this will help Zalman remove heat from the motherboard chips such as northbridge/southbridge.

No comments:

Post a Comment