Tuesday, July 10, 2012

Welcome to the GPU supercomputing blog

You will find here the description of several experimental GPU installations with a total computing power of  >10 TFLOPS. They're a part of Pawel Artymowicz's computer lab at UofT.

Please also see the description of a summer project in 2010, where a  team of f UofT students has constructed and programmed GPU mini-supercomputers using Nvidia cards and CUDA language.

At present, 5 cudak machines are exist and all but those in brackets are currently online & operational.  Their motherboards are ASUS P6T6/P6T7Workstation/Supercomputer, CPUs are Intel i7 920...960.
They're housed in Thermaltake high tower cases and powered bvy Thermaltake and Silverstone 1.2-1.5 kW power supplies.
They run the following cards and achieve a rough FLOP count as follows, assuming a realistic tested 1 TF/card (rather than the advertised higher Nvidia numbers)

[cudak1:          2 * gtx 280 watercooled, O/C,<1 TFLOPS] *)
cudak2:           2 * (gtx 480 1.5GB RAM), 2*~1 TFLOPS
cudak3:           3 * (gtx 480 1.5GB RAM), 3*~1 TFLOPS
cudak4:           3 * (gtx 480 1.5GB RAM), 3*~1 TFLOPS
cudak5:           4 * (gtx 580   3GB RAM),  4*~1 TFLOPS, (a.k.a. seti)

*) in prep. for installation of Tesla cards.

There is one 24-port Infiniband 20 Gbps interconnect, currently not in use,
as we still learn the capabilities of communication via PCIe bus.
We use the CUDA version 4.2 and Fedora and other linux O/S.

It looks like we have ~12 TFLOPS available for computation.
Of course Jeffrey Fung and I are not computing 24/7, most of the time is spent in code development.  Jeffrey mastered the art of C CUDA and recently ported a 3-D high-res hydrocode of the PPMLR ilk to our machines.
Previously all our CFD codes were 2-D.  It runs as fast as about 40 cpu kernels on cudak5/seti, at St George campus of UofT. We also dabble in CUDA Fortran (PGI).

We're waiting anxiously for the new Tesla-like Kepler GPUs meant for computations that should have improved bandwidths (nominally 320 GB/s), as we're usually bandwidth-limited. Unfortunately, the consumer Kepler (GK 104) GPUs are not much different from what we already have. We're waiting for GK 110s (to be shown in 4 Q 2012). They'll have 2500+ cuda cores, but only 384-lane interface and too small a bandwidth to satisfy us.
Oh well.. the small N-Body programs, for instance the statistical simulations of a lot of planetary systems at once, will be running at a high speed. CFD will see only a very moderate improvement.

Tuesday, March 17, 2009

a welcome to two baby brothers


well, I am definitely not given to moderation. Actually, now that I've spent all the money I am :-)

two new machines:

* * *

cudak2:

very similar to cudak1 described below, you wouldn't tell them apart from outside, so I'm not posting pictures today.
the same overall profile of a quiet, small, personal supercomputer. an office machine on which to learn cuda and develop applications, running 64-bit linux fedora 10 and cuda, as well as doing everything else you expect of a desktop workstation. (I crammed like 100 audio cd's into my rhythmbox files.)

major differences: X58 architecture with 4-core (8-thread) Nehalem Intel chip (2.68 GHz, I believe). only slightly overclocked gpus: 720 cores on 3 nVidia GeForce GTX280 H2O (as opposed to H2OC in original cudak1). Different communication bandwidths via PCI-e bus on a P6T6 workstation beard by Asus. has a multiplexer preventing a permanent degradation of a link to the middle graphics card, like in the 790i architecture. That doesn't mean you can get 3 x (5-6)GB/s flow concurrently to all 3 cards, but maybe that's an unlikely request in practice. cpu simply cannot handle 3 cards at max speed. at least all three cards are now on a more level playing field. they reach the same peak bandwidth of 5.8 GB/s, which Asus calls "true 3-way SLI". whatever. these motherboards are rare though, I understand.


the new Intel cpu also includes its memory controllers, offloading tasks form the previously overworked or at least overheating northbridge. and that means: finally no noisy, microscopic northbridge fan on the P6T6 mobo.

a nice surprise: you open up a system monitor and there are 8 separate cpus reported and graphed (twice the number of actual cores on Nehalem thanks to hyperthreading).

at some point I may describe one technical mod which I made to thermally stabilize the motherboard. I revesed all the Zalman's fans and am now blowing the hot air from the radiator outside, as God intended it. higher coolant and card temps (still comfortably low vis-a-vis specs) but, importantly, lower component and air temps inside the box --> no thermal hang-ups of the motherboard.

* * *

cudak3:

this one is a step-brother, not a twin brother of cudak1 vel Z-machine. Its an air-cooled monster
in a Thermaltake Armour full-tower case, with watercooling applied to the Nehalem cpu, but with 3 aircooled GTX295 dual-gpu cards. So 6 gpus this time not 3, although each a little slower (main clock ~579 MHz, as opposed to the overclocked H2OC @ 680 MHz). That translates to a theoretical peak performance of 5+ TFLOPs. Benchmarks heat gpus up to 91 C, and the system becomes a bit noisy.
oh well... the noise is a fair price to pay for a theoretical performance of a small campus-scale computing center. (I made a small mechanical modification to improve air flow, preemptively. So far I could not crash this system thermally, but I haven't tried extra hard, "just" 6 or 8 benchmarks running at the same time..)


total system cost of cudak2 and 3 was on the order of $5.5k CAD each; that is currently something like $4.4k USD. more info later.

a big can of Ooops.

the Z-machine described previously (cudak1) had an accident after just 2 months of duty.
its flow meter, that little paddle wheel which sends signals to the controlling pcb logic of the Zalman, got stuck unobserved at night. the cooling system panicked and switched itself off instead of pumping even more coolant. if you wonder what could happen next.. do the words chernobyl & tsunami mean anything to you? Zalman apparently uses bad handling of emergencies, and badly designed, faulty flow meters. what a shame, a very nice box as I said. it should absolutely switch off and protect the computer not just itself, in emergency.

anyway, I won't spend time describing this accident. my doctor says it's bad for my blood pressure. a reconfigured system is running again. limping a bit but alive. I cleaned the !@#$&^% flow meter.
:-(

Sunday, December 21, 2008

the road ahead


my plans for using the ZMachine for scientific work include:

  • porting some serious hydrodynamics to cuda. maybe PPM or symmetric high-order Kourganov schemes. it's not clear if I should use fortran bindings or switch to c/c++. (fortran would be soo cool and easy :-)

  • run 3D, variable resolution simulations of disk-planet interaction. study formation of extrasolar planets.

  • learn openGL and use it for visualization, especially of 3D flow lines.

  • create particle codes to study dust disks in extrasolar planetary systems

  • etc.






some of the challanges:

  • porting some serious hydrodynamics to cuda. :-)

  • using those 16kB of shared mem skillfully

  • using single precision as much as possible: double precision isn't a strong suit of cuda.

  • multi-gpu computing

  • clustering with MPI (no urge to do it right now! actually ma be a bad idea at all)

  • prevailing over the ruthless Amdahls law



cudak_1 (as I now tend to call the machine, cudak = eccentric/geek in Polish) may soon have siblings cudak_2 & cudak_3, using i7 nehalems, and x58 mo-bo's asus P6T6. I'm currently (Jan 09) waiting for parts, incl. three evga gtx 295.
watch this space, it should be interesting & v. challenging - 6 gpus in one cudak (read: heterogeneous computing).

so far so good

the ZMachine is a hardware success! (excuse my enthusiasm - this is the first computer I've built completely from scratch.) In a midtower with 790i mini-atx format motherboard, I have 3 powerful, o.c'd gtx280 GPUs (680 MHz base clock > 602 Mhz stock). they pack 720 cores, waiting to be used in parallel. the quad-core cpu keeps the gpus busy and isn't choking, handling the workload ok.

hydraulic assembly was fun, and it turns out you CAN achieve zero leaks ;-) Zalman is a bit cramped inside, but after several hours you can figure out how to route all those power connectors and tubes. it's really a high-end box made of 4-5 mm aluminum plates and doors. it looks and sounds good.

Zalman's 3 l/min pump (max rating, in practice half that value with all those waterblocks) is doing fine. in fact, when run on automatic setting, the cooling system never goes beyond the minimum fan rpm (1000) and flow rate (~1 l/min), since the cpu and gpus aren't able to heat the coolant to more than about 46 C (I think Zalman alarms and does strange things like shutting off above 60 C). This is one of the reasons the system (if not running cuda) is very quiet. Max load, producing up to ~1000 W of heat, causes my northbridge fan to go to audible levels, since the south- and northbridge aren't watercooled and easily heat up above 80 C.

the new x58 chipset (for i7 nehelem processors, socket LGA1366) will not have that problem. nevertheless, I'm waiting for updated waterblock mounts (some are available already) and a larger number of PCIe lines on those new architectures. the PCIe bandwidth seems to be a big problem with X58/i7. for all I know, they don't yet offer 3 x16 PCIe configuration like my evga mobo does: degrading just one of the three slots to 1.0 standard, equivalent to x8 slot.
the X58 manufacturers are apparently under no pressure to widen PCIe, since the SLI/crossfire takes over from the pci express bus the duty of gluing the cards together. but why would they go back and reduce the PCI throughput as compared to 790i? beats me, unless this is a limitation of the QPI sections on the cpu right now.

you may notice that I did not mention SLI. Unfortunately, SLI and CUDA aren't friends yet :-( But somebody wanting to use 3-way SLI could do it in the Zalman LQ1000 box. despite occasional thermal crash problems I'm rather happy with the system tests so far. those problems arise only when the machine is overloaded with computations (multiple large simulations per gpu), and are likely due to very busy north/southbridge on a 790i motherboard (this should not be a factor on newer i7/socket LGA1366 boards, as mentioned - you won't have a fan on the northbridge).

the ZMachine is certainly pushing the limits, trying to be small, powerful and quiet at the same time. it seems that it will handle scientific simulations very well, as they tend to be similar to the simple fluid and particle codes includes as examples in the sdk. by the way, those examples are really useful. if I wanted to run the kind of test I described on my clusters I'd need dozens to hundreds of cpus. I can't give you an estimate of speedup yet, I only know
that runnig the sdk's examples in emulated mode, where you compile with flags forcing execution on cpu, is not a fair comparison, because the programs would in reality be run very differently on cpus.

* * *

of course, since nvidia is doubling the graphics cards' transistors and performance so rapidly, and the first to appear are the air cooled cards, there is a valid question as to water vs. air cooling. in a server room, Tesla rackmounted machines may be the best choice (not necessarily the cheapest). I myself may soon build an air cooled production-run machine to be kept away from people's desks, based on the upcoming gtx 295 gpu. it will certainly be much cheaper, about half-priced compared with the existing gtx 280 cards. [about 1.66 of my cards are needed to match the planned performance of one new dual 295, so I had to pay $850 (CAD) * 1.66 ~ $1400 (CAD) for the performance I will be able to buy for under $500 (US) or $600 (CAN).] but two 295s will be much louder than my current setup! moreover, gpus are only 40% or so of my system's price..

if the situation with mobo's continues for more than a year, CUDA will make sense on those fast x16 slots only, each of which will have to be shared by two gpus (decreasing communication bandwidth we have now?). we will have max 4 gpus; power consumption and air cooling will restrict the clock speeds (e.g. ~580 MHz on the upcoming gtx295 vs. 680 MHz on my 280s). until something changes radically regarding the PCIe bus designs, or cooling/overclocking of gpus, in the near future we're going to have a slow progress in CUDA hardware, as the 4 new gpus on 2 cards will not be much faster than the current 3 water-cooled gpus.

so enjoy the moment and, in theory, get onto the list of 500 fastest supercomputers (ending at some 10 TFLOPs performance) for just $20k, right now!

testing, testing, 123...

CUDA SDK provides a nice set of 20+ examples, some of which can also serve as benchmarks as constant heavy loads (run in many copies on a given gpu).

useful options: add --device=0 to command line in order to run most example programs on the gpu 0 (main one), --device=1 on gpu nr. 1 (slot nr 1, the furthest from cpu on my mobo), and so on.

bandwidth tests


~/cuda/bin/linux/release$ bandwidthTest --device={0,1,2,all} --memory={pageable,pinned}

very curiously, I once got this funny, misconfigured slot 3:
---------------
the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.2 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.7 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 0.79 GB/s PCIe: 1x4 (1.0)* theor: <1GB/s
--
* = as shown by nvidia-settings utility
--------------
inside the cards, the bandwidth is always as it should be: 128 GB/s or so!

the bandwidth between cpu and gpus in the slot _3 is intriguing. I have yet to understand why my third slot was configured as PCIe x4, not x16 during this test!

anyway, normal readings are now better. to achieve them, I have overclocked the PCI buses: slot 1,2 and 3, from automatic setting of 100MHz to, correspondingly, 115, 115, and 120 MHz, and the SPP-MCI comm speed from automatic 200 MHz to 240 MHz. I changed the latency timer of PCI to 100 from 128 CLK.

-------------
the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.874 GB/s, to cpu 5.875; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.875 GB/s, to cpu 5.876; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 2.075 GB/s, to cpu 2.017; PCIe: 1x16 (1.0)* theor: <4GB/s
______________
gpu 0-2 cumulatively: from cpu 13.8 GB/s, to cpu 13.8 GB/s, internally 383 GB/s.
--
* = as shown by nvidia-settings utility
--------------
I must stress that, unless the benchmark is cheating, the troughput of the 3 cards is additive, i.e., they transfer data w/o mutual interference. the bandwidthTest shows cumulative throughput of 13.8 GB/s each way. [edit: yes, I think the benchmark is cheating.. but I wasn't able to run a concurrent bandwidth test with 3 cards.. :-( maybe the the cpu can't handle 14 GB/s concurrently.]

it is revealing to compare EVGA 790i nForce SLI FTW with the X58 motherboards from ASUS (P6T X58 deluxe) and Gigabyte (GA-EX58-extreme or UD5). The latter have 3 physical x16 slots just like the EVGA but unlike it, do not provide enough bandwidth to use them simultaneously: you can only use full (PCIe 2.0) x16 throughput on two first slots, and if you want to have a 3rd card, the second slot goes to x8. Thus, according to their documentation, they are WORSE than the EVGA board/chipset/cpu. well, it's sad. The ASUS Striker II Extreme board is a socket 775 board similar to my EVGA and from the documentation it seems to have very similar PCIe capabilities: two slots at full (2.0) x16 speed, one middle slot at (1.0) x16 speed.

thermal tests


the tests described above run ok. however, these workloads and PCI clock settings produce conditions close to a thermal instability of the motherboard.



temperatures depend a lot on applications. I ran a fluidsGL simulation of 'stable fluids'.
It's a Fourier-based, incompressible, implicit hydrocode, in which I modified the source code to have a much larger array of 2048^2 cells and a much lower viscosity coefficient (which doesn't affect the speed). iterations on this enlarged grid using calls to cufft FFT library yielded the following stabilized temp's: T ~69/58 C (gpu0, chip/card), 66/55 C (gpu1), 64/54 C (gpu2).

Zalman system was showing 43/35 C (coolant/box temperature).

it is quite easy to thermally destabilize the motherboard by increasing the number of tasks run on each gpu from one to a few. the temperatures cited above are close to a maximum for long-term runs. everything depends on the type of application, of course, for instance running many small fluid grids instead of an equivalent one large grid tends to increase the demand on SPP/MCP and raise temps. ideally, I should be running just one big simulation per gpu, or even one simulation on 3 gpus, no intensive output to monitor via gpu0, so I should(?) be fine.

I guess the thermal issues will stay with us for the foreseeable future, whether we have 65nm, 55nm or, one day, 25nm technology, since we're always going to push the performance to the limit.

computing power tests


the SDK 2.1 tests were done successfully.

fluidsGL test was running at 17 fps on one gpu, or ~10 fps in 3 copies on 3 gpus, at resolution (2K)^2. at the standard resolution of 512^2, the frame rate was 266fps on one pgu, or 180+180+90 (=450fps combined on 3 gpus). let me comment on this. the algorithm, in addition to lots of interpolation (to perform advection step) also does at least two fft transforms of the 2-D, multi-variable data. 200 timesteps per second is thus equivalent to almost 10^3 fft transforms per second on a 512^2 array. in several seconds, the simulated fluid can travel across the computational grid. admittedly different, but not necessarily more computationally intensive, CFD hydrocodes run on a cpu would easily take a full coffee break to accomplish this.

nbody test showed that there is no problem with having a sustained combined processing power (e.g., sum of FLOPs in many concurrently running nbody calculations) equal to 1.53 TFLOP. Theoretical sum is closer to 3 TFLOPs. 30000 particle N-body system oscillates, turns and evolves on the screen in a matter of seconds. (rendered by openGL in a game-like way, somewhat nicer than what we normally do in science :-). btw, somewhere here I said that Zalman cooling system never goes into high gear. well, that's no longer true, in this 3 x nbody test it did go to 1400rpm fans and 1.5 l/min flow rate. but check out the frame rate of those N-body simulations: about 20fps each, which means smooth real-time video during which each simulation computes N^2 ~ 10^9 gravitational interactions per frame. to add to the insanity of this calculation, by dragging
the mouse over the screen you can turn the simulated 3D objects in space to get a different view, whether the simulation is running or pausing.



I have to find the actual top FLOPs in my own applications this winter or spring (2009). I believe they may be higher than 1.53TFLOP, because I will try not to give device=0 so much graphics to display as do the examples, forcing high frame rates. this will help Zalman remove heat from the motherboard chips such as northbridge/southbridge.

it's alive!





perhaps the most important issue with 3-SLI-like configuration is the thermal stability, not of the cards which, as both the Zalman and the nvidia utility nvidia-settings show, are relatively cool (under maximum strain, they heat up to mid-70 C, when air-cooled versions would go to 100 C.
likewise, there is never a danger of overheating the cpu, if watercooled.
rather, the issue is that radiators are supposedly working better when drawing air into the box.
in addition, that's a natural direction to support the power unit and the 12cm box fan both trying to blow the air out the back side of the box. so the radiator-WARMED air is blown into the box, not outside. this of course heats and cools the whole motherboard simulatneously. it heats the whole motherboard but efficiently cools its small chips, which are heating up during any intensive use. unfortunately but predictably, at maximum computational load, especially if the system is burdened with running multiple programs per physical card, something has to give. most probably the southbridge or northbridge overheats and the computer hangs. I wish there was some solution to this that allows the big fan to cool the radiator to the outside, while still providing a good flow of air near the motherboard..


however, many combinations of the workload, fully utilizing all the resources (gpu, cpu, memory) are stable. let's take a closer look...