Pawel Artymowicz lab: GPU supercomputing & CUDA: 2008

Sunday, December 21, 2008

the road ahead

my plans for using the ZMachine for scientific work include:

porting some serious hydrodynamics to cuda. maybe PPM or symmetric high-order Kourganov schemes. it's not clear if I should use fortran bindings or switch to c/c++. (fortran would be soo cool and easy :-)

run 3D, variable resolution simulations of disk-planet interaction. study formation of extrasolar planets.

learn openGL and use it for visualization, especially of 3D flow lines.

create particle codes to study dust disks in extrasolar planetary systems

etc.

some of the challanges:

porting some serious hydrodynamics to cuda. :-)

using those 16kB of shared mem skillfully

using single precision as much as possible: double precision isn't a strong suit of cuda.

multi-gpu computing

clustering with MPI (no urge to do it right now! actually ma be a bad idea at all)

prevailing over the ruthless Amdahls law

cudak_1 (as I now tend to call the machine, cudak = eccentric/geek in Polish) may soon have siblings cudak_2 & cudak_3, using i7 nehalems, and x58 mo-bo's asus P6T6. I'm currently (Jan 09) waiting for parts, incl. three evga gtx 295.
watch this space, it should be interesting & v. challenging - 6 gpus in one cudak (read: heterogeneous computing).

so far so good

the ZMachine is a hardware success! (excuse my enthusiasm - this is the first computer I've built completely from scratch.) In a midtower with 790i mini-atx format motherboard, I have 3 powerful, o.c'd gtx280 GPUs (680 MHz base clock > 602 Mhz stock). they pack 720 cores, waiting to be used in parallel. the quad-core cpu keeps the gpus busy and isn't choking, handling the workload ok.

hydraulic assembly was fun, and it turns out you CAN achieve zero leaks ;-) Zalman is a bit cramped inside, but after several hours you can figure out how to route all those power connectors and tubes. it's really a high-end box made of 4-5 mm aluminum plates and doors. it looks and sounds good.

Zalman's 3 l/min pump (max rating, in practice half that value with all those waterblocks) is doing fine. in fact, when run on automatic setting, the cooling system never goes beyond the minimum fan rpm (1000) and flow rate (~1 l/min), since the cpu and gpus aren't able to heat the coolant to more than about 46 C (I think Zalman alarms and does strange things like shutting off above 60 C). This is one of the reasons the system (if not running cuda) is very quiet. Max load, producing up to ~1000 W of heat, causes my northbridge fan to go to audible levels, since the south- and northbridge aren't watercooled and easily heat up above 80 C.

the new x58 chipset (for i7 nehelem processors, socket LGA1366) will not have that problem. nevertheless, I'm waiting for updated waterblock mounts (some are available already) and a larger number of PCIe lines on those new architectures. the PCIe bandwidth seems to be a big problem with X58/i7. for all I know, they don't yet offer 3 x16 PCIe configuration like my evga mobo does: degrading just one of the three slots to 1.0 standard, equivalent to x8 slot.
the X58 manufacturers are apparently under no pressure to widen PCIe, since the SLI/crossfire takes over from the pci express bus the duty of gluing the cards together. but why would they go back and reduce the PCI throughput as compared to 790i? beats me, unless this is a limitation of the QPI sections on the cpu right now.

you may notice that I did not mention SLI. Unfortunately, SLI and CUDA aren't friends yet :-( But somebody wanting to use 3-way SLI could do it in the Zalman LQ1000 box. despite occasional thermal crash problems I'm rather happy with the system tests so far. those problems arise only when the machine is overloaded with computations (multiple large simulations per gpu), and are likely due to very busy north/southbridge on a 790i motherboard (this should not be a factor on newer i7/socket LGA1366 boards, as mentioned - you won't have a fan on the northbridge).

the ZMachine is certainly pushing the limits, trying to be small, powerful and quiet at the same time. it seems that it will handle scientific simulations very well, as they tend to be similar to the simple fluid and particle codes includes as examples in the sdk. by the way, those examples are really useful. if I wanted to run the kind of test I described on my clusters I'd need dozens to hundreds of cpus. I can't give you an estimate of speedup yet, I only know
that runnig the sdk's examples in emulated mode, where you compile with flags forcing execution on cpu, is not a fair comparison, because the programs would in reality be run very differently on cpus.

* * *

of course, since nvidia is doubling the graphics cards' transistors and performance so rapidly, and the first to appear are the air cooled cards, there is a valid question as to water vs. air cooling. in a server room, Tesla rackmounted machines may be the best choice (not necessarily the cheapest). I myself may soon build an air cooled production-run machine to be kept away from people's desks, based on the upcoming gtx 295 gpu. it will certainly be much cheaper, about half-priced compared with the existing gtx 280 cards. [about 1.66 of my cards are needed to match the planned performance of one new dual 295, so I had to pay $850 (CAD) * 1.66 ~ $1400 (CAD) for the performance I will be able to buy for under $500 (US) or $600 (CAN).] but two 295s will be much louder than my current setup! moreover, gpus are only 40% or so of my system's price..

if the situation with mobo's continues for more than a year, CUDA will make sense on those fast x16 slots only, each of which will have to be shared by two gpus (decreasing communication bandwidth we have now?). we will have max 4 gpus; power consumption and air cooling will restrict the clock speeds (e.g. ~580 MHz on the upcoming gtx295 vs. 680 MHz on my 280s). until something changes radically regarding the PCIe bus designs, or cooling/overclocking of gpus, in the near future we're going to have a slow progress in CUDA hardware, as the 4 new gpus on 2 cards will not be much faster than the current 3 water-cooled gpus.

so enjoy the moment and, in theory, get onto the list of 500 fastest supercomputers (ending at some 10 TFLOPs performance) for just $20k, right now!

testing, testing, 123...

CUDA SDK provides a nice set of 20+ examples, some of which can also serve as benchmarks as constant heavy loads (run in many copies on a given gpu).

useful options: add --device=0 to command line in order to run most example programs on the gpu 0 (main one), --device=1 on gpu nr. 1 (slot nr 1, the furthest from cpu on my mobo), and so on.

bandwidth tests

~/cuda/bin/linux/release$ bandwidthTest --device={0,1,2,all} --memory={pageable,pinned}

very curiously, I once got this funny, misconfigured slot 3:
---------------
the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.2 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.7 GB/s PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 0.79 GB/s PCIe: 1x4 (1.0)* theor: <1GB/s
--
* = as shown by nvidia-settings utility
--------------
inside the cards, the bandwidth is always as it should be: 128 GB/s or so!

the bandwidth between cpu and gpus in the slot _3 is intriguing. I have yet to understand why my third slot was configured as PCIe x4, not x16 during this test!

anyway, normal readings are now better. to achieve them, I have overclocked the PCI buses: slot 1,2 and 3, from automatic setting of 100MHz to, correspondingly, 115, 115, and 120 MHz, and the SPP-MCI comm speed from automatic 200 MHz to 240 MHz. I changed the latency timer of PCI to 100 from 128 CLK.

-------------
the size of data packet transferred is 33MB, about a minimum for best efficiency of transfers.
Host to Device Bandwidth for pinned memory
gpu 0: from cpu 5.874 GB/s, to cpu 5.875; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 1: from cpu 5.875 GB/s, to cpu 5.876; PCIe: 1x16 (2.0) theor: <8GB/s
gpu 2: from cpu 2.075 GB/s, to cpu 2.017; PCIe: 1x16 (1.0)* theor: <4GB/s
______________
gpu 0-2 cumulatively: from cpu 13.8 GB/s, to cpu 13.8 GB/s, internally 383 GB/s.
--
* = as shown by nvidia-settings utility
--------------
I must stress that, unless the benchmark is cheating, the troughput of the 3 cards is additive, i.e., they transfer data w/o mutual interference. the bandwidthTest shows cumulative throughput of 13.8 GB/s each way. [edit: yes, I think the benchmark is cheating.. but I wasn't able to run a concurrent bandwidth test with 3 cards.. :-( maybe the the cpu can't handle 14 GB/s concurrently.]

it is revealing to compare EVGA 790i nForce SLI FTW with the X58 motherboards from ASUS (P6T X58 deluxe) and Gigabyte (GA-EX58-extreme or UD5). The latter have 3 physical x16 slots just like the EVGA but unlike it, do not provide enough bandwidth to use them simultaneously: you can only use full (PCIe 2.0) x16 throughput on two first slots, and if you want to have a 3rd card, the second slot goes to x8. Thus, according to their documentation, they are WORSE than the EVGA board/chipset/cpu. well, it's sad. The ASUS Striker II Extreme board is a socket 775 board similar to my EVGA and from the documentation it seems to have very similar PCIe capabilities: two slots at full (2.0) x16 speed, one middle slot at (1.0) x16 speed.

thermal tests

the tests described above run ok. however, these workloads and PCI clock settings produce conditions close to a thermal instability of the motherboard.

temperatures depend a lot on applications. I ran a fluidsGL simulation of 'stable fluids'.
It's a Fourier-based, incompressible, implicit hydrocode, in which I modified the source code to have a much larger array of 2048^2 cells and a much lower viscosity coefficient (which doesn't affect the speed). iterations on this enlarged grid using calls to cufft FFT library yielded the following stabilized temp's: T ~69/58 C (gpu0, chip/card), 66/55 C (gpu1), 64/54 C (gpu2).

Zalman system was showing 43/35 C (coolant/box temperature).

it is quite easy to thermally destabilize the motherboard by increasing the number of tasks run on each gpu from one to a few. the temperatures cited above are close to a maximum for long-term runs. everything depends on the type of application, of course, for instance running many small fluid grids instead of an equivalent one large grid tends to increase the demand on SPP/MCP and raise temps. ideally, I should be running just one big simulation per gpu, or even one simulation on 3 gpus, no intensive output to monitor via gpu0, so I should(?) be fine.

I guess the thermal issues will stay with us for the foreseeable future, whether we have 65nm, 55nm or, one day, 25nm technology, since we're always going to push the performance to the limit.

computing power tests

the SDK 2.1 tests were done successfully.

fluidsGL test was running at 17 fps on one gpu, or ~10 fps in 3 copies on 3 gpus, at resolution (2K)^2. at the standard resolution of 512^2, the frame rate was 266fps on one pgu, or 180+180+90 (=450fps combined on 3 gpus). let me comment on this. the algorithm, in addition to lots of interpolation (to perform advection step) also does at least two fft transforms of the 2-D, multi-variable data. 200 timesteps per second is thus equivalent to almost 10^3 fft transforms per second on a 512^2 array. in several seconds, the simulated fluid can travel across the computational grid. admittedly different, but not necessarily more computationally intensive, CFD hydrocodes run on a cpu would easily take a full coffee break to accomplish this.

nbody test showed that there is no problem with having a sustained combined processing power (e.g., sum of FLOPs in many concurrently running nbody calculations) equal to 1.53 TFLOP. Theoretical sum is closer to 3 TFLOPs. 30000 particle N-body system oscillates, turns and evolves on the screen in a matter of seconds. (rendered by openGL in a game-like way, somewhat nicer than what we normally do in science :-). btw, somewhere here I said that Zalman cooling system never goes into high gear. well, that's no longer true, in this 3 x nbody test it did go to 1400rpm fans and 1.5 l/min flow rate. but check out the frame rate of those N-body simulations: about 20fps each, which means smooth real-time video during which each simulation computes N^2 ~ 10^9 gravitational interactions per frame. to add to the insanity of this calculation, by dragging
the mouse over the screen you can turn the simulated 3D objects in space to get a different view, whether the simulation is running or pausing.

I have to find the actual top FLOPs in my own applications this winter or spring (2009). I believe they may be higher than 1.53TFLOP, because I will try not to give device=0 so much graphics to display as do the examples, forcing high frame rates. this will help Zalman remove heat from the motherboard chips such as northbridge/southbridge.

it's alive!

perhaps the most important issue with 3-SLI-like configuration is the thermal stability, not of the cards which, as both the Zalman and the nvidia utility nvidia-settings show, are relatively cool (under maximum strain, they heat up to mid-70 C, when air-cooled versions would go to 100 C.
likewise, there is never a danger of overheating the cpu, if watercooled.
rather, the issue is that radiators are supposedly working better when drawing air into the box.
in addition, that's a natural direction to support the power unit and the 12cm box fan both trying to blow the air out the back side of the box. so the radiator-WARMED air is blown into the box, not outside. this of course heats and cools the whole motherboard simulatneously. it heats the whole motherboard but efficiently cools its small chips, which are heating up during any intensive use. unfortunately but predictably, at maximum computational load, especially if the system is burdened with running multiple programs per physical card, something has to give. most probably the southbridge or northbridge overheats and the computer hangs. I wish there was some solution to this that allows the big fan to cool the radiator to the outside, while still providing a good flow of air near the motherboard..

however, many combinations of the workload, fully utilizing all the resources (gpu, cpu, memory) are stable. let's take a closer look...

the three-card game

already on the 2-card machine, I played with the os installation. initially, I installed a 32-bit RHEL 5.2 from media, only to change my mind and go for the 64-bit fedora10, which should give me a slight performance edge.

I had a nasty day or two without the X windows/desktop, since the nvidia driver version 180.06 and then 180.16 did NOT install ok. whoever wrote that script seems not to have known a magic spell I finally found on nvidia forum, you know, something like:
"/usr/bin/nvidia-xconfig -a". a spell without which your computer only outputs 24 lines of text ;-). half a day, and a bunch of rpm's later I started feeling at home in my tcsh and gnome desktop. I installed the newest sdk for cuda without incident.

then the real fun began. the physical re-installation of the water-cooled cards.
my first leak... :-( well, all those of you who used the extremely short barbs in the BFG gtx280 H2OC kit, meant for SLI installation, know what I mean. the supplied short pieces of clear lastic vinyl tubing don't tightly fit around the barbs, the white plastic clamps aren't really working and, besides, the short barbs are a bit too wide and catch on the aluminum card backplate painted in black, when you try to mount them on the card. all those mechanical/hydraulic issues
can be solved by abandoning those too-short and too-wide barb fittings and using the 3/8 inch fittings, which are smoother and better quality anyway. just one thing: you have to shorten them so that the cards are close together. I put the cards in, measured the distance and cut off pieces of four 3/8" barbs with a carbon (diamond?) disk attached to an electric drill.
I smoothed the edges so they won't cut the hoses and everything started looking good again (although I'm missing the home depot's garden hose a lot! :-)

the LQ1000 case is very compact, like all mid-towers. for instance, I musty warn those of you who would like to use EVGA water-cooled 280s that this box DOES NOT WORK with them. they're too high (from PCIe bracket on mobo to the top of the card). you have to use BFG gtx280 H2OC cards. I was just lucky to have gotten them in my first iteration. they almost touch the big fan casing, so you have to route the cooling hoses over the low sectors of the cards. there is enough tubing in the Zalman kit for several waterblocks.

those hoses will likely touch the 24cm fan casing
which is ok. the small door covering the disks will touch their sata cabling. it's all barely doable. if you install the lowest/furthest x16 card, you'll always worry a lot about the sharply bent hose coming down toward the bottom plate of the case. but it will all work! I bent and even clamped the hose in a working system with my fingers and the flow rate diminished noticeably only when I used so much force that I thought I'll stop the flow completely. but it kept going and Zalman's flow rate alarm wasn't even triggered.

in this picture you can see that I removed the white plastic clamps between the graphics cards (cf. previous pix), and installed the home-depot clamps on the shortened 3/8" barbs. no leaks now.

the two-card game

Let's take a look at the first configuration I built, with two gtx280 gpus. I actually bought three cards, but got scared about the thermal limitations of my Zalman cooler (which are like those of Zalman XT external cooler): nominally only 500 W heat removed.
well, theoretically I had 3 x 236W of heat just from the 3 cards! so my system, on paper, was limited to 2 cards... but I figured that all the heat doesn't go into the coolant.

the de-gassing of the cooling system has to be done with all the waterblocks below the pump.
no leaks were found. if you use Zalman, remember not to give up (like someone on some forum did)
after a few beeps and disconnections of the pump. this is a normal behavior. before most air is gone you'll have to restart the system up to ten times. I recommend to turn the waterblocks as much possible during this; it helps the air escape.

I like this shot, it shows perhaps the first supercomputer built from pieces of nylon-reinforced garden hose from home depot :-) the wide spacing between the cards is due to the location of the 2 fastest PCIe x16 slots.

intro & specs

the system I'm going to describe is my first step on a new path into shared-memory parallel computing. in other words, here you won't find anything about how to improve your highscores and frame rates in Crysis 3.14 or Left 4 Dead 7.0. :-(

my Z-Machine (when I find a good name for it, I'll edit out ZMachine :-) was designed with these objectives in mind:

1. max power: max number of cards in one box,

2. well interconnected: cards sitting on pci express bus (x16 if possible) and not dependent on the relatively very slow gigabit ethernet switches (which aren't very broadband; in PCIe terminology they are x1 or x2 devices!)

3. quiet operation for office, not sever room setting

* * *

points 1 and 3 suggested water cooling, and when I started reading up on that subject, I was amazed that a little-known box called LQ1000 from a respected manufacturer Zalman has a nice cooling system integrated inside the box. although a bit expensive, it looks great (wine-colored gauges remind one of a bmw dashboard :-)

btw, the box looks like so

and not like this

prototype from a 2007 trade show.

* * *

next: which cards? nvidia geForce gtx280 was my choice (240 cores!). what's interesting, water cooled cards by BFG and EVGA are factory overclocked. great!
I considered the newer versions of gtx260 with 216 cores but the price-performance calculus preferred gtx280. [I looked at the price and performance of the whole computer, not just one card!]

next: the motherboard and cpu. well.. that was kind of unimportant if my hopes as to the gpus were
right (-: so i settled on a run-of-the-mill quad-core intel processor...

* * *

it took me the last week of Nov 2008 to (over)design my machine while scanning the world for the following components (prices are approximate, in CAD):

ZMachine:

box and cooling: Zalman "ZMachine" LQ1000, with included cpu waterblock & whole cooling system in a midtower. $800

cards: 3 x BFG GeFOrce gtx 280 H2OC, factory-overclocked setting, 680 MHz main clock - $686 each at bestdirect.ca

motherboard: EVGA nForce 790i SLI FTW - $350 [FSB clock 1350 MHz, +15% overclocked PCIe.
Good mobo, except for a tiny northbridge radiator fan, which becomes loud when nb is getting a workout by cuda applications. however, at the end of 2008 there simply were no better boards. I could (and maybe would) have opted for ASUS Striker II Extreme or a Gigabit board with i7 nehelem cpu (socket LSA1366), but then I would have a wrong Zalman cpu waterblock bracket, and the really insufficient PCIe throughput, about which later..]

CPU: intel quad-core at 2.83GHz (Q9550) - $300(?)

RAM: 2x2GB SLI-ready DDR3 1800 $ 440

PSU: Toughpower Thermaltake 1200W [has the required two modular +12V connectors to each of the three gtx280 cards, and is very quiet. pay close attention to the number of available power connectors if you construct a 3-way SLI!] - $430

2 x 1TB Spinpoint harddisks from Samsung (quiet) - $240 (both) [I have a backup partition 250GB on the second drive, still don't know what I gain :-) since if the 1st disk crashes, the second is not automatically bootable... well I'll sort it out later]

1 dvd-rom $29 [nice, quiet], kbd/mouse $10 ea.[spent too little? both aleady failing :-]

Samsung SyncMaster 2443BW, 24" 1920x1200 monitor. $350 [I like it, pivots around 2 axes, adjustable height. great contrast etc]

OS: Fedora 10 , x86_64, driver: nvidia 180.16 - $0 [I downloaded and installed 6-7 GB over the net without any physical media in one night]

CUDA v. 2.1 beta. [installs & works fine; I skipped compilation of those few examples that require some extra libraries]

GPU >> CPU

in early 2007, nVidia opened up the gates to a paradise. a free if not entirely open-source project called CUDA made its debut. it's a general-purpose graphics device computing, utilizing the massively parallel architecture of today's GPUs, or graphics processing units. in effect, all the recent nvidia cards became capable of carrying out parallel computational tasks. their raw power exceeds that of a CPU by a factor now typically ~10^2.

CUDA is best explained in this wiki page.
It is nicely illustrated using real-life applications in the
nVidia CUDA Zone. Typical speedups w.r.t. cpu are 5-100.

mythbusters were hired to illustrate the power of parallel processing and constructed a cute, monstual
parallel paint gun to paint mona lisa in less than 80 ms. it's fun to watch the monster and its 1024 paintballs flying slow-mo to meet their final destination: canvas.

without much exaggeration one can say that gpus could only be ignored so long - as soon as there's a solution that speeds up your program 100 times, you have no choice but to change to that track, no mater how comfortable your old one was.

for me, the old path was clusters and MPI. I started 10 yrs ago with a cluster of sun ultra5 workstations, and later continued with a cluster of custom built pc's in rack mounts.
hydra cluster, 5 GFLOPs
sample application of hydra
ANTARES cluster, 61-144 GFLOPs

MPI is a language (more precisely protocol and libraries that implement it, for exchange of data between nodes of a cluster). physical exchange was facilitated by commodity gigabit ethernet switches that became affordable about that time.

clusters were great, and essentially most today's supercomputers are built like that: farms of dozens to tens of thousands of machines hooked up by relatively slow interconnects. distributed memory and distributed processing power. which is ok for some problems, like hydrodynamics w/o radiation transfer or self-gravity in astrophysics, or frame-by-frame movie rendering and postproduction in a studio.

so clusters were great, but not trouble-free. you had to wait (hours or days, depending on how ambitious your computation was!) for the requested number of processors on some big national supercomputer to be allocated to your simulation. or you could decide to build your own little cluster, if you had money, place and time for it. that made more sense to many, and could cost your grant agency 'only' $20k or so, unless you really needed smp (shared memory machine, then you had to multiply the cost 5-10 times.)

you and your associates could only run a system of a few dozen nodes at best. beyond that magical number, frequent individual component breakdowns, software upgrades, and so on, needed to be taken care of by a professional sysadmin or technician (which you could not afford; so you used
your nodes praying they don't fail, and did not repair those that eventually did.)

on a small cluster, scientific long-term simulations could in practice be done in 2D but rarely in 3D, unless you were very lucky with your problem and/or very patient..

* * *

let's skip to 2007 then. why is a gpu hundred times more powerful than a cpu?
today, both are capable of parallel computation, since they have multi-core structure.
each gpu core is far less advanced on the control/vectorization side a bit slower.
but your ~$350 4-core intel processor is no match for 216-240 cores of a ~$350 nvidia gpu, on the newest G200 card series (nForce gtx260, gtx2800, and from january 2009 also a 2-gpu card gtx295 with 480 cores). ASSUMING YOU CAN harness the combined power of those gpu cores...

so here's the challange: to build and program a massively parallel system (with hundreds of cumputing nodes or cores) that is a bit more environmentally friendly than the old clusters: much less noise, much less total electrical power used, and finally much much more bang for the buck.
and that means: a supercomputer in a signle computer case, performing thousands of GFLOPs!
perhaps a thousand times the number of operations you could perform 10 yrs ago.