Once we assembled the system, we installed Windows XP Professional (64-bit edition). We then had to make the system recognize all four graphics cards. This turned out to be a serious undertaking. At the beginning, just two graphics cards were recognized. We searched for hours, for a magic sequence of installing the drivers, uninstalling them again and rebooting the system. We can only begin to describe our joy when we were finally facing the screen shown below.
For our computations on the graphics cards, we use NVIDIA CUDA. Since CUDA is also supported by the latest GeForce drivers, installing special CUDA drivers proved to be unnecessary.
Even though FASTRA contains a total of 8 GPU cores, it is not a gaming PC. The reason for this is a lack of interoperability between the GPUs. Any hard-core gamer would choose a motherboard that supported SLI, so the cards can work together while rendering a scene. Probably, this is also the reason why (as far as we know), no one else has started the foolish undertaking of cramming four 9800GX2 cards in a single PC. Fortunately, the lack of SLI was no problem for us, because we don't need SLI during a reconstruction (every graphics card communicates directly with the CPU, no inter-GPU communication is required). Our biggest problem was to find both a motherboard and a case able to store four 9800GX2s.
This turned out to be a bit more problematic than previously anticipated. Most "normal" large cases only offer up to 7 expansion slots. Since every GX2 requires two of those, we would have been one short. Our search eventually ended in Taiwan with a case from LIAN-LI. The motherboard was required to have at least 4 PCI-Express expansion slots, but also in such a way as to allow every graphics card to occupy double slot spacing. No SLI-supporting motherboard could offer us this. That's why we chose for the MSI-K9A2 motherboard, which has the required number of slots and in the right configuration.
As said before, FASTRA is not a gaming system, and we don't use SLI. However, it may still be interesting to investigate the gaming performance of our system. Our first tests were performed with FutureMark's 3DMark06. On-board SLI (so within a GX2 card between two GPUs) is supported. We used the default settings (1280x1024 and “Optimal” filtering). See the results:
As you can see, the scores aren't particularly high. This was as expected, mainly because of lacking inter-GPU SLI. Of course, we would still have hoped for better gaming performance, for the times our boss isn't looking... We did not perform 3DMark06 tests at higher resolutions, which would probably lead to better SLI results.
The real reason FASTRA was built, was to do tomographical reconstructions. We expected to get really good results when testing with these types of calculations. For gaming, clear benchmarks are available to measure system performance. For tomography, this is much more difficult. We decided to compare the FASTRA results with CPU-based calculations, performed on a modern supercomputer cluster.
This we found in “CalcUA”, the supercomputer of the University of Antwerp, which cost 3.5 million euro in March 2005. With 256 nodes, each containing two Opteron 250 processors, it has about the same strength as 512 regular desktop PCs. We used tomography code optimized for the CPU and ran it on one core of the CalcUA. We divided the measured running time by 512, to get a theoretical upper-bound for the total performance of this supercomputer.
We compared the results with our code written for the GPU. First, we measured the time required to project a large number of slices along a range of angles. This can be considered a basic operation in iterative tomographic reconstruction methods. We performed this operation on all eight GPUs simultaneously and measured the running time. Next, we repeated the experiment for a full iteration of an iterative reconstruction algorithm.
The performance of the CalcUA and FASTRA are then compared. See the following figure for the results for a forward projection. The image size is 1024x1024 pixels, number of projections 1024, number of detector pixels is 1536.
The employed CPU algorithm has been in use at our lab for more than two years. It makes use of CPU specific optimizations. This ensures that we were able to perform a meaningful comparison.
It turned out to be easy to overclock the shader frequency of the GPUs by 20%, having each core running at 720MHz. The performance of our overclocked system is also displayed.
As can be seen from the picture, CalcUA is faster than FASTRA for projections, but FASTRA is not far behind. Not bad for a <4000 euro machine!
The numbers become better once we consider complete reconstructions (so not just a forward projection). The used algorithm (both on FASTRA and CalcUA) is SIRT. This means one iteration uses all 1024 projections. This algorithm is computationally very intensive compared to other reconstruction algorithms, yet we use it as a building block for many of our more advanced algorithms (as it has favourable reconstruction properties in case of noise). The results show the fast on-board GPU memory really gives us an edge in SIRT:
Here FASTRA really shines! It was able to beat the “official” supercomputer. This opens up a great opportunity: for certain specific applications, a research group is able to access full-time supercomputer computation power for a very modest price.
Because of the large number of GPU cores put into a rather small space, we expected big temperature problems. As you can see from the pictures, there is only very little room between the graphics cards. Therefore we were unable to form a good air-flow through the case that would also cool the GPUs sufficiently.
And indeed, during our first tests, we noticed the temperatures going up very fast. We decided to limit the temperature to 100 degrees Celsius, after which we stop the calculation. During those tests, we often needed to stop the calculations prematurely. Eventually we found an easy solution: keep the side panel open. When FASTRA is open, the GPUs are able to dispose their heat. Eventually we will place an extra fan in the side panel, so we will be able to close the machine. Until that time, we will enjoy the view.
Using the NVIDIA driver, we were able to measure the GPU temperatures and try to find a safe overclock level. Plotting the temperatures shows the dramatic temperature increase once a calculation is started (the reconstruction begins at t=0):
As we can see, the temperature is highest when the case is closed and the GPU is overclocked. Interestingly the closing of the case causes a larger increase of the temperatures than overclocking. This shows the tight placement of the graphic cards doesn't allow for a good airflow.