GPU memory error correction lagging

Published in Graphics

GPU memory error correction lagging

by Nick Farrell on03 September 2009

font size decrease font size increase font size

Customers clamour for better protection

Customers are apparently shouting at graphics chip makers to do more about correcting memory systems.

One of the downsides of the current generation of GPUs is a lack of error correction which causes problems for high performance users. According to HPC Wire, Graphics chip vendors are aware of the problem and it appears to be only a matter of time before GPUs get a memory makeover.

In the good old days graphics processors didn't really need to be concerned with error-prone memory. No one really cared if a pixel's colour was off by a bit or two. (Or if the pixel showed up at all. sub.ed.). GPU makers did not bother with error corrected memory, however with general-purpose computing on graphics processing units, otherwise know as GPGPU, it started to become crucial.

Once you start to use the GPU as a math accelerator and a memory bit flips in a data value the computer becomes unreliable. The reason that general-purpose computing can be done on GPUs at all is because errors on standard graphics hardware are still rare. From a programming point of view the safest way to tackle the problem is run the code twice, which is unfortunately a bit slow.

Patricia Harrell, AMD's director of Stream Computing, said that there was a need for more robust data protection in GPUs. Error corrected memory is a requirement for a number of customers, especially those looking to deploy GPUs at scale. She pointed out that although individual memory error rates are low, as you add more GPUs to the system, and run applications for longer periods of time, the chances of hitting a flipped memory bit increases proportionally. The AMD FireStream 9270 board uses GDDR5 memory, so data protection is already in place at the memory interface in this product. The memory controller sends and receives data to and from the DRAM, buffers the data locally while the DRAM calculates the integrity. If there is a problem the memory controller does the retry automatically.

Harrell said that AMD was talking a cautious approach to error correcting GPUs because you could end up with kit that is too big and hot. You also lose all the performance advantages GPGPU was originally intended for.

Andy Keane, general manager of the GPU computing business unit at Nvidia said that his outfit would be doing something about the problem soon. ECC memory is a hard requirement in datacentres and so Nvidia has to build that kind of support into its roadmap. He was not sure how long it will take but Nvidia already has a pretty good idea of the timeline. A pretty good guess will be one to two years.

Rate this item

(0 votes)