Customers are apparently shouting at graphics chip makers
to do more about correcting memory systems.
One of the downsides of the current generation of GPUs is
a lack of error correction which causes problems for high performance users. According to
HPC Wire, Graphics chip vendors are aware of the problem and it
appears to be only a matter of time before GPUs get a memory makeover.
In the good old days graphics processors didn't really
need to be concerned with error-prone memory. No one really cared if a pixel's
colour was off by a bit or two. (Or if the pixel showed up at all. sub.ed.). GPU
makers did not bother with error corrected memory, however with general-purpose
computing on graphics processing units, otherwise know as GPGPU, it started to
become crucial.
Once you start to use the GPU as a math accelerator and a
memory bit flips in a data value the computer becomes unreliable. The reason that general-purpose computing can be done on
GPUs at all is because errors on standard graphics hardware are still rare. From a programming
point of view the safest way to tackle the problem is run the code twice, which
is unfortunately a bit slow.
Patricia Harrell, AMD's director of Stream Computing,
said that there was a need for more robust data protection in GPUs.
Error corrected memory is a requirement for a number of
customers, especially those looking to deploy GPUs at scale. She
pointed out that although individual memory error
rates are low, as you add more GPUs to the system, and run applications
for
longer periods of time, the chances of hitting a flipped memory bit
increases
proportionally. The AMD FireStream 9270 board uses GDDR5 memory, so
data
protection is already in place at the memory interface in this product.
The memory controller sends and receives data to and from
the DRAM, buffers the data locally while the DRAM calculates the
integrity. If
there is a problem the memory controller does the retry automatically.
Harrell said that AMD was talking a cautious approach to
error correcting GPUs because you could end up with kit that is too big and hot. You also lose all the performance advantages
GPGPU was originally intended for.
Andy Keane, general manager of the GPU computing business
unit at Nvidia said that his outfit would be doing something about the problem
soon. ECC memory is a hard requirement in datacentres and so
Nvidia has to build that kind of support into its roadmap. He was not sure how
long it will take but Nvidia already has a pretty good idea of the timeline. A
pretty good guess will be one to two years.