@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.
I remember we had a problem like this when running a few dozen pentium 933MHz machines: one cpu on one machine would give incorrect results for one particular run (out of thousands)
that would be in 2001 or so.
And earlier than that, I remember people working at Sun said they'd re-run any failing job (out of tens or hundreds of thousands) and only mark it as failed if the re-run (or possibily even the re-re-run) failed.
Having now read the fine article...
Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)
Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.
Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.
Could be some ageing effect. Clearly not leading to an easily reproducible failure, though: leading to something unlikely but possible.
I did read that Intel is cutting down on validation effort, in which case they are designing-in more bugs:
"CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort"
Search within for "sheer panic"
"I appreciate SDF but it's a general-purpose server and the name doesn't make it obvious that it's about art." - Eugen Rochko