@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.


I remember we had a problem like this when running a few dozen pentium 933MHz machines: one cpu on one machine would give incorrect results for one particular run (out of thousands)

that would be in 2001 or so.

And earlier than that, I remember people working at Sun said they'd re-run any failing job (out of tens or hundreds of thousands) and only mark it as failed if the re-run (or possibily even the re-re-run) failed.

@penguin42 @psf@oldbytes.space

· · Web · 1 · 0 · 1

@EdS @psf First time I've heard of that on Pentiums; around the same time Sun did have a known problem with cache on one series of SPARCs; theregister.com/2001/03/07/sun

Having now read the fine article...

Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)

Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.

Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.

@penguin42 @psf@oldbytes.space

My suspicion is it's not test cases missing, but either degradation with age (I remember reading about Electromigration: en.wikipedia.org/wiki/Electrom ) , or timing/voltage dependencies - e.g. not being quite fast enough/strong enough at low voltages.
@EdS @psf

Could be some ageing effect. Clearly not leading to an easily reproducible failure, though: leading to something unlikely but possible.

I did read that Intel is cutting down on validation effort, in which case they are designing-in more bugs:
"CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort"


Search within for "sheer panic"

@penguin42 @psf@oldbytes.space

@EdS @psf Yep, but actual bugs worry me in a different way - something that reliably does the wrong thing is a bit easier to think about; core/speed specific ones are nasty though and ones specific to ageing/failing are even worse.

Sign in to participate in the conversation
Mastodon @ SDF

"I appreciate SDF but it's a general-purpose server and the name doesn't make it obvious that it's about art." - Eugen Rochko