D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Folding@home 'errors'

 

On Tue, 16 Mar 2010, tom wrote:

I run folding@home on 4 machines here and one quite often finishes early:
" Simulation instability has been encountered. The run has entered a
[23:12:46]   state from which no further progress can be made.
[23:12:46] This may be the correct result of the simulation, however if you
[23:12:46]   often see other project units terminating early like this
[23:12:46] too, you may wish to check the stability of your computer (issues
[23:12:46]   such as high temperature, overclocking, etc.)."

there is no overclocking, cpu is ~29c and memtest can run for days without finding a problem...
Any clues/tips?
Once upon a time I worked in the R&D department of an old british 
supercomputer company... I mainly wrote test & diagnostics, and low-level 
driver code - worked with the hardware & chip desginers, did some design & 
integration, system building, etc, etc...
And even then, I could get a system to run all my diagnostics for days on 
end in & out of the burn-in ovens, then they would fail miserably when 
subject to application code )-:
And even more yerars ago - I looked after a PDP11 running Unix v6 - every 
quarter we'd get the DEC engineer in as part of the maintenance contract - 
he'd hoover the core memory, etc... run all his diagnostics, but I 
remember him saying that running Unix on them was a much better test than 
any of his diagnostics ever were!
So you need to think bigger than just memtest - have you tried cpuburn? 
However that's just a set of CPU tests. There is a user-land memory tester 
too - it's 'memtester' under debian. Portentially not as thorough as 
memtest86+, but you can run it in conjunction with other things.
But who knows where the issue is - 9 times out of ten we never found bad 
memory on those old boards - it was more usually bad PCBs/memory 
controllers (all custom designed)
As part of a soaktest/burn-in for new servers, I try to get them to run as 
many different things - so get the PCI(e) bus(es) excercised (disk IO - 
run bonnie or some custom scripts - dd'ing /dev/urandom to a file for 
example) and some network activity - FTP big files (and small files) 
to/from another PC.
I try to get my systems doing as much as possible - so as well as 
individual sub-system tests, run a disk test and a network test and a cpu 
test all at the same time...
And good luck...

Gordon

--
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html