When being Smart isn’t Enough… (Hard drive failure)

We grow up, being taught that right is right, and wrong is wrong. That there is only the correct answer, and the wrong answer, and nothing in between.

So, is it a suprise that our testing tools are design to indicate that something is “healthly” or “failing”? Or “Good” or “Bad”?

History

S.M.A.R.T. (Self-Monitor, Analysis, and Reporting Technology) testing for hard drives was first introduced by IBM in 1992 with their IBM 9337 Disk Arrays for main frame servers. This version measured several key device parameters and evaluated them. The results would either indicate “Device is Ok” or “drive likely to fail soon”.

Compaq with Seagate, Quantum and Conner devised IntelliSafe, which was effectively a variant of PFA (“Predictive Failure Analysis”), IBM’s first version of SMART. The major difference is that the IntelliSafe system would communicate the data back to the operating system, which would then evaluate the drive’s status, instead of PFA which was based in the drives firmware.

The IntelliSafe version was submitted for standardization in 1995, and was supported by IBM, Seagate, Quantum, Conner and Western Digital.  The revised standard was then named SMART.

The Assumptions that lead to Problems

A SMART test, does not guarantee a working drive.  All that it does is help detect a failing or failed hard drive.  In many cases a SMART test will help reveal or make it easier to troubleshoot a failing drive, but it only acts as a potential early warning detector.

Why?  A SMART test only provides two values: “threshold not exceeded” and “threshold exceeded”. Often these are represented as “drive OK” or “drive fail” respectively. A “threshold exceeded” is intended to indicate that there is a relatively high probability that the drive will not be able to honor its specification in the future: that is, the drive is “about to fail”. The predicted failure may be catastrophic or may be something as subtle as the inability to write to certain sectors, or perhaps slower performance than the manufacturer’s declared minimum.

The status also does not necessarily gauge the drive’s past or present usability.  If the drive has already failed, then the SMART status maybe inaccessible.  In addition, if a drive has experienced issues in the past, but at that moment is not, then the SMART test may indicate that the Drive is sound.

The SMART standard also only refers to the signaling method between the drive, and the drive controller.  So, a drive that supports SMART testing, may or may not include a temperature sensor.  The specific sensors that should be tested are not specified in the SMART standard.  Most external enclosures do not report SMART results over USB, or Firewire.

So why do I mention this?  My Wife’s system was working fine while in Mac OS X 10.6, but when bootcamped over to Windows XP, she was complaining about “stuttering” or momentary freezes.  After investigating, I discovered that the Hard drive was reporting write failures, and read issues with the pagefile.  In otherwords, the hard drive that had just been installed 4 months ago, was bad.

The SMART tests did not reveal this on the Macintosh side, since it was a different partition that wasn’t being used on the Mac Side.  Windows XP, just doesn’t have a built-in SMART test.  So the OS was silently reporting this, and not reporting this critical issue to the User.

I find it slightly Funny though, Windows will display pop up messages from the System Tray for sometimes the silliest things, but can’t give me a popup indicating that there was a failure to write to the pagefile or to a drive.

The good news was that we were able to detect this before it became critical, and we had recent backups (Yeah! Time Machine!).