Solid state has made listening out for a hard drive’s “click of death” an impossible task, but magnetic disks are still in heavy use, especially in server applications and knowing exactly when a drive is going to fail can’t be left up to the sharpness of one’s hearing. That’s what SMART, or Self-Monitoring, Analysis, and Reporting Technology, is designed for, but getting any actionable information from a drive’s SMART data is difficult at best. Through trial and error, one company has figured out which SMART stats you should be paying attention to.
Image: Kenny Louie / Flickr, licensed under Creative Commons 2.0
As Computerworld’s Lucas Mearian reports, Backblaze, an online backup provider, has been keeping an eye on SMART data from its servers and identifying which of the many values provide a reliable indication of impending failure. This isn’t a straightforward task as the data is not consistent across hard drive models and manufacturers.
40,000 drives later, and the results are in:
SMART 5 – Reallocated_Sector_Count.
SMART 187 – Reported_Uncorrectable_Errors.
SMART 188 – Command_Timeout.
SMART 197 – Current_Pending_Sector_Count.
SMART 198 – Offline_Uncorrectable
In Backblaze’s experience, the above five metrics were the most reliable and consistent when it came to predicting when a drive would give up the ghost. All you have to do to see the numbers for your own hardware is to grab a tool such as CrystalDiskInfo (it’s free, just watch out for the browser bar in the installer) and poll your drives. Hopefully they’re all in decent nick.
The 5 SMART stats that actually predict hard drive failure [Computerworld]
Comments
2 responses to “Your Hard Drive’s Time Of Death Is Based On These Numbers”
Neat, I shall get smartmontools on my NAS to monitor these counters and alert on anything out of the ordinary. 🙂
All my machines get Hard Disk Sentinel by default 🙂
I’d be interested in any recommendations of alternatives ppl have to offer too.
This seems really self evident,..
“When the stats that monitor your drives health show a lot of problems – it’s more likely to fail”
Have you taken a look at your SMART data recently? Some drives will display a lot of different numbers for a lot of different events, some of which are more useful as indicators of drive health/failure than others.
Having a top five to watch out for makes it far simpler to deduce the health of the drive than trying to work out what all metrics reflect what type of errors and the severity of those errors in your situation. Is it not better to have a catch-all five than to guess based on potentially hundreds of data points?
I like to keep a copy of the table from Wikipedia printed out and posted up near my NAS – They have highlighted the top 9 and provided a short description of each item – great when your SMART monitor only lists the errata, or when viewing smart logs in CLI.