Many computer users take hard drive reliability for granted, not even thinking about minimal possibility of drive crash. They suppose that hard disk drive manufacturers have done a great improvement to their products towards disk reliability. And they have, but the reality is that all disks die eventually. Even if you have a recent backup, sudden disk failure is a minor catastrophe. How can we protect ourselves from a sudden hard drive crash? One of the ways is through SMART (Self-Monitoring Analysis and Reporting Technology) by predicting future failures.

The essential moment is that the user should understand how drives fail and why. There are two classes of failures the hard disk can suffer: unpredictable and predictable.

Unpredictable failures happen suddenly, without warning and can be caused by catastrophic events, handling damage, static electricity or an electronic component burning out, and there is nothing that can be done to foresee or stay away from them.

Predictable ones are 60% mechanical and occur gradually over time. The degradation of drive performance may include head crashes, head contamination, bad solder joints, bad curcuit board connections, motor break down, worn down bearing, spinning inability, excessive run out, bad servo positioning.

Most hard drives lose their performance slowly, and the disk is able to monitor and diagnose many elements' condition through SMART, providing an early warning for many types of problems. When a potential problem is detected, the drive can be repaired or replaced before any loss of data.

This technology has developed to be industry standard for drive manufacturers and allows checking hard drive status, reporting it and providing some estimation for future failure date. SMART has been able to predict a gradual degradation of the disk. The original SMART spec (SFF-8035i) was written by a group of disk drive manufacturers. In 1995, parts of SFF-8035i were merged into the ATA-3 standard. Starting with the ATA-4 standard, the requirement for the disks to maintain an internal Attribute table was dropped. Instead, now, the disks simply return an OK or NOT OK response to an inquiry about their health. A negative response indicates that the disk firmware has determined that the disk is likely to fail. The ATA-5 standard added an ATA error log and commands to run disk self-tests to the SMART command set.

Self-Monitoring, Analysis and Reporting Technology systems (SMART) are built in to most modern ATA and SCSI hard disks. SMART disk drives internally monitor their own health and performance. SMART technology features include a set of attributes, which determine reliability-prediction parameters of drive and should not be exceeded under normal operation. Each attribute has an identification number ID. Self-Monitoring, Analysis and Reporting Technology systems (SMART) are built in to most modern ATA and SCSI hard disks. SMART disk drives internally monitor their own health and performance. SMART technology features include a set of attributes, which determine reliability-prediction parameters of drive and should not be exceeded under normal operation. Each attribute has an identification number (ID).

Some types of reliability parameters are:

- Distance between the heads and the disk platters;
- Faulty sectors;
- Recalibration;
- Drive spin-up time;
- Drive temperature;
- Characteristics of the media;
- Motor and servomechanisms.

Attribute value is a positive integral number, usually in the range from 1 to 253. Initially, all attributes have maximum values. A value of 100 or 200 will often be chosen as the "normal" value. Some attributes are considered life-critical and others are just "informative". In case of hard drive wearing or when some components of the disk are about to fail, attributes indicate decreasing amount of values. Consequently, high values determine high reliability of the drive and low values - low reliability or high possibility of drive failure. Specific threshold is assigned to each attribute. Once the value drops below this threshold, SMART considers the disk to be faulty, which means it becomes very dangerous to store data on this drive.

The following list describes some critical hard drive attributes.

ID	Attribute name	Description
01	Read Error Rate	Indicates the rate of hardware read errors that occurred when reading data from a disk surface. Lower values indicate a problem with either disk surface or read/write heads.
05	Reallocated Sectors Count	Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area. The more sectors that are reallocated, the more read/write speed will decrease.
11	Recalibration Retries	This attribute indicates the number of times recalibration was requested (under the condition that the first attempt was unsuccessful). A decrease of this attribute value is a sign of problems in the hard disk mechanical subsystem.
194	Temperature	Current internal temperature.
196	Reallocation Event Count	Count of reallocation operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area.
197	Current Pending Sector Count	Number of "unstable" sectors. When unstable sectors are read successfully, the value is decreased. If errors occur when reading a sector, the drive will attempt to recover the data, transfer it to the reserved area and mark the sector as remapped.
198	Uncorrectable Sector Count	The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.
220	Disk Shift	Distance the disk has shifted relative to the spindle (usually due to shock). Unit of measure is unknown.

Currently, the SMART system can detect about 70% of all hard drive errors. Its main shortcoming is that it doesn't provide a direct mechanism for informing the OS or the user if problems are found. In fact, because disk SMART status is frequently not monitored, many disk problems go undetected until they lead to a catastrophic failure.

Monitoring a drive's behavior, SMART has the purpose of warning a user about the threat of drive collapse while time remains to take preventive action, such as back up the data to a replacement device. So why not use SMART monitor programs freely available on Internet to cut these problems off at the pass?

SMART way to prevent data loss.

More About