5/5 - (1 vote)

S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) is a technology embedded in different drives such as hard drives or SSDs. Its main task is to monitor the condition of the device.

In fact, S.M.A.R.T. controls several parameters during normal work with the disk. It monitors number of read errors, disk startup time, and even the state of the environment. In addition, S.M.A.R.T. can also conduct tests using the drive.

Ideally, S.M.A.R.T. will predict some failures, such as failures caused by mechanical wear or deterioration of the disk surface, as well as unpredictable failures caused by any unexpected defect. Since usually drives do not fail suddenly, S.M.A.R.T. helps the operating system or system administrator identify those disks that will fail soon, so that they can be replaced and data loss will be avoided.

What S.M.A.R.T. can’t do?

All this, of course, is cool. However, S.M.A.R.T. – is not a crystal ball. He cannot predict failure with the absolute probability and cannot guarantee that the drive will not fail without warning.  The best is to use S.M.A.R.T. assessing the possibility of damage.

Because of statistical nature of failure forecasting, S.M.A.R.T. technology is especially interesting for companies that use a large number of storage devices. To find out how accurate S.M.A.R.T. can predict failures and report the need to replace disks in data centers or server mainframes, have been conducted even special studies.

In 2016, Microsoft and the University of Pennsylvania conducted a study related to SSDs.

According to this study, some attributes of S.M.A.R.T. considered good indicators of the inevitability of failure. In particular:

Realloc sectors counter:

Despite the fact that the underlying technologies are radically different, this indicator is important for both: the world of SSD and the world of hard drives. Due to the features of the wear balancing algorithms used in SSDs, when several sectors fail, then with high probability we can assume that even more will fail soon.

Errors in the Program / Erase (P / E) cycle:

This is a sign of problems with the main flash memory equipment due to the fact that the disk cannot delete data from the block or save it there. The fact is that the production process is imperfect, so the appearance of such errors can be expected. However, flash memory has a limited number of write / delete cycles. For this reason, a sudden increase in the number of events may indicate that the disk is reaching its limit, and it is expected that other memory cells will also begin to fail.

CRC and fatal errors (“Data Error”):

Events of such type can be result of storage errors or problems with the internal communication channel of the drive. This indicator notices both: corrected errors (reported to the host system as without any problems) and uncorrected errors (due to which the drive locks, which informed the host system that it could not be read). In other words, the corrected errors are invisible to the operating system, however, they affect the performance of the drive, increasing the possibility of reassigning of the sector.

SATA downshift count:

Due to temporary interference, problems with the communication channel between the drive and the host, or due to internal problems with the drive, the SATA interface may switch to a lower signal transmission speed. Lowering the connection speed below the nominal level has an obvious effect on disk performance. Thus, this indicator is most significant, especially when it correlates with the presence of one or more previous indicators.

According to the study, 62% of failed SSDs showed the presence of at least one of the above symptoms. From the other side, we can say that 38% of the studied drives broke down without an indication of these symptoms. The studies didn’t mention whether there were any other reports of refusals of S. M. A. R. T. for other “symptoms”. For this reason, you cannot directly correlate these values ​​with failure without warning in 36% of cases from a Google article.

A study by Microsoft and the University of Pennsylvania didn’t mention the model of the test disc, however, according to the authors, most discs have come from the same supplier for several generations.

The study also noted significant differences in reliability between different models. For example, the “worst” model studied shows a twenty percent failure rate 9 months after the first reassignment error and up to 36 percent of failures within 9 months after the first occurrence of data errors. The “worst” model was called the eldest generation of discs considered in the article.

From the other side, with the same symptoms that are given above, the new generation drives failed in 3% and 20% in accordance to the same errors. It is difficult to say whether these numbers can be explained by an improvement in the design of the drive and the production process, or whether plays a role the effect of obsolescence.

The most interesting thing that is mentioned in the article, is that an increase in the number of registered errors can happen as an alarming indicator:

“There is a huge possibility of appearance of symptoms preceding the failure of the SSD, which actively manifest themselves and progress rapidly, greatly reducing the drive’s lifespan to several months.”

In other words, one random error reported by S.M.A.R.T. should definitely not be considered a signal of imminent failure.

However, when a healthy SSD starts reporting more and more errors, you should expect a short or medium term failure. And to avoid all the disadvantages of such situation order fault tolerance services from System admins PRO right now!