Notice that the offline testing is enabled every 4 hours. There is some debate about whether the offline tests impact performance. From the smartctl man pages,
This type of test can, in principle, degrade the device performance. The ‘-o on’ option causes this offline testing to be carried out, automatically, on a regular scheduled basis. Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so in practice it has little effect.
So it’s reasonably safe to turn on this option but it’s up to you to decide, based on your workload, how it will impact your performance. For my desktop I don’t worry about the performance impact but for a server, particularly a storage server where the storage could be under a constant load, I might think twice about turning this on. Instead I might rely on a cron job to run offline tests (or perhaps only run them during a maintenance period). Regardless, take the time to test your situation and make the decision of how you often you would run an offline test (but you should run an offline test periodically).
The next smartmontools option to try is the “-c” option. This option prints out the generic SMART “capabilities” of the drive. In this case, capabilities refers to the ability to run tests and store the results in a log. An example of the output from smartctl using the -c option is shown below for /dev/sdb
.
[root@test64 laytonjb]# /usr/local/sbin/smartctl -c /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 642) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 119) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103b) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
There are some interesting bits from this output. You can read through the various options but one part that is interesting is that the drive is capable of running self-tests and the “short” self-test requires 1 minute and the “extended” or offline self-test takes 119 minutes.
Since this is the first time the drives have been examined using smartmontools, both the short and extended self tests should be run. The output below is for the short self-test.
[root@test64 laytonjb]# /usr/local/sbin/smartctl -t short /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sat Apr 10 22:15:37 2010
Use smartctl -X to abort test.
This command starts the self test. The only way to check if it is finished as well as the results of the test is to use the “-l selftest” option with smartctl.
[root@test64 laytonjb]# /usr/local/sbin/smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1432 -
# 2 Short offline Completed without error 00% 1432 -
# 3 Short offline Completed without error 00% 1432 -
# 4 Short offline Completed without error 00% 1432 -
You can see that I ran the test 4 times (just to be sure). But all four tests completed without error.
We can also invoke the extended (offline) testing in a similar way.
[root@test64 laytonjb]# /usr/local/sbin/smartctl -t long /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 119 minutes for test to complete.
Test will complete after Sat Apr 10 22:24:41 2010
Use smartctl -X to abort test.
Just as with the short self-test, the only way to tell when it’s done is to list the log using the smartctl option “-l selftest”.
[root@test64 laytonjb]# /usr/local/sbin/smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1433 -
# 2 Short offline Completed without error 00% 1432 -
# 3 Short offline Completed without error 00% 1432 -
# 4 Short offline Completed without error 00% 1432 -
# 5 Short offline Completed without error 00% 1432 -
The extended test took a while to finish but as you can see it completed without error.
We can also search the SMART logs for “errors” with a simple command:
[root@test64 laytonjb]# /usr/local/sbin/smartctl -l error -d sat /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
The option “-d sat” hasn’t been used before. It simply tells smartctl that the device “-d” is a sata (”sat”) drive. This just prevents smartctl from having to determine the type of drive.
Now that it looks like the drive is good (no errors and SMART is enabled). We can start to probe the drive a little further. Earlier we used the “-c” option to list the test and reporting capabilities of the drive. We can also use the “-a” option to list the vendor specific SMART attributes:
[root@test64 SMARTMONTOOLS]# /usr/local/sbin/smartctl -a /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST3500320AS
Serial Number: 9QM5WJ21
Firmware Version: SD15
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun Apr 11 17:32:31 2010 EDT
==> WARNING: There are known problems with these drives,
AND THIS FIRMWARE VERSION IS AFFECTED,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 642) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 119) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103b) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 100 006 Pre-fail Always - 86741246
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 82
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 170269847
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1435
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 83
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 074 063 045 Old_age Always - 26 (Lifetime Min/Max 18/26)
194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 023 023 000 Old_age Always - 86741246
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1435 -
# 2 Extended offline Completed without error 00% 1433 -
# 3 Short offline Completed without error 00% 1432 -
# 4 Short offline Completed without error 00% 1432 -
# 5 Short offline Completed without error 00% 1432 -
# 6 Short offline Completed without error 00% 1432 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
This listing is rather long but has a great deal of information buried in it. The first part of the output we’ve seen before using the “-c” option - it’s the capabilities section. But the second part have we have not seen yet. The table in the second part has labels starting with “ID# ATTRIBUTE_NAME FLAG … ” and contains the vendor specific SMART attributes. The first attribute to examine is the first row, “Raw_Read_Error_Rate.”
The Raw Read Error Rate attribute is the rate of hardware read errors that occurred when reading data from a drive. The value of the attribute is 115, it’s worst value is 110, and the threshold is 006. Does this mean the read error rate is 115 when the threshold is 6? Not necessarily because the absolute values we are examining are meaningless without knowing their definitions. What you should do is track that attribute and see when/if it changes.
There are other attributes that are useful to monitor as well. Here is a sample of the attributes reported by the drive in this article.
- Reallocated_Sector_Ct: This is the number of reallocated sectors on the drive. Basically this means that there has been a verification error on a specific sector on the drive and that sector is remapped to an area that has spare sectors. Typically the “raw” value is the number of sectors that have been remapped.
- Seek Error Rate: This is the rate of seek errors of the drive heads.
- End-to-End Error: This is the number of errors when the data transferred through the drive cache does not match the data at the host. Typically this is measured by a parity calculation.
- Command Timeout: The number of aborted drive operations due to a drive timeout.
There are other attributes as well. Typically Google will turn up a discussion about them (don’t forget that they vary from manufacturer to manufacturer and drive to drive).
Summary
This article is just a quick introduction to smartmontools which allows Linux users to work with the SMART attributes and capabilities of storage devices. The tool is easy to configure and works quite well for most common drives. However, remember that the SMART attributes are not standard so smartmontools may not know about your particular drive (or RAID card). It may take some work to get it to understand the attributes of your particular drive (don’t hesitate to use the smartmontools mailing list) but when it is included in the smartmontools database, life is a bit easier for querying SMART attributes and capabilities.
SMART can be a great asset for administrators and even home users. It has a great deal of capability and can be used to watch the history of your storage devices. The capability is quite broad and we haven’t even gotten into the smartmontools daemon, smartd. That’s the subject for future articles. In the meantime, take a look at the smartmontools webpage and look at the man pages. Take some time to read through the documentation and then start checking your own storage devices.