PROBLEM Failing Hard Drive? How do I know what one?

Discussion in 'Hardware' started by TheBlueDalek, Jun 28, 2013.

  1. Offline

    TheBlueDalek

    Member Since:
    Jun 28, 2013
    Messages:
    8
    Message Count:
    8
    Likes Received:
    0
    Trophy Points:
    1
    TheBlueDalek, Jun 28, 2013

    Hi all

    I've been happily using FreeNAS for about 2yrs now, and never had a problem.
    Last weekend, however, I logged into the admin GUI and it was complaining about an unknown error. Google returned a couple of results, and I ended up with this:

    Code (text):
    1. [root@freenas] ~# zpool status -x                                                                                                                                                                      
    2.   pool: storage                                                                                                                                                                                        
    3. state: ONLINE                                                                                                                                                                                          
    4. status: One or more devices has experienced an unrecoverable error.  An                                                                                                                                
    5.         attempt was made to correct the error.  Applications are unaffected.                                                                                                                            
    6. action: Determine if the device needs to be replaced, and clear the errors                                                                                                                              
    7.         using 'zpool clear' or replace the device with 'zpool replace'.                                                                                                                                
    8.   see: http://www.sun.com/msg/ZFS-8000-9P                                                                                                                                                              
    9. scrub: none requested                                                                                                                                                                                  
    10. config:                                                                                                                                                                                                
    11.                                                                                                                                                                                                          
    12.         NAME                                            STATE    READ WRITE CKSUM                                                                                                                      
    13.         storage                                        ONLINE      0    0    0                                                                                                                      
    14.           raidz1                                        ONLINE      0    0    0                                                                                                                      
    15.             gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                      
    16.             gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                      
    17.             gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                      
    18.             gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    96                                                                                                                      
    19.                                                                                                                                                                                                          
    20. errors: No known data errors
    21.  
    I spoke to one of my co-workers, who has been using FreeNAS for years, and suggested scrubbing the array.
    After the scrub, I ended up with the following:

    Code (text):
    1. [root@freenas] ~# zpool status storage
    2.   pool: storage
    3.   state: ONLINE
    4. status: One or more devices has experienced an unrecoverable error. An
    5.         attempt was made to correct the error.  Applications are unaffected.
    6. action: Determine if the device needs to be replaced, and clear the errors
    7.         using 'zpool clear' or replace the device with 'zpool replace'.
    8.     see: http://www.sun.com/msg/ZFS-8000-9P
    9.   scrub: scrub completed after 6h22m with 0 errors on Fri Jun 28
    10. 02:16:46 2013
    11. config:
    12.  
    13.         NAME                                            STATE READ
    14. WRITE CKSUM
    15.         storage                                        ONLINE 0   0    0
    16.           raidz1                                        ONLINE 0   0    0
    17.             gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE 0   0    0
    18.             gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE 0  0    0
    19.             gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE 0   0    0
    20.             gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE 0    0  180K  7.61G repaired
    21.  
    22. errors: No known data errors
    23.  
    It looks like one of my drives is failing. Question is, how do I know what one? I'm Linux competent, but have very limited BSD knowledge. In Linux, there is a command that shows all the drives including serial & model numbers - lshw.

    Is there a similar command or utility that will show me what gptid/256c0c11-4b27-11e2-b9e0-50e54952401a is in terms of make / model / SN?

    Many thanks in advance!
  2. Offline

    titan_rw NAS-ty with the FreeNAS

    Member Since:
    Sep 1, 2012
    Messages:
    390
    Message Count:
    390
    Likes Received:
    25
    Trophy Points:
    28
    Location:
    Canada
    titan_rw, Jun 28, 2013

    Have you checked "volume status" under "active volumes" in the 'storage' section? It should list which /dev/[a]daX device it is.

    Then 'view disks' will let you match that to a disk serial number.

    How often is your scheduled scrub set up? 180,000 checksum errors is a lot. Either the drive is returning huge amounts of bad data, or something else weird is going on.

    Do all the drives pass 'long' smart tests?

    Can you paste a "smartctl -a -q noserial /dev/adaX" for whatever the 'problem' drive is?
  3. Offline

    cyberjock Forum Guard Dog/Admin

    Member Since:
    Mar 25, 2012
    Messages:
    13,651
    Message Count:
    13,651
    Likes Received:
    704
    Trophy Points:
    113
    cyberjock, Jun 28, 2013

    gpart list will match your gptid to device. Then look at the device in the FreeNAS GUI to get the serial number.
  4. Offline

    paleoN FreeNAS Guru

    Member Since:
    Apr 22, 2012
    Messages:
    1,403
    Message Count:
    1,403
    Likes Received:
    15
    Trophy Points:
    38
    paleoN, Jun 28, 2013

    I find glabel status is typically "better" for this.
  5. Offline

    TheBlueDalek

    Member Since:
    Jun 28, 2013
    Messages:
    8
    Message Count:
    8
    Likes Received:
    0
    Trophy Points:
    1
    TheBlueDalek, Jun 29, 2013

    Thanks for the help guys.

    gpart list gives me:

    Code (text):
    1.  
    2. 2. Name: ada4p2
    3.   Mediasize: 2998445415936 (2.7T)
    4.   Sectorsize: 512
    5.   Stripesize: 4096
    6.   Stripeoffset: 0
    7.   Mode: r1w1e2
    8.   rawuuid: 256c0c11-4b27-11e2-b9e0-50e54952401a
    9.   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
    10.   label: (null)
    11.   length: 2998445415936
    12.   offset: 2147549184
    13.   type: freebsd-zfs
    14.   index: 2
    15.   end: 5860533134
    16.   start: 4194432
    17.  
    This is a brand new (4mo) Seagate Barracuda.

    Code (text):
    1.  
    2. [root@freenas] ~# smartctl -a -q noserial /dev/ada4
    3. smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
    4. Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    5.  
    6. === START OF INFORMATION SECTION ===
    7. Device Model:    ST3000DM001-9YN166
    8. Firmware Version: CC4H
    9. User Capacity:    3,000,592,982,016 bytes [3.00 TB]
    10. Sector Sizes:    512 bytes logical, 4096 bytes physical
    11. Device is:        Not in smartctl database [for details use: -P showall]
    12. ATA Version is:  8
    13. ATA Standard is:  ATA-8-ACS revision 4
    14. Local Time is:    Sat Jun 29 14:28:51 2013 EDT
    15. SMART support is: Available - device has SMART capability.
    16. SMART support is: Enabled
    17.  
    18. === START OF READ SMART DATA SECTION ===
    19. SMART overall-health self-assessment test result: PASSED
    20.  
    21. General SMART Values:
    22. Offline data collection status:  (0x00)    Offline data collection activity
    23.                     was never started.
    24.                     Auto Offline Data Collection: Disabled.
    25. Self-test execution status:      (  0)    The previous self-test routine completed
    26.                     without error or no self-test has ever
    27.                     been run.
    28. Total time to complete Offline
    29. data collection:        (  600) seconds.
    30. Offline data collection
    31. capabilities:              (0x73) SMART execute Offline immediate.
    32.                     Auto Offline data collection on/off support.
    33.                     Suspend Offline collection upon new
    34.                     command.
    35.                     No Offline surface scan supported.
    36.                     Self-test supported.
    37.                     Conveyance Self-test supported.
    38.                     Selective Self-test supported.
    39. SMART capabilities:            (0x0003)    Saves SMART data before entering
    40.                     power-saving mode.
    41.                     Supports SMART auto save timer.
    42. Error logging capability:        (0x01)    Error logging supported.
    43.                     General Purpose Logging supported.
    44. Short self-test routine
    45. recommended polling time:      (  1) minutes.
    46. Extended self-test routine
    47. recommended polling time:      ( 255) minutes.
    48. Conveyance self-test routine
    49. recommended polling time:      (  2) minutes.
    50. SCT capabilities:            (0x3085)    SCT Status supported.
    51.  
    52. SMART Attributes Data Structure revision number: 10
    53. Vendor Specific SMART Attributes with Thresholds:
    54. ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    55.   1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      127495160
    56.   3 Spin_Up_Time            0x0003  092  092  000    Pre-fail  Always      -      0
    57.   4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      398
    58.   5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
    59.   7 Seek_Error_Rate        0x000f  038  036  030    Pre-fail  Always      -      9938568106768
    60.   9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      3662
    61. 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
    62. 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      16
    63. 183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0
    64. 184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
    65. 187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
    66. 188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0
    67. 189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
    68. 190 Airflow_Temperature_Cel 0x0022  067  056  045    Old_age  Always      -      33 (Min/Max 29/44)
    69. 191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
    70. 192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      7
    71. 193 Load_Cycle_Count        0x0032  098  098  000    Old_age  Always      -      4585
    72. 194 Temperature_Celsius    0x0022  033  044  000    Old_age  Always      -      33 (0 12 0 0 0)
    73. 197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
    74. 198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
    75. 199 UDMA_CRC_Error_Count    0x003e  200  123  000    Old_age  Always      -      2268097
    76. 240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      227972569105024
    77. 241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      5891963953900
    78. 242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      4241206544186
    79.  
    80. SMART Error Log Version: 1
    81. No Errors Logged
    82.  
    83. SMART Self-test log structure revision number 1
    84. No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    85.  
    86.  
    87. SMART Selective self-test log data structure revision number 1
    88. SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    89.     1        0        0  Not_testing
    90.     2        0        0  Not_testing
    91.     3        0        0  Not_testing
    92.     4        0        0  Not_testing
    93.     5        0        0  Not_testing
    94. Selective self-test flags (0x0):
    95.   After scanning selected spans, do NOT read-scan remainder of disk.
    96. If Selective self-test is pending on power-up, resume after 0 minute delay.
    97.  
    I've told it to do a full smart test and can post results later if needed.
    If this is not a drive problem, the rest of my hardware is as follows:

    Code (text):
    1.  
    2. Hostname    freenas.local
    3. Build    FreeNAS-8.2.0-RELEASE-p1-x64 (r11950)
    4. Platform    AMD A4-3400 APU with Radeon(tm) HD Graphics
    5. Memory    7663MB
    6. System Time    Sat Jun 29 14:30:19 EDT 2013
    7. Uptime    2:30PM up 16:06, 1 user
    8. Load Average    0.08, 0.03, 0.00
    9. Connected through    192.168.1.122
    10.  
    I don't have the specific motherboard model number, but it is a Gigabyte full ATX.
  6. Offline

    titan_rw NAS-ty with the FreeNAS

    Member Since:
    Sep 1, 2012
    Messages:
    390
    Message Count:
    390
    Likes Received:
    25
    Trophy Points:
    28
    Location:
    Canada
    titan_rw, Jun 29, 2013

    Code (text):
    1.  
    2. 199 UDMA_CRC_Error_Count    0x003e  200  123  000    Old_age  Always      -      2268097
    3.  
    You're sure it's not a bad cable or something?

    I thought udma error counts were not usually the drive, but controller / cable / etc?

    That's very possibly where your checksum errors are coming from.

    Have you tried bypassing the 5 bay enclosure thing? I had a 3 bay enclsoure that would occasionally give me udma errors until I changed the ports that were hooked up to it to sata2 ports. When sata3 ports were connected, I got intermittent errors on the drives. I wrote it off as the enclosure not being rated for sata3.
  7. Offline

    TheBlueDalek

    Member Since:
    Jun 28, 2013
    Messages:
    8
    Message Count:
    8
    Likes Received:
    0
    Trophy Points:
    1
    TheBlueDalek, Jun 29, 2013


    It's not an 'enclosure', but rather a purpose built PC.
    It could be a cable I guess. I'll try swapping it and see if that clears the issue.

    I picked up a couple WD Red drives just in case it was a bad drive.. maybe this means I can take 'em back for a refund! :)
  8. Offline

    titan_rw NAS-ty with the FreeNAS

    Member Since:
    Sep 1, 2012
    Messages:
    390
    Message Count:
    390
    Likes Received:
    25
    Trophy Points:
    28
    Location:
    Canada
    titan_rw, Jun 29, 2013

    Never mind. Must have been thinking of a different post.

    You should really have scheduled scrubs. It will 'prove' that everything is good every time it does a scrub instead of relying on whenever you happen to read through the old data.

    Definitely swap everything. Try a different cable, different sata port, different disk. I know you've done some of this switching around of things already.

    Do any of the other drives show udma crc errors?
  9. Offline

    TheBlueDalek

    Member Since:
    Jun 28, 2013
    Messages:
    8
    Message Count:
    8
    Likes Received:
    0
    Trophy Points:
    1
    TheBlueDalek, Jun 29, 2013

    How odd..

    I simply restarted the system...

    Code (text):
    1.  
    2. [root@freenas] ~# zpool status storage
    3.   pool: storage
    4. state: ONLINE
    5. scrub: scrub completed after 3h3m with 0 errors on Sat Jun 29 01:29:13 2013
    6. config:
    7.  
    8.         NAME                                            STATE    READ WRITE CKSUM
    9.         storage                                        ONLINE      0    0    0
    10.           raidz1                                        ONLINE      0    0    0
    11.             gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    12.             gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    13.             gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    14.             gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    15.  
    16. errors: No known data errors
    17.  
    I picked up the WD Red drives at a very good price, so I may just hold on to them for now... y'know... just in case.
    I'll also be keeping an eye on if any other errors show up.

    Thanks again all!
  10. Offline

    cyberjock Forum Guard Dog/Admin

    Member Since:
    Mar 25, 2012
    Messages:
    13,651
    Message Count:
    13,651
    Likes Received:
    704
    Trophy Points:
    113
    cyberjock, Jun 29, 2013

    I believe a reboot resets all values to zero.

    I'd do a RAM test if I were you. Not that there is ever much evidence that RAM is bad, but its an easy and cheap test to do.
  11. Offline

    TheBlueDalek

    Member Since:
    Jun 28, 2013
    Messages:
    8
    Message Count:
    8
    Likes Received:
    0
    Trophy Points:
    1
    TheBlueDalek, Jun 29, 2013

    Alright, so I happened to log into the GUI and the green indicator was now flashing yellow once again...

    Code (text):
    1.  
    2. [root@freenas] ~# zpool status storage
    3.   pool: storage
    4. state: ONLINE
    5. status: One or more devices has experienced an unrecoverable error.  An
    6.     attempt was made to correct the error.  Applications are unaffected.
    7. action: Determine if the device needs to be replaced, and clear the errors
    8.     using 'zpool clear' or replace the device with 'zpool replace'.
    9.   see: http://www.sun.com/msg/ZFS-8000-9P
    10. scrub: scrub completed after 3h3m with 0 errors on Sat Jun 29 01:29:13 2013
    11. config:
    12.  
    13.     NAME                                            STATE    READ WRITE CKSUM
    14.     storage                                        ONLINE      0    0    0
    15.       raidz1                                        ONLINE      0    0    0
    16.         gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    17.         gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    18.         gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
    19.         gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    2
    20.  
    21. errors: No known data errors
    22.  
    I'll open the case tomorrow and swap out the cable. I don't have another SATA port available. Is FreeNAS smart enough to detect a drive change if I move drives around? ie... change the port that they are connected to?

    I don't believe it is a RAM issue as there are no errors being reported on the other drives. Does FreeNAS have a way to do so? Normally I'd put an Ubuntu install disc in the optical drive and run the memory checker that is included, but this system does not have an optical drive.

    Thanks
  12. Offline

    cyberjock Forum Guard Dog/Admin

    Member Since:
    Mar 25, 2012
    Messages:
    13,651
    Message Count:
    13,651
    Likes Received:
    704
    Trophy Points:
    113
    cyberjock, Jun 29, 2013

    I agree that I don't "believe" it is a RAm issue either, but with how easy it is to do the test, it's something that you can't go wrong with.

    One of the most frustrating things I have ever had to do is troubleshoot a computer with bad RAM. You'll get very oddball errors and messages that don't make any sense. You'll rack your brain for days trying to figure out what is going on until you make the silly choice to run a RAM test. Each time you think you've "narrowed down" the problem something will point you in a different direction.

    www.memtest.org can make a bootable USB stick to test RAM. 3 full passes(typically leave it on overnight) will generally prove your RAM is good.
  13. Offline

    titan_rw NAS-ty with the FreeNAS

    Member Since:
    Sep 1, 2012
    Messages:
    390
    Message Count:
    390
    Likes Received:
    25
    Trophy Points:
    28
    Location:
    Canada
    titan_rw, Jun 29, 2013

    You need to determine what's causing the problem. Port, Cable, or Drive.

    The first step, switch one of them. Like change drive A with drive B, but keep the port and cable the same. So you'd have Port A and cable A, but drive B. And Port B, cable B, and drive A. If the checksum errors continue to be reported on the same drive, you know it's the drive.

    If not, do similar tests with the other components. Switch just the cable, or just the port.

    Freenas doesn't care which port the drive(s) are connected to. As long as it sees the drive(s) on any port, it'll 'do the right thing'.

    As cyberjock says, memtest is pretty much the standard in proving memory. Maybe the ubuntu memory checker is memtest. I wouldn't use anything but. All you need is a spare usb flash drive to boot memtest. Good peace of mind to know the ram passes memtest overnight.

Share This Page