Fix bad blocks

Discussion in 'Hardware' started by budmannxx, Nov 3, 2011.

Thread Status:
Not open for further replies.
  1. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Nov 3, 2011

    I have some bad blocks on 2 of 6 drives in my ZFS raidz2. Is there a way to fix them without formatting the drives? If not, will formatting fix them? Some details:


    • I'm running FreeNAS-8.0.2-RELEASE-amd64 (8288)
    • The drives are Samsung HD204UI (the drives were manufactured after the firmware issue that is well documented on the forum, and I applied the firmware patch anyway, just to be sure)

    Here is the smartctl output for 1 of the drives (output for the other drive is pretty much the same, just different LBA_of_first_error):
    Code (text):
    1. smartctl -l selftest /dev/ada0
    2. smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
    3. Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    4.  
    5. === START OF READ SMART DATA SECTION ===
    6. SMART Self-test log structure revision number 1
    7. Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    8. # 1  Short offline       Completed: read failure       90%      3741         1192616706
    9. # 2  Short offline       Completed: read failure       90%      3732         1192616706
    10. # 3  Short offline       Completed: read failure       90%      3407         1192616706
    11. # 4  Extended offline    Completed: read failure       70%       299         1192616706
    12. # 5  Short offline       Completed: read failure       90%       297         1192616706
    13. # 6  Short offline       Completed without error       00%        33         -
    14.  
    I've read through this tutorial but it's geared towards Linux, and the sg3_utils mentioned are not available in FreeNAS. Any help here would be greatly appreciated.
  2. Dmitry Nosachev New Member

    Member Since:
    Aug 26, 2011
    Message Count:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Location:
    Moscow
    Dmitry Nosachev, Nov 4, 2011

    Please show the output from smartctl -A /dev/ada0
  3. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Nov 4, 2011

    Here you go:

    Code (text):
    1.  
    2. smartctl -A /dev/ada0
    3. smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
    4. Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    5.  
    6. === START OF READ SMART DATA SECTION ===
    7. SMART Attributes Data Structure revision number: 16
    8. Vendor Specific SMART Attributes with Thresholds:
    9. ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    10.   1 Raw_Read_Error_Rate     0x002f   100   099   051    Pre-fail  Always       -       936
    11.   2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
    12.   3 Spin_Up_Time            0x0023   067   067   025    Pre-fail  Always       -       10185
    13.   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       77
    14.   5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
    15.   7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
    16.   8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
    17.   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       4078
    18.  10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
    19.  11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       1
    20.  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       46
    21. 181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       9672078
    22. 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       69
    23. 192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
    24. 194 Temperature_Celsius     0x0002   064   063   000    Old_age   Always       -       33 (Min/Max 22/37)
    25. 195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
    26. 196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
    27. 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       2
    28. 198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
    29. 199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
    30. 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       73
    31. 223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       1
    32. 225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       3486
    33.  
  4. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Nov 4, 2011

    I guess the only concern there is the 2 in Current_Pending_Sector--is this something I can fix myself or do I have to RMA the drive?
  5. Dmitry Nosachev New Member

    Member Since:
    Aug 26, 2011
    Message Count:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Location:
    Moscow
    Dmitry Nosachev, Nov 4, 2011

    You need to make the HDD to remap these sectors.
    "sg3_utils" package can run on FreeBSD and contains such useful tools as sg_format, sg_verify and sg_reassing, but they are designed for use with SCSI/SAS disks. Some SATA discs may work with a limited number of SCSI commands (e.g. sg_verify works on Hitachi A7K2000), but in your case you need to fill HDD with zeroes: dd if=/dev/zero of=/dev/ada0 bs=1M (remove HDD from zpool first).
    Then look at the attributes "Reallocated_Sector_Ct" (it should increase) and "Current_Pending_Sector" (it should be 0) again. Finally, run the SMART selftest: smartctl --test=long /dev/ada0
  6. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Nov 5, 2011

    I tried this, but couldn't get the dd command to work (logged in as root).

    Tried to remove the drive from the pool (can't, I think because it's a raidz2, not a mirror):

    Code (text):
    1.  
    2. /mnt/# zpool remove freenas gpt/ada0
    3. cannot remove gpt/ada0: only inactive hot spares or cache devices can be removed
    4.  
    Offlined the drive (success) and checked pool status (degraded, as expected):
    Code (text):
    1.  
    2. /mnt/# zpool offline freenas gpt/ada0
    3. /mnt/# zpool status
    4.   pool: freenas
    5.  state: DEGRADED
    6. status: One or more devices has been taken offline by the administrator.
    7.         Sufficient replicas exist for the pool to continue functioning in a
    8.         degraded state.
    9. action: Online the device using 'zpool online' or replace the device with
    10.         'zpool replace'.
    11.  scrub: none requested
    12. config:
    13.  
    14.         NAME          STATE     READ WRITE CKSUM
    15.         freenas       DEGRADED     0     0     0
    16.           raidz2      DEGRADED     0     0     0
    17.             gpt/ada0  OFFLINE      0     0     0
    18.             gpt/ada1  ONLINE       0     0     0
    19.             gpt/ada2  ONLINE       0     0     0
    20.             gpt/ada3  ONLINE       0     0     0
    21.             gpt/ada4  ONLINE       0     0     0
    22.             gpt/ada5  ONLINE       0     0     0
    23.  
    24. errors: No known data errors
    25.  
    Attempt the dd command:
    Code (text):
    1.  
    2. /mnt/# dd if=/dev/zero of=/dev/ada0 bs=1M
    3. dd: /dev/ada0: Operation not permitted
    4.  
    I think the dd didn't work because the drive is only OFFLINE, and somehow still "locked" by the pool. Should I be removing the drive in a different way? I'd prefer not to have to physically pull the drive.
  7. Dmitry Nosachev New Member

    Member Since:
    Aug 26, 2011
    Message Count:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Location:
    Moscow
    Dmitry Nosachev, Nov 9, 2011

    Try to zero problem block with hdparm: hdparm --write-sector 1192616706 /dev/ada0
    Then check the "Current_Pending_Sector" value, and start long selftest. Finally you will need to scrub your zpool.
  8. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Nov 9, 2011

    I must be missing something obvious, but I'm getting a "command not found" when trying to run hdparm:

    Code (text):
    1.  
    2. /mnt/# hdparm --write-sector 1192616706 /dev/ada0
    3. hdparm: Command not found.
    4. /mnt/# hdparm
    5. hdparm: Command not found.
    6.  
  9. bsalinux New Member

    Member Since:
    Nov 18, 2011
    Message Count:
    12
    Likes Received:
    0
    Trophy Points:
    0
    bsalinux, Dec 17, 2011

    It would be easier if you have a spare drive. Replace the bad drive with another drive, move the defected drive to another system and run seatools / other manufacturer low level tools to format the drive. If you RMA the drive, they will send you a re-certified drive. Smart won't trip until you have Reallocated_Sector_Ct <= 36.
  10. sunflashx New Member

    Member Since:
    Sep 13, 2011
    Message Count:
    26
    Likes Received:
    0
    Trophy Points:
    1
    sunflashx, Jan 6, 2012

    For the record, I think it's asinine you can't fix something stupid like this easily on a dedicated storage appliance.

    I gave up and RMA my drive that had bad blocks. Samsung's advance RMA system is great. They'll ship you a new drive, you can yank one drive and immediately start your rebuild. Cost me $12 or so to ship the old drive back.
  11. budmannxx Member

    Member Since:
    Sep 7, 2011
    Message Count:
    83
    Likes Received:
    2
    Trophy Points:
    8
    budmannxx, Jan 6, 2012

    And now that they're Seagate, it appears that the advanced RMA option isn't available, at least for the Samsung HD204UI. Anyone know of a way to do this post merger?
  12. deajan New Member

    Member Since:
    Jul 21, 2012
    Message Count:
    12
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    IT Admin
    Location:
    South France
    Home page:
    deajan, Sep 10, 2012

    Hello,

    Sorry if i burry out this topic, but here's what i tried to manage bad blocks:

    Package sg3_utils ins't included in freenas by default (indeed very strange as it's kinda useful for a storage server) so i manually installed it:

    Code (text):
    1.  
    2. # mount -uw /
    3. # mkdir /root/sg3_utils
    4. # cd /root/sg3_utils
    5. # wget http://ftp2.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.2-release/sysutils/sg3_utils-1.28.tbz
    6. # pkg_add sg3_utils-1.28.tbz
    7. # mount -ur /
    8.  
    Then i tried to check my bad blocks listed
    Code (text):
    1.  
    2. # smartctl -l selftest /dev/ada1
    3.  
    4. === START OF READ SMART DATA SECTION ===
    5. SMART Self-test log structure revision number 1
    6. Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    7. # 1  Extended offline    Completed: read failure       30%     10685         2903215976
    8.  
    9. # /usr/local/bin/sg_verify --lba=2903215976 /dev/ada1
    10.  
    11. verify (10): transport: (pass2:ahcich2:0:0:0): VERIFY(10). CDB: 2f 0 ad b 8f 68 0 0 1 0
    12. (pass2:ahcich2:0:0:0): CAM status: CCB request was invalid
    13.  
    14. Verify(10) failed near lba=2903215976 [0xad0b8f68]
    15.  
    Now i'm stuck here... It seems that my RAID controller (IBM ServeRAID C100) can't speak with CAM framework... Or my drive (WD2003FYYS-02W0B0) can't speak SCSI.
    Does someone have any clues ?

    Thanks.
  13. deajan New Member

    Member Since:
    Jul 21, 2012
    Message Count:
    12
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    IT Admin
    Location:
    South France
    Home page:
    deajan, Sep 20, 2012

    Finally, Western Digital RE3 /RE4 drives do not simply speak SCSI.

    I've tried on my home NAS having a RE3 drive, and i got the same result:
    Code (text):
    1. [root@freenas] ~# sg_verify --lba=10340032 /dev/ada0
    2. verify (10): transport: (pass0:ahcich0:0:0:0): VERIFY(10). CDB: 2f 0 0 9d c6 c0
    3. (pass0:ahcich0:0:0:0): CAM status: CCB request was invalid
    4.  
    5. Verify(10) failed near lba=10340032 [0x9dc6c0]
    6.  
    Only solution is what ? Removing the disk from zpool, fill it with zeros until HDD firmware finds out that the sector is not writable, remaps it, and then attach the disk to the zpool again and resilver ?
  14. deajan New Member

    Member Since:
    Jul 21, 2012
    Message Count:
    12
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    IT Admin
    Location:
    South France
    Home page:
    deajan, Sep 20, 2012

    Okay... new round.

    I've played around with dd and think got success:

    These are some lines of my smartctl -a /dev/ada1 output before:

    Code (text):
    1.  
    2. ...
    3.   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
    4. ...
    5. 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       1
    6.  
    7. ...
    8. # 2  Extended offline    Completed: read failure       30%     10899         2903215976
    9. ...
    10.  
    I actually disabled disk geometry protection and then zerofilled the sector with dd:
    Code (text):
    1.  
    2. # sysctl kern.geom.debugflags=0x10
    3. # dd bs=512 seek=2903215976 if=/dev/zero of=/dev/ada1 count=1
    4. # sysctl kern.geom.debugflags=0x0
    5.  
    Now my smartctl output says
    Code (text):
    1.  
    2. ...
    3.   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
    4. ...
    5. 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    6. ...
    7.  
    I'm making a long selftest right now to be sure.
    Hopefully this helps someone.

    If you have SATA drives, use the dd technique.
    If you have SAS drives, install the sg3_utils package i stated in this topic and follow this guide http://smartmontools.sourceforge.net/badblockhowto.html#bb

    Cheers.
  15. paleoN Active Member

    Member Since:
    Apr 22, 2012
    Message Count:
    1,403
    Likes Received:
    15
    Trophy Points:
    38
    paleoN, Sep 20, 2012

    Correct me if I'm wrong, but I don't believe it's necessary to disable geom unless the bad block was in one of the GPT labels. It's not clear what your steps were, but you would want to offline the disk before dd'ing it and online it afterwards.
  16. deajan New Member

    Member Since:
    Jul 21, 2012
    Message Count:
    12
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    IT Admin
    Location:
    South France
    Home page:
    deajan, Sep 20, 2012

    I've actually tried putting offline the disk, even exporting the zpool didn't the trick.
    As long as i did not change the sysctl parameter i suggested, everytime i tried dd i ended with:
    Code (text):
    1. dd: /dev/ada1: Operation not permitted
    2.  
    I might be wrong too (i'm not a BSD expert at all), but i think kern.geom.debugflags provides protection against "raw" writing to disk with tools like fdisk / gdisk or in my case dd.
  17. paleoN Active Member

    Member Since:
    Apr 22, 2012
    Message Count:
    1,403
    Likes Received:
    15
    Trophy Points:
    38
    paleoN, Sep 20, 2012

    You would offline the disk when you were working on it. Then when you online the disk it would resilver if needed. Unless you destroyed the partitions, geom would still protect the disk.

    Ah, I see now. Thanks.
  18. Visseroth Member

    Member Since:
    Nov 4, 2011
    Message Count:
    86
    Likes Received:
    0
    Trophy Points:
    6
    Visseroth, Oct 19, 2012

    This was a good post and very helpful. I have a total of 12 of these drives, one which suddenly got the click of death one day and 11 more that are showing pending sector reallocation counts. I doubt they are all bad so I'm working on trying to repair the drives one at a time by taking them offline and am currently in the process of running the long test while they are still in the server and since the server is always online this means I don't have to keep another machine turned on to repair them.
    So far I'm running the "smartctl --test=long /dev/ada0" and it reported "Please wait 347 minutes for test to complete"
    Again thank you for this post and in information contained within. Newbs like me appreciate it.
Thread Status:
Not open for further replies.

Share This Page