Linux hints: 3ware 9500S-8: replace failing drive from raid 5

I have failing drive at port 3 of my 3ware sata controler. From /var/log/messages:

Dec 15 00:01:25 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=3.

What we actually have:

]# tw_cli  show

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c2    9500S-8      8       8        2       0        5       5       -

and configuration:

]# tw_cli /c2 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    VERIFY-PAUSED  -       22      64K     11175.8   ON     ON     
u1    SPARE     VERIFY-PAUSED  -       98      -       1863.01   -      ON     

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     1.82 TB     3907029168    Z1E0AH83            
p1     OK               u0     1.82 TB     3907029168    Z1E5AL3Q            
p2     OK               u0     1.82 TB     3907029168    Z1E0BBE6            
p3     SMART-FAILURE    u0     1.82 TB     3907029168    5XW28M3A            
p4     OK               u0     1.82 TB     3907029168    W1H2T46V            
p5     OK               u0     1.82 TB     3907029168    W1E1SL7X            
p6     VERIFYING        u1     1.82 TB     3907029168    W2411LK9            
p7     OK               u0     1.82 TB     3907029168    W2411SSY

Despite of the SMART-falure the array seems to be working fine (no degraded regime). However, I'll remove the drive and use spare instead. Then I can test the failing drive and/or replace it.

]# tw_cli /c2 remove p3

Removing port /c2/p3 ... Done.

]#  tw_cli /c2 rescan

Rescanning controller /c2 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c2/p3].

]# tw_cli /c2/u0 show

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-5    REBUILD-PAUSED 0       -       -     64K     11175.8   
u0-0     DISK      OK             -       -       p0    -       1862.63   
u0-1     DISK      OK             -       -       p2    -       1862.63   
u0-2     DISK      DEGRADED       -       -       p6    -       1862.63   
u0-3     DISK      OK             -       -       p1    -       1862.63   
u0-4     DISK      OK             -       -       p5    -       1862.63   
u0-5     DISK      OK             -       -       p4    -       1862.63   
u0-6     DISK      OK             -       -       p7    -       1862.63

u0 should be rebuilding now, but it is not. To start rebuild it is necessary to disable scheduled actions (rebuild and verify):

]# tw_cli sched rebuild c2 disable

Disabling scheduled rebuilds on controller /c2 ...Done.

]# tw_cli /c2 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0       -       64K     11175.8   ON     ON

Now we can export the failing disk for testing:

]# tw_cli /c2 add type=single disk=3 name=drive-test

Creating new unit on controller /c2 ...  Done. The new unit is /c2/u1.
Naming unit /c2/u1 to [drive-test] ... Done.
Setting write cache=ON for the new unit ... Done.

The failing disk will be present as a separate device (/dev/sdd)

 tw_cli /c2 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     1       -       64K     11175.8   ON     ON     
u1    SINGLE    OK             -       -       -       1862.63   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     1.82 TB     3907029168    Z1E0AH83            
p1     OK               u0     1.82 TB     3907029168    Z1E5AL3Q            
p2     OK               u0     1.82 TB     3907029168    Z1E0BBE6            
p3     SMART-FAILURE    u1     1.82 TB     3907029168    5XW28M3A            
p4     OK               u0     1.82 TB     3907029168    W1H2T46V            
p5     OK               u0     1.82 TB     3907029168    W1E1SL7X            
p6     DEGRADED         u0     1.82 TB     3907029168    W2411LK9            
p7     OK               u0     1.82 TB     3907029168    W2411SSY

What does smartctl say:

]# smartctl -a -d 3ware,3 /dev/twa0

Model Family:     Seagate Barracuda LP
Device Model:     ST32000542AS
Serial Number:    5XW28M3A
...
  5 Reallocated_Sector_Ct   0x0033   014   014   036    Pre-fail  Always   FAILING_NOW 3556
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       29287
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

This Seagate Barracuda Drive is intended for NAS, thus it should be capable of 24/7 load. It is not too old (29287/24/365 = 3.33 years), but definitely older than its 3 years warranty. However, lack of pending and uncorrectable records gives some hope. First I try the long selftest, then multiple rewriting using badblocks.

]# time smartctl -C -t select,0-max -d 3ware,3 /dev/twa0

In this case the test fails immediately (I have not seen such behaviour yet):

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective captive   Completed: unknown failure    90%     29287         0

and badblocks:

]# time badblocks -svw -o bbl-sdd.out  /dev/sdd

Checking for bad blocks in read-write mode
From block 0 to 1953114111
...                                
Reading and comparing: done                                
Pass completed, 1184 bad blocks found.

real    6803m14.142s
user    234m16.798s
sys     120m22.010s

The drive is really ready for replacement. I will use a WD RE2-GP (pdf) drive instead (I have one in the shelf) even though I have some bad experiences with WD drives and 3ware 9500S controller.

tw_cli /c2 remove u1
tw_cli /c2 rescan

Rescanning controller /c2 for units and drives ...Done.
Found the following unit(s): [/c2/u1].
Found the following drive(s): [none].

]# tw_cli  /c2 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    VERIFY-PAUSED  -       51      64K     11175.8   ON     ON     
u1    JBOD      OK             -       -       -       1863.02   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
...
p2     OK               u0     1.82 TB     3907029168    Z1E0BBE6            
p3     OK               u1     1.82 TB     3907029168    WD-WCAVY5487741     
p4     OK               u0     1.82 TB     3907029168    W1H2T46V            
...

Use u1 as a spare:

tw_cli maint deleteunit c2 u1
tw_cli /c2 add type=spare disk=3

Creating new unit on controller /c2 ...  Done. The new unit is /c2/u1.

]# tw_cli  /c2 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    VERIFYING      -       53      64K     11175.8   ON     ON     
u1    SPARE     OK             -       -       -       1863.01   -      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
...          
p3     OK               u1     1.82 TB     3907029168    WD-WCAVY5487741     
...

]# tw_cli /c2 set verify=enable
]# tw_cli /c2/u1 set autoverify=on

smartctl shows fairly good values:

smartctl -a -d 3ware,4 /dev/twa0

Model Family:     Western Digital RE4-GP
Device Model:     WDC WD2002FYPS-02W3B0
Firmware Version: 04.01G01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5829
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5431         -

Enable writeback cache, if necessary

# tw_cli /c2/u0 set cache=on

Long live array!
Links:
http://wiki.hetzner.de/index.php/3Ware_RAID_Controller/en

update 2016/07/01: 3Ware controller is stuck in REBUILD-PAUSED status.

To force the rebuild to start, toggle the setting for the rebuild schedule. The rebuild should then start.

tw_cli sched rebuild c2 enable
tw_cli sched rebuild c2 disable

Linux hints

pondělí 29. prosince 2014

3ware 9500S-8: replace failing drive from raid 5

Žádné komentáře:

Okomentovat

O mně