Dec 15 00:01:25 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=3.
What we actually have:
]# tw_cli show
Ctl Model Ports Drives Units NotOpt RRate VRate BBU ------------------------------------------------------------------------ c2 9500S-8 8 8 2 0 5 5 -
and configuration:
]# tw_cli /c2 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 VERIFY-PAUSED - 22 64K 11175.8 ON ON u1 SPARE VERIFY-PAUSED - 98 - 1863.01 - ON Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 1.82 TB 3907029168 Z1E0AH83 p1 OK u0 1.82 TB 3907029168 Z1E5AL3Q p2 OK u0 1.82 TB 3907029168 Z1E0BBE6 p3 SMART-FAILURE u0 1.82 TB 3907029168 5XW28M3A p4 OK u0 1.82 TB 3907029168 W1H2T46V p5 OK u0 1.82 TB 3907029168 W1E1SL7X p6 VERIFYING u1 1.82 TB 3907029168 W2411LK9 p7 OK u0 1.82 TB 3907029168 W2411SSY
Despite of the SMART-falure the array seems to be working fine (no degraded regime). However, I'll remove the drive and use spare instead. Then I can test the failing drive and/or replace it.
]# tw_cli /c2 remove p3
Removing port /c2/p3 ... Done.
]# tw_cli /c2 rescan
Rescanning controller /c2 for units and drives ...Done. Found the following unit(s): [none]. Found the following drive(s): [/c2/p3].
]# tw_cli /c2/u0 show
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB) ------------------------------------------------------------------------ u0 RAID-5 REBUILD-PAUSED 0 - - 64K 11175.8 u0-0 DISK OK - - p0 - 1862.63 u0-1 DISK OK - - p2 - 1862.63 u0-2 DISK DEGRADED - - p6 - 1862.63 u0-3 DISK OK - - p1 - 1862.63 u0-4 DISK OK - - p5 - 1862.63 u0-5 DISK OK - - p4 - 1862.63 u0-6 DISK OK - - p7 - 1862.63
u0 should be rebuilding now, but it is not. To start rebuild it is necessary to disable scheduled actions (rebuild and verify):
]# tw_cli sched rebuild c2 disable
Disabling scheduled rebuilds on controller /c2 ...Done.
]# tw_cli /c2 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 REBUILDING 0 - 64K 11175.8 ON ON
Now we can export the failing disk for testing:
]# tw_cli /c2 add type=single disk=3 name=drive-test
Creating new unit on controller /c2 ... Done. The new unit is /c2/u1. Naming unit /c2/u1 to [drive-test] ... Done. Setting write cache=ON for the new unit ... Done.
The failing disk will be present as a separate device (/dev/sdd)
tw_cli /c2 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 REBUILDING 1 - 64K 11175.8 ON ON u1 SINGLE OK - - - 1862.63 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 1.82 TB 3907029168 Z1E0AH83 p1 OK u0 1.82 TB 3907029168 Z1E5AL3Q p2 OK u0 1.82 TB 3907029168 Z1E0BBE6 p3 SMART-FAILURE u1 1.82 TB 3907029168 5XW28M3A p4 OK u0 1.82 TB 3907029168 W1H2T46V p5 OK u0 1.82 TB 3907029168 W1E1SL7X p6 DEGRADED u0 1.82 TB 3907029168 W2411LK9 p7 OK u0 1.82 TB 3907029168 W2411SSY
What does smartctl say:
]# smartctl -a -d 3ware,3 /dev/twa0
Model Family: Seagate Barracuda LP Device Model: ST32000542AS Serial Number: 5XW28M3A ... 5 Reallocated_Sector_Ct 0x0033 014 014 036 Pre-fail Always FAILING_NOW 3556 9 Power_On_Hours 0x0032 067 067 000 Old_age Always - 29287 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
This Seagate Barracuda Drive is intended for NAS, thus it should be capable of 24/7 load. It is not too old (29287/24/365 = 3.33 years), but definitely older than its 3 years warranty. However, lack of pending and uncorrectable records gives some hope. First I try the long selftest, then multiple rewriting using badblocks.
]# time smartctl -C -t select,0-max -d 3ware,3 /dev/twa0In this case the test fails immediately (I have not seen such behaviour yet):
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Selective captive Completed: unknown failure 90% 29287 0
and badblocks:
]# time badblocks -svw -o bbl-sdd.out /dev/sdd
Checking for bad blocks in read-write mode From block 0 to 1953114111 ... Reading and comparing: done Pass completed, 1184 bad blocks found. real 6803m14.142s user 234m16.798s sys 120m22.010s
The drive is really ready for replacement. I will use a WD RE2-GP (pdf) drive instead (I have one in the shelf) even though I have some bad experiences with WD drives and 3ware 9500S controller.
tw_cli /c2 remove u1 tw_cli /c2 rescan
Rescanning controller /c2 for units and drives ...Done. Found the following unit(s): [/c2/u1]. Found the following drive(s): [none].
]# tw_cli /c2 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 VERIFY-PAUSED - 51 64K 11175.8 ON ON u1 JBOD OK - - - 1863.02 OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- ... p2 OK u0 1.82 TB 3907029168 Z1E0BBE6 p3 OK u1 1.82 TB 3907029168 WD-WCAVY5487741 p4 OK u0 1.82 TB 3907029168 W1H2T46V ...Use u1 as a spare:
tw_cli maint deleteunit c2 u1 tw_cli /c2 add type=spare disk=3
Creating new unit on controller /c2 ... Done. The new unit is /c2/u1.
]# tw_cli /c2 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 VERIFYING - 53 64K 11175.8 ON ON u1 SPARE OK - - - 1863.01 - OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- ... p3 OK u1 1.82 TB 3907029168 WD-WCAVY5487741 ...
]# tw_cli /c2 set verify=enable ]# tw_cli /c2/u1 set autoverify=on
smartctl shows fairly good values:
smartctl -a -d 3ware,4 /dev/twa0
Model Family: Western Digital RE4-GP Device Model: WDC WD2002FYPS-02W3B0 Firmware Version: 04.01G01 User Capacity: 2,000,398,934,016 bytes [2.00 TB] 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 24 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5829 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 5431 -
Enable writeback cache, if necessary
# tw_cli /c2/u0 set cache=on
Long live array!
Links:
http://wiki.hetzner.de/index.php/3Ware_RAID_Controller/en
update 2016/07/01: 3Ware controller is stuck in REBUILD-PAUSED status.
To force the rebuild to start, toggle the setting for the rebuild schedule. The rebuild should then start.
tw_cli sched rebuild c2 enable tw_cli sched rebuild c2 disable
Žádné komentáře:
Okomentovat