I came across an issue where one of our drive failed in a RAID setup. The server is still running but with degraded performance so I had to replace that drive. Here I am covering the things need to be done for replace failed disk from a raid.
In my scenario we were using Hardware RAID setup with hot swap support enabled. The server is using MegaRAID as Controller, you can check the raid controller using the below command
lspci -nn | grep RAID
(The command lspci stands for list pci. Think of this command as âlsâ + âpciâ. This will display information about all the PCI bus in your server)
My output was the below:-
05:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
Now the very first thing we have to do is to check the status of the drive. For this for this LSI Corporation has created a command line utility called MegaCLI.
MegaCLIÂ
MegaCLI is the command line interface (CLI) binary used to communicate with the full LSI family of raid controllers found in Supermicro, DELL (PERC), ESXi and Intel servers.
To install MegaCLI – Go to the LSI Downloads page: LSI Downloads Search for “MegaCli Linux”
Drive Status Evaluation
Now check the status of the drive. The below command will provide you the status of all the drives but I am just taking the status of the failed drive.
root@supportsages [/usr/local/src]# MegaCli -PDList -aALL
Adapter #0
Enclosure Device ID: 252
Slot Number: 0
Drive’s postion: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 23
WWN: 5000C5003BA9E5F8
Sequence Number: 2
Media Error Count: 174
Other Error Count: 3
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 42539
PD Type: SAS
Raw Size: 419.186 GB [0x3465f870 Sectors]
Non Coerced Size: 418.686 GB [0x3455f870 Sectors]
Coerced Size: 418.656 GB [0x34550000 Sectors]
Emulated Drive: No
Firmware state: Online, Spun Up
Commissioned Spare : Yes
Emergency Spare : Yes
Device Firmware Level: HPD5
Shield Counter: 0
Successful diagnostics completion on : N/A
=========================================================
As you can see it shows ‘Media Error Count: 174‘. The output of that command is quite long, in my example the failed disk is shown as ‘Enclosure Device ID: 252’ and ‘Slot Number: 0’ so my drive can be referenced as ‘[252:0]’ . The device ID for the failed drive is 23.
After this I have analyzed the faulty drive using Smartctl. Under Linux, you can read the SMART (Self-Monitoring, Analysis and Reporting Technology) information from the hard disk using smartctl. I am not going to much deeper about Smartctl and it’s command line option but For I SCSI/SAS disks behind LSI MegaRAID controllers, the smartctl check syntax is given below.
smartctl -a -d megaraid,N /dev/sda
where in the argument megaraid,N, the integer N is the physical disk number within the MegaRAID controller. In my case the the physical disk number is 23(heighlighted above).
For SATA disk, use the following syntax
smartctl -a -d sat+megaraid,N /dev/sda
So below is the command to check the status.
root@supportsages [~]# smartctl -a -d megaraid,23 /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-408.el5.lve0.8.58] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
===
Vendor: HP
Product: EF0450FARMV
Revision: HPD5
User Capacity: 450,098,159,616 bytes [450 GB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c5003ba9e5fb
Serial number: 6SK0T91N0000N150EEPX
Device type: disk
Transport protocol: SAS
Local Time is: Sun Sep 20 19:17:26 2015 PDT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=10]
===
As you can see from the report
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=10]
the drive is experiencing HW issues and more like to be failed in the nearer future. So I have escalated and discussed the issue with DC and they suggested a “Hot Swapping”
What is Hot Swapping?
WikiPedia defines hot swapping as “Hot swapping and hot plugging are terms used to describe the functions of replacing computer system components without shutting down the system.”
To perform the process, we need to identify and power off the faulty drive from the RAID array and then replace it with a healthy and wait till the rebuilding completes. Luckily our RAID array was equipped with hot swap support, hence I was in a better position to identify and turning off the faulty drive using the “MegaCli” controller. Identifying the faulty drive from a RAID array is bit difficult, the DC guy identifies a powered off drive through checking the LED attached to the drive, usually red.
Now after discussing this with the DC the DC guy asked me to poweroff the drive using ‘MegaCli’ to perform a hot swap. Below is the command to make the drive off line.
MegaCli -PDOffline -PhysDrv [E:S] -aN
Let us have look into the various arguments given in the above command
E is the ‘Enclosure Device ID‘
S is the ‘Slot Number’ .
aN represents adapter ID(where N is a number starting with zero or the string ALL)
In my example the Enclosure Device ID is 252 and the Slot Number is 0. The adapter ID is ‘Adapter #0′ . Hence the command to power off the drive is given below
root@supportsages [~]# MegaCli -PDOffline -PhysDrv [252:0] -a0
Running this command will show the below output :-
Adapter: 0: EnclId-252 SlotId-0 state changed to OffLine. Exit Code: 0x00
Now if we check the status if the drive it will shows as Offline.
root@supportsages [/home/secure]# MegaCli -PDList -aALL
Adapter #0
Enclosure Device ID: 252
Slot Number: 0
Drive’s postion: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 23
WWN: 5000C5003BA9E5F8
Sequence Number: 3
Media Error Count: 174
Other Error Count: 3
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 42539
PD Type: SAS
Raw Size: 419.186 GB [0x3465f870 Sectors]
Non Coerced Size: 418.686 GB [0x3455f870 Sectors]
Coerced Size: 418.656 GB [0x34550000 Sectors]
Emulated Drive: No
Firmware state: Offline
Commissioned Spare : Yes
Emergency Spare : Yes
Device Firmware Level: HPD5
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c5003ba9e5f9
SAS Address(1): 0x0
===
Once the drive is confirmed to be of “Offline” by checking the parameter
Firmware state: Offline
I have updated the DC and asked them to proceed with the hot swap. During the process they will just remove the faulty drive and replace failed disk with a new drive ð .
It will take approximately 30 seconds for the server to identify the new disk. I checked the RAID info using the below command and the status shows as ‘degraded’. The adapter may show the raid as degraded during a rebuild. It will probably show as degraded until the rebuild finishes.
[~]# MegaCli -LDInfo -Lall -aALL
Adapter 0 — Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name               :
RAID Level         : Primary-1, Secondary-0, RAID Level Qualifier-0
Size               : 836.326 GB
Mirror Data       : 836.326 GB
State             : Degraded
Strip Size         : 64 KB
Number Of Drives per span:2
Span Depth         : 2
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad
BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk’s Default
Encryption Type   : None
Is VD Cached: No
So I checked the physical drive status using the below command and the new device was showing Firmware state: Rebuild, there you have it. It will probably show as degraded until the rebuild finishes.
root@supportsages [~]# MegaCli -PDList -aALL
Adapter #0
Enclosure Device ID: 252
Slot Number: 0
Drive’s postion: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 50
WWN:
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 419.186 GB [0x3465f870 Sectors]
Non Coerced Size: 418.686 GB [0x3455f870 Sectors]
Coerced Size: 418.163 GB [0x34453800 Sectors]
Firmware state: Rebuild
Device Firmware Level: HPD4
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c50042aea19d
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: HP EF0450FARMV HPD46SK0H5FD
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device
Drive Temperature :45C (113.00 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port’s Linkspeed: Unknown
Port-1 :
Port status: Active
Port’s Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No
====
Once everything is ok the  Firmware state will show as
Firmware state: Online, Spun Up
Thanks for your time ð