What happens when a drive or node fails?

Chris Nelson -

Node and drive failures are rarely emergencies when using Swift. In a standard 3 replica policy with 3 or more nodes, losing an entire node means that you've lost one of your replicas at most. Generally, when these failures occur, it's simply a matter of replacing the failed hardware. Below are the steps for replacing failed drives and failed nodes.

In the event that you have multiple disks across multiple nodes, or multiple nodes, fail at the same time, you should contact Support.

Failed Drives

In events of file drives, please follow the following procedures:

  1. Acknowledge the alert in the controller / monitoring system you have in place. 
  2. Remove the failed drives from the ring using the controller. First log in to the controller and select the node that has the failed drive. Next on the "Manage Swift Drives" page, select the failing drives and click the "Add or Remove Policies" button to remove the drive. 
  3. If the drive is exhibiting patterns of dying but somewhat operational, please select the "Remove Gradually" option from the "Add or Remove Policies" windows. This option allows the failing drive to "drain" the data it stores to other nodes and drives in the cluster. If the drive is completed failed, you should just remove it immediately. 
  4. Replace the failed drive in the physical node.
  5. Format and add the drive back to the cluster via the controller.

Failed Nodes

In the events that the node has failed, please follow the following procedures:

  1. Acknowledge the alert in the controller / monitoring system you have in place.
  2. You'll want to proceed differently depending on the type of failure:
    -If the failed node requires OS reinstallation, please disable the node from the controller, re-install the OS and the SwiftStack Swift software, and re-introduce it back to the cluster. If the failed node requires hardware replacement and OS reinstallation, the process is roughly the same. The only difference is that when you claim the node, please make sure that you click the drop down menu and select the option that says you are this new node is replacing the one with hardware failure. 
    -If the failed node does not require an OS reinstallation (for example, if it was a hardware failure but all the disks are still in good working order), DO NOT DISABLE THE NODE. Disabling the node will remove it's disks from the ring and kick off a rebalance. If the hardware can be replaced within a week, leave the node as is, and Swift will work around it. If the hardware will take more than a week to replace, you can disable the node, just be aware that a rebalance will take place and will need to be completed before the node can be added back into the cluster.
  3. Mount the drives no the "Manage Swift Drives" page.
Have more questions? Submit a request

Comments

Powered by Zendesk