Working with SelfHeal

While a clusters are running, entire nodes or individual block devices that YDB runs on can fail.

SelfHeal ensures a cluster's continuous performance and fault tolerance if malfunctioning nodes or devices cannot be repaired quickly.

SelfHeal can:

  • Detect faulty system elements.
  • Transfer faulty elements carefully without data loss and disintegration of storage groups.

SelfHeal is enabled by default.

Enabling and disabling SelfHeal

  1. To enable fault detection, go to http://localhost:8765/cms#show=config-items-25.

  2. Go to any node.

  3. Create an updated configuration file with the parameter SentinelConfig { Enable: true }.

    Sample config.txt file:

    Actions {
        AddConfigItem {
            ConfigItem {
                Config {
                    CmsConfig {
                        SentinelConfig {
                            Enable: true
                        }
                    }
                }
            }
        }
    }
    
  4. Run the command:

    kikimr admin console configs update config.txt
    
  5. To enable data transfer, run the command:

    kikimr -s <endpoint> admin bs config invoke --proto 'Command{EnableSelfHeal{Enable: true}}'
    
  1. To disable fault detection, go to http://localhost:8765/cms#show=config-items-25.

  2. Go to any node.

  3. Create an updated configuration file with the parameter SentinelConfig { Enable: false }.

    Sample config.txt file:

    Actions {
        AddConfigItem {
            ConfigItem {
                Config {
                    CmsConfig {
                        SentinelConfig {
                            Enable: false
                        }
                    }
                }
            }
        }
    }
    
  4. Run the command:

    kikimr admin console configs update config.txt
    
  5. To disable data transfer, run the command:

    kikimr -s <endpoint> admin bs config invoke --proto 'Command{EnableSelfHeal{Enable: false}}'
    

SelfHeal settings

You can configure SelfHeal in ViewerCluster Management SystemCmsConfigItems.

To create the initial settings, click Create. If you want to update the current settings, click .

You can use the following settings:

Parameter Description
Status Enabling and disabling SelfHeal in CMS.
Dry run Enables/disables the mode in which the CMS doesn't change the BSC setting.
Config update interval (sec.) BSC configuration update interval.
Retry interval (sec.) Interval of configuration update attempts.
State update interval (sec.) PDisk state update interval.
Timeout (sec.) PDisk state update timeout.
Change status retries Number of retries to change the PDisk status for BSC (ACTIVE, FAULTY, BROKEN, and so on).
Change status retry interval (sec.) Delay between retries to update the PDisk status in BSC. CMS monitors the status of the disk with the interval State update inverval. If the disk remains in one Status update interval state during several cycles, the CMS changes its status to BSC.
Next are the settings for the number of update cycles after which the CMS changes the disk status. If the disk state is Normal, the disk status changes to ACTIVE. In other states, the disk switches to FAULTY.
The 0 value disables status changes for the state (by default, this is set for Unknown).
For example, with the default settings, if the CMS detects the Initial disk state for five Status update interval cycles which are 60 seconds each, the disk status changes to FAULTY.
Default state limit For states with no setting specified, this value can be used by default. This value is also used for unknown PDisk states that don't have any settings. It's used if no value is set for states such as Initial, InitialFormatRead, InitialSysLogRead, InitialCommonLogRead, and Normal.
Initial PDisk starts initializing. Transition to FAULTY.
InitialFormatRead PDisk is reading its format. Transition to FAULTY.
InitialFormatReadError PDisk received an error when reading its format. Transition to FAULTY.
InitialSysLogRead PDisk is reading the system log. Transition to FAULTY.
InitialSysLogReadError PDisk received an error when reading the system log. Transition to FAULTY.
InitialSysLogParseError PDisk received an error when parsing and checking the consistency of the system log. Transition to FAULTY.
InitialCommonLogRead PDisk is reading the common VDisk log. Transition to FAULTY.
InitialCommonLogReadError PDisk received an error when reading the common VDisk log. Transition to FAULTY.
InitialCommonLogParseError PDisk received an error when parsing and checking the consistency of the common log. Transition to FAULTY.
CommonLoggerInitError PDisk received an error when initializing internal structures to be logged to the common log. Transition to FAULTY.
Normal PDisk completed initialization and is running normally. Transition to ACTIVE will occur after a specified number of cycles (for example, if the disk is Normal for 5 minutes, it switches to ACTIVE).
OpenFileError PDisk received an error when opening a disk file. Transition to FAULTY.
Missing The node responds, but this PDisk is missing from its list. Transition to FAULTY.
Timeout The node didn't respond within the specified timeout. Transition to FAULTY.
NodeDisconnected The node has disconnected. Transition to FAULTY.
Unknown Unexpected response, for example, TEvUndelivered to the state request. Transition to FAULTY.

Working with donor disks

To prevent data loss when moving a VDisk, enable donor disks:

kikimr admin bs config invoke --proto 'Command { UpdateSettings { EnableDonorMode: true } }'

To disable donor disks, set EnableDonorMode to false in the same command:

kikimr admin bs config invoke --proto 'Command { UpdateSettings { EnableDonorMode: false } }'

The donor disk is the previous VDisk after the data transfer, which continues to store its data and only responds to read requests from the new VDisk. When data is transfered with donor disks enabled, previous VDisks continue to function until the data is fully moved to the new disks.