Working with SelfHeal
While a clusters are running, entire nodes or individual block devices that YDB runs on can fail.
SelfHeal ensures a cluster's continuous performance and fault tolerance if malfunctioning nodes or devices cannot be repaired quickly.
SelfHeal can:
- Detect faulty system elements.
- Transfer faulty elements carefully without data loss and disintegration of storage groups.
SelfHeal is enabled by default.
YDB component responsible for SelfHeal is called "Sentinel".
Enabling and disabling SelfHeal
You can enable and disable SelfHeal using YDB DSTool.
To enable SelfHeal, run the command:
ydb-dstool -e <bs_endpoint> cluster set --enable-self-heal
To disable SelfHeal, run the command:
ydb-dstool -e <bs_endpoint> cluster set --disable-self-heal
SelfHeal settings
You can configure SelfHeal in Viewer → Cluster Management System → CmsConfigItems.
To create the initial settings, click Create. If you want to update the current settings, click
You can use the following settings:
Parameter | Description |
---|---|
Status | Enabling and disabling SelfHeal in CMS. |
Dry run | Enables/disables the mode in which the CMS doesn't change the BSC setting. |
Config update interval (sec.) | BSC configuration update interval. |
Retry interval (sec.) | Interval of configuration update attempts. |
State update interval (sec.) | PDisk state update interval. |
Timeout (sec.) | PDisk state update timeout. |
Change status retries | Number of retries to change the PDisk status for BSC (ACTIVE , FAULTY , BROKEN , and so on). |
Change status retry interval (sec.) | Delay between retries to update the PDisk status in BSC. CMS monitors the status of the disk with the interval State update inverval. If the disk remains in one Status update interval state during several cycles, the CMS changes its status to BSC. Next are the settings for the number of update cycles after which the CMS changes the disk status. If the disk state is Normal , the disk status changes to ACTIVE . In other states, the disk switches to FAULTY .The 0 value disables status changes for the state (by default, this is set for Unknown ).For example, with the default settings, if the CMS detects the Initial disk state for five Status update interval cycles which are 60 seconds each, the disk status changes to FAULTY . |
Default state limit | For states with no setting specified, this value can be used by default. This value is also used for unknown PDisk states that don't have any settings. It's used if no value is set for states such as Initial , InitialFormatRead , InitialSysLogRead , InitialCommonLogRead , and Normal . |
Initial | PDisk starts initializing. Transition to FAULTY . |
InitialFormatRead | PDisk is reading its format. Transition to FAULTY . |
InitialFormatReadError | PDisk received an error when reading its format. Transition to FAULTY . |
InitialSysLogRead | PDisk is reading the system log. Transition to FAULTY . |
InitialSysLogReadError | PDisk received an error when reading the system log. Transition to FAULTY . |
InitialSysLogParseError | PDisk received an error when parsing and checking the consistency of the system log. Transition to FAULTY . |
InitialCommonLogRead | PDisk is reading the common VDisk log. Transition to FAULTY . |
InitialCommonLogReadError | PDisk received an error when reading the common VDisk log. Transition to FAULTY . |
InitialCommonLogParseError | PDisk received an error when parsing and checking the consistency of the common log. Transition to FAULTY . |
CommonLoggerInitError | PDisk received an error when initializing internal structures to be logged to the common log. Transition to FAULTY . |
Normal | PDisk completed initialization and is running normally. Transition to ACTIVE will occur after a specified number of cycles (for example, if the disk is Normal for 5 minutes, it switches to ACTIVE ). |
OpenFileError | PDisk received an error when opening a disk file. Transition to FAULTY . |
Missing | The node responds, but this PDisk is missing from its list. Transition to FAULTY . |
Timeout | The node didn't respond within the specified timeout. Transition to FAULTY . |
NodeDisconnected | The node has disconnected. Transition to FAULTY . |
Unknown | Unexpected response, for example, TEvUndelivered to the state request. Transition to FAULTY . |
Working with donor disks
The donor disk is the previous VDisk after the data transfer, which continues to store its data and only responds to read requests from the new VDisk. When data is transfered with donor disks enabled, previous VDisks continue to function until the data is fully moved to the new disks. To prevent data loss when moving a VDisk, enable donor disks:
ydb-dstool -e <bs_endpoint> cluster set --enable-donor-mode
To disable donor disks, run the command:
ydb-dstool -e <bs_endpoint> cluster set --disable-donor-mode