Working with SelfHeal

While a clusters are running, entire nodes or individual block devices that YDB runs on can fail.

SelfHeal ensures a cluster's continuous performance and fault tolerance if malfunctioning nodes or devices cannot be repaired quickly.

SelfHeal can:

Detect faulty system elements.
Transfer faulty elements carefully without data loss and disintegration of storage groups.

SelfHeal is enabled by default.

YDB component responsible for SelfHeal is called "Sentinel".

Enabling and disabling SelfHeal

You can enable and disable SelfHeal using YDB DSTool.

To enable SelfHeal, run the command:

ydb-dstool -e <bs_endpoint> cluster set --enable-self-heal

To disable SelfHeal, run the command:

ydb-dstool -e <bs_endpoint> cluster set --disable-self-heal

SelfHeal settings

You can configure SelfHeal in Viewer → Cluster Management System → CmsConfigItems.

To create the initial settings, click Create. If you want to update the current settings, click .

You can use the following settings:

Parameter	Description
Status	Enabling and disabling SelfHeal in CMS.
Dry run	Enables/disables the mode in which the CMS doesn't change the BSC setting.
Config update interval (sec.)	BSC configuration update interval.
Retry interval (sec.)	Interval of configuration update attempts.
State update interval (sec.)	PDisk state update interval.
Timeout (sec.)	PDisk state update timeout.
Change status retries	Number of retries to change the PDisk status for BSC (`ACTIVE`, `FAULTY`, `BROKEN`, and so on).
Change status retry interval (sec.)	Delay between retries to update the PDisk status in BSC. CMS monitors the status of the disk with the interval State update inverval. If the disk remains in one Status update interval state during several cycles, the CMS changes its status to BSC. Next are the settings for the number of update cycles after which the CMS changes the disk status. If the disk state is `Normal`, the disk status changes to `ACTIVE`. In other states, the disk switches to `FAULTY`. The `0` value disables status changes for the state (by default, this is set for `Unknown`). For example, with the default settings, if the CMS detects the `Initial` disk state for five `Status update interval` cycles which are 60 seconds each, the disk status changes to `FAULTY`.
Default state limit	For states with no setting specified, this value can be used by default. This value is also used for unknown PDisk states that don't have any settings. It's used if no value is set for states such as `Initial`, `InitialFormatRead`, `InitialSysLogRead`, `InitialCommonLogRead`, and `Normal`.
Initial	PDisk starts initializing. Transition to `FAULTY`.
InitialFormatRead	PDisk is reading its format. Transition to `FAULTY`.
InitialFormatReadError	PDisk received an error when reading its format. Transition to `FAULTY`.
InitialSysLogRead	PDisk is reading the system log. Transition to `FAULTY`.
InitialSysLogReadError	PDisk received an error when reading the system log. Transition to `FAULTY`.
InitialSysLogParseError	PDisk received an error when parsing and checking the consistency of the system log. Transition to `FAULTY`.
InitialCommonLogRead	PDisk is reading the common VDisk log. Transition to `FAULTY`.
InitialCommonLogReadError	PDisk received an error when reading the common VDisk log. Transition to `FAULTY`.
InitialCommonLogParseError	PDisk received an error when parsing and checking the consistency of the common log. Transition to `FAULTY`.
CommonLoggerInitError	PDisk received an error when initializing internal structures to be logged to the common log. Transition to `FAULTY`.
Normal	PDisk completed initialization and is running normally. Transition to `ACTIVE` will occur after a specified number of cycles (for example, if the disk is `Normal` for 5 minutes, it switches to `ACTIVE`).
OpenFileError	PDisk received an error when opening a disk file. Transition to `FAULTY`.
Missing	The node responds, but this PDisk is missing from its list. Transition to `FAULTY`.
Timeout	The node didn't respond within the specified timeout. Transition to `FAULTY`.
NodeDisconnected	The node has disconnected. Transition to `FAULTY`.
Stopped	PDisk has been stopped. Transition to `FAULTY`.
Unknown	Unexpected response, for example, `TEvUndelivered` to the state request. Transition to `FAULTY`.

Working with donor disks

The donor disk is the previous VDisk after the data transfer, which continues to store its data and only responds to read requests from the new VDisk. When data is transfered with donor disks enabled, previous VDisks continue to function until the data is fully moved to the new disks. To prevent data loss when moving a VDisk, enable donor disks:

ydb-dstool -e <bs_endpoint> cluster set --enable-donor-mode

To disable donor disks, run the command:

ydb-dstool -e <bs_endpoint> cluster set --disable-donor-mode

Was the article helpful?

Enabling/disabling Scrubbing

Decommissioning a cluster part