Health Check API
YDB has a built-in self-diagnostic system, which can be used to get a brief report on the database status and information about existing problems.
To initiate the check, call the SelfCheck
method from Ydb.Monitoring
. You must also pass the name of the checked DB as usual.
Calling the method will return the following structure:
message SelfCheckResult {
SelfCheck.Result self_check_result = 1;
repeated IssueLog issue_log = 2;
}
The self_check_result
field of the enum
type contains the DB check result:
GOOD
: No problems were detected.DEGRADED
: Degradation of one of the database systems was detected, but the database is still functioning (for example, allowable disk loss).MAINTENANCE_REQUIRED
: Significant degradation was detected, there is a risk of accessibility loss, and human intervention is required.EMERGENCY
: A serious problem was detected in the database, with complete or partial loss of accessibility.
If problems are detected, the issue_log
field will contain problem descriptions with the following structure:
message IssueLog {
string id = 1;
StatusFlag.Status status = 2;
string message = 3;
Location location = 4;
repeated string reason = 5;
string type = 6;
uint32 level = 7;
}
id
: A unique problem ID within this response.status
: Status (severity) of the current problem. It can take one of the following values:RED
: A component is faulty or unavailable.ORANGE
: A serious problem, we are one step away from losing accessibility. Intervention may be required.YELLOW
: A minor problem, no risks to accessibility. We recommend you continue monitoring the problem.BLUE
: Temporary minor degradation that does not affect database accessibility.GREEN
: No problems were detected.GREY
: Failed to determine the status (a problem with the self-diagnostic mechanism).
message
: Text that describes the problem.location
: Location of the problem.reason
: Possible IDs of the nested problems that led to the current problem.type
: Problem category (by subsystem).level
: Depth of the problem nesting.
Possible problems
Pool usage over 90/95/99%
: One of the pools' CPUs is overloaded.System tablet is unresponsive / response time over 1000ms/5000ms
: The system tablet is not responding or it takes too long to respond.Tablets are restarting too often
: Tablets are restarting too often.Tablets are dead
: Tablets are not started (or cannot be started).LoadAverage above 100%
: A physical host is overloaded.There are no compute nodes
: The database has no nodes to start the tablets.PDisk state is ...
: Indicates problems with a physical disk.PDisk is not available
: A physical disk is not available.Available size is less than 12%/9%/6%
: Free space on the physical disk is running out.VDisk is not available
: A virtual disk is not available.VDisk state is ...
: Indicates problems with a virtual disk.DiskSpace is ...
: Indicates problems with virtual disk space.Storage node is not available
: A node with disks is not available.Replication in progress
: Disk replication is in progress.Group has no redundancy
: A storage group lost its redundancy.Group failed
: A storage group lost its integrity.Group degraded
: The number of disks allowed in the group is not available.