Chaos Testing
Chaos testing is a methodology for verifying the resilience of YDB by deliberately injecting failures into a running cluster. The goal is to ensure that YDB correctly survives real-world failures: node losses, network partitions, disk problems, and other abnormal situations. The Nemesis tool is responsible for injecting this chaos.
What Is Tested
Chaos testing verifies cluster behavior under the following types of failures:
Network Failures
- Node network isolation (blocking incoming and outgoing traffic)
- System time skew on nodes
Node Failures
- Forceful termination of cluster node processes
- Stopping and restarting nodes
- Suspending node processes
Tablet Failures
Tablets are the primary computational units of YDB. Resilience is verified by forcefully terminating various tablet types:
- Coordinator — distributed transaction coordinator
- Hive — tablet placement manager
- BSController — distributed storage controller
- SchemeShard — schema manager
- DataShard — data storage tablets
- Mediator — transaction mediator
- PersQueue — message queue tablets
- Other system tablets
Tablet rebalancing between nodes via Hive is also tested.
Disk Failures
- Safely taking a disk out of service on a node
- Cleaning up disks on nodes
Multi-Datacenter Cluster Scenarios
- Stopping all nodes in a single datacenter
- Network isolation of a datacenter
Bridge Mode Cluster Scenarios
Integration with Stress Testing
Chaos testing is typically run in conjunction with stress testing workloads from the ydb/tests/stress directory. This combination ensures that the cluster is tested under load conditions while experiencing various failure scenarios.
How Verification Works
While failures are being injected, the system checks two properties on demand:
- Liveness — the cluster remains available and continues to process requests
- Safety — no signs of data correctness violations or internal system invariant breaches appear in cluster logs and metrics
Failures are injected automatically on a schedule, and check results are aggregated and available for analysis.
The Nemesis Tool
YDB uses the Nemesis tool for chaos testing — a fault injection application located in the YDB repository on GitHub. It is deployed directly on the nodes of the cluster under test and manages fault injection according to a configured schedule.
Warning
Nemesis only works with YDB clusters that were deployed using the ydbd_slice utility.
Installation
Deploy Nemesis to your cluster:
# Single-file config (cluster.yaml contains both hosts and database template)
nemesis install --yaml-config-location /path/to/cluster.yaml
# Two-file config (separate cluster.yaml and databases.yaml)
nemesis install \
--yaml-config-location /path/to/cluster.yaml \
--database-config-location /path/to/databases.yaml
The first host in the cluster configuration becomes the orchestrator; all other hosts become agents. Services are deployed as systemd units and started automatically.
The primary way to observe a Nemesis run and inspect its results is the web UI served by the orchestrator. Open the URL printed at the end of nemesis install in a browser; by default it is available at:
http://<orchestrator_host>:31434/static/index.html
The UI displays:
- Active faults — currently injected faults and their execution logs, grouped by category (network, node, tablet, disk, datacenter, pile)
- Schedules — fault types registered for automatic injection, their configured intervals, and the next scheduled run
- Manual controls — buttons to inject individual faults on demand
- Execution history — past fault injections with timestamps and target hosts
- Liveness checks — cluster-wide health checks run by the orchestrator
- Safety checks — violation detectors over the local logs for the last 24 hours
The same data is available via the HTTP API on the orchestrator if you need to scrape it programmatically.
Interpreting Results
A Nemesis run is judged against two properties: liveness and safety. The pass/fail criteria are:
- Pass — throughout the run, all liveness checks report the cluster as healthy and all safety wardens return empty violation lists
- Fail — at least one liveness check reported the cluster as unavailable (cluster did not recover after a fault within the expected window), or at least one safety warden returned a non-empty list of violations (for example, errors or assertions in cluster logs, schemeshard inconsistencies, datashard invariant breaches)
In the UI, healthy checks appear in the violations list; failed checks display the violation messages produced by the corresponding warden.
To investigate a problem, open the failing check in the UI and read the listed violation messages.
For the meaning of individual checks and the exact patterns each warden looks for, see the Nemesis README.
Stopping Services
Stop all Nemesis services on the cluster:
nemesis stop --yaml-config-location /path/to/cluster.yaml
Extending Nemesis
To add a new chaos type (fault) or a new safety/liveness check, see the README. It describes how to implement and register new fault runners, planners, and warden classes.