CPU bottleneck

High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes.

YDB nodes primarily consume CPU resources for running actors. On each node, actors are executed using multiple actor system pools. The resource consumption of each pool is measured separately which allows to identify what kind of activity changed its behavior.

Diagnostics

  1. Use Diagnostics in the Embedded UI to analyze CPU utilization in all pools:

    1. In the Embedded UI, go to the Databases tab and click on the database.

    2. On the Navigation tab, ensure the required database is selected.

    3. Open the Diagnostics tab.

    4. On the Info tab, click the CPU button and see if any pools show high CPU usage.

  2. Use Grafana charts to analyze CPU utilization in all pools:

    1. Open the CPU dashboard in Grafana.

    2. See if the following charts show any spikes:

      • CPU by execution pool chart

      • User pool - CPU by host chart

      • System pool - CPU by host chart

      • Batch pool - CPU by host chart

      • IC pool - CPU by host chart

      • IO pool - CPU by host chart

  3. If the spike is in the user pool, analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the DB overview dashboard in Grafana:

    • Requests chart

    • Request size chart

    • Response size chart

    Also, see all of the charts in the Operations section of the DataShard dashboard.

  4. If the spike is in the batch pool, check if there are any backups running.

Recommendation

Add additional database nodes to the cluster or allocate more CPU cores to the existing nodes. If that's not possible, consider distributing CPU cores between pools differently.