Bridge mode

Feature of Yandex Enterprise Database

This functionality is available only in the Yandex Enterprise Database. In the open-source version of YDB it is absent.

For a general description of the mode and use cases, see Bridge cluster operation mode. This article is intended for YDB developers and contributors and describes the technical implementation details of bridge mode.

YDB supports an arbitrary number of piles; for simplicity, the following discussion considers the case of two piles, referred to as pile A and pile B.

Configuration storage

Distconf stores the configuration of two static groups, two sets of state storage, as well as scheme board and board. It also stores the cluster operating mode.

The configuration of static groups, state storage, and boards uses a storage scheme with two quorum sets:

  • Quorum A — includes only nodes from pile A. A write is considered successful if it has been applied on a majority of nodes in pile A and on the quorum of disks of the static group of pile A;
  • Quorum B — includes only nodes from pile B. A write is considered successful if it has been applied on a majority of nodes in pile B and on the quorum of disks of the static group of pile B.

If one of the piles fails, writing to the corresponding quorum becomes impossible.

Distconf uses both quorums (A and B) to store configuration:

  • Configuration with mode PRIMARY/DISCONNECTED is written only to quorum A;
  • Configuration with mode DISCONNECTED/PRIMARY is written only to quorum B;
  • All other modes are written to both quorums — A and B.

When performing failover, the administrator puts the cluster into mode PRIMARY/DISCONNECTED or DISCONNECTED/PRIMARY. In this mode, the configuration of shared components is stored only on the quorum of the PRIMARY pile.

To start recovery, the administrator puts the pair of the live pile and the failed pile into mode PRIMARY/NOT_SYNCHRONIZED. This configuration is stored on all quorums.

Writing configuration is forbidden if the new configuration is incompatible with the last one stored on the node. For example, a transition from PRIMARY/DISCONNECTED to DISCONNECTED/PRIMARY is not allowed. This prevents a situation where pile A and pile B simultaneously switch to modes that both claim the PRIMARY role.

To prevent pile A and pile B from interacting when configurations conflict, interconnect session establishment is blocked when such inconsistencies are detected on nodes.

When operating in bridge mode, Distconf creates a subtree in each pile, and the roots of these subtrees are attached to the Distconf leader. This minimizes the reconfiguration of Distconf node links when a pile is lost.

Starting static groups

The configuration of all static groups is stored in Distconf. The root of the Distconf subtree in each pile can obtain the configuration and decide which quorum is required to apply it based on that configuration. A DISCONNECTED pile does not participate in the quorum; therefore, applying configuration without a DISCONNECTED pile requires quorums A and B; applying configuration PRIMARY/DISCONNECTED requires only quorum A; applying configuration DISCONNECTED/PRIMARY requires only quorum B.

When the required quorum is formed, Distconf applies the configuration, and the Node Warden starts the VDisk and PDisk of the static groups.

Access to the static group is provided by the dsproxy-proxy service, which exposes the dsproxy interface to tablets and performs operations synchronously with both groups in the pair.

The dsproxy-proxy of a static group changes its operating mode according to the configuration received from Distconf. The dsproxy-proxy configuration is stored in the configuration of the shared components.

dsproxy-proxy behavior

In mode PRIMARY/SYNCHRONIZED, writes are performed synchronously to both groups; reads are performed from the PRIMARY, and each read issues a ping request to the second group to ensure timely detection of link failure or a change in that group's operating mode.

In mode PRIMARY/DISCONNECTED, reads and writes are performed only in the PRIMARY group.

In mode PRIMARY/NOT_SYNCHRONIZED, dsproxy-proxy supports several submodes that determine which kinds of writes are performed in the NOT_SYNCHRONIZED group:

  • WRITE_KEEP — only KEEP flags are written;
  • WRITE_KEEP_BARRIER_DONOTKEEP — barriers and KEEP and DO_NOT_KEEP flags are written;
  • WRITE_KEEP_BARRIER_DONOTKEEP_DATA — data, barriers, and KEEP and DO_NOT_KEEP flags are written.

The dsproxy-proxy configuration includes:

  • mode;
  • submode;
  • PPGeneration — the generation number of the dsproxy-proxy configuration;
  • the generation number of the dsproxy group configuration from pile A;
  • the generation number of the dsproxy group configuration from pile B.

When the dsproxy-proxy configuration changes, the configuration generation number and the group configuration generation are always updated for the PRIMARY group; for the others, they are updated except when in the DISCONNECTED state.

The group configuration includes:

  • mode;
  • submode;
  • PPGeneration — the generation number of the controlling dsproxy-proxy configuration.

On each operation, the group checks the mode, submode, and PPGeneration. This ensures that operations from stale dsproxy-proxy instances that missed a mode or submode change are not executed, and avoids cross-interaction when two PRIMARY instances appear.

To transition from mode PRIMARY/NOT_SYNCHRONIZED to PRIMARY/SYNCHRONIZED, synchronization is performed. During synchronization, missing data is copied from the PRIMARY group to the NOT_SYNCHRONIZED group, and lock information is copied in both directions.

Bootstrap of BSController and other system tablets

The node list for starting a tablet need only specify one set of nodes that includes a sufficient number of nodes from each pile. The Bootstrapper must analyze the state storage proxy-proxy configuration and use it to determine whether the tablet can be started on a given node. If the tablet cannot be started on the node (because the node is not in the PRIMARY pile), the Bootstrapper must walk the tree of Bootstrappers connected to it and assign as the launcher a Bootstrapper from the PRIMARY pile. If no such Bootstrapper exists, the current Bootstrapper must not start the tablet.

Configuration and startup of other tablets

Dynamic storage groups are managed by the BS Controller. Instead of a single regular storage pool each time, three storage pools are created: in pool A — groups from pile A; in pool B — groups from pile B; in pool C — proxy-proxy groups formed from pairs of groups from pools A and B. Tablets managed by Hive use only the proxy-proxy groups from pool C.

Tablets are started only in the PRIMARY or PROMOTED pile. Moving tablets from PRIMARY to PROMOTED is done as smoothly as possible; moving tablets from DISCONNECTED to PRIMARY is done as quickly as possible.

Interconnect behavior

In the interconnect, metadata exchange is separated from session establishment. Metadata exchange occurs before session establishment and allows a pair of nodes to determine the set of interconnect features in use and to exchange configuration. The interconnect with nodes from the disconnected half of the cluster is torn down, and no connection attempts are made. Connection establishment is only possible after the pile is moved to state PRIMARY/NOT_SYNCHRONIZED.

Want to join the YDB development team?