Federated query processing system architecture

External data sources and external tables

A key element of the federated query processing system in YDB is the concept of an external data source. Such sources can include relational DBMS, object storage, and other data storage systems. When processing a federated query, YDB streams data from external systems and allows performing the same range of operations on them as on local data.

To work with data located in external systems, YDB must have information about the internal structure of this data (e.g., the number, names, and types of columns in tables). Some sources provide such metadata along with the data itself, whereas for other unschematized sources, this metadata must be provided externally. This latter purpose is served by external tables.

Once external data sources and (if necessary) external tables are registered in YDB, the client can proceed to describe federated queries.

Connectors

While executing federated queries, YDB needs to access external data storage systems over the network, for which it uses their client libraries. Including such dependencies negatively affects the codebase size, compilation time, and binary file size of YDB, as well as the product's overall stability.

The list of supported data sources for federated queries is constantly expanding. The most popular sources, such as S3, are natively supported by YDB. However, not all users require support for all sources simultaneously. Support can be optionally enabled using connectors - special microservices implementing a unified interface for accessing external data sources.

The functions of connectors include:

  • Translating YQL queries into queries in the language specific to the external source (e.g., into another SQL dialect or HTTP API calls).
  • Establishing network connections with data sources.
  • Converting data retrieved from external sources into a columnar format in Arrow IPC Stream format, supported by YDB.

YDB Federated Query Architecture

Thus, connectors form an abstraction layer that hides the specifics of external data sources from YDB. The concise connector interface makes it easy to expand the list of supported sources with minimal changes to YDB's code.

Users can deploy one of the ready-made connectors or write their own implementation in any programming language according to the gRPC specification.

List of supported external data sources

Source Support
S3 Built into ydbd
ClickHouse Via connector fq-connector-go
PostgreSQL Via connector fq-connector-go
YDB Via connector fq-connector-go
Greenplum Via connector fq-connector-go