Overview

YDB is a distributed, fault-tolerant database that provides components for building a data warehouse (DWH) on a unified platform.

Using YDB allows you to consolidate the functionality of several technologies (for example, separate systems for streaming, storage, and analytics) into a single solution. You can use familiar tools and approaches while gaining the benefits of a distributed system.

Data Ingestion

The platform is designed to handle streaming and batch data ingestion at large scale.

Streaming processing: Built-in topic system with Kafka API support for integration with existing systems. Plugins for Fluent Bit and Logstash are available for log collection.
Batch loading: BulkUpsert API for efficiently loading large data sets and a connector for Apache Spark to integrate with data processing platforms.
Connection via standard interfaces: JDBC-driver and native SDK.

Data Storage

The core of the storage system is — column-oriented tables with built-in compression, optimized for analytical workloads.

Separation of storage and compute: A key feature of YDB, enabling independent scaling of disk space and computing resources.
Minimized administration: Background processes for (compaction) and TTL-based data removal reduce the need for manual operations.

Query Execution

YDB is an Massively Parallel Processing (MPP) DBMS with no dedicated master node. All nodes perform the same roles, and the system scales horizontally by dynamically adding or removing compute resources.

Cost-Based Optimizer (CBO): Selects the optimal query execution plan by analyzing data statistics.
Spilling mechanism: Enables execution of queries whose intermediate results do not fit in RAM by offloading them to disk.
Workload Manager: Manages resource allocation among queries, isolating different types of workloads.

Data Transformation

Data transformation is supported using standard approaches and tools.

ELT with SQL: Use INSERT INTO ... SELECT to build data marts. For managing complex SQL pipelines, integration with dbt is available.
ETL with Apache Spark: Run ETL jobs on Apache Spark using the parallel connector.
Orchestration: Automate pipelines with Apache Airflow.

Federated Queries

YDB allows you to run queries on data stored in S3-compatible storage without preloading it. This simplifies working with data stored in a data lake.

Data Analysis and Visualization (BI and ML)

You can use industry-standard tools for data analysis:

BI tools: Yandex DataLens, Apache Superset, Grafana.
ML tools: Use Jupyter Notebook and Apache Spark for data preparation and machine learning model training.

Was the article helpful?

Glossary

Key Features