Import and export for column tables

YDB column tables currently do not have a built-in backup and restore mechanism (it is in development). For data migration or recovery after failures, use export and import operations.

Two main approaches are available:

Export and import via federated queries to object storage (for example, Object Storage or any other S3-compatible storage).
Advantages: Uses built-in YDB functionality; no third-party services or tools required.
Limitations: This method only supports exporting data to object storage.
Export and import via Apache Spark™ — a flexible option for large volumes of data.
Advantages: Supports a wide range of target storage systems.
Limitations: Requires installing and configuring additional software (Apache Spark™).

Export and import via federated queries to Object Storage

Federated queries let YDB read and write data in Parquet or CSV files directly. This approach is convenient for export and import using SQL only, without external tools.

Prerequisites

Object storage (Object Storage) and a static access key; for example, Yandex Object Storage with a pre-created bucket (e.g. your-bucket).
Network access from the YDB cluster nodes to the object storage. The examples use endpoint storage.yandexcloud.net — ensure access on port 443.
The examples use TPC-H benchmark data. Instructions for loading the sample data are in the relevant section of the guide.

Create a secret for access to Object Storage

To connect to a private bucket, use static access key authentication. In YDB, these keys are stored as secrets.

CREATE SECRET aws_access_id WITH (value='<access_key_id>');
CREATE SECRET aws_access_key WITH (value='<secret_access_key>');

Where:

aws_access_id — secret containing the access key ID.
<access_key_id> — static access key identifier.
aws_access_key — secret containing the secret key.
<secret_access_key> — secret part of the access key.

Configuring the connection

Next, configure the connection to the bucket by creating an external data source and an external table with a schema that matches lineitem.

-- Create a data source pointing to the bucket and using the secret
CREATE EXTERNAL DATA SOURCE `external/backup_datasource` WITH (
    SOURCE_TYPE="ObjectStorage",
    LOCATION="https://storage.yandexcloud.net/<bucket_name>/",
    AUTH_METHOD="AWS",
    AWS_ACCESS_KEY_ID_SECRET_PATH="aws_access_id",
    AWS_SECRET_ACCESS_KEY_SECRET_PATH="aws_access_key",
    AWS_REGION="ru-central1"
);

Where:

external/backup_datasource — name of the external data source being created.
LOCATION — bucket URL, including the bucket name <bucket_name>.
AUTH_METHOD="AWS" — authentication method compatible with the S3 API.
AWS_ACCESS_KEY_ID_SECRET_PATH, AWS_SECRET_ACCESS_KEY_SECRET_PATH — secrets used for authentication with Object Storage.

-- Create an external table with the lineitem schema
CREATE EXTERNAL TABLE `external/backup/lineitem_sql` (
    l_orderkey Int64 NOT NULL,
    l_partkey Int32 NOT NULL,
    l_suppkey Int32 NOT NULL,
    l_linenumber Int32 NOT NULL,
    l_quantity Double NOT NULL,
    l_extendedprice Double NOT NULL,
    l_discount Double NOT NULL,
    l_tax Double NOT NULL,
    l_returnflag String NOT NULL,
    l_linestatus String NOT NULL,
    l_shipdate Date NOT NULL,
    l_commitdate Date NOT NULL,
    l_receiptdate Date NOT NULL,
    l_shipinstruct String NOT NULL,
    l_shipmode String NOT NULL,
    l_comment String NOT NULL
) WITH (
    DATA_SOURCE="external/backup_datasource",
    LOCATION="/ydb-dumps-sql/lineitem/",
    FORMAT="parquet"
);

Where:

LOCATION — path to the directory with data inside the bucket.
DATA_SOURCE — name of the EXTERNAL DATA SOURCE object that holds the connection parameters.
external/backup/lineitem_sql — full name of the external table being created.

Exporting data from YDB

To export data from table tpch/s10/lineitem to Object Storage, use INSERT INTO ... SELECT into the external table.

INSERT INTO `external/backup/lineitem_sql`
SELECT * FROM `tpch/s10/lineitem`;

After this query runs, Parquet files with the data will appear in bucket your-bucket at path /ydb-dumps-sql/lineitem/.

Importing data into YDB

Note

The INSERT command may fail if the table you are restoring into already has rows. In that case, clear the target table and run the INSERT again.

To import data from Object Storage back into table tpch/s10/lineitem, use INSERT INTO ... SELECT from the external table.

INSERT INTO `tpch/s10/lineitem`
SELECT * FROM `external/backup/lineitem_sql`;

Here tpch/s10/lineitem is the target table in YDB into which data will be loaded.

Export and import with Apache Spark™

Using the connector for YDB and Apache Spark™ is a flexible and scalable way to export and import large volumes of data.

Prerequisites

PySpark version 4.0.1 installed; see the installation guide.
A gRPC endpoint for connecting to the YDB database.
Connection credentials for YDB with read/write permissions.
Network access from the YDB cluster nodes to the object storage. The examples use endpoint storage.yandexcloud.net — ensure access on port 443.
The examples use TPC-H benchmark data. Instructions for loading the sample data are in the relevant section of the guide.

Exporting data from YDB to Parquet

Parameters used:

spark.jars.packages — Maven configuration that loads the YDB Spark connector and other required components.
S3_ENDPOINT — endpoint of the S3-compatible storage (for Yandex Object Storage use https://storage.yandexcloud.net).
S3_ACCESS_KEY — static access key ID for S3.
S3_SECRET_KEY — secret part of the S3 access key.
YDB_HOSTNAME — gRPC endpoint host (e.g. ydb.serverless.yandexcloud.net).
YDB_PORT — gRPC endpoint port (e.g. 2135).
YDB_DATABASE_NAME — path to your database (e.g. /ru-central1/b1g.../etn...).
YDB_AUTH_TYPE — authentication parameters for YDB, as supported by the Apache Spark driver.
YDB_SOURCE_TABLE — path to the source table (e.g. tpch/s1/lineitem).

from pyspark.sql import SparkSession

# Source settings
YDB_HOSTNAME = ""
YDB_PORT = ""
YDB_DATABASE_NAME = ""
YDB_AUTH_TYPE = ""
YDB_SOURCE_TABLE = ""

# Destination settings
S3_ENDPOINT = ""
S3_ACCESS_KEY = ""
S3_SECRET_KEY = ""
S3_BUCKET_NAME = ""

spark = (SparkSession.builder
    .appName("ydb-export-lineitem-to-parquet")
    .config("spark.jars.packages", "tech.ydb.spark:ydb-spark-connector-shaded:2.0.1,org.apache.hadoop:hadoop-aws:3.3.6,com.amazonaws:aws-java-sdk-bundle:1.12.662")
    # S3 connector configuration
    .config("spark.hadoop.fs.s3a.endpoint", S3_ENDPOINT)
    .config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
    .config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
    .config("spark.hadoop.fs.s3a.threads.keepalivetime", "60000")
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
    .config("spark.hadoop.fs.s3a.connection.establish.timeout", "30000")   # 30s
    .config("spark.hadoop.fs.s3a.connection.timeout", "200000")            # 200s
    .config("spark.hadoop.fs.s3a.threads.keepalivetime", "60000")          # 60s
    .config("spark.hadoop.fs.s3a.connection.ttl", "300000")                # 5m
    .config("spark.hadoop.fs.s3a.assumed.role.session.duration", "1800000")# 30m
    .config("spark.hadoop.fs.s3a.multipart.purge.age", "86400000")         # 24h
    .config("spark.hadoop.fs.s3a.retry.interval", "500")                   # 500ms
    .config("spark.hadoop.fs.s3a.retry.throttle.interval", "100")          # 100ms
    .getOrCreate())

# Read data from the lineitem table
df = (spark.read.format("ydb")
    .option("url", f"grpcs://{YDB_HOSTNAME}:{YDB_PORT}{YDB_DATABASE_NAME}?{YDB_AUTH_TYPE}")
    .load(YDB_SOURCE_TABLE))

# Write data to Parquet files in S3
(df.repartition(64)
    .write.mode("overwrite")
    .option("compression", "snappy")
    .parquet(f"s3a://{S3_BUCKET_NAME}/ydb-dumps-spark/lineitem/"))

spark.stop()

Importing data from Parquet into YDB

spark.jars.packages — Maven configuration that loads the YDB Spark connector and other required components.
S3_ENDPOINT — endpoint of the S3-compatible storage (for Yandex Object Storage use https://storage.yandexcloud.net).
S3_ACCESS_KEY — static access key ID for S3.
S3_SECRET_KEY — secret part of the S3 access key.
YDB_HOSTNAME — gRPC endpoint host (e.g. ydb.serverless.yandexcloud.net).
YDB_PORT — gRPC endpoint port (e.g. 2135).
YDB_DATABASE_NAME — path to your database (e.g. /ru-central1/b1g.../etn...).
YDB_AUTH_TYPE — authentication parameters for YDB, as supported by the Apache Spark driver.
YDB_TARGET_TABLE — path to the target table (e.g. tpch/s1/lineitem).

from pyspark.sql import SparkSession

# Source settings
S3_ENDPOINT = "https://storage.yandexcloud.net"
S3_ACCESS_KEY = ""
S3_SECRET_KEY = ""
S3_BUCKET_NAME = ""
S3_FOLDER_PATH = ""

# Destination settings
YDB_HOSTNAME = ""
YDB_PORT = ""
YDB_DATABASE_NAME = ""
YDB_AUTH_TYPE = ""
YDB_TARGET_TABLE = ""

spark = (SparkSession.builder
    .appName("ydb-import-lineitem-from-parquet")
    .config("spark.jars.packages", "tech.ydb.spark:ydb-spark-connector-shaded:2.0.1,org.apache.hadoop:hadoop-aws:3.3.6,com.amazonaws:aws-java-sdk-bundle:1.12.662")
    # S3 connector configuration (same as for export)
    .config("spark.hadoop.fs.s3a.endpoint", S3_ENDPOINT)
    .config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
    .config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
    .config("spark.hadoop.fs.s3a.threads.keepalivetime", "60000")
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
    .config("spark.hadoop.fs.s3a.connection.establish.timeout", "30000")   # 30s
    .config("spark.hadoop.fs.s3a.connection.timeout", "200000")            # 200s
    .config("spark.hadoop.fs.s3a.threads.keepalivetime", "60000")          # 60s
    .config("spark.hadoop.fs.s3a.connection.ttl", "300000")                # 5m
    .config("spark.hadoop.fs.s3a.assumed.role.session.duration", "1800000")# 30m
    .config("spark.hadoop.fs.s3a.multipart.purge.age", "86400000")         # 24h
    .config("spark.hadoop.fs.s3a.retry.interval", "500")                   # 500ms
    .config("spark.hadoop.fs.s3a.retry.throttle.interval", "100")          # 100ms
    .getOrCreate())

# Read data from Parquet files in S3 created during export
df = spark.read.parquet(f"s3a://{S3_BUCKET_NAME}/{S3_FOLDER_PATH}")

# Write data to the target table
(df.write.format("ydb")
    .option("url", f"grpcs://{YDB_HOSTNAME}:{YDB_PORT}{YDB_DATABASE_NAME}?{YDB_AUTH_TYPE}")
    .mode("append")
    .save(YDB_TARGET_TABLE))

spark.stop()

Was the article helpful?

Transfer — streaming NGINX access logs to a table

Streaming queries