Working with S3 buckets (Yandex Object Storage)

When working with Yandex Object Storage using external data sources, it is convenient to perform prototyping and initial data connection setup.

An example query to read data:

SELECT
    *
FROM
 object_storage.`*.tsv`
WITH
(
 FORMAT = "tsv_with_names",
    SCHEMA =
 (
 ts Uint32,
        action Utf8
 )
);

The list of supported formats and data compression algorithms for reading data in S3 (Yandex Object Storage) is provided in the section Data formats and compression algorithms.

Data model

In Yandex Object Storage, data is stored in files. To read data, you need to specify the data format in the files, compression, and lists of fields. This is done using the following SQL expression:

SELECT
  <expression>
FROM
  <object_storage_connection_name>.`<file_path>`
WITH(
 FORMAT = "<file_format>",
  SCHEMA = (<schema_definition>),
  COMPRESSION = "<compression>")
WHERE
  <filter>;

Where:

  • object_storage_connection_name — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
  • file_path — the path to the file or files inside the bucket. Wildcards * are supported; more details in the section.
  • file_format — the data format in the files.
  • schema_definition — the schema definition of the data stored in the files.
  • compression — the compression format of the files.

Data schema description

The data schema description consists of a set of fields:

  • Field name.
  • Field type.
  • Required data flag.

For example, the data schema below describes a schema field named Year of type Int32 with the requirement that this field must be present in the data:

Year Int32 NOT NULL

If a data field is marked as required (NOT NULL) but this field is missing in the processed file, the processing of such a file will result in an error. If a field is marked as optional (NULL), no error will occur in the absence of the field in the processed file, but the field will take the value NULL.

Data path formats

In YDB, the following data paths are supported:

Path format Description Example
Path ends with a / Path to a directory The path /a addresses all contents of the directory:
/a/b/c/d/1.txt
/a/b/2.csv
Path contains a wildcard character * Any files nested in the path The path /a/*.csv addresses files in directories:
/a/b/c/1.csv
/a/2.csv
/a/b/c/d/e/f/g/2.csv
Path does not end with / and does not contain wildcard characters Path to a single file The path /a/b.csv addresses the specific file /a/b.csv

Example

Example query to read data from S3 (Yandex Object Storage):

SELECT
    *
FROM
    connection.`folder/filename.csv`
WITH(
 FORMAT = "csv_with_names",
    SCHEMA =
 (
        Year Int32,
 Manufacturer Utf8,
 Model Utf8,
 Price Double
 )
);

Where:

  • connection — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
  • folder/filename.csv — the path to the file in the S3 bucket (Yandex Object Storage).
  • SCHEMA — the data schema description in the file.