Working with S3 buckets (Yandex Object Storage)

When working with Yandex Object Storage using external data sources, it is convenient to perform prototyping and initial data connection setup.

An example query to read data:

SELECT
  *
FROM
 object_storage.`*.tsv`
WITH
(
  FORMAT = "tsv_with_names",
  SCHEMA =
  (
    ts Uint32,
    action Utf8
  )
);

The list of supported formats and data compression algorithms for reading data in S3 (Yandex Object Storage) is provided in the section Data formats and compression algorithms.

Data model

In Yandex Object Storage, data is stored in files. To read data, you need to specify the data format in the files, compression, and lists of fields. This is done using the following SQL expression:

SELECT
  <expression>
FROM
  <object_storage_connection_name>.`<file_path>`
WITH(
  FORMAT = "<file_format>",
  COMPRESSION = "<compression>",
  SCHEMA = (<schema_definition>))
WHERE
  <filter>;

Where:

object_storage_connection_name — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
file_path — the path to the file or files inside the bucket. Wildcards * are supported; more details in the section.
file_format — the data format in the files.
compression — the compression format of the files.
schema_definition — the schema definition of the data stored in the files.

Data schema description

The data schema description consists of a set of fields:

Field name.
Field type.
Required data flag.

For example, the data schema below describes a schema field named Year of type Int32 with the requirement that this field must be present in the data:

Year Int32 NOT NULL

If a data field is marked as required (NOT NULL) but is missing in the processed file, processing such a file will result in an error. If a field is marked as optional (NULL), no error will occur the field is absent in the processed file, but the field will take the value NULL. The keyword NULL is optional in this context.

Schema inference

YDB can determine the data schema of the files inside the bucket so that you do not have to specify these fields manually.

Note

Schema inference is available for all data formats except raw and json_as_string. For these formats you must describe the schema manually.

To enable schema inference, use the WITH_INFER parameter:

SELECT
  <expression>
FROM
  <object_storage_connection_name>.`<file_path>`
WITH(
  FORMAT = "<file_format>",
  COMPRESSION = "<compression>",
  WITH_INFER = "true")
WHERE
  <filter>;

Where:

object_storage_connection_name — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
file_path — the path to the file or files inside the bucket. Wildcards * are supported; For more information, see Data path formats.
file_format — the data format in the files. All formats except raw and json_as_string are supported.
compression — the compression format of the files.

As a result of executing such a query, the names and types of fields will be inferred.

Data path formats

In YDB, the following data paths are supported:

Path format	Description	Example
Path ends with a `/`	Path to a directory	The path `/a` addresses all contents of the directory: `/a/b/c/d/1.txt` `/a/b/2.csv`
Path contains a wildcard character `*`	Any files nested in the path	The path `/a/*.csv` addresses files in directories: `/a/b/c/1.csv` `/a/2.csv` `/a/b/c/d/e/f/g/2.csv`
Path does not end with `/` and does not contain wildcard characters	Path to a single file	The path `/a/b.csv` addresses the specific file `/a/b.csv`

Example

Example query to read data from S3 (Yandex Object Storage):

SELECT
  *
FROM
  connection.`folder/filename.csv`
WITH(
  FORMAT = "csv_with_names",
  SCHEMA =
  (
    Year Int32,
    Manufacturer Utf8,
    Model Utf8,
    Price Double
  )
);

Where:

connection — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
folder/filename.csv — the path to the file in the S3 bucket (Yandex Object Storage).
SCHEMA — the data schema description in the file.

Was the article helpful?

Working with YDB databases

Reading data through external tables