Working with S3 Buckets (Yandex Object Storage)

To work with S3, you need to set up a data storage connection. There is a DDL for configuring such connections. Next, let's look at the SQL syntax and the management of these settings.

There are two types of buckets in S3: public and private. To connect to a public bucket, use AUTH_METHOD="NONE". To connect to a private bucket, use AUTH_METHOD="AWS". A detailed description of AWS can be found here. AUTH_METHOD="NONE" means that no authentication is used. If AUTH_METHOD="AWS" is specified, several additional parameters are required:

  • AWS_ACCESS_KEY_ID_SECRET_NAME – reference to the name of the secret where AWS_ACCESS_KEY_ID is stored.
  • AWS_SECRET_ACCESS_KEY_SECRET_NAME – reference to the name of the secret where AWS_SECRET_ACCESS_KEY is stored.
  • AWS_REGION – region from which reading is performed, for example, ru-central-1.

To set up a connection to a public bucket, execute the following SQL query. The query creates an external connection named object_storage, which points to a specific S3 bucket named bucket.

CREATE EXTERNAL DATA SOURCE object_storage WITH (
  SOURCE_TYPE="ObjectStorage",
  LOCATION="https://object_storage_domain/bucket/",
  AUTH_METHOD="NONE"
);

To set up a connection to a private bucket, you need to run a few SQL queries. First, create secrets containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

CREATE OBJECT aws_access_id (TYPE SECRET) WITH (value=`<id>`);
CREATE OBJECT aws_access_key (TYPE SECRET) WITH (value=`<key>`);

The next step is to create an external connection named object_storage, which points to a specific S3 bucket named bucket and uses AUTH_METHOD="AWS". The parameters AWS_ACCESS_KEY_ID_SECRET_NAME, AWS_SECRET_ACCESS_KEY_SECRET_NAME, and AWS_REGION are filled in for AWS. The values of these parameters are described above.

CREATE EXTERNAL DATA SOURCE object_storage WITH (
  SOURCE_TYPE="ObjectStorage",
  LOCATION="https://object_storage_domain/bucket/",
  AUTH_METHOD="AWS",
  AWS_ACCESS_KEY_ID_SECRET_NAME="aws_access_id",
  AWS_SECRET_ACCESS_KEY_SECRET_NAME="aws_access_key",
  AWS_REGION="ru-central-1"
);

Using an External Connection to an S3 Bucket

When working with Yandex Object Storage using external data sources, it is convenient to perform prototyping and initial data connection setup.

An example query to read data:

SELECT
  *
FROM
 object_storage.`*.tsv`
WITH
(
  FORMAT = "tsv_with_names",
  SCHEMA =
  (
    ts Uint32,
    action Utf8
  )
);

The list of supported formats and data compression algorithms for reading data in S3 (Yandex Object Storage) is provided in the section Data Formats and Compression Algorithms.

Data Model

In Yandex Object Storage, data is stored in files. To read data, you need to specify the data format in the files, compression, and lists of fields. This is done using the following SQL expression:

SELECT
  <expression>
FROM
  <object_storage_connection_name>.`<file_path>`
WITH(
  FORMAT = "<file_format>",
  COMPRESSION = "<compression>",
  SCHEMA = (<schema_definition>),
  <format_settings>)
WHERE
  <filter>;

Where:

  • object_storage_connection_name — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
  • file_path — the path to the file or files inside the bucket. Wildcards * are supported; more details in the section.
  • file_format — the data format in the files.
  • compression — the compression format of the files.
  • schema_definition — the schema definition of the data stored in the files.
  • format_settings — optional format settings

Data Schema Description

The data schema description consists of a set of fields:

  • Field name.
  • Field type.
  • Required data flag.

For example, the data schema below describes a schema field named Year of type Int32 with the requirement that this field must be present in the data:

Year Int32 NOT NULL

If a data field is marked as required (NOT NULL) but is missing in the processed file, processing such a file will result in an error. If a field is marked as optional (NULL), no error will occur if the field is absent in the processed file, but the field will take the value NULL. The keyword NULL is optional in this context.

Schema Inference

YDB can determine the data schema of the files inside the bucket so that you do not have to specify these fields manually.

Note

Schema inference is available for all data formats except raw and json_as_string. For these formats you must describe the schema manually.

To enable schema inference, use the WITH_INFER parameter:

SELECT
  <expression>
FROM
  <object_storage_connection_name>.`<file_path>`
WITH(
  FORMAT = "<file_format>",
  COMPRESSION = "<compression>",
  WITH_INFER = "true")
WHERE
  <filter>;

Where:

  • object_storage_connection_name — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
  • file_path — the path to the file or files inside the bucket. Wildcards * are supported. For more information, see Data Path Formats Specified in `file_path`.
  • file_format — the data format in the files. All formats except raw and json_as_string are supported.
  • compression — the compression format of the files.

As a result of executing such a query, the names and types of fields will be inferred.

Data Path Formats Specified in file_path

In YDB, the followingdata paths are supported:

Path Format Description Example
Path ends with a / Path to a directory The path /a/ addresses all contents of the directory:
/a/b/c/d/1.txt
/a/b/2.csv
Path contains a wildcard character * Any files nested in the path The path /a/*.csv addresses files in directories:
/a/b/c/1.csv
/a/2.csv
/a/b/c/d/e/f/g/2.csv
Path does not end with / and does not contain wildcard characters Path to a single file The path /a/b.csv addresses the specific file /a/b.csv

Format Settings

In YDB, the following format settings are supported:

Setting name Description Possible values
file_pattern File name template File name template string. Wildcards * are supported.
data.interval.unit Unit for parsing Interval type MICROSECONDS, MILLISECONDS, SECONDS, MINUTES, HOURS, DAYS, WEEKS
data.datetime.format_name Predefined format in which Datetime data is stored POSIX, ISO
data.datetime.format Strftime-like template which defines how Datetime data is stored Formatting string, for example: %Y-%m-%dT%H-%M
date.timestamp.format_name Predefined format in which Timestamp data is stored POSIX, ISO, UNIX_TIME_SECONDS, UNIX_TIME_MILLISECONDS, UNIX_TIME_MICROSECONDS
data.timestamp.format Strftime-like template which defines how Timestamp data is stored Formatting string, for example: %Y-%m-%dT%H-%M-%S
data.date.format The format in which Date data is stored Formatting string, for example: %Y-%m-%d
csv_delimiter Delimeter for csv_with_names format Any character (UTF-8)

You can only specify file_pattern setting if file_path is a path to a directory. Any conversion specifiers supported by strftime(C99) function can be used in formatting strings. In YDB, the following Datetime and Timestamp formats are supported:

Name Description Example
POSIX String in %Y-%m-%d %H:%M:%S format 2001-03-26 16:10:00
ISO Format, corresponding to the ISO 8601 standard 2001-03-26 16:10:00Z
UNIX_TIME_SECONDS Number of seconds that have elapsed since the 1st of january 1970 (00:00:00 UTC) 985623000
UNIX_TIME_MILLISECONDS Number of milliseconds that have elapsed since the 1st of january 1970 (00:00:00 UTC) 985623000000
UNIX_TIME_MICROSECONDS Number of microseconds that have elapsed since the 1st of january 1970 (00:00:00 UTC) 985623000000000

Example

Example query to read data from S3 (Yandex Object Storage):

SELECT
  *
FROM
  connection.`folder/`
WITH(
  FORMAT = "csv_with_names",
  COMPRESSION="gzip"
  SCHEMA =
  (
    Id Int32 NOT NULL,
    UserId Int32 NOT NULL,
    TripDate Date NOT NULL,
    TripDistance Double NOT NULL,
    UserComment Utf8
  ),
  FILE_PATTERN="*.csv.gz",
  `DATA.DATE.FORMAT`="%Y-%m-%d",
  CSV_DELIMITER='/'
);

Where:

  • connection — the name of the external data source leading to the S3 bucket (Yandex Object Storage).
  • folder/filename.csv — the path to the directory in the S3 bucket (Yandex Object Storage).
  • SCHEMA — the data schema description in the file.
  • *.csv.gz — file name template.
  • %Y-%m-%d — format in which Date type is stored in S3.