Working with S3 buckets (Yandex Object Storage)
To work with S3, you need to set up a data storage connection. There is a DDL for configuring such connections. Next, let's look at the SQL syntax and the management of these settings.
There are two types of buckets in S3: public and private. To connect to a public bucket, use AUTH_METHOD="NONE"
. To connect to a private bucket, use AUTH_METHOD="AWS"
. A detailed description of AWS
can be found here. AUTH_METHOD="NONE"
means that no authentication is used. If AUTH_METHOD="AWS"
is specified, several additional parameters are required:
AWS_ACCESS_KEY_ID_SECRET_NAME
– reference to the name of the secret whereAWS_ACCESS_KEY_ID
is stored.AWS_SECRET_ACCESS_KEY_SECRET_NAME
– reference to the name of the secret whereAWS_SECRET_ACCESS_KEY
is stored.AWS_REGION
– region from which reading is performed, for example,ru-central-1
.
To set up a connection to a public bucket, execute the following SQL query. The query creates an external connection named object_storage
, which points to a specific S3 bucket named bucket
.
CREATE EXTERNAL DATA SOURCE object_storage WITH (
SOURCE_TYPE="ObjectStorage",
LOCATION="https://object_storage_domain/bucket/",
AUTH_METHOD="NONE"
);
To set up a connection to a private bucket, you need to run a few SQL queries. First, create secrets containing AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
CREATE OBJECT aws_access_id (TYPE SECRET) WITH (value=`<id>`);
CREATE OBJECT aws_access_key (TYPE SECRET) WITH (value=`<key>`);
The next step is to create an external connection named object_storage
, which points to a specific S3 bucket named bucket
and uses AUTH_METHOD="AWS"
. The parameters AWS_ACCESS_KEY_ID_SECRET_NAME
, AWS_SECRET_ACCESS_KEY_SECRET_NAME
, and AWS_REGION
are filled in for AWS
. The values of these parameters are described above.
CREATE EXTERNAL DATA SOURCE object_storage WITH (
SOURCE_TYPE="ObjectStorage",
LOCATION="https://object_storage_domain/bucket/",
AUTH_METHOD="AWS",
AWS_ACCESS_KEY_ID_SECRET_NAME="aws_access_id",
AWS_SECRET_ACCESS_KEY_SECRET_NAME="aws_access_key",
AWS_REGION="ru-central-1"
);
Using an external connection to an S3 bucket
When working with Yandex Object Storage using external data sources, it is convenient to perform prototyping and initial data connection setup.
An example query to read data:
SELECT
*
FROM
object_storage.`*.tsv`
WITH
(
FORMAT = "tsv_with_names",
SCHEMA =
(
ts Uint32,
action Utf8
)
);
The list of supported formats and data compression algorithms for reading data in S3 (Yandex Object Storage) is provided in the section Data formats and compression algorithms.
Data model
In Yandex Object Storage, data is stored in files. To read data, you need to specify the data format in the files, compression, and lists of fields. This is done using the following SQL expression:
SELECT
<expression>
FROM
<object_storage_connection_name>.`<file_path>`
WITH(
FORMAT = "<file_format>",
COMPRESSION = "<compression>",
SCHEMA = (<schema_definition>))
WHERE
<filter>;
Where:
object_storage_connection_name
— the name of the external data source leading to the S3 bucket (Yandex Object Storage).file_path
— the path to the file or files inside the bucket. Wildcards*
are supported; more details in the section.file_format
— the data format in the files.compression
— the compression format of the files.schema_definition
— the schema definition of the data stored in the files.
Data schema description
The data schema description consists of a set of fields:
- Field name.
- Field type.
- Required data flag.
For example, the data schema below describes a schema field named Year
of type Int32
with the requirement that this field must be present in the data:
Year Int32 NOT NULL
If a data field is marked as required (NOT NULL
) but is missing in the processed file, processing such a file will result in an error. If a field is marked as optional (NULL
), no error will occur the field is absent in the processed file, but the field will take the value NULL
. The keyword NULL
is optional in this context.
Schema inference
YDB can determine the data schema of the files inside the bucket so that you do not have to specify these fields manually.
Note
Schema inference is available for all data formats except raw
and json_as_string
. For these formats you must describe the schema manually.
To enable schema inference, use the WITH_INFER
parameter:
SELECT
<expression>
FROM
<object_storage_connection_name>.`<file_path>`
WITH(
FORMAT = "<file_format>",
COMPRESSION = "<compression>",
WITH_INFER = "true")
WHERE
<filter>;
Where:
object_storage_connection_name
— the name of the external data source leading to the S3 bucket (Yandex Object Storage).file_path
— the path to the file or files inside the bucket. Wildcards*
are supported; For more information, see Data path formats.file_format
— the data format in the files. All formats exceptraw
andjson_as_string
are supported.compression
— the compression format of the files.
As a result of executing such a query, the names and types of fields will be inferred.
Data path formats
In YDB, the following data paths are supported:
Path format | Description | Example |
---|---|---|
Path ends with a / |
Path to a directory | The path /a addresses all contents of the directory:/a/b/c/d/1.txt /a/b/2.csv |
Path contains a wildcard character * |
Any files nested in the path | The path /a/*.csv addresses files in directories:/a/b/c/1.csv /a/2.csv /a/b/c/d/e/f/g/2.csv |
Path does not end with / and does not contain wildcard characters |
Path to a single file | The path /a/b.csv addresses the specific file /a/b.csv |
Example
Example query to read data from S3 (Yandex Object Storage):
SELECT
*
FROM
connection.`folder/filename.csv`
WITH(
FORMAT = "csv_with_names",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
);
Where:
connection
— the name of the external data source leading to the S3 bucket (Yandex Object Storage).folder/filename.csv
— the path to the file in the S3 bucket (Yandex Object Storage).SCHEMA
— the data schema description in the file.