Data formats and compression algorithms
This section describes the data formats supported in YDB that are stored in S3, the supported compression algorithms, and the list of supported YQL types for each data format.
Supported data formats
The list of data formats supported in YDB is provided in the table below.
| Format | Reading | Writing |
|---|---|---|
csv |
✓ | |
csv_with_names |
✓ | ✓ |
tsv_with_names |
✓ | ✓ |
json_list |
✓ | ✓ |
json_each_row |
✓ | ✓ |
json_as_string |
✓ | |
parquet |
✓ | ✓ |
raw |
✓ | ✓ |
CSV format
This format is based on the CSV format and differs from csv_with_names in that it does not have a header with column names. All lines in the file are treated as data, and column names are taken from the schema of the query in the order they are declared in SCHEMA.
Note
The csv format is available only for reading data from S3. Writing in this format is not supported.
Example of data (without a header row):
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "csv",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
Csv_with_names format
This format is based on the CSV format. The data is arranged in columns, separated by commas, and the first line of the file contains the column names. Unlike the csv format, where column names are taken from the query schema and matched positionally, in the csv_with_names format, column names are read directly from the header line of the file and matched to the schema by name.
Example data:
Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "csv_with_names",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
tsv_with_names format
This format is based on the TSV format. The data is arranged in columns and separated by tab characters (code 0x9). The first line of the file contains the column names.
Example of data:
Year Manufacturer Model Price
1997 Man_1 Model_1 3000.00
1999 Man_2 Model_2 4900.00
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "tsv_with_names",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
json_list format
This format is based on JSON representation of data. In this format, each file must contain an object in a valid JSON representation.
Example of valid data (data is presented as a list of JSON objects):
[
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 },
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
]
Example of INcorrect data (each separate line contains a separate JSON object, but these objects are not combined into a list):
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "json_list",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
Result of query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
json_each_row format
This format is based on JSON representation of data. In this format, each line of the file must contain a separate object in a valid JSON representation, but these objects are not combined into a JSON list. This format is used when transferring data through streaming systems, such as Apache Kafka or YDB topics (../../../datamodel/topic.md).
Example of correct data (each line contains a separate JSON object, but these objects are not combined into a list):
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 },
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "json_each_row",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
json_as_string format
This format is based on JSON representation. The json_as_string format does not split the input JSON document into fields, but represents each line of the file as a single JSON object (or a single string). This format is convenient if the list of fields is not the same in all lines and can vary.
In this format, each file should contain:
- a JSON object in a valid JSON representation on each separate line of the file;
- JSON objects in a valid JSON representation, combined into a list.
Example of valid data (data is presented as a list of JSON objects):
{ "Year": 1997, "Attrs": { "Manufacturer": "Man_1", "Model": "Model_1" }, "Price": 3000.0 }
{ "Year": 1999, "Attrs": { "Manufacturer": "Man_2", "Model": "Model_2" }, "Price": 4900.00 }
In this format, the schema of the data being read must consist of only one column with one of the allowed data types; see below for details.
Example query
SELECT
CAST(JSON_VALUE(Data, "$.Year") AS Int32) AS Year,
JSON_VALUE(Data, "$.Attrs.Manufacturer") AS Manufacturer,
JSON_VALUE(Data, "$.Attrs.Model") AS Model,
CAST(JSON_VALUE(Data, "$.Price") AS Double) AS Price
FROM external_source.path
WITH
(
FORMAT = "json_as_string",
SCHEMA =
(
Data Json
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
Parquet format
This format allows reading the contents of files in the Apache Parquet format.
Supported data compression algorithms inside Parquet files for reading from S3:
- No compression
- SNAPPY
- GZIP
- BROTLI
- LZ4
- ZSTD
- LZ4_RAW
Note
Parquet format records will be written using the Snappy compression algorithm.
Example query
SELECT
*
FROM external_source.path
WITH
(
FORMAT = "parquet",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
Raw format
This format allows you to read the contents of files as is, in "raw" form. Data read in this way can be processed using YQL by dividing it into rows and columns.
This format allows reading the contents of files as is, in raw form. The data read in this way can be processed using YQL tools, splitting into rows and columns.
The size of each of the files read in raw format cannot exceed the overall memory consumption limit for a single query in YDB.
Example query
To parse the following data, where rows are separated by a semicolon and values by a comma:
1997,Man_1,Model_1,3000.00;
1999,Man_2,Model_2,4900.00;
You can use the following query:
$input = SELECT
String::SplitToList( -- splitting each row by ','
String::Strip(RowData), -- removing all whitespace characters from the beginning and end of the rows
","
) AS Row
FROM external_source.path
WITH
(
FORMAT = "raw",
SCHEMA =
(
FileData Utf8
)
)
FLATTEN LIST BY (String::SplitToList(FileData, ";", TRUE AS SkipEmpty) AS RowData); -- splitting the file by ';'
SELECT -- Getting the required columns
CAST(Row[0] AS Int32) AS Year,
Row[1] AS Manufacturer,
Row[2] AS Model,
CAST(Row[3] AS Double) AS Price
FROM $input
The result of the query execution:
| # | Manufacturer | Model | Price | Year |
|---|---|---|---|---|
| 1 | Man_1 | Model_1 | 3000 | 1997 |
| 2 | Man_2 | Model_2 | 4900 | 1999 |
Supported compression algorithms
The use of compression algorithms depends on the file formats. For all file formats except Parquet, the following compression algorithms can be used:
| Algorithm | Name in YDB | Reading | Writing |
|---|---|---|---|
| Gzip | gzip | ✓ | ✓ |
| Zstd | zstd | ✓ | ✓ |
| LZ4 | lz4 | ✓ | ✓ |
| Brotli | brotli | ✓ | ✓ |
| Bzip2 | bzip2 | ✓ | ✓ |
| Xz | xz | ✓ | ✓ |
For Parquet file format, native internal compression algorithms are supported:
| Compression format | Read | Write |
|---|---|---|
| None | ✓ | |
| Snappy | ✓ | ✓ |
| Gzip | ✓ | |
| Brotli | ✓ | |
| LZ4 | ✓ | |
| Zstd | ✓ | |
| LZ4_RAW | ✓ |
In YDB, working with externally compressed Parquet files, such as files of the form <myfile>.parquet.gz or similar, is not supported. All Parquet files must be without external compression.
Supported data types
A table of all supported types when reading from S3 in the query schema:
| Type | csv | csv_with_names | tsv_with_names | json_list | json_each_row | json_as_string | parquet | raw |
|---|---|---|---|---|---|---|---|---|
Bool,Int8, Int16, Int32, Int64,Uint8, Uint16, Uint32, Uint64,Float, Double |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
DyNumber |
✓ | |||||||
String, Utf8, Json |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
JsonDocument |
✓ | |||||||
Yson |
✓ | ✓ | ✓ | |||||
Uuid |
✓ | ✓ | ✓ | ✓ | ||||
Date, Datetime, Timestamp,TzDate, TzDateTime, TzTimestamp |
✓ | ✓ | ✓ | ✓ | ✓ | |||
Interval |
✓ | ✓ | ✓ | ✓ | ✓ | |||
Date32, Datetime64, Timestamp64,Interval64,TzDate32, TzDateTime64, TzTimestamp64 |
✓ | |||||||
Optional<T> |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
The table of all supported types for writing to S3:
| Type | csv_with_names | tsv_with_names | json_list | json_each_row | parquet | raw |
|---|---|---|---|---|---|---|
Bool,Int8, Int16, Int32, Int64,Uint8, Uint16, Uint32, Uint64,Float, Double |
✓ | ✓ | ✓ | ✓ | ✓ | |
DyNumber |
✓ | |||||
String, Utf8, Json |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
JsonDocument |
||||||
Yson |
✓ | ✓ | ||||
Uuid |
✓ | ✓ | ✓ | |||
Date, Datetime, Timestamp,TzDate, TzDateTime, TzTimestamp |
✓ | ✓ | ✓ | ✓ | ||
Interval |
||||||
Date32, Datetime64, Timestamp64,Interval64,TzDate32, TzDateTime64, TzTimestamp64 |
||||||
Optional<T> |
✓ | ✓ | ✓ | ✓ | ✓ |
For all S3 read and write formats, except json_list, the Optional<T> type can be used only if T is a primitive YQL type. For more information about optional types, see the article Data types accepting NULL.