Data formats and compression algorithms
This section describes the data formats supported in YDB for storage in S3 and the supported compression algorithms.
Supported data formats
The table below lists the data formats supported in YDB.
Format | Read | Write |
---|---|---|
csv_with_names |
✓ | ✓ |
tsv_with_names |
✓ | |
json_list |
✓ | |
json_each_row |
✓ | |
json_as_string |
✓ | |
parquet |
✓ | ✓ |
raw |
✓ |
Format csv_with_names
This format is based on the CSV format. Data is placed in columns separated by commas, with column names in the file's first row.
Example data:
Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00
Example query
SELECT
AVG(Price)
FROM `connection`.`path`
WITH
(
FORMAT = "csv_with_names",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
Query result:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Man_1 | Model_1 | 3000 | 1997 |
2 | Man_2 | Model_2 | 4900 | 1999 |
Format tsv_with_names
This format is based on the TSV
format. Data is placed in columns separated by tab characters (code 0x9
), with column names in the file's first row.
Example data:
Year Manufacturer Model Price
1997 Man_1 Model_1 3000.00
1999 Man_2 Model_2 4900.00
Example query
SELECT
AVG(Price)
FROM `connection`.`path`
WITH
(
FORMAT = "tsv_with_names",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
Query result:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Man_1 | Model_1 | 3000 | 1997 |
2 | Man_2 | Model_2 | 4900 | 1999 |
Format json_list
This format is based on the JSON representation of data. Each file must contain an array of JSON objects.
Example of valid data (data presented as a list of JSON objects):
[
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 },
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
]
Example of INVALID data (each line contains a separate JSON object, but they are not combined into a list):
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Format json_each_row
This format is based on the JSON representation of data. Each file must contain a JSON object on each line without combining them into a JSON array. This format is used for data streaming systems like Apache Kafka or YDB Topics.
Example of valid data (each line contains a separate JSON object):
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
AVG(Price)
FROM `connection`.`path`
WITH
(
FORMAT = "json_each_row",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
Query result:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Man_1 | Model_1 | 3000 | 1997 |
2 | Man_2 | Model_2 | 4900 | 1999 |
Format json_as_string
This format is based on the JSON representation of data. The json_as_string
format does not split the input JSON document into fields but represents each file line as a single JSON object (or string). This format is helpful when the list of fields is not the same in all rows and may vary.
Each file must contain:
- a JSON object on each line, or
- JSON objects combined into an array.
Example of valid data (data presented as a list of JSON objects):
{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
*
FROM `connection`.`path`
WITH
(
FORMAT = "json_as_string",
SCHEMA =
(
Data Json
)
)
Query result:
# | Data |
---|---|
1 | {"Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000, "Year": 1997} |
2 | {"Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900, "Year": 1999} |
Format parquet
This format allows reading the contents of files in Apache Parquet format.
Supported data compression algorithms for Parquet files:
- Uncompressed
- SNAPPY
- GZIP
- LZO
- BROTLI
- LZ4
- ZSTD
- LZ4_RAW
Example query
SELECT
AVG(Price)
FROM `connection`.`path`
WITH
(
FORMAT = "parquet",
SCHEMA =
(
Year Int32,
Manufacturer Utf8,
Model Utf8,
Price Double
)
)
Query result:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Man_1 | Model_1 | 3000 | 1997 |
2 | Man_2 | Model_2 | 4900 | 1999 |
Format raw
This format allows reading the contents of files as is, in raw form. The data read in this way can be processed using YQL tools, splitting into rows and columns.
This format should be used if the built-in parsing capabilities in YDB are insufficient.
Example query
SELECT
*
FROM `connection`.`path`
WITH
(
FORMAT = "raw",
SCHEMA =
(
Data String
)
)
Query result:
Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00
Supported compression algorithms
The use of compression algorithms depends on the file formats. For all file formats except Parquet, the following compression algorithms can be used:
Algorithm | Name in YDB | Read | Write |
---|---|---|---|
Gzip | gzip | ✓ | ✓ |
Zstd | zstd | ✓ | |
LZ4 | lz4 | ✓ | ✓ |
Brotli | brotli | ✓ | |
Bzip2 | bzip2 | ✓ | |
Xz | xz | ✓ |
For Parquet file format, the following internal compression algorithms are supported:
Compression format | Name in YDB | Read | Write |
---|---|---|---|
Raw | raw | ✓ | |
Snappy | snappy | ✓ | ✓ |
YDB does not support working with externally compressed Parquet files, such as files named <myfile>.parquet.gz
or similar. All files in Parquet format must be without external compression.