Data formats and compression algorithms

This section describes the data formats supported in YDB for storage in S3 and the supported compression algorithms.

Supported data formats

The table below lists the data formats supported in YDB.

Format	Read	Write
`csv_with_names`	✓	✓
`tsv_with_names`	✓
`json_list`	✓
`json_each_row`	✓
`json_as_string`	✓
`parquet`	✓	✓
`raw`	✓

Format csv_with_names

This format is based on the CSV format. Data is placed in columns separated by commas, with column names in the file's first row.

Example data:

Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00

Example query

SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "csv_with_names",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

#	Manufacturer	Model	Price	Year
1	Man_1	Model_1	3000	1997
2	Man_2	Model_2	4900	1999

Format tsv_with_names

This format is based on the TSV format. Data is placed in columns separated by tab characters (code 0x9), with column names in the file's first row.

Example data:

Year    Manufacturer    Model   Price
1997    Man_1   Model_1    3000.00
1999    Man_2   Model_2    4900.00

Example query

SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "tsv_with_names",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

#	Manufacturer	Model	Price	Year
1	Man_1	Model_1	3000	1997
2	Man_2	Model_2	4900	1999

Format json_list

This format is based on the JSON representation of data. Each file must contain an array of JSON objects.

Example of valid data (data presented as a list of JSON objects):

[
    { "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 },
    { "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
]

Example of INVALID data (each line contains a separate JSON object, but they are not combined into a list):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }

Format json_each_row

This format is based on the JSON representation of data. Each file must contain a JSON object on each line without combining them into a JSON array. This format is used for data streaming systems like Apache Kafka or YDB Topics.

Example of valid data (each line contains a separate JSON object):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }

Example query

SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "json_each_row",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

#	Manufacturer	Model	Price	Year
1	Man_1	Model_1	3000	1997
2	Man_2	Model_2	4900	1999

Format json_as_string

This format is based on the JSON representation of data. The json_as_string format does not split the input JSON document into fields but represents each file line as a single JSON object (or string). This format is helpful when the list of fields is not the same in all rows and may vary.

Each file must contain:

a JSON object on each line, or
JSON objects combined into an array.

Example of valid data (data presented as a list of JSON objects):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }

Example query

SELECT
    *
FROM `connection`.`path`
WITH
(
    FORMAT = "json_as_string",
    SCHEMA =
    (
        Data Json
    )
)

Query result:

#	Data
1	`{"Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000, "Year": 1997}`
2	`{"Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900, "Year": 1999}`

Format parquet

This format allows reading the contents of files in Apache Parquet format.

Supported data compression algorithms for Parquet files:

Uncompressed
SNAPPY
GZIP
LZO
BROTLI
LZ4
ZSTD
LZ4_RAW

Example query

SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "parquet",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

#	Manufacturer	Model	Price	Year
1	Man_1	Model_1	3000	1997
2	Man_2	Model_2	4900	1999

Format raw

This format allows reading the contents of files as is, in raw form. The data read in this way can be processed using YQL tools, splitting into rows and columns.

This format should be used if the built-in parsing capabilities in YDB are insufficient.

Example query

SELECT
    *
FROM `connection`.`path`
WITH
(
    FORMAT = "raw",
    SCHEMA =
    (
        Data String
    )
)

Query result:

Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00

Supported compression algorithms

The use of compression algorithms depends on the file formats. For all file formats except Parquet, the following compression algorithms can be used:

Algorithm	Name in YDB	Read	Write
Gzip	gzip	✓	✓
Zstd	zstd	✓
LZ4	lz4	✓	✓
Brotli	brotli	✓
Bzip2	bzip2	✓
Xz	xz	✓

For Parquet file format, the following internal compression algorithms are supported:

Compression format	Name in YDB	Read	Write
Raw	raw	✓
Snappy	snappy	✓	✓

YDB does not support working with externally compressed Parquet files, such as files named <myfile>.parquet.gz or similar. All files in Parquet format must be without external compression.

Was the article helpful?

Writing data to S3 buckets

Data partitioning