TPC-H workload

The workload is based on the TPC-H specification, with the queries and table schemas adapted for YDB.

The benchmark generates a workload typical for decision support systems.

Common command options

All commands support the common --path option, which specifies the path to the directory containing tables in the database:

ydb workload tpch --path tpch/s1 ...

Available options

Name Description Default value
--path or -p Path to the directory with tables. /

Initializing a load test

Before running the benchmark, create a table:

ydb workload tpch --path tpch/s1 init

See the command description:

ydb workload tpch init --help

Available parameters

Name Description Default value
--store <value> Table storage type. Possible values: row, column, external-s3. column
--external-s3-prefix <value> Relevant only for external tables. Root path to the dataset in S3 storage.
--external-s3-endpoint <value> or -e <value> Relevant only for external tables. Link to the S3 bucket with data.
--string Use the String type for text fields. Utf8
--datetime Use for time-related fields of type Date, Datetime, and Timestamp. Date32, Datetime64, Timestamp64
--partition-size Maximum partition size in megabytes (AUTO_PARTITIONING_PARTITION_SIZE_MB) for row tables. 2000
--float-mode <value> Specifies the data type to use for fractional fields. Possible values are double and decimal. double uses the Double type, decimal uses Decimal with dimensions specified by the test standard. double
--scale Sets the percentage of the benchmark's data size and workload to use, relative to full scale. 1
--clear If the table at the specified path already exists, it will be deleted.

Loading data into a table

The data will be generated and loaded into a table directly by ydb:

ydb workload tpch --path tpch/s1 import generator --scale 1

See the command description:

ydb workload tpch import --help

Available options

Name Description Default value
--scale <value> Data scale. Powers of ten are usually used.
--tables <value> Comma-separated list of tables to generate. Available tables: customer, nation, order_line, part_psupp, region, supplier. All tables
--proccess-count <value> or -C <value> Data generation can be split into several processes, this parameter specifies the number of processes. 1
--proccess-index <value> or -i <value> Data generation can be split into several processes, this parameter specifies the process number. 0
--state <path> Path to the generation state file. If the generation was interrupted for some reason, the download will be continued from the same place when it is started again.
--clear-state Relevant if the --state parameter is specified. Clear the state file and start the download from the beginning.

Common parameters of the import command

Name Description Default value
--upload-threads <value> or -t <value> The number of execution threads for data preparation. The number of available cores on the client.
--bulk-size <value> The size of the chunk for sending data, in rows. 10000
--max-in-flight <value> The maximum number of data chunks that can be processed simultaneously. 128

Run the load test

Run the load:

ydb workload tpch --path tpch/s1 run

During the test, load statistics are displayed for each request.

See the command description:

ydb workload tpch run --help

Common parameters for all load types

Name Description Default value
--output <value> The name of the file where the query execution results will be saved. results.out
--iterations <value> The number of times each load query will be executed. 1
--json <name> The name of the file where query execution statistics will be saved in json format. Not saved by default
--ministat <name> The name of the file where query execution statistics will be saved in ministat format. Not saved by default
--plan <name> The name of the file to save the query plan. Files like <name>.<query number>.explain and <name>.<query number>.<iteration number> will be saved in formats: ast, json, svg. Not saved by default
--query-settings <setting> Query execution settings. Each setting is added as a separate line at the beginning of each query. Use multiple times for multiple settings. Not specified by default
--include Query numbers or segments to be executed as part of the load. All queries executed
--exclude Query numbers or segments to be excluded from the load. None excluded by default
--executer Query execution engine. Available values: scan, generic. generic
--verbose or -v Print additional information to the screen during query execution.

TPC-H-specific options

Name Description Default value
--ext-query-dir <name> Directory with external queries for load execution. Queries should be in files named q[1-23].sql.

Test data cleaning

Run cleaning:

ydb workload tpch --path tpch/s1 clean

The command has no parameters.