Vector load
Allows you to test YDB vector search performance and recall using exact and approximate search. Supports both global and filtered vector indexes.
The workload supports importing vectors from a real dataset (e.g., Wikipedia embeddings) or generating synthetic random vectors. After loading data, you can build a vector index, run search queries to measure recall and performance, and clean up the workload tables.
Command structure
ydb [global options...] workload vector [options...] <subcommand>
Subcommands:
vector YDB vector workload.
├─ init Create and initialize tables for the workload
├─ import Fill workload tables with data
│ ├─ files Import vectors from files
│ └─ generator Generate random vectors
├─ build-index Create and initialize a vector index for the workload
├─ drop-index Drop the vector index created for the workload
├─ run Run YDB vector workload
│ ├─ select Search vectors and measure performance/recall
│ └─ upsert Insert or update vector rows in the table
└─ clean Drop tables created for load testing
Initializing the workload
Create the table for the workload:
ydb workload vector init
Available options
| Name | Description | Default value |
|---|---|---|
--table <name> |
Name of the main table that stores vectors. | vector_index_workload |
--min-partitions <value> |
Minimum number of table partitions. | 40 |
--partition-size <value> |
Target partition size, in MB. | 2000 |
--auto-partition <value> |
Enable auto-partitioning by load (1 — enabled, 0 — disabled). |
1 |
--prefixed |
Add a prefix column to the table for use with a filtered (prefixed) vector index. |
|
--clear |
Drop and recreate the table if it already exists. | |
--dry-run |
Print the DDL query instead of executing it. |
The created table has the following schema:
id Uint64 NOT NULL— primary key;embedding String— serialized embedding vector;prefix Uint64 NOT NULL(only when--prefixedis specified) — filter column for prefixed indexes.
Loading data
After initialization, load data into the table. There are two subcommands: files to import vectors from an existing dataset and generator to generate synthetic random vectors.
After import is complete, a vector index named index is automatically built on the embedding column (unless --index-type None is specified).
Importing from files
Import vectors from files (CSV, TSV, or Parquet, optionally gzip-compressed). The dataset must contain id and embedding columns. Any additional columns are ignored.
For CSV/TSV files, embeddings must be encoded as a list of floats, e.g., "[ 1.0, 2.0, 3.0 ]". For Parquet files, embeddings can be a list of float32 values or already serialized YDB binary embeddings.
Example:
ydb workload vector import files
Available options
| Name | Description | Default value |
|---|---|---|
--input <path> or -i <path> |
Path to the dataset file or directory. Supported formats: CSV/TSV (optionally gzip-compressed), Parquet. Only id and embedding columns are imported. |
Required |
--format <format> |
Source files format. One of csv, tsv, parquet. When set, only files matching the specified format are imported from a directory. When not set, format is auto-detected from file extensions. |
|
--embedding-column-name <name> |
Alternative source column name for the embedding field in input files. | embedding |
--table <name> |
Name of the table to load data into. | vector_index_workload |
--index <name> |
Name of the vector index to build after import. | index |
--index-type <type> |
Index type to build after import. Possible values: None, KmeansTree. Set to None to skip index building. |
KmeansTree |
--vector-type <type> |
Type of vectors. One of float, int8, uint8, bit. |
float |
--vector-dimension <value> |
Vector dimension (size of embedding vectors). | 1024 |
--distance <value> |
Distance or similarity function. One of inner_product, cosine, euclidean, manhattan. |
inner_product |
--kmeans-tree-levels <value> |
Number of levels in the kmeans tree. If not set, auto-detected by server. See kmeans-tree type. | Auto-detected |
--kmeans-tree-clusters <value> |
Number of clusters in kmeans. If not set, auto-detected by server. See kmeans-tree type. | Auto-detected |
--kmeans-tree-covering <value> |
Build a covering index (1 — enabled, 0 — disabled). |
0 |
--kmeans-tree-prefixed <value> |
Build a prefixed (filtered) index (1 — enabled, 0 — disabled). The table must have been created with the --prefixed option. |
0 |
Note
For more details on --kmeans-tree-* index building parameters, see kmeans-tree type.
Common parameters of the import command
| Name | Description | Default value |
|---|---|---|
--upload-threads <value> or -t <value> |
The number of execution threads for data preparation. | The number of available cores on the client. |
--bulk-size <value> |
The size of the chunk for sending data, in rows. | 10000 |
--max-in-flight <value> |
The maximum number of data chunks that can be processed simultaneously. | 128 |
--file-output-path <value> or -f <path> |
If this option is set, the data will not be loaded into the database, but will be saved to the directory |
Generating synthetic data
Generate random vectors and load them into the table. Vector components are sampled from a uniform distribution and serialized into YDB binary embedding format.
ydb workload vector import generator
Available options
| Name | Description | Default value |
|---|---|---|
--rows <value> |
Number of rows to generate. | 10000 |
--prefix-count <value> |
Number of distinct prefix values for a prefixed index. | 100 |
--seed <value> |
Seed for the random number generator. | 42 |
--table <name> |
Name of the table to load data into. | vector_index_workload |
--index <name> |
Name of the vector index to build after import. | index |
--index-type <type> |
Index type to build after import. Possible values: None, KmeansTree. |
KmeansTree |
--vector-type <type> |
Type of vectors. One of float, int8, uint8, bit. |
float |
--vector-dimension <value> |
Vector dimension. | 1024 |
--distance <value> |
Distance or similarity function. | inner_product |
--kmeans-tree-levels <value> |
Number of levels in the kmeans tree. If not set, auto-detected by server. | Auto-detected |
--kmeans-tree-clusters <value> |
Number of clusters in kmeans. If not set, auto-detected by server. | Auto-detected |
--kmeans-tree-covering <value> |
Build a covering index. | 0 |
--kmeans-tree-prefixed <value> |
Build a prefixed (filtered) index. | 0 |
Note
For more details on --kmeans-tree-* index building parameters, see kmeans-tree type.
Common parameters of the import command
| Name | Description | Default value |
|---|---|---|
--upload-threads <value> or -t <value> |
The number of execution threads for data preparation. | The number of available cores on the client. |
--bulk-size <value> |
The size of the chunk for sending data, in rows. | 10000 |
--max-in-flight <value> |
The maximum number of data chunks that can be processed simultaneously. | 128 |
--file-output-path <value> or -f <path> |
If this option is set, the data will not be loaded into the database, but will be saved to the directory |
Building a vector index
If the table was loaded with --index-type None, or if you want to build an additional index with different parameters, you can build a vector index on an existing table using the build-index command.
ydb workload vector build-index --distance cosine
Available options
| Name | Description | Default value |
|---|---|---|
--table <name> |
Name of the table to build the index on. | vector_index_workload |
--index <name> |
Name of the index to create. | index |
--vector-type <type> |
Type of vectors. One of float, int8, uint8, bit. |
float |
--vector-dimension <value> |
Vector dimension. | 1024 |
--distance <value> |
Distance or similarity function. | inner_product |
--kmeans-tree-levels <value> |
Number of levels in the kmeans tree. If not set, auto-detected by server. | Auto-detected |
--kmeans-tree-clusters <value> |
Number of clusters in kmeans. If not set, auto-detected by server. | Auto-detected |
--dry-run |
Print the DDL query instead of executing it. |
Note
For more details on --kmeans-tree-* index building parameters, see kmeans-tree type.
Dropping a vector index
Drop a previously built vector index.
ydb workload vector drop-index
Available options
| Name | Description | Default value |
|---|---|---|
--table <name> |
Name of the table that holds the index. | vector_index_workload |
--index <name> |
Name of the index to drop. | index |
--dry-run |
Print the DDL query instead of executing it. |
Running the workload
Run load testing using one of two modes: select (vector search queries) or upsert (inserting new vector rows).
Search workload
Executes vector search queries against the indexed table. Optionally measures recall by comparing approximate search results with exact (full-scan) results, then runs a performance benchmark.
ydb workload vector run select --recall
Available options
| Name | Description | Default value |
|---|---|---|
--table <name> |
Name of the main table with vectors. | vector_index_workload |
--index <name> |
Name of the vector index to use. | index |
--query-table <name> |
Name of the table with predefined search vectors. If not specified, random rows from the main table are used. | |
--targets <value> |
Number of vectors to use as test targets. | 100 |
--limit <value> |
Maximum number of nearest vectors to return per query. | 5 |
--kmeans-tree-clusters <value> |
Maximum number of clusters to inspect during search (KMeansTreeSearchTopSize). | 1 |
--recall |
Measure recall of the approximate search compared to exact search. | |
--recall-threads <value> |
Number of concurrent queries during recall measurement. | 10 |
--non-indexed |
Take vector settings from the index, but search without the index (full scan). | |
--stale-ro |
Read with StaleRO consistency mode. |
Warning
Pay attention to the --kmeans-tree-clusters parameter — increasing it significantly improves search recall at the expense of speed. Try values from 1 to the number of clusters specified when creating the index.
Common parameters for all load types
| Name | Description | Default value |
|---|---|---|
--dry-run |
Do not execute initialization queries, but only display their text. | |
--check-canonical or -c |
Use special version of queries (they have deterministic answers) and compare results with canonical ones. | |
--output <value> |
The name of the file where the query execution results will be saved. | results.out |
--iterations <value> |
The number of times each load query will be executed. | 1 |
--json <name> |
The name of the file where query execution statistics will be saved in json format. |
Not saved by default |
--ministat <name> |
The name of the file where query execution statistics will be saved in ministat format. |
Not saved by default |
--csv <name> |
The name of the file to save the CSV version of the result table. | Not saved by default |
--plan <name> |
The name of the file to save the query plan. Files like <name>.<query number>.explain and <name>.<query number>.<iteration number> will be saved in formats: ast, json, svg, and table. |
Not saved by default |
--query-prefix <setting> |
Query prefix. Every prefix is a line that will be added to the beginning of each query. For multiple prefix lines use this option several times. | Not specified by default |
--retries |
Max retry count for every request. | 0 |
--include |
Names, numbers or ranges of query numbers to be executed as part of the load. Specified as a comma-separated list, e.g.: 1,2,4-6. |
All queries executed |
--exclude |
Names, numbers or ranges of query numbers to be excluded from the load. Specified as a comma-separated list, e.g.: 1,2,4-6. |
None excluded by default |
--verbose or -v |
Print additional information to the screen during query execution. | |
--global-timeout <value> |
Global timeout for all queries. Supports time units (e.g., '5s', '1m'). Plain number interpreted as milliseconds. | Not specified by default. The time is unlimited. |
--request-timeout <value> |
Timeout for each iteration of each query. Supports time units (e.g., '5s', '1m'). Plain number interpreted as milliseconds. | Not specified by default. The time is unlimited. |
--threads <value> or -t <value> |
The number of parallel threads generating the load. Zero means that queries will be executed in the main thread; otherwise, queries will be mixed. | 0 |
--stats <value> |
Extended execution statistics collection mode. Available values: full, profile. |
full |
Upsert workload
Continuously inserts new vector rows into the table, generating random embeddings on the fly.
ydb workload vector run upsert
Available options
| Name | Description | Default value |
|---|---|---|
--table <name> |
Name of the table to insert into. | vector_index_workload |
--index <name> |
Name of the vector index. | index |
--bulk-size <value> |
Number of rows per upsert batch. | 100 |
--prefixed |
Generate upserts with a prefix column (for prefixed indexes). |
|
--prefix-count <value> |
Number of distinct prefix values. Used only when --prefixed is set. |
1000 |
Common parameters for all load types
| Name | Description | Default value |
|---|---|---|
--dry-run |
Do not execute initialization queries, but only display their text. | |
--check-canonical or -c |
Use special version of queries (they have deterministic answers) and compare results with canonical ones. | |
--output <value> |
The name of the file where the query execution results will be saved. | results.out |
--iterations <value> |
The number of times each load query will be executed. | 1 |
--json <name> |
The name of the file where query execution statistics will be saved in json format. |
Not saved by default |
--ministat <name> |
The name of the file where query execution statistics will be saved in ministat format. |
Not saved by default |
--csv <name> |
The name of the file to save the CSV version of the result table. | Not saved by default |
--plan <name> |
The name of the file to save the query plan. Files like <name>.<query number>.explain and <name>.<query number>.<iteration number> will be saved in formats: ast, json, svg, and table. |
Not saved by default |
--query-prefix <setting> |
Query prefix. Every prefix is a line that will be added to the beginning of each query. For multiple prefix lines use this option several times. | Not specified by default |
--retries |
Max retry count for every request. | 0 |
--include |
Names, numbers or ranges of query numbers to be executed as part of the load. Specified as a comma-separated list, e.g.: 1,2,4-6. |
All queries executed |
--exclude |
Names, numbers or ranges of query numbers to be excluded from the load. Specified as a comma-separated list, e.g.: 1,2,4-6. |
None excluded by default |
--verbose or -v |
Print additional information to the screen during query execution. | |
--global-timeout <value> |
Global timeout for all queries. Supports time units (e.g., '5s', '1m'). Plain number interpreted as milliseconds. | Not specified by default. The time is unlimited. |
--request-timeout <value> |
Timeout for each iteration of each query. Supports time units (e.g., '5s', '1m'). Plain number interpreted as milliseconds. | Not specified by default. The time is unlimited. |
--threads <value> or -t <value> |
The number of parallel threads generating the load. Zero means that queries will be executed in the main thread; otherwise, queries will be mixed. | 0 |
--stats <value> |
Extended execution statistics collection mode. Available values: full, profile. |
full |
Cleaning up
Drop tables created for load testing:
ydb workload vector clean
Test algorithm
The run select mode performs the following stages:
- A test set is generated:
- If a table with a test set is specified via
--query-table, the first--targetsrows are selected from it in primary key order. - Otherwise,
--targetsrandom rows are selected from the main table.
- If a table with a test set is specified via
- Recall measurement is performed (if
--recallis specified):- Two queries are executed for each item in the test set.
- The first query performs an exact vector search (full scan) based on vector distance, forming the
R_exactresult set. - The second query performs an approximate vector search using the vector index, forming the
R_approxresult set. - If a filtered vector index is selected, both queries scan only the entries with matching filter column values.
- Recall of the approximate search is calculated using the formula (where
|A|is the number of elements in the set A andA ∩ Bis the intersection of sets A and B).
- Performance measurement is performed:
- Indexed search queries are run for a specified duration and with a specified number of parallel threads.
- Each query is executed for a random entry from the test set.
- The average number of requests per second (RPS) and response time percentiles are calculated.
Usage examples
Example with generated data
-
Initialize the workload table:
ydb workload vector init -
Generate and load synthetic vectors:
ydb workload vector import generator --rows 100000 --distance cosine -
Run the search workload:
ydb workload vector run select --recallOutput example:
Recall: 0.8950 Window Txs Txs/Sec Retries Errors p50(ms) p95(ms) p99(ms) pMax(ms) 1 100 100 0 0 5 8 12 15 2 98 98 0 0 5 9 13 16 ... Total Txs Txs/Sec Retries Errors p50(ms) p95(ms) p99(ms) pMax(ms) 10 980 98.0 0 0 5 9 14 18Column descriptions:
Window— Sequential number of the time window (e.g., each second or fixed interval).Txs— Number of transactions successfully completed in this window (or total across all windows in the bottom section).Txs/Sec— Transaction rate per second for the given window (or average for total).Retries— Number of automatic retries performed due to temporary errors (e.g., conflicts or throttling).Errors— Number of unrecoverable errors encountered.p50(ms)— Median (50th percentile) transaction latency in milliseconds.p95(ms)— 95th percentile latency (milliseconds).p99(ms)— 99th percentile latency (milliseconds).pMax(ms)— Maximum observed latency within the window (or overall maximum for total).
-
Run the upsert workload:
ydb workload vector run upsert -
Clean up:
ydb workload vector clean
Example with an external dataset
-
Prepare tables and load data. See the recipe Vector index with external dataset loading.
Examples of creating vector indexes are available on the Vector Indexes documentation page.
-
Run the search workload:
ydb -e grpc://hostname:2135 -d /Root/testdb workload vector run select \ --table wikipedia --index idx_vector \ --query-table wikipedia_sample --recallOutput example:
Recall: 0.8950 Window Txs Txs/Sec Retries Errors p50(ms) p95(ms) p99(ms) pMax(ms) 1 100 100 0 0 5 8 12 15 2 98 98 0 0 5 9 13 16 ... Total Txs Txs/Sec Retries Errors p50(ms) p95(ms) p99(ms) pMax(ms) 10 980 98.0 0 0 5 9 14 18
Remarks
- The search workload generates SQL queries that select nearest rows by vector distance from table
--tableusing index--index. - You don't need to specify column names (embedding, filter, primary key columns) or the distance function — they are automatically extracted from the index definition.
- Random selection of test vectors from
--tableworks only if its primary key is numeric. - If
--query-tableis specified, it must have the same vector and filter column names as--table. - Recall measurement (
--recall) is performed as a separate stage before the main performance test and shows the average overlap ratio (from 0 to 1) of indexed search results with full-scan search results across all selected test vectors.
Preparing a test sample from a large table
Warning
This step is only required if you don't already have a test dataset and if the primary key of the main table isn't numeric, because auto-generation of the test set doesn't work for non-numeric keys.
In this case, you can prepare a test set by selecting random rows from the main table manually. Example query to create the table for test samples:
CREATE TABLE vector_index_sample (
id Uint64 NOT NULL,
prefix Uint64 NOT NULL,
embedding String NOT NULL,
PRIMARY KEY (id)
);
A query to fill it with approximately 1000 rows from a large table:
INSERT INTO vector_index_sample
SELECT id, prefix, embedding FROM large_table
WHERE RandomNumber(id) < 0xFFFFFFFFFFFFFFFF / <number_of_rows_in_table> * 1000;
Tip
You can find the approximate number of rows in the table in table statistics without running SELECT COUNT(*).