Secondary indexes

Indexes are auxiliary structures within databases that help find data by certain criteria without having to search an entire database, and retrieve sorted samples without actually sorting, which would require processing the entire dataset.

Data in a YDB dataset is always sorted by the primary key. That means that retrieving any entry from the table with specified field values comprising the primary key always takes the minimum fixed time, regardless of the total number of table entries. Indexing by the primary key makes it possible to retrieve any consecutive range of entries in ascending or descending order of the primary key. Execution time for this operation depends only on the number of retrieved entries rather than on the total number of table values.

To use a similar feature with any field or combination of fields, additional indexes called secondary indexes can be created for them

In transactional systems, indexes are used to limit or avoid performance degradation and increase of query cost as your data grows.

This article describes the main operations with secondary indexes and gives references to detailed information on each operation. For more information about various types of secondary indexes and their specifics, see Secondary indexes in the Concepts section.

Creating secondary indexes

A secondary index is a data schema object that can be set when creating a table with the CREATE TABLE YQL command or added to it later with the ALTER TABLE YQL command.

The table index add command is supported in the YDB CLI.

Since an index contains its own data derived from table data, when creating an index on an existing table with data, an operation is performed to initially build an index. This may take a long time. This operation is executed in the background and you can keep working with the table while it's in progress. However, you can't use the new index until it's created.

An index can only be used in the order of the fields included in it. If an index contains two fields, such as a and b, you can effectively use it for queries such as:

  • WHERE a = $var1 AND b = $var2;
  • WHERE a = $var1;
  • WHERE a > $var1 and other comparison operators.
  • WHERE a = $var1 AND b > $var2 and any other comparison operators in which the first field must be checked for equality.

This index can't be used in the following queries:

  • WHERE b = $var1;
  • WHERE a > $var1 AND b > $var2, which is equivalent to WHERE a > $var1 in terms of applying the index.
  • WHERE b > $var1.

Considering the above, there's no use in pre-indexing all possible combinations of table columns to speed up the execution of any query. An index is always a compromise between the lookup and write speed and the storage space occupied by the data. Indexes are created for specific search queries and criteria made by an app in the database.

Using secondary indexes when selecting data

For a table to be accessed by a secondary index, its name must be explicitly specified in the VIEW section after the table name as described in the article about the YQL SELECT statement. For example, a query to retrieve orders from the orders table by the specified customer ID (id_customer) looks like this:

DECLARE $customer_id AS Uint64;
FROM   orders VIEW idx_customer AS o
WHERE  o.id_customer = $customer_id

where idx_customer is the name of the secondary index on the orders table with the id_customer field specified first.

If no VIEW section is specified, making a query like this requires a full scan of the orders table .

In transactional applications, such information queries are executed with paginated data results. This eliminates an increase in the cost and time of query execution if the number of entries that meet the filtering conditions grows. The described approach to writing paginated queries using the primary key can also be applied to columns that are part of a secondary index.

Checking the cost of queries

Any query made in a transactional application should be checked in terms of the number of I/O operations it performed in the database and how much CPU was used to run it. You should also make sure these indicators don't continuously grow as the database volume grows. YDB returns statistics required for the analysis after running each query.

If you use the YDB CLI, select the --stats option to enable printing statistics after executing the yql command. All YDB SDKs also contain structures with statistics returned after running a query. If you make a query in the UI, you'll see a tab with statistics next to the results tab.

Updating data using a secondary index

The UPDATE, UPSERT, and REPLACE YQL statements don't permit indicating the use of a secondary index to perform a search for data, so an attempt to make an UPDATE ... WHERE indexed_field = $value will result in a full scan of the table. To avoid this, you can first run SELECT by index to get the primary key value and then UPDATE by the primary key. You can also use UPDATE ON.

To update data in the table1 table, run the query:

$to_update = (
    SELECT pk_field, $f1 AS field1, $f2 AS field2, ...
    FROM   table1 VIEW idx_field3
    WHERE  field3 = $f3)

UPDATE table1 ON SELECT * FROM $to_update

Deleting data using a secondary index

To delete data by secondary index, use SELECT with a predicate by secondary index and then call DELETE ON.

To delete all data about series with zero views from the series table, run the query:

SELECT series_id
FROM series VIEW views_index
WHERE views = 0;

Performance of data writes to tables with secondary indexes

You need additional data structures to enable secondary indexes. Support for these structures makes table data update operations more costly.

During synchronous index updates, a transaction is only committed after all the necessary data is written in both a table and synchronous indexes. As a result, it takes longer to execute it and makes it necessary to use distributed transactions even if adding or updating entries in a single partition.

Indexes that are updated asynchronously let you use single-shard transactions. However, they only guarantee eventual consistency and still put a load on the database.