Bloom skip indexes

Bloom skip indexes are local auxiliary structures based on a Bloom filter that speed up selective queries by skipping data fragments that cannot contain the requested value. Unlike global secondary indexes, they act as read-time filters on the main table and reduce the amount of data that must be actually read.

Types

YDB supports two types:

  • bloom_filter: a filter over exact values of the indexed column; use for equality and IN (see when to use which).
  • bloom_ngram_filter: a filter over n-grams of a string column (String, Utf8); use for substring and LIKE pattern search on column-oriented tables.

Parameters and defaults

Full list of WITH (...) parameters and defaults:

  • bloom_filter
    • false_positive_probability: Target false-positive rate of the filter: the fraction of fragments that are not skipped even though the requested value is not there (range (0, 1)). A lower value reduces extra reads but increases index size.
      • If omitted, the default is 0.0001 for row-oriented tables and 0.1 for column-oriented tables. The stricter default on row-oriented tables matches OLTP point lookups; on column-oriented tables it reflects analytical scans, where a larger trade-off between index size and fragment skipping is acceptable.
    • Indexed columns: YQL types that support equality comparison, except Yson, Json, and JsonDocument (see comparison operators). One column in a column-oriented table; multiple columns in a row-oriented table.
  • bloom_ngram_filter (String and Utf8 columns; column-oriented tables only)
    • ngram_size: N-gram length, an integer from 3 to 8 (default 3).
    • false_positive_probability: Target false-positive rate (range (0, 1); default 0.1).
    • case_sensitive: Whether n-grams respect character case: true or false (default true).

Creation syntax: CREATE TABLE, ALTER TABLE ADD INDEX.

To change parameters after creation, use ALTER INDEX.

Examples

Create a table with a bloom_filter index:

CREATE TABLE events (
    id Uint64,
    resource_id Utf8,
    message Utf8,
    PRIMARY KEY (id),
    INDEX idx_bloom LOCAL USING bloom_filter
        ON (resource_id)
        WITH (false_positive_probability = 0.01)
);

Add a bloom_ngram_filter index to an existing table:

ALTER TABLE events
  ADD INDEX idx_ngram LOCAL USING bloom_ngram_filter
  ON (message)
  WITH (
    ngram_size = 3,
    false_positive_probability = 0.01,
    case_sensitive = true
  );

Alter index parameters:

ALTER TABLE events ALTER INDEX idx_ngram SET (
    ngram_size = 4,
    false_positive_probability = 0.005,
    case_sensitive = false
);

Parameter tuning

If queries still read too much data after deploying the index, try lowering false_positive_probability. If the index uses too much space, try raising it.

Use ALTER INDEX to change values, as in the example above.

Tips:

  • Start with false_positive_probability = 0.01, then adjust based on real read metrics and index size.
  • For bloom_ngram_filter, ngram_size typically starts at 3; increasing it can make filtering stricter for longer substrings.
  • Change one parameter at a time and compare results under the same workload.

Behavior and limitations

  • The index is always local (LOCAL); there is no global variant.
  • Queries do not use the VIEW <index> syntax (unlike, for example, fulltext indexes).
  • The filter is applied on read only to data fragments where an index block was already stored during a write or portion merge, other fragments are not skipped by this index until merge.

Limitations

  • COVER (...) and extra index columns are not supported.
  • Column-oriented tables allow only one indexed column. Row-oriented tables allow multiple indexed columns.
  • bloom_ngram_filter is not supported on row-oriented tables.

Additional Materials