Apache Arrow format
Results are returned in the columnar Apache Arrow format (IPC standard, version 5.0) without conversion inside the SDK, which makes large result sets efficient to process.
Prefer this format for:
- Analytical (OLAP) workloads with columnar processing — aggregations, filters, scans over multiple columns on large samples.
- Systems that integrate natively with Apache Arrow.
- Scenarios where throughput for large data transfers matters most.
YQL type mapping
YQL data types are mapped to Apache Arrow types as follows.
Numeric types
| YQL type | Arrow type | Notes |
|---|---|---|
Bool |
uint8 |
true/false encoded as 1/0 |
Int8 |
int8 |
|
Int16 |
int16 |
|
Int32 |
int32 |
|
Int64 |
int64 |
|
Uint8 |
uint8 |
|
Uint16 |
uint16 |
|
Uint32 |
uint32 |
|
Uint64 |
uint64 |
|
Float |
float |
|
Double |
double |
|
Decimal(p, s) |
fixed_size_binary(16) |
120 bits used; ±inf and NaN markers supported |
DyNumber |
string |
String representation of the number |
String types
| YQL type | Arrow type | Notes |
|---|---|---|
String |
binary |
|
Utf8 |
string |
|
Json |
string |
|
JsonDocument |
string |
String form of binary JSON |
Yson |
binary |
|
Uuid |
fixed_size_binary(16) |
16 bytes in mixed-endian order |
Date and time types
| YQL type | Arrow type | Notes |
|---|---|---|
Date |
uint16 |
Day precision |
Date32 |
int32 |
Day precision |
Datetime |
uint32 |
Second precision |
Datetime64 |
int64 |
Second precision |
Timestamp |
uint64 |
Microsecond precision |
Timestamp64 |
int64 |
Microsecond precision |
Interval |
int64 |
Microseconds |
Interval64 |
int64 |
Microseconds |
TzDate |
struct<datetime: uint16, timezone: string> |
Time zone name as a string |
TzDate32 |
struct<datetime: int32, timezone: string> |
Time zone name as a string |
TzDatetime |
struct<datetime: uint32, timezone: string> |
Time zone name as a string |
TzDatetime64 |
struct<datetime: int64, timezone: string> |
Time zone name as a string |
TzTimestamp |
struct<datetime: uint64, timezone: string> |
Time zone name as a string |
TzTimestamp64 |
struct<datetime: int64, timezone: string> |
Time zone name as a string |
Note
In Arrow, basic date/time types are represented as unsigned integers, while extended date/time types types use signed integers.
Container types
| YQL type | Arrow type | Notes |
|---|---|---|
List<T> |
list<item: T> |
|
Tuple<T1, T2, ...> |
struct<field0: T1, field1: T2, ...> |
|
Struct<name: T, ...> |
struct<name: T, ...> |
|
Dict<K, V> |
list<struct<key: K, payload: V>> |
|
Set<T> |
list<struct<key: T, payload: struct<>>> |
Same as Dict<T, Void> |
Variant<T1, ..., Tn> |
dense_union<field0: T1, ...> |
For n <= 128 |
Variant<T1, ..., Tn> |
dense_union<dense_union<field0: T1, ...>, ...> |
For 128 < n <= 16384 |
Variant<name1: T1, ..., nameN: Tn> |
dense_union<name1: T1, ...> |
For n <= 128 |
Variant<name1: T1, ..., nameN: Tn> |
dense_union<dense_union<name1: T1, ...>, ...> |
For 128 < n <= 16384 |
Warning
Variant over tuple or struct cannot be represented in Apache Arrow if the number of child types exceeds 16384 (128 * 128).
Optional and special types
| YQL type | Arrow type | Notes |
|---|---|---|
Null |
null |
Singular type |
Void |
struct<> |
Singular type |
EmptyList |
struct<> |
Singular type |
EmptyDict |
struct<> |
Singular type |
Tagged<T> |
T |
Tag stripped |
Optional<T> |
struct<opt: T> |
When T is Variant, Optional, Pg, or singular |
Optional<T> |
T |
For all other T |
Types in the pg family
All pg types are represented as Arrow string values containing the textual form.
Data compression
You can enable compression for Arrow payloads. Supported codecs:
| Codec | Description |
|---|---|
| None (default) | |
ZSTD |
Zstandard. Compression level is configurable |
LZ4_FRAME |
LZ4. Compression level is not configurable |
Returned schema
The schema has two parts:
- Column list with YQL types — describes the logical types in YQL terms so your application can interpret semantics regardless of Arrow representation details.
- Serialized Arrow RecordBatch schema — describes the binary layout in Apache Arrow terms. Required to deserialize RecordBatch on the client.
Both are needed because YQL and Arrow types do not always map one-to-one.
Apache Arrow examples in the SDK
import pyarrow
pool = ydb.QuerySessionPool(driver)
query = """
SELECT * FROM example ORDER BY Key LIMIT 100;
"""
format_settings = ydb.ArrowFormatSettings(
compression_codec=ydb.ArrowCompressionCodec(ydb.ArrowCompressionCodecType.ZSTD, 10)
)
result = pool.execute_with_retries(
query,
result_set_format=ydb.QueryResultSetFormat.ARROW,
schema_inclusion_mode=ydb.QuerySchemaInclusionMode.FIRST_ONLY,
arrow_format_settings=format_settings,
)
for result_set in result:
schema = pyarrow.ipc.read_schema(pyarrow.py_buffer(result_set.arrow_format_meta.schema))
batch = pyarrow.ipc.read_record_batch(pyarrow.py_buffer(result_set.data), schema)
print(f"Record batch with {batch.num_rows} rows and {batch.num_columns} columns")
String query = "SELECT * FROM example ORDER BY Key LIMIT 100;";
ExecuteQuerySettings settings = ExecuteQuerySettings.newBuilder()
.useApacheArrowFormat(ApacheArrowFormat.zstd())
.build();
try (RootAllocator allocator = new RootAllocator()) {
retryCtx.supplyResult(session -> session
.createQuery(query, TxMode.SERIALIZABLE_RW, Params.empty(), settings)
.execute(new ApacheArrowCompressedPartsHandler(allocator) {
@Override
public void onNextPart(QueryResultPart part) {
ResultSetReader rs = part.getResultSetReader();
while (rs.next()) {
String key = rs.getColumn("Key").getText();
System.out.println("Read row with key " + key);
}
System.out.printf("Record batch with %d rows and %d columns%n",
rs.getRowCount(), rs.getColumnCount());
}
})
).join().getStatus().expectSuccess("execute query problem");
}