ArrowStream
| Input | Output | Alias |
|---|---|---|
| ✔ | ✔ |
Description
ArrowStream is Apache Arrow's "stream mode" format. It is designed for in-memory stream processing.
Example usage
In the example below we use the forex dataset which is available in the
ClickHouse SQL playground. You can connect to it
remotely with clickhouse-client using the host sql-clickhouse.clickhouse.com
and the user demo (which has no password). The forex table lives in the
forex database, so we select it as the default database:
The forex table stores currency exchange rates. We can inspect its size and
how well it compresses on disk by querying system.columns:
Unlike the Arrow "file mode" format, which
requires the whole result before it can be read, ArrowStream is delivered as a
sequence of record batches that a consumer can read incrementally as they
arrive. This makes it well suited to streaming a query result straight into a
visualization or analytics tool without first materializing the entire dataset.
To stream the result, send the query over ClickHouse's HTTP interface with a
POST request and read the response as an Arrow stream. We disable compression
of the Arrow output via the
output_format_arrow_compression_method
setting so that consumers can decode batches directly as they are received.
The ArrowStream output is raw binary, so rather than printing it to the
terminal we pipe it into a consumer. The stream is self-describing (it carries
its own schema), so here we pipe it straight into
clickhouse-local, which reads the
incoming batches with --input-format ArrowStream and queries them as a table.
The forex table is large, so we bound the remote query with a WHERE
predicate and a LIMIT to keep this example small:
The same stream can be consumed incrementally by any Arrow-aware client, which
reads it batch-by-batch rather than buffering the result in full. For example,
using the Apache Arrow JavaScript library, a
RecordBatchReader yields each record batch as soon as it is streamed from the
server:
For a full walkthrough of streaming ArrowStream data from ClickHouse into a
real-time visualization with Perspective, see
the blog post
Streaming real-time visualizations with ClickHouse, Apache Arrow and Perspective.
Format settings
ArrowStream shares the same format settings as the Arrow format.
| Setting | Description | Default |
|---|---|---|
input_format_arrow_allow_missing_columns | Allow missing columns while reading Arrow input formats | 1 |
input_format_arrow_case_insensitive_column_matching | Ignore case when matching Arrow columns with CH columns. | 0 |
input_format_arrow_import_nested | Obsolete setting, does nothing. | 0 |
input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference | Skip columns with unsupported types while schema inference for format Arrow | 0 |
output_format_arrow_compression_method | Compression method for Arrow output format. Supported codecs: lz4_frame, zstd, none (uncompressed) | lz4_frame |
output_format_arrow_date_as_uint16 | Write Date values as plain 16-bit numbers (read back as UInt16), instead of converting to a 32-bit Arrow DATE32 type (read back as Date32). | 0 |
output_format_arrow_fixed_string_as_fixed_byte_array | Use Arrow FIXED_SIZE_BINARY type instead of Binary for FixedString columns. | 1 |
output_format_arrow_low_cardinality_as_dictionary | Enable output LowCardinality type as Dictionary Arrow type | 0 |
output_format_arrow_string_as_string | Use Arrow String type instead of Binary for String columns | 1 |
output_format_arrow_unsupported_types_as_binary | Output types having no conversion as raw binary data. If false - such types would raise UNKNOWN_TYPE exception. | 1 |
output_format_arrow_use_64_bit_indexes_for_dictionary | Always use 64 bit integers for dictionary indexes in Arrow format | 0 |
output_format_arrow_use_signed_indexes_for_dictionary | Use signed integers for dictionary indexes in Arrow format | 1 |