ArroyoSystems/arroyo v0.12.0 on GitHub

These release notes are also available on the Arroyo blog

The Arroyo team is thrilled to announce that Arroyo 0.12.0 is now available! This release introduces Python UDFs, which allow Python developers to extend the engine with custom functions, callable from SQL. We've also added support for Protobuf as an ingestion format, new JSON syntax, custom state TTLs for updating SQL queries, among many other features, improvements, and fixes.

Excited to try things out? Getting started is easier than ever with new native packages for Linux and MacOS, complementing our existing Docker images and Helm chart.

Arroyo is a community project, and we're very grateful to all of our contributors. We are particularly excited to welcome four new contributors to the project in this release:

@jr200 made their first contribution in #677
@zhuliquan made their first contribution in #704
@MarcoLugo made their first contribution in #720
@tiagogcampos made their first contribution in #735

Thanks to all of our contributors for this release:

And now, all of the details on what's new in Arroyo 0.12!

Features

Python UDFs

Arroyo has long supported user-defined functions (UDFs), allowing users to extend the engine by writing new scalar, aggregate, and async functions. We've been amazed by the diversity of UDFs that our users have come up with, including

Parsers for custom formats
Ad-hoc joins with other databases
Calling AI inference APIs
Sinks to other data systems
Integrating specialized financial libraries

Among many other use cases. But until now, Arroyo only supported UDFs written in Rust. We love Rust, but we know it's not the most popular (or second, or third, or...) language for data users.

So in 0.12, we're thrilled to support UDFs written in Python.

It looks like this

from arroyo_udf import udf

@udf
def levenshtein(s1: str, s2: str) -> int:
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)

    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

which can then be used in SQL

SELECT levenshtein(username, email) as distance
from events;

Python UDFs take a series of arguments, each of which can be called with a SQL column or literal. The argument types and return type are determined by the function signature and type hints, including support for Optional to indicate how nullability should interact with the UDF.

We've also updated the Web UI to add a Python UDF editor.

What we're releasing in 0.12 is just the start. In our next release, we will add support for Python UDAFs, as well as direct PyArrow support for high-performance Python integrations without deserialization or copying overhead.

For more details on Python UDFs, see the documentation.

We're excited to see what you build with Python UDFs!

Initial support for Python UDFs by @mwylde in #736
Add Python to docker containers by @mwylde in #738

Protobuf

Protocol buffers—better known as protobuf—is a fast, space-efficient binary data format that's commonly used in event pipelines. And as of Arroyo 0.12, it's now natively supported as an ingestion format, along with support for reading protobuf schemas from Confluent Schema Registry.

This expands on our existing set of formats, including JSON, Avro, and raw string and bytes.

All protobuf features are supported, including optionals, lists, structs, enums, and imports.

See the full format documentation here.

Protobuf deserialization support by @mwylde in #715
Add confluent schema registry support for protobuf by @mwylde in #724
Add protoc to docker image for protobuf support by @mwylde in #725

JSON syntax

Sometimes you don't have a nice, proper schema for the JSON flowing through your data pipelines (it's ok, we've all been there). Arroyo still has you covered, with unstructured JSON fields (type JSON). And now the experience is even better, thanks to a suite of new JSON functions and integration of Postgres-style JSON syntax.

It looks like this:

CREATE TABLE events (
  value JSON
) WITH (
  connector = 'kafka',
  bootstrap_servers = 'kafka:9092',
  topic = 'events',
  format = 'json',
  type = 'source',
  'json.unstructured' = 'true'
);

SELECT
  -- using the json_get function
  json_get(value, 'user', 'name')::TEXT as name,
  -- or using the -> operator
  value->'user'->'email' as email,
  -- field presence check can be done with the ? operator
  value ? 'id' as has_id
FROM events;

There are several ways to access JSON fields:

json_get(json: str, *keys: str | int) takes a JSON-encoded string and a series of keys to traverse,
returning a partially-parsed JSON value that can be further processed without needing to be re-parsed
The Postgres-style -> operator is a synonym for json_get, and can be efficiently chained
json_get_{str|int|bool|float|json}(json: str, *keys: str | int) is a set of convenience functions
that return the JSON value as a specific type
The --> operator is a synonym for json_get_str
SQL type casts can also be used with json_get to get an output of the desired type, like
json_get(value, 'a')::INT

We've also added a couple other handy functions for working with JSON:

json_contains(json: str, *keys: str | int) (aliased to the ? operator)
json_length(json: str, *keys: str | int) -> int

Under the hood, these new functions use the ultra-fast JSON parser
jiter and deserialize data into an efficient
parsed format, avoiding the need the repeatedly re-parse data to access multiple fields.

See the json function docs for more detail.

Add new json functions from datafusion-functions-json by @mwylde in #640

Custom State TTLs

Arroyo has two intersecting streaming SQL semantics, which we call dataflow SQL and updating SQL. Dataflow SQL is based around time-oriented windows, which encode a notion of completeness via watermark-based processing. In other words, for a particular window, the watermark tells us that we can process it and then drop the data for that window from our state.

But updating semantics have no in-built notion of completeness. These are queries like

SELECT user, count(*)
FROM events
GROUP BY user;

The semantics of the query are that, for every user, we should be able to output the complete count of their events going back to the beginning of time. But it's generally intractable in a streaming system to actually keep all of the data for all time without blowing up our state.

To make these queries tractable, Arroyo supports a TTL (time-to-live) for updating state, which controls how long we will keep data around after seeing a particular key (like the user/count pair in that example). In previous Arroyo releases this was configurable only at the cluster level, but now it can be modified at a per-query level with SET updating_ttl.

So if we want to keep the state around for longer, we can write:

SET updating_ttl = '7 days';

SELECT user, count(*)
FROM events
GROUP BY user;

Add support for custom TTLs by @mwylde in #741

IRSA support

AWS has a powerful (and achingly complex) system for managing permissions across services called IAM. Kubernetes has a completely different access-control system based on roles and service accounts.

So...how do you manage permissions when running a Kubernetes cluster on AWS? For example if you wanted to run a certain stream processing engine that, perhaps, needs to access an S3 bucket?

The answer is IAM Roles for Service Accounts (IRSA), a predictable mouthful of an acronym from the marketing folks who brought you the streaming service called Kinesis Data Analytics for Apache Flink.

But naming aside, IRSA lets you attach an IAM role to a Kubernetes service account. And in 0.12, it's now fully supported in Arroyo. This provides a secure, granular way to control Arroyo's access to your AWS resources.

Setting this up is relatively simple on the AWS side. From the Arroyo side, you will just need to set a custom service account in your Helm configuration, for example

serviceAccount:
  create: false
  name: arroyo-pod-sa

role:
  create: false

See the docs for a full walk-through.

Add support for IRSA authentication for S3 by @mwylde in #694

Kafka producer metadata

Messages produced to Kafka include data—a sequence of bytes typically encoded in a format like JSON, Avro, or Protobuf—as well a metadata, including a timestamp and key. The Arroyo Kafka connector now supports setting metadata fields to support more powerful integrations. For example:

create table sink (
    timestamp TIMESTAMP NOT NULL,
    user TEXT,
    event TEXT
) with (
    connector = 'kafka',
    bootstrap_servers = 'localhost:9092',
    format = 'json',
    'sink.timestamp_field' = 'timestamp',
    'sink.key_field' = 'user',
    type = 'sink',
    topic = 'events'
);


insert into sink
select timestamp, user, event
from source;

Add support for setting the key and timestamp for messages written to kafka by @mwylde in #698

SQL improvements

This release brings various SQL improvements, including:

Support for escaped-string literals, like line1\nline2
Adding the array_resize function
Supporting stride in array_slice, and changing indices to be 1-based
Adding ends_with and instr string functions
Adding the array_reverse function
Adding the position function for strings
Adding nvl2 function, which returns the second argument if the first is not null

Other improvements

Allow cluster-domain for services to be configured in values.yaml by @jr200 in #677
Statically link libsasl for kafka by @mwylde in #680
Add option to set raw_datums in avro schema editor by @mwylde in #685
UI: open docs in a new tab by @haoxins in #683
Upgrade to DF 40 and Arrow 52 by @mwylde in #702
Improve robustness of SSE source to server EOFs by @mwylde in #711
Sync logging by @mwylde in #718
Switch to jemalloc by @mwylde in #719
Allow setting worker environment variables in kubernetes by @mwylde in #721
Validate the regex pattern field of file system source by @haoxins in #722
Enhance the field description of the file system connector by @haoxins in #726
Add validation when json format has unstructured flag enabled by @tiagogcampos in #735
Add line number and file info in log message by @zhuliquan in #709
Upgrade kinesis dependency to avoid multiple AWS client versions by @mwylde in #747
Show the finished jobs in the dashboard by @haoxins in #727

Fixes

Fix helm chart deploys when pullsecrets is empty by @mwylde in #676
Use an inner join to prevent controller from loading partially-constructed jobs by @mwylde in #681
Fix pending requests in preview that could stall out the webui by @mwylde in #684
Fix regression in updating equi-join optimization by @mwylde in #686
Fix checkpoint cleanup failure (#688) by @mwylde in #689
Use named ports in Helm configuration (#682) by @mwylde in #690
Use correct (relative) delta paths when writing to object stores by @mwylde in #693
Clean up local backup files for pipeline clusters by @mwylde in #706
Fix race condition that could stall scheduling by @mwylde in #712
Enable crypto_expressions feature for datafusion-functions by @mwylde in #713
Update DF fork to f2792f6d for metrics memory leak fix by @mwylde in #717
Fix recursion limit reached decode error by @zhuliquan in #716
Fix state cleanup when checkpoint location is a non-root directory in object store by @mwylde in #731
Upgrade parquet version to fix checkpoint failure regression by @mwylde in #739

Project infra and tooling

Update pnpm action to fix CI failure by @mwylde in #675
Add experimental_allow_proto3_optional compile option by @zhuliquan in #704
GitHub actions binaries by @MarcoLugo in #720
Move binary builds to buildjet and add arm build by @mwylde in #732
Cargo clippy and allow lint clippy:to_string_trait_impl by @zhuliquan in #737
Run clippy after build to speed up CI by @mwylde in #740

Full Changelog: v0.11.0...v0.12.0