apollographql/router v1.16.0 on GitHub

🚀 Features

Add ability to transmit un-redacted errors from federated traces to Apollo Studio

When using subgraphs which are enabled with Apollo Federated Tracing, the error messages within those traces will be redacted by default.

New configuration (tracing.apollo.errors.subgraph.all.redact, which defaults to true) enables or disables the redaction mechanism. Similar configuration (tracing.apollo.errors.subgraph.all.send, which also defaults to true) enables or disables the entire transmission of the error to Studio.

The error messages returned to the clients are not changed or redacted from their previous behavior.

To enable sending subgraph's federated trace error messages to Studio without redaction, you can set the following configuration:

telemetry:
  apollo:
    errors:
      subgraph:
        all:
          send: true # (true = Send to Studio, false = Do not send; default: true)
          redact: false # (true = Redact full error message, false = Do not redact; default: true)

It is also possible to configure this per-subgraph using a subgraphs map at the same level as all in the configuration, much like other sections of the configuration which have subgraph-specific capabilities:

telemetry:
  apollo:
    errors:
      subgraph:
        all:
          send: true
          redact: false # Disable redaction as a default.  The `accounts` service enables it below.
        subgraphs:
          accounts: # Applies to the `accounts` subgraph, overriding the `all` global setting.
            redact: true # Redact messages from the `accounts` service.

By @bnjjj in #3011

Introduce `response.is_primary` Rhai helper for working with deferred responses (Issue #2935) (Issue #2936)

A new Rhai response.is_primary() helper has been introduced that returns false when the current chunk being processed is a deferred response chunk. Put another way, it will be false if the chunk is a follow-up response to the initial primary response, during the fulfillment of a @defer'd fragment in a larger operation. The initial response will be is_primary() == true. This aims to provide the right primitives so users can write more defensible error checking. The introduction of this relates to a bug fix noted in the Fixes section below.

By @garypen in #2945

Time-based forced hot-reload for "chaos" testing

For testing purposes, the Router can now artificially be forced to hot-reload (as if the configuration or schema had changed) at a configured time interval. This can help reproduce issues like reload-related memory leaks. We don't recommend using this in any production environment. (If you are compelled to use it in production, please let us know about your use case!)

The new configuration section for this "chaos" testing is (and will likely remain) marked as "experimental":

experimental_chaos:
  force_hot_reload: 1m

By @SimonSapin in #2988

Provide helpful console output when using "preview" features, just like "experimental" features

This expands on the existing mechanism that was originally introduced in #2242, which supports the notion of an "experimental" feature, and make it compatible with the notion of "preview" features.

When preview or experimental features are used, an INFO-level log is emitted during startup to notify of which features are used and shows URLs to their GitHub discussions, for feedback. Additionally, router config experimental and router config preview CLI sub-commands list all such features in the current Router version, regardless of which are used in a given configuration file.

For more information about launch stages, please see the documentation here: https://www.apollographql.com/docs/resources/product-launch-stages/

By @o0Ignition0o, @abernix, and @SimonSapin in #2960

Report `operationCountByType` counts to Apollo Studio (PR #2979)

This adds the ability for Studio to track operation counts broken down by type of operations (e.g., query vs mutation). Previously, we only reported total operation count.

By @bnjjj in #2979

🐛 Fixes

Update to Federation v2.4.2

This update to Federation v2.4.2 fixes a potential bug when an @interfaceObject type has a @requires. This might be encountered when an @interfaceObject type has a field with a @requires and the query requests that field only for some specific implementations of the corresponding interface. In this case, the generated query plan was sometimes invalid and could result in an invalid query to a subgraph. In the case that the subgraph was an Apollo Server implementation, this lead the subgraph producing an "The _entities resolver tried to load an entity for type X, but no object or interface type of that name was found in the schema" error.

By @abernix in #2910

Fix handling of deferred response errors from Rhai scripts (Issue #2935) (Issue #2936)

If a Rhai script was to error while processing a deferred response (i.e., an operation which uses @defer) the Router was ignoring the error and returning None in the stream of results. This had two unfortunate aspects:

the error was not propagated to the client
the stream was terminated (silently)

With this fix we now capture the error and still propagate the response to the client. This fix also adds support for the is_primary() method which may be invoked on both supergraph_service() and execution_service() responses. It may be used to avoid implementing exception handling for header interactions and to determine if a response is_primary() (i.e., first) or not.

e.g.:

    if response.is_primary() {
        print(`all response headers: `);
    } else {
        print(`don't try to access headers`);
    }

    try {
        print(`all response headers: `);
    }
    catch(err) {
        if err == "cannot access headers on a deferred response" {
            print(`don't try to access headers`);
        }
    }

Note
This is a minimal example for purposes of illustration which doesn't exhaustively check all error conditions. An exception handler should always handle all error conditions.

By @garypen in #2945

Fix incorrectly placed "message" in Rhai JSON-formatted logging (Issue #2777)

This fixes a bug where Rhai logging was incorrectly putting the message of the log into the out attribute, when serialized as JSON. Previously, the message field was showing rhai_{{level}} (i.e., rhai_info), despite there being a separate level field in the JSON structure.

The impact of this fix can be seen in this example where we call log_info() in a Rhai script:

  log_info("this is info");

Previously, this would result in a log as follows, with the text of the message set within out, rather than message.

{"timestamp":"2023-04-19T07:46:15.483358Z","level":"INFO","message":"rhai_info","out":"this is info"}

After the change, the message is correctly within message. The level continues to be available at level. We've also additionally added a target property which shows the file which produced the error:

{"timestamp":"2023-04-19T07:46:15.483358Z","level":"INFO","message":"this is info","target":"src/rhai_logging.rhai"}

By @garypen in #2975

Deferred responses now utilize compression, when requested (Issue #1572)

We previously had to disable compression on deferred responses due to an upstream library bug. To fix this, we've replaced tower-http's CompressionLayer with a custom stream transformation. This is necessary because tower-http uses async-compression under the hood, which buffers data until the end of the stream, analyzes it, then writes it, ensuring a better compression. However, this is wholly-incompatible with a core concept of the multipart protocol for @defer, which requires chunks to be sent as soon as possible. To support that, we need to compress chunks independently.

This extracts parts of the codec module of async-compression, which so far is not public, and makes a streaming wrapper above it that flushes the compressed data on every response within the stream.

By @Geal in #2986

Update the `h2` dependency to fix a potential Denial-of-Service (DoS) vulnerability

Proactively addresses the advisory in https://rustsec.org/advisories/RUSTSEC-2023-0034, though we have no evidence that suggests it has been exploited on any Router deployment.

By @Geal in #2982

Rate limit errors emitted from OpenTelemetry (Issue #2953)

When a batch span exporter is unable to send accept a span because the buffer is full it will emit an error. These errors can be very frequent and could potentially impact performance. To mitigate this, OpenTelemetry errors are now rate limited to one every ten seconds, per error type.

By @BrynCooke in #2954

Improved messaging when a request is received without an operation (Issue #2941)

The message that is displayed when a request has been sent to the Router without an operation has been improved. This materializes as a developer experience improvement since users (especially those using GraphqL for the first time) might send a request to the Router using a tool that isn't GraphQL-aware, or might just have their API tool of choice misconfigured.

Previously, the message stated "missing query string", but now more helpfully suggests sending either a POST or GET request and specifying the desired operation as the query parameter (i.e., either in the POST data or in the query string parameters for GET queries).

By @kushal-93 in #2955

Traffic shaping configuration fix for global `experimental_enable_http2`

We've resolved a case where the experimental_enable_http2 feature wouldn't properly apply when configured with a global configuration.

Huge thanks to @westhechiang, @leggomuhgreggo, @vecchp and @davidvasandani for discovering the issue and finding a reproducible testcase!

By @o0Ignition0o in #2976

Limit the memory usage of the `apollo` OpenTelemetry exporter (PR #3006)

We've added a new LRU cache in place of a Vec for sub-span data to avoid keeping all events for a span in memory, since we don't need it for our computations.

By @bnjjj in #3006