π In this release, we are excited to make three new features generally available to GraphOS Enterprise customers running self-hosted routers: JWT Authentication, Distributed APQ Caching, and External Coprocessor support. Read more about these features below, and see our documentation for additional information.
π Features
GraphOS Enterprise: JWT Authentication
π JWT Authentication is now generally available to GraphOS Enterprise customers running self-hosted routers. To fully account for the changes between the initial experimental release and the final generally available implementation, we recommend removing the experimental configuration and re-implementing it following the documentation below to ensure proper configuration and that all security requirements are met.
Router v1.12 adds support for JWT validation, claim extraction, and custom security policies in Rhai scripting to reject bad traffic at the edge of the graph β for enhanced zero-trust and defense-in-depth. Extracting claims one time in the router and securely forwarding them to subgraphs can reduce the operational burden on backend API teams, reduce JWT processing, and speed up response times with improved header matching for increased query deduplication.
See the JWT Authentication documentation for information on setting up this GraphOS Enterprise feature.
GraphOS Enterprise: Distributed APQ Caching
π Distributed APQ Caching is now generally available to GraphOS Enterprise customers running self-hosted routers. To fully account for the changes between the initial experimental releases and the final generally available implementation, we recommend removing the experimental configuration and re-implementing it following the documentation below to ensure proper configuration.
With Router v1.12, you can now use distributed APQ caching to improve p99 latencies during peak times. A shared Redis instance can now be used by the entire router fleet to build the APQ cache faster and share existing APQ cache with new router instances that are spun up during scaling events β when they need it most. This ensures the fast path to query execution is consistently available to all users even during peak load.
See the distributed APQ caching documentation for information on setting up this GraphOS Enterprise feature.
GraphOS Enterprise: External Coprocessor support
π External Coprocessor support is now generally available to GraphOS Enterprise customers running self-hosted routers. To fully account for the changes between the initial experimental releases and the final generally available implementation, we recommend removing the experimental configuration and re-implementing it following the documentation below to ensure proper configuration.
Router now supports external coprocessors written in your programming language of choice. Coprocessors run with full isolation and a clean separation of concerns, that decouples delivery and provides fault isolation. Low overhead can be achieved by running coprocessors alongside the router on the same host or in the same Kubernetes Pod as a sidecar. Coprocessors can be used to speed Gateway migrations, support bespoke use cases, or integrate the router with existing network services for custom auth (JWT mapping, claim enrichment), service discovery integration, and more!
See the external coprocessor documentation for information on setting up this GraphOS Enterprise feature.
TLS termination (Issue #2615)
If there is no intermediary proxy or load-balancer present capable of doing it, the router ends up responsible for terminating TLS. This can be relevant in the case of needing to support HTTP/2, which requires TLS in most implementations. We've introduced TLS termination support for the router using the rustls
implementation, limited to one server certificate and using safe default ciphers. We do not support TLS versions prior to v1.2.
If you require more advanced TLS termination than this implementation offers, we recommend using a proxy which supports this (as is the case with most cloud-based proxies today).
Make initialDelaySeconds
configurable for health check probes in Helm chart
Currently initialDelaySeconds
uses the default of 0
. This means that Kubernetes will give router no additional time before it does the first probe.
This can be configured as follows:
probes:
readiness:
initialDelaySeconds: 1
liveness:
initialDelaySeconds: 5
GraphQL errors can be thrown within Rhai (PR #2677)
Up until now rhai script throws would yield an http status code and a message String which would end up as a GraphQL error.
This change allows users to throw with a valid GraphQL response body, which may include data, as well as errors and extensions.
Refer to the Terminating client requests
section of the Rhai api documentation to learn how to throw GraphQL payloads.
By @o0Ignition0o in #2677
π Fixes
In-flight requests will terminate before shutdown is completed (Issue #2539)
In-flight client requests will now be completed when the router is asked to shutdown gracefully.
State machine will retain most recent valid config (Issue #2752)
The state machine will retain current state until new state has gone into service. Previously, if the router failed to reload either the configuration or the supergraph, it would discard the incoming state change even if that state change turned out to be invalid. It is important to avoid reloading inconsistent state because the a new supergraph may, for example, directly rely on changes in config to work correctly.
Changing this behaviour means that the router must enter a "good" configuration state before it will reload, rather than reloading with potentially inconsistent state.
For example, previously:
- Router starts with valid supergraph and config.
- Router config is set to something invalid and restart doesn't happen.
- Router receives a new schema, the router restarts with the new supergraph and the original valid config.
Now, the latest information is used to restart the router:
- Router starts with valid schema and config.
- Router config is set to something invalid and restart doesn't happen.
- Router receives a new schema, but the router fails to restart because of config is still invalid.
By @BrynCooke in #2753
Ability to disable HTTP/2 for subgraphs (Issue #2063)
There are cases where the balancing HTTP/2 connections to subgraphs behaves erratically. While we consider this a bug, users may disable HTTP/2 support to subgraphs in the short-term while we work to find the root cause.
Tracing default service name restored (Issue #2641)
With this fix the default tracing service name is restored to router
.
By @BrynCooke in #2642
Header plugin now has a static plugin priority (Issue #2559)
Execution order of the headers
plugin which handles header forwarding is now enforced. This ensures reliable behavior with other built-in plugins.
It is now possible to use custom attributes derived from headers within the telemetry
plugin in addition to using the headers
plugin to propagate/insert headers for subgraphs.
Add content-type
header when publishing Datadog metrics (Issue #2697)
Add the required content-type
header for publishing Datadog metrics from Prometheus:
content-type: text/plain; version=0.0.4
By @ShaunPhillips in #2698
Sandbox Explorer endpoint URL is no longer editable (PR #2729)
The "Endpoint" in the Sandbox Explorer (Which is served by default when running in development mode) is no longer editable, to prevent inadvertent changes. Sandbox is not generally useful with other endpoints as CORS must be configured on the other host.
A hosted version of Sandbox Explorer without this restriction is still available if you necessitate a version which allows editing.
By @mayakoneval in #2729
Argument parsing is now optional in the Executable
builder (PR #2666)
The Executable
builder was parsing command-line arguments, which was causing issues when used as part of a larger application with its own set of command-line flags, leading to those arguments not be recognized by the router. This change allows parsing the arguments separately, then passing the required ones to the Executable
builder directly. The default behaviour is still parsing from inside the builder.
Unnecessary space has been removed from the log formatter (PR #2755)
Indentation was being introduced after the log-level annotations in router logs. We've removed the offending spaces!
FTV1 trace sampling is now applied per supergraph request rather than subgraph request (Issue #2655)
Because tracing can be costly, it is only enabled for a configurable fraction of requests. Each request is selected for tracing or not with a corresponding probability. This used to be done as part of the subgraph service, meaning that when a single supergraph request handled by the Router involves making multiple subgraph requests, it would be possible (and likely) that tracing would only be enabled for some of those sub-requests. If this same supergraph request is repeated enough times the aggregated metrics should be fine, but for smaller sample size this risks giving an unexpectedly partial view of whatβs happening.
As of this change, each supergraph request received by the Router is either decided to be sampled or not sampled and all corresponding subgraph requests use that same decision.
By @SimonSapin in #2656
JWKS download failure no longer results in JWT plugin init failure (Issue #2747)
This feature was previously experimental and is now generally available as a GraphOS Enterprise feature. See the "Features" section above for more detail, and consult the feature's documentation for more information.
JWKS download can temporarily fail for the same reasons that any network request fails. Such an intermittent failure no longer fails plugin initialization, preventing router load or hot-reloads. We now continue try to download the failed asset during initialization making a reasonable effort to start router with all JWKS. In the event that one of the configured JWKS does not download, the router will still start with the remaining sets.
JWKS is now downloaded out of band (Issue #2647)
This feature was previously experimental and is now generally available as a GraphOS Enterprise feature. See the "Features" section above for more detail, and consult the feature's documentation for more information.
The JWKS download in the JWT authentication plugin now lives in a separate task which polls the JWKS URLs asynchronously, rather than downloading them on demand when a JWT is verified. This should reduce the latency for the initial requests received by the router and increase reliability by removing (internal) tower Buffer
usage.
Add an issuer check after JWT signature verification (Issue #2647)
This feature was previously experimental and is now generally available as a GraphOS Enterprise feature. See the "Features" section above for more detail, and consult the feature's documentation for more information.
This is a notable change if you're coming from the experimental implementation. Experimental features offer no breaking change policy while they are in experimental state.
A JWKS URL can now be associated with an issuer in the YAML configuration. After verifying the JWT signature, if the issuer is configured in YAML and there is an corresponding iss
claim in the JWT, the router will check that they match, and reject the request if not.
For those coming from experimental, the configuration changes incorporate a map of objects including url
and an optional issuer
property:
< authentication:
< experimental:
< jwt:
< jwks_urls:
< - file:///path/to/jwks.json
< - http:///idp.dev/jwks.json
---
> authentication:
> jwt:
> jwks:
> - url: file:///path/to/jwks.json
> issuer: "http://idp.local" # optional field
> - url: http:///idp.dev/jwks.json
> issuer: http://idp.dev # optional field
π Configuration
β οΈ Configuration changes are non-breaking in the current minor version, but we recommend making these changes as soon as possible since they will become breaking changes in a future major version.
apq
has been moved to the top level of configuration (Issue #2744)
For improved usability, we will be moving items out of supergraph
in the router configuration file. This is because various plugins use router pipeline stages as part of their YAML config, of which supergraph
is one.
You may not have this option in your configuration since APQ is on by default, but if you're using this option, the appropriate change will look like this:
< supergraph:
< apq:
< enabled: true
---
> apq:
> enabled: true
By @BrynCooke in #2745
π Maintenance
Correct visibility of telemetry plugin (Issue #2739)
The telemetry plugin code itself was previously marked pub
. However, due to the recent refactor of the telemetry
plugin and its associated tests this is no longer the case. This does not manifest as a breaking change since the plugin was exported under the _private
module which itself was marked as internal.
By @BrynCooke in #2740
Jaeger integration tests have been improved in CircleCI (Issue #2675)
We now use a Jaeger Docker image rather than downloading the binaries directly, improving the overall reliability since the artifacts themselves were previously being pulled from GitHub artifacts and failed regularly.
By @BrynCooke in #2673
Clean up trace_providers
on a thread rather than in a Tokio blocking_task
(Issue #2668)
OpenTelemetry shutdown occasionally hangs due to Telemetry::Drop
using a tokio::spawn_blocking
to flush the trace_provider
. However, Tokio doesn't finish executing tasks before termination tokio-rs/tokio#1156.
This means that if the Tokio runtime itself is shutdown, there is a potential race where trace_provider
may not be flushed.
We can mitigate this by using a thread so that task flush will always be completed even if the Tokio runtime is shut down. Hangs were most likely to happen in tests due to the Tokio runtime being destroyed when the test method exits.
By @BrynCooke in #2757
Tweak the rate limit test times to prevent sporadic CI failures (Issue #2667)
A slight adjustment to the timing should make this less likely to cause flakes.
By @BrynCooke in #2758
Remove "dead" parsing code still using the apollo-parser
AST (Issue #2636)
Now that apollo-compiler
HIR has been used for long enough, the now-unused version of parsing code that was still based on apollo-parser
's AST has been removed. We had previously left this code intentionally to make it easy to roll-back to.
This removal will unlock further refactoring in the upcoming #2483.
By @SimonSapin in #2637
Use the fred
Redis client (Issue #2623)
Use the fred
Redis client instead of the redis
and redis-cluster-async
crates. Overall, this adds necessary support for TLS in Redis "cluster" mode, removes OpenSSL usage entirely (this was our only dependency which used OpenSSL, so this means that our router can install without friction on the newest Ubuntu version again) and overall cleans up the code.
Update local development docker-compose
configuration (Issue #2680)
The federation-demo
was used for testing in early versions of the Router but is no longer used, and we removed most references to it some time ago. The docker-compose.yml
(used primarily in the development of this repository) has been updated to reflect this, and now also includes Redis which is required for some tests.
By @BrynCooke in https://github.com/apollographql/router/pull/#2681
Improve CI time by removing test-binaries
from build (Issue #2625)
We now have an experimental plugin called broken
that is included in the router.
This removes the need to use test-binaries
and avoids a full recompile of the router during integration testing.
By @BrynCooke in #2650
Ban openssl-sys
using cargo-deny
(PR #2510)
We avoid depending on OpenSSL in the router, instead opting to use rustls
for various reasons. This change introduces a tooling "ban" of the openssl-sys
crate to avoid inadvertently introducing OpenSSL again in the future by signalling this early in our pull-requests. This will help us avoid mistakenly reintroducing it in the future.
By @o0Ignition0o in #2638
π Documentation
Context::get
has been corrected (Issue #2580)
If we have an error, it doesn't mean the context entry didn't exist, it generally means it's a deserialization error. We've updated the Context::get
documentation to reflect this.
Remove "embedded" example (Issue #2737)
The "embedded" example in our documentation was a throwback to early days of the Router where "distribution as middleware" was considered more viable. As development has progressed, this approach has become obsolete, particularly as we have baked some of our functionality into the webserver layer. In addition, the entire example was still using the TestHarness
which is designed for testing rather than production traffic. Overall, we think the rest of our documentation properly represents modern days way of doing this work.
By @BrynCooke in #2738