π Features
Uplink metrics and improved logging (Issue #2769, Issue #2815, Issue #2816)
For monitoring, observability and debugging requirements around Uplink-related behaviors (those which occur as part of Managed Federation) the router now emits better log messages and emits new metrics around these facilities. The new metrics are:
-
apollo_router_uplink_duration_seconds_bucket
: A histogram of durations with the following attributes:url
: The URL that was polledquery
:SupergraphSdl
orEntitlement
type
:new
,unchanged
,http_error
,uplink_error
, orignored
code
: The error code, depending ontype
error
: The error message
-
apollo_router_uplink_fetch_count_total
: A gauge that counts the overall success (status="success"
) or failure (status="failure"
) counts that occur when communicating to Uplink without taking into account fallback.
β οΈ The very first poll to Uplink is unable to capture metrics since its so early in the router's lifecycle that telemetry hasn't yet been setup. We consider this a suitable trade-off and don't want to allow perfect to be the enemy of good.
Here's an example of what these new metrics look like from the Prometheus scraping endpoint:
# HELP apollo_router_uplink_fetch_count_total apollo_router_uplink_fetch_count_total
# TYPE apollo_router_uplink_fetch_count_total gauge
apollo_router_uplink_fetch_count_total{query="SupergraphSdl",service_name="apollo-router",status="success"} 1
# HELP apollo_router_uplink_fetch_duration_seconds apollo_router_uplink_fetch_duration_seconds
# TYPE apollo_router_uplink_fetch_duration_seconds histogram
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.001"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.005"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.015"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.05"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.1"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.2"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.3"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.4"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="1"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="10"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="+Inf"} 1
apollo_router_uplink_fetch_duration_seconds_sum{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 0.465257131
apollo_router_uplink_fetch_duration_seconds_count{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 1
By @BrynCooke in #2779, #2817, #2819 #2826
π Fixes
Only process Uplink messages that are deemed to be newer (Issue #2794)
Uplink is backed by multiple cloud providers to ensure high availability. However, this means that there will be periods of time where Uplink endpoints do not agree on what the latest data is. They are eventually consistent.
This has not been a problem for most users, as the default mode of operation for the router is to fallback to the secondary Uplink endpoint if the first fails.
The other mode of operation, is round-robin, which is triggered only when setting the APOLLO_UPLINK_ENDPOINTS
environment variable. In this mode there is a much higher chance that the router will go back and forth between schema versions due to disagreement between the Apollo Uplink servers or any user-provided proxies set into this variable.
This change introduces two fixes:
- The Router will only use fallback strategy. Uplink endpoints are not strongly consistent, and therefore it is better to always poll a primary source of information if available.
- Uplink already handled freshness of schema but now also handles entitlement freshness.
Note: We advise against using
APOLLO_UPLINK_ENDPOINTS
to try to cache uplink responses for high availability purposes. Each request to Uplink currently sends state which limits the usefulness of such a cache.
By @BrynCooke in #2803, #2826, #2846
Distributed caching: Don't send Redis' CLIENT SETNAME
(PR #2825)
We won't send the CLIENT SETNAME
command to connected Redis servers. This resolves an incompatibility with some Redis-compatible servers since not allΒ "Redis-compatible" offerings (like Google Memorystore) actually support every Redis command. We weren't actually necessitating this feature, it was just a feature that could be enabled optionally on our Redis client. No Router functionality is impacted.
Support bare top-level __typename
when aliased (Issue #2792)
PR #1762 implemented support for the query { __typename }
but it did not work properly if the top-level standalone __typename
field was aliased. This now works properly.
Maintain errors set on _entities
(Issue #2731)
In their responses, some subgraph implementations do not return errors per entity but instead on the entire path. We now transmit those, irregardless.
π Configuration
Custom OpenTelemetry Datadog exporter mapping (Issue #2228)
This PR fixes the issue with the Datadog exporter not providing meaningful contextual data in the Datadog traces.
There is a known issue where OpenTelemetry is not fully compatible with Datadog.
To fix this, the opentelemetry-datadog
crate added custom mapping functions.
Now, when enable_span_mapping
is set to true
, the Apollo Router will perform the following mapping:
- Use the OpenTelemetry span name to set the Datadog span operation name.
- Use the OpenTelemetry span attributes to set the Datadog span resource name.
For example:
Let's say we send a query MyQuery
to the Apollo Router, then the Router using the operation's query plan will send a query to my-subgraph-name
, producing the following trace:
| apollo_router request |
| apollo_router router |
| apollo_router supergraph |
| apollo_router query_planning | apollo_router execution |
| apollo_router fetch |
| apollo_router subgraph |
| apollo_router subgraph_request |
As you can see, there is no clear information about the name of the query, the name of the subgraph, or the name of query sent to the subgraph.
Instead, with this new enable_span_mapping
setting set to true
, the following trace will be created:
| request /graphql |
| router |
| supergraph MyQuery |
| query_planning MyQuery | execution |
| fetch fetch |
| subgraph my-subgraph-name |
| subgraph_request MyQuery__my-subgraph-name__0 |
All this logic is gated behind the configuration enable_span_mapping
which, if set to true
, will take the values from the span attributes.
By @samuelAndalon in #2790
π Maintenance
Migrate xtask
CLI parsing from StructOpt
to Clap
(Issue #2807)
As an internal improvement to our tooling, we've migrated our xtask
toolset from StructOpt
to Clap
, since StructOpt
is in maintenance mode.
By @BrynCooke in #2808
Subgraph configuration override (Issue #2426)
We've introduced a new generic wrapper type for subgraph-level configuration, with the following behaviour:
- If there's a config in
all
, it applies to all subgraphs. If it is not there, the default values apply - If there's a config in
subgraphs
for a specific named subgraph:- the fields it specifies override the fields specified in
all
- the fields it does not specify uses the values provided by
all
, or default values, if applicable
- the fields it specifies override the fields specified in
Add integration tests for Uplink URLs (Issue #2827)
We've added integration tests to ensure that all Uplink URLs can be contacted and data can be retrieved in an expected format.
We've also changed our URLs to align exactly with Gateway, to simplify our own documentation. Existing Router users do not need to take any action as we support both on our infrastructure.
By @BrynCooke in #2830, #2834
Improve integration test harness (Issue #2809)
Our internal integration test harness has been simplified.
By @BrynCooke in #2810
Use kubeconform
to validate the Router's Helm manifest (Issue #1914)
We've had a couple cases where errors have been inadvertently introduced to our Helm charts. These have required fixes such as this fix. So far, we've been relying on manual testing and inspection, but we've reached the point where automation is desired. This change uses kubeconform
to ensure that the YAML generated by our Helm manifest is indeed valid. Errors may still be possible, but this should at least prevent basic errors from occurring. This information will be surfaced in our CI checks.
π Documentation
Re-point links going via redirect to their true sources
Some of our documentation links were pointing to pages which have been renamed and received new page names during routine documentation updates. While the links were not broken (the former links redirected to the new URLs) we've updated them to avoid the extra hop
By @o0Ignition0o in #2780
Fix coprocessor docs about subgraph URI mutability
The subgraph uri
is (and always has been) mutable when responding to the SubgraphRequest
stage in a coprocessor.
By @lennyburdette in #2801