🔍 Debuggability
Add compute job pool spans (PR #7236)
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection.
This PR adds spans to jobs that are on this pool to allow users to see when latency is introduced due to
resource contention within the compute job pool.
compute_job:job.type: (query_parsing|query_planning|introspection)
compute_job.executionjob.age:P1-P8job.type: (query_parsing|query_planning|introspection)
Jobs are executed highest priority (P8) first. Jobs that are low priority (P1) age over time, eventually executing
at highest priority. The age of a job is can be used to diagnose if a job was waiting in the queue due to other higher
priority jobs also in the queue.
By @BrynCooke in #7236
Add compute job pool metrics (PR #7184)
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection.
When this pool becomes saturated it is difficult for users to see why so that they can take action.
This change adds new metrics to help users understand how long jobs are waiting to be processed.
New metrics:
apollo.router.compute_jobs.queue_is_full- A counter of requests rejected because the queue was full.apollo.router.compute_jobs.duration- A histogram of time spent in the compute pipeline by the job, including the queue and query planning.job.type: (query_planning,query_parsing,introspection)job.outcome: (executed_ok,executed_error,channel_error,rejected_queue_full,abandoned)
apollo.router.compute_jobs.queue.wait.duration- A histogram of time spent in the compute queue by the job.job.type: (query_planning,query_parsing,introspection)
apollo.router.compute_jobs.execution.duration- A histogram of time spent to execute job (excludes time spent in the queue).job.type: (query_planning,query_parsing,introspection)
apollo.router.compute_jobs.active_jobs- A gauge of the number of compute jobs being processed in parallel.job.type: (query_planning,query_parsing,introspection)
By @carodewig in #7184
🐛 Fixes
Fix hanging requests when compute job queue is full (PR #7273)
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When the pool is busy, jobs enter a queue.
When the compute job queue was full, requests could hang until timeout. Now, the router immediately returns a SERVICE_UNAVAILABLE response to the user.
By @BrynCooke in #7273
Increase compute job pool queue size (PR #7205)
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When the pool is busy, jobs enter a queue.
We previously set this queue size to 20 (per thread). However, this may be too small on resource constrained environments.
This patch increases the queue size to 1,000 jobs per thread. For reference, in older router versions before the introduction of the compute job worker pool, the equivalent queue size was 1,000.
By @goto-bus-stop in #7205