github runs-on/runs-on v2.9.0

latest release: v2.9.1
pre-release2 days ago

This is a large release, with many internal and external changes. Please review the first section below carefully.

Note 2025-10-17: please use v2.9.1 instead, since it includes important fixes.

Potentially breaking changes

  • Update default linux image to ubuntu24. Set image=ubuntu22-full-x64 or image=ubuntu22-full-arm64 if you want to keep using Ubuntu22.
  • When a job is canceled due to a spot interruption, all the failed jobs from that failed workflow will get retried, instead of only the first job interrupted. Fixes #192.
  • Prometheus metrics endpoint removed, along with the ServerPassword stack parameter. RunsOn now ships with OTEL integration. Fixes #322. The /metrics endpoint anyway had a long-standing issue where some prometheus scrapers were unable to reach the AppRunner endpoint due to how the Envoy proxy from AWS handles requests.
  • disk=large and disk=default labels are deprecated. If present, they will be automatically translated into the new volume label, but once you have adopted v2.9.0, you should update your workflows for future upgrades.
  • RunnerLargeDiskDeviceName and RunnerDefaultDiskDeviceName are removed (now always use the AMI root volume device name).
  • Add stack parameter RunnerConfigAutoExtendsFrom to always force a specific value for repository configuration _extends directive (even if no local config file exists). Fixes #366. Note that it defaults to .github-private, meaning that if you leave that default, RunsOn will always attempt to load the config file from that repo as a base configuration. Set it to . (only extend from current repo _extends directive) to keep the previous behavior. This has been a long requested feature (and source of confusion) and is what new users expect, so this is why the breaking setting is enabled by default.
  • Custom tag precedence is now: stack custom runner tags < custom runner tags < repository custom runner tags. This allows to set default tags at the stack level, which can be overriden by runner-level tags, but in the end repo-level tags always taking precedence (if set) to make sure repo admins can control the final tag value when needed.

Deprecations (please send feedback!)

  • disk label support is going to be removed in the next minor version (replaced by volume label, which is much more flexible).
  • RunnerLargeVolumeThroughput and RunnerLargeDiskSize are deprecated stack parameters.

Also, now that external networking is supported, next version will just set sane defaults for some VPC features when using embedded networking. As such, those parameters will be removed:

  • VpcFlowLogFormat: [DEPRECATED, use external networking if you need to fine-tune this].
  • VpcFlowLogS3BucketArn: [DEPRECATED, use external networking if you need to fine-tune this]
  • VpcFlowLogRetentionInDays: [DEPRECATED, use external networking if you need to fine-tune this]
  • VpcCidrSubnetBits: [DEPRECATED, use external networking if you need to fine-tune this]

Then, DefaultAdmins will be removed, since it's just better for admin-level people to use SSM to log into the runners if needed:

  • DefaultAdmins: [DEPRECATED, prefer to use SSM for admin access].

Finally, I don't think ECInstanceDetailedMonitoring is useful since default cloudwatch metrics are useless anyway, and you're better off using the new runs-on/action metrics:

  • ECInstanceDetailedMonitoring: [DEPRECATED. See https://runs-on.com/monitoring/job-metrics/#performance-metrics for better performance metrics].

Now that native Slack webhook integration is supported, I believe we can also remove the AlertTopicSubscriptionHttpsEndpoint parameter, which was originally introduced for that use case (but required an adapter in between). Please reach out if you think we should keep it.

Warm pools (BETA)

RunsOn can now operate pools of stopped or hot instances, which means pick-up times will be improved.

See https://github.com/runs-on/runs-on/blob/main/adrs/20250727-warm-pools.md for all details.

Volume overrides

Runners can now override the default volume settings directly within labels or through the configuration file.

Examples:

# full
runs-on: runner=2cpu-linux-x64/volume=gp3:80g:125mbps:3000iops
# partial
runs-on: runner=2cpu-linux-x64/volume=80g:250mbps

GitHub webhook redelivery on failures

RunsOn now ships with a background job to check (every 5min) for failed webhook deliveries from the github side. If it finds some (matching the current stack labels), it will attempt to redeliver them once. This is especially useful under very high load, as AppRunner can sometimes rate-limit incoming webhooks when GitHub sends a burst of webhooks all at once.

You'll get alerted (over SNS, Slack, etc.) if failed webhooks have been redelivered. Cloudwatch dashboard also has a widget showing recent runs and redeliveries (if any).

2025-10-16-000496-health-checks (Channel) - RunsOn - Slack

Runner details

  • Add original labels
  • Add pool details (if any)
Click to view example

Slack integration

Can now define a slack webhook URL (AlertTopicSlackWebhookUrl stack parameter), so that alerts also get sent there.

OTEL integration

Can now pass OTEL endpoint and headers (OtelExporterOtlpEndpoint, OtelExporterOtlpHeader). Only HTTP transport enabled for now. Metrics will be shipped there. Example dashboard below using Signoz:

OTEL dashboard
Details of all new metrics and logs

Job Metrics

runs_on_jobs_total (Counter)

Total number of jobs by status.

Attributes:

  • status: Job status (queued, scheduled, in_progress, completed)
  • repo_full_name: Repository full name (e.g., owner/repo)
  • workflow_name: GitHub workflow name
  • instance_type: EC2 instance type (e.g., t3.medium) (optional, only for scheduled status)
  • instance_lifecycle: Instance lifecycle (spot or on-demand) (optional, only for scheduled status)
  • pool_name: Pool name if scheduled from a pool (optional, only when scheduled via pool)
  • interrupted: Whether the job was interrupted (bool) (optional, only when true)
  • org: GitHub organization name
  • installation_id: GitHub App installation ID
  • stack_name: Stack name (optional, when provided in JobEvent)
  • region: AWS region (optional, when provided in JobEvent)
  • conclusion: Job conclusion for completed status (success, failure, cancelled, skipped)

Examples:

# Scheduled status (has instance_type and instance_lifecycle)
runs_on_jobs_total{status="scheduled",repo_full_name="acme/api",workflow_name="CI",instance_type="t3.medium",instance_lifecycle="spot",pool_name="default",org="acme",installation_id=12345,stack_name="runs-on-prod",region="us-east-1"} 42

# Completed status (no instance_type or instance_lifecycle)
runs_on_jobs_total{status="completed",repo_full_name="acme/api",workflow_name="CI",pool_name="default",conclusion="success",org="acme",installation_id=12345,stack_name="runs-on-prod",region="us-east-1"} 42

runs_on_internal_queue_duration_seconds (Histogram)

Time from job queued in RunsOn to scheduled (internal queue time). This measures how long the job spends in RunsOn's internal queue before an instance is scheduled.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

runs_on_overall_queue_duration_seconds (Histogram)

Time from job queued by GitHub to started (overall queue time). This measures the total time from when GitHub queues the job to when it actually starts running, including instance launch and runner bootstrap.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

runs_on_job_duration_seconds (Histogram)

Time from job started to completed.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

Pool Metrics

runs_on_pool_instances_total (Observable Gauge)

Number of pool instances by state. This is a pull-based metric that reports current state.

Attributes:

  • pool_name: Pool name
  • state: Instance state (running, stopped, pending, terminating)
  • installation_id: GitHub App installation ID
  • org: GitHub organization name

Example:

runs_on_pool_instances_total{pool_name="default",state="running",installation_id=12345,org="acme"} 5
runs_on_pool_instances_total{pool_name="default",state="stopped",installation_id=12345,org="acme"} 10

Rate Limiter Metrics

runs_on_rate_limiter_tokens (Observable Gauge)

Available tokens in rate limiter. This is a pull-based metric that reports current state.

Attributes:

  • limiter: Rate limiter name (e.g., github_api, ec2_api)

Example:

runs_on_rate_limiter_tokens{limiter="github_api"} 4500.5

runs_on_rate_limiter_burst (Observable Gauge)

Burst capacity of rate limiter. This is a pull-based metric that reports current state.

Attributes:

  • limiter: Rate limiter name

Example:

runs_on_rate_limiter_burst{limiter="github_api"} 5000

Spot Circuit Breaker Metrics

runs_on_spot_circuit_breaker_active (Observable Gauge)

Whether spot circuit breaker is currently active. This is a pull-based metric that reports current state.

Values:

  • 1: Circuit breaker is active (spot instances disabled)
  • 0: Circuit breaker is inactive (spot instances enabled)

Example:

runs_on_spot_circuit_breaker_active{} 0

Go Runtime Metrics

The metrics package automatically instruments Go runtime metrics via go.opentelemetry.io/contrib/instrumentation/runtime:

  • process.runtime.go.mem.heap_alloc
  • process.runtime.go.mem.heap_idle
  • process.runtime.go.mem.heap_inuse
  • process.runtime.go.gc.count
  • process.runtime.go.goroutines.count
  • And more...

These metrics include the standard service.name="runs-on-server" attribute.

Resource Attributes

All metrics include these resource attributes:

Attribute Description Example
service.name Service name (always runs-on-server) runs-on-server
app.version Application version (if configured) v2.9.0
app.environment Environment name (if configured) production
stack_name Stack name (if configured) runs-on-prod
region AWS region (if configured) us-east-1

Structured Logs

The metrics package emits periodic structured logs (JSON) containing snapshots of all metrics.

Log Types

Job Summary (metric_type=jobs_summary)

Cumulative job counts since server start.

{
  "metric_type": "jobs_summary",
  "queued": 1234,
  "scheduled": 1200,
  "in_progress": 34,
  "completed": 1150,
  "interrupted": 16
}

Note: The interrupted counter tracks jobs that were interrupted (e.g., by spot interruptions), but jobs are recorded with their final status (e.g., completed) and the interrupted attribute set to true.

Job Event (metric_type=job_event)

Individual job lifecycle events (emitted immediately, not periodic).

{
  "metric_type": "job_event",
  "status": "completed",
  "conclusion": "success",
  "repo_full_name": "acme/api",
  "workflow_name": "CI",
  "instance_type": "t3.medium",
  "instance_lifecycle": "spot",
  "pool_name": "default",
  "interrupted": true,
  "internal_queue_duration_seconds": 12.5,
  "overall_queue_duration_seconds": 45.2,
  "job_duration_seconds": 180.3
}

Note: instance_type, instance_lifecycle, pool_name, and interrupted fields are only included when available/applicable.

Pool Instances (metric_type=pool_instances)

Current pool instance counts by state.

{
  "metric_type": "pool_instances",
  "installation_id": 12345,
  "org": "acme",
  "pool_name": "default",
  "running": 5,
  "stopped": 10,
  "pending": 2
}

Rate Limiter (metric_type=rate_limiter)

Current rate limiter state.

{
  "metric_type": "rate_limiter",
  "limiter": "github_api",
  "tokens": 4500.5,
  "burst": 5000
}

Spot Circuit Breaker (metric_type=spot_circuit_breaker)

Current circuit breaker state.

{
  "metric_type": "spot_circuit_breaker",
  "active": false,
  "interruption_count": 42
}

Spot Interruption (metric_type=spot_interruption)

Individual spot interruption events (emitted immediately, not periodic).

{
  "metric_type": "spot_interruption",
  "interruption_time": "2025-10-10T14:30:00Z",
  "trip_count": 3,
  "recovery_minutes": 15,
  "circuit_breaker_active": false,
  "active_until": "2025-10-10T14:45:00Z",
  "instance_id": "i-1234567890abcdef0",
  "job_id": "987654321",
  "job_name": "build",
  "job_url": "https://github.com/owner/repo/actions/runs/123456789/job/987654321",
  "repo_full_name": "owner/repo"
}

Note: active_until is only included when the circuit breaker is active. Job details (instance_id, job_id, job_name, job_url, repo_full_name) are only included when available.

Pre/Post custom job hooks

You can now launch custom scripts within the "Set up runner" and "Complete runner" sections of a workflow. If /runs-on/pre.custom.sh or /runs-on/post.custom.sh scripts are found, the RunsOn agent will execute them in their respective job section. They are executed after the RunsOn-specific scripts, and RunsOn will fail the step if those custom scripts fail. See https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/run-scripts for more details.

Misc

  • Improve failure message when invalid runner spec (missing family). Fixes #343.
  • Fix permission issue with Docker and ECR login in preinstall scripts on instances with local disks. Fixes #362.
  • Auto-resize Windows disks. Fixes #369.
  • Properly disable all ipv4 public addresses whenever launching in private subnets. Previously this was only done when Private=only stack parameter was set, leading to increased costs when running mixed networking mode (public + private runners allowed) stacks.
  • When instance received a spot interruption warning, let AWS perform the termination so that we don't get billed if runtime was <1h. Fixes #365.
  • Surface job error after all schedule attempts have been exhausted. Fixes #357.
  • Fix SSH setup issues on AlmaLinux images. #330.

Upgrade

Don't miss a new runs-on release

NewReleases is sending notifications on new releases.