Buildkite Agent Scaler

An AWS lambda function that handles the scaling of an Amazon Autoscaling Group (ASG) based on metrics provided by the Buildkite Agent Metrics API.

In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules. 🚀

Why?

The Elastic CI Stack depends on being able to scale up quickly from zero instances in response to scheduled Buildkite jobs. Amazon's AutoScaling primatives have a number of limitations that we wanted more granular control over:

The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with a minimum period of 60 seconds between.
Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work well with custom metrics like we use.

How does it work?

The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the results sets the DesiredCount to exactly what is needed. This allows much faster scale up.

Gracefully scaling in

Whilst the lambda does support scaling in via setting DesiredCount, Amazon ASGs appear to not send Lifecycle Hooks before terminating instances, so jobs in progress are interrupted.

Instead, in the Elastic CI Stack we run the scaler with scale-in disabled (DISABLE_SCALE_IN) and rely on the recent addition in buildkite-agent v3.10.0 of --disconnect-after-idle-timeout in the Agent combined with a systemd PostStop script to terminate the instance and atomically decrease the DesiredCount after the agent has been idle for a time period. We've found it to work really well, and is less complicated than relying on lifecycled and Lifecycle Hooks.

See the forum post for more details.

Publishing Cloudwatch Metrics

The scaler collects it's own metrics and doesn't require the buildkite-agent-metrics. It supports optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset of the metrics that the buildkite-agent-metrics binary collects:

Buildkite > (Org, Queue) > ScheduledJobsCount
Buildkite > (Org, Queue) > RunningJobCount

Running as an AWS Lambda

An AWS Lambda bundle is created and published as part of the build process. The lambda will require the following IAM permissions:

cloudwatch:PutMetricData
autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeScalingActivities
autoscaling:SetDesiredCapacity

It's entrypoint is handler, it requires a go1.x environment and requires the following env vars:

BUILDKITE_AGENT_TOKEN or BUILDKITE_AGENT_TOKEN_SSM_KEY
BUILDKITE_QUEUE
AGENTS_PER_INSTANCE
ASG_NAME

If BUILDKITE_AGENT_TOKEN_SSM_KEY is set, the token will be read from AWS Systems Manager Parameter Store GetParameter which can also read from AWS Secrets Manager.

aws lambda create-function \
  --function-name buildkite-agent-scaler \
  --memory 128 \
  --role arn:aws:iam::account-id:role/execution_role \
  --runtime provided.al2 \
  --zip-file fileb://handler.zip \
  --handler handler

Running locally for development

$ aws-vault exec my-profile -- go run . \
  --asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
  --agent-token "$BUILDKITE_AGENT_TOKEN"

Using Clusters

The BUILDKITE_AGENT_TOKEN is scoped to a specific cluster. It's best to create a unique token for the cluster being targeted by the scaler.

The scaler is set up automatically by the Elastic CI Stack's CloudFormation templates, which reference the agent token and a queue name. A Lambda function running the scaler is then generated using these references (e.g., BUILDKITE_AGENT_TOKEN_SSM_KEY and BUILDKITE_QUEUE).

Name		Name	Last commit message	Last commit date
Latest commit History 373 Commits
.buildkite		.buildkite
.github		.github
buildkite		buildkite
lambda		lambda
scaler		scaler
version		version
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Buildkite Agent Scaler

Why?

How does it work?

Gracefully scaling in

Publishing Cloudwatch Metrics

Running as an AWS Lambda

Running locally for development

Using Clusters

Copyright

About

Releases

Packages

Languages

License

klaviyo/buildkite-agent-scaler-fork

Folders and files

Latest commit

History

Repository files navigation

Buildkite Agent Scaler

Why?

How does it work?

Gracefully scaling in

Publishing Cloudwatch Metrics

Running as an AWS Lambda

Running locally for development

Using Clusters

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages