(Hetzner) Sometimes the server is not destroyed #129

HenkVanMaanen · 2023-02-02T09:01:59Z

Once in a while the created instance on Hetzner does not get destroyed. This results in a big bill from Hetzner because the server is still running.

Can it be that the autoscaler forgets that some servers exist when the autoscaler gets restarted, or the git server gets restarted?

This is our config:

drone-autoscaler:
    image: drone/autoscaler:1.8.2
    restart: unless-stopped
    volumes:
      - drone_autoscaler_data:/data
    environment:
      - DRONE_POOL_MIN=0
      - DRONE_POOL_MAX=4
      - DRONE_POOL_MIN_AGE=1h
      - DRONE_CAPACITY_BUFFER=0
      - DRONE_AGENT_CONCURRENCY=5
      - DRONE_SERVER_PROTO=https
      - DRONE_SERVER_HOST=REDACTED
      - DRONE_SERVER_TOKEN=${DRONE_SERVER_TOKEN}
      - DRONE_AGENT_TOKEN=${DRONE_RPC_SECRET}
      - DRONE_HETZNERCLOUD_DATACENTER=${DRONE_HETZNERCLOUD_DATACENTER}
      - DRONE_HETZNERCLOUD_IMAGE=ubuntu-20.04
      - DRONE_HETZNERCLOUD_TYPE=cx51
      - DRONE_HETZNERCLOUD_SSHKEY=${DRONE_HETZNERCLOUD_SSHKEY}
      - DRONE_HETZNERCLOUD_TOKEN=${DRONE_HETZNERCLOUD_TOKEN}
      - DRONE_INTERVAL=10s
      - DRONE_LOGS_DEBUG=false

sdarwin · 2024-07-28T10:33:51Z

Here is a common situation of "server is not destroyed". Not necessarily the only reason.

Check the logs of the autoscaler container. Example:
{"id":"seUyIeJvPpjurPuw","level":"debug","max-pool":300,"min-pool":0,"msg":"check capacity","pending-builds":4,"running-builds":36,"server-buffer":0,"server-capacity":42,"server-count":42,"time":"2023-06-20T15:42:30Z"}
There are 4 undead agents that correspond to the "pending-builds":4.

I have a hypothesis (not certain) this is caused by the features "Auto cancel pull requests Automatically cancel pending pull request builds." "Auto cancel pushes Automatically cancel pending push builds." "Auto cancel running Automatically cancel running builds if newer commit pushed." Maybe multiple builds were cancelled quickly, in succession, and the autoscaler got confused. There was some sort of race condition.

"Solution":

run this database query

update stages set stage_status='killed'
where stage_id in
(select stage_id from stages s
 join builds b on s.stage_build_id = b.build_id
 where b.build_status not in ('running') and s.stage_status='pending');

Ideally, the real source of the bug could be found. Otherwise, drone itself could run this cleanup on a schedule. this is giving me the idea now to set up a postgres cron task on the server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Hetzner) Sometimes the server is not destroyed #129

(Hetzner) Sometimes the server is not destroyed #129

HenkVanMaanen commented Feb 2, 2023

sdarwin commented Jul 28, 2024

(Hetzner) Sometimes the server is not destroyed #129

(Hetzner) Sometimes the server is not destroyed #129

Comments

HenkVanMaanen commented Feb 2, 2023

sdarwin commented Jul 28, 2024