Automatically remove dead AutoScale nodes from Chef Server

So you’ve created an autoscaling group in Amazon AWS, when it scales up you’re nodes automatically authenticate with Chef, configure and deploy successfully. That’s great! At least until you need to scale down again. Chef nodes are good at telling the server they exist, but not so great at telling it they’ve checked out. This becomes a real problem if you try to use knife to issue commands to all servers of a particular role for example. You’ll end up with ever increasing stale data about nodes that no longer exist. So what can you do to fix this?

Luckily, AWS provides most of the tools for you, you just need to stitch them together.

What you will need

  • An AWS autoscale group
  • An AWS simple notification service (SNS) topic
  • An AWS simple queue service (SQS) endpoint
  • A script to consume the queue

Topics and Queues

First of all, create your notification topic in SNS, giving it a title (I used AutoScaleDown) and a display name. You can do this in either the web console or the command line tools. Next create a queue in SQS, giving it a name ( I used DeregQueue) and leave all other options as defaults.

SNS Topic Create Subscription

Now you have the notification topic, and the queue endpoint, you need to “subscribe” your queue to the topic, so that any notifications that come into that topic are passed to the message queue. The easiest way to do this is to note the queue’s ARN (Amazon Resource Name) and then “Create Subscription” for the topic in the SNS section of the console.

You can test that the subscription is functioning by sending a test message to the topic, and checking that your message gets onto the queue. If you do this, do not forget to delete the message, as it will interfere later on.

AutoScaling Notifications

Autoscale notificationNow that you have a notification topic, you should be able to create a notification for your autoscaling group to send to this topic. This can be done in the EC2 section of the AWS Console, or with the command line tools. Either way, ensure that the notification happens only when an instance terminates.

 Consuming the queue

So at this point, when your autoscale group scales down and terminates an instance it should trigger a message to your topic, which should then end up on the queue endpoint. This is great, but there needs to be something consuming these messages. You could use any viable scripting language you like, although I found that Ruby was the easiest tool for the job as it has decent AWS libraries and execution of shell commands is a breeze.

The messages are in JSON format, and contain various details about the scale down event that just occurred. The only important bit of information for this example is the AWS instance ID (which doubles up as the chef node / client ID).

The script needs to do three basic tasks; check for messages on the queue, extract the instance ID for each message it finds, and then run the necessary shell commands to remove the nodes from the chef server.

I installed my script on the chef server itself, although any system that has knife configured correctly would suffice. I then created a cron job to run the script once every 30 seconds. Works like a charm!

Here is my script in full:


About Matt

Cloud Systems Engineer at Reed Elsevier; cloud computing advocate, rock climber, swing dancer, amateur photographer, professional idiot....
Tagged , , , , . Bookmark the permalink.

4 Responses to Automatically remove dead AutoScale nodes from Chef Server

  1. Jason Floyd says:

    Line 57 is missing a double quote. ¬†I also switched ‘knife node bulk delete’ and ‘knife client bulk delete’ as my instanceId is part of the hostname, so a regex match worked well.

  2. Matt says:

    Fixed! Thanks for pointing that out

  3. Ishu Gupta says:

    Hi Matt,

    I created an alarm configured with Metric that looked for TerminateInstances in Cloud trail logs , and I was able to successfully send the messages to SNS which further send the message to SQS for the terminated nodes, but the problem is , when I read the SQS it only has alarm information , Instead of the the log event info .

    Please look at my complete log here and suggest:

    • Matt says:

      Apologies that I did not see this, it got lost in the piles of spam!

      It seems that you got a solution out of Stack Overflow anyway, however, for reference; John Rotenstein was quite correct, the reason you did not see the information I described in this post was because of the origin of the event. You were passing through CloudWatch, whereas I configured the AutoScale group to send notifications directly to the SNS topic.

Leave a Reply

Your email address will not be published. Required fields are marked *