Automating Snapshot Deletion with Lambda for Cloud Cost Optimization

·

9 min read

Overview

Snapshots are an essential part of AWS infrastructure, allowing users to create backups of volumes. However, over time, these snapshots can pile up and result in unnecessary cloud costs. In this blog, we will walk through the process of creating an AWS Lambda function that automatically deletes stale snapshots—those that are no longer associated with any EC2 instance or volume. This will help automate the cleanup process and optimize cloud spending.

Objective

The objective of this blog is to guide you through setting up a Lambda function that detects and deletes snapshots that are not associated with any volume or EC2 instance. By doing so, we can automate the snapshot lifecycle management process and keep AWS costs under control.

Why Cloud Cost Optimization Matters:

  1. Cost Efficiency: Startups and mid-scale organizations move to the cloud to reduce the overhead of managing physical infrastructure. Cloud platforms like AWS allow them to avoid the costs associated with maintaining a data center, such as hardware, servers, and a dedicated team.

  2. Efficient Resource Management: After migrating to the cloud, organizations often face high cloud costs if resources are not used efficiently. A DevOps engineer's role becomes crucial in monitoring and managing resources to avoid unnecessary spending.

Key Cost Optimization Example:

The video demonstrates how mismanagement of resources can lead to unnecessary costs:

  • Unused Resources: Developers or engineers might leave resources like EC2 instances, EBS volumes, S3 buckets, or EKS clusters running without deleting them. These "stale resources" can accumulate and significantly increase cloud costs.

  • Snapshots and Volumes: A developer might create snapshots or volumes without deleting them once the associated resources are no longer in use, leading to continuous charges.

  • EKS Clusters and S3 Buckets: Similarly, forgotten EKS clusters or S3 buckets, if not cleaned up, can result in excess charges.

The Role of DevOps in Cost Optimization:

  • Monitoring Stale Resources: DevOps engineers need to identify and manage stale resources to avoid cloud cost overruns. This can be done by automating the deletion of unused resources.

  • Notifications: DevOps engineers can send notifications (e.g., using SNS) to alert team members about unused resources, encouraging them to clean up.

  • Automatic Deletion: DevOps engineers can automate the deletion of stale resources, which is the focus of the project in this video.

Solution Demonstration:

The demonstration will involve creating an architecture that uses AWS Lambda functions to automate the detection and deletion of stale resources. The Lambda functions will be written in Python using the Boto3 library, which is the AWS SDK for Python. This allows DevOps engineers to interact with AWS services like EC2, S3, and EBS, and automate resource management.

Prerequisites

Before you begin, make sure you have the following:

  • AWS Account with access to EC2 and Lambda.

  • IAM Role with necessary permissions to interact with EC2, snapshots, and Lambda.

  • Basic understanding of AWS Lambda and EC2.

Step-by-Step Guide

Step 1: Launch an EC2 Instance

  1. Navigate to the EC2 Dashboard on the AWS Management Console.

  2. Create a new EC2 instance:

    • Name: Test EC2 Instance.

    • Instance type: t2.micro (free tier eligible).

    • OS: Ubuntu (default settings are sufficient).

    • No changes to additional volumes; the default volume will suffice.

  3. Launch the instance and wait for it to be in the Running state.

Step 2: Verify the Attached Volume

  • As part of the EC2 instance setup, a default EBS volume is attached.

  • This volume will later be used to create a snapshot.

Creating a Snapshot

  1. From the EC2 dashboard, locate the EBS volume attached to your new EC2 instance.

  2. Create a snapshot of this volume:

    • Navigate to the Volumes section.

    • Select the attached volume and choose the Create Snapshot option.

    • Provide a name or description for the snapshot.

Select Volume

Click Create Snapshot

Scenario

  • A developer creates daily snapshots of an EBS volume attached to an EC2 instance.

  • The snapshots are backups or "images" of the volume data, which can be restored later.

  • The developer deletes the EC2 instance and the associated volume, but the snapshots remain, leading to unnecessary costs.

  • The goal is to automate the cleanup of unused snapshots using a Lambda function. let's create a Lambda function to:

  • Step 1: Fetch all EBS snapshots using AWS API.

  • Step 2: Filter out snapshots that are still attached to active volumes.

  • Step 3: Delete any snapshots that are not attached to active resources.

Creating and Testing the Lambda Function

Either you can assign permissions while creating a Lambda function, or you can create a role and assign it later.

Scroll down to the coding space, copy the Python code, and click deploy and test.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # Get all EBS snapshots
    response = ec2.describe_snapshots(OwnerIds=['self'])

    # Get all active EC2 instance IDs
    instances_response = ec2.describe_instances(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
    active_instance_ids = set()

    for reservation in instances_response['Reservations']:
        for instance in reservation['Instances']:
            active_instance_ids.add(instance['InstanceId'])

    # Iterate through each snapshot and delete if it's not attached to any volume or the volume is not attached to a running instance
    for snapshot in response['Snapshots']:
        snapshot_id = snapshot['SnapshotId']
        volume_id = snapshot.get('VolumeId')

        if not volume_id:
            # Delete the snapshot if it's not attached to any volume
            ec2.delete_snapshot(SnapshotId=snapshot_id)
            print(f"Deleted EBS snapshot {snapshot_id} as it was not attached to any volume.")
        else:
            # Check if the volume still exists
            try:
                volume_response = ec2.describe_volumes(VolumeIds=[volume_id])
                if not volume_response['Volumes'][0]['Attachments']:
                    ec2.delete_snapshot(SnapshotId=snapshot_id)
                    print(f"Deleted EBS snapshot {snapshot_id} as it was taken from a volume not attached to any running instance.")
            except ec2.exceptions.ClientError as e:
                if e.response['Error']['Code'] == 'InvalidVolume.NotFound':
                    # The volume associated with the snapshot is not found (it might have been deleted)
                    ec2.delete_snapshot(SnapshotId=snapshot_id)
                    print(f"Deleted EBS snapshot {snapshot_id} as its associated volume was not found.")
  • The function uses Boto3 to interact with the EC2 service.

  • It first lists all snapshots owned by your account using describe_snapshots().

  • Then, for each snapshot, it checks if there are any volumes associated with it using describe_volumes().

  • If no volumes are associated with the snapshot, it deletes the snapshot using delete_snapshot().

Here we are manually triggering the lambda function by creating an event

Now, run the test, and you'll see it fails due to permission issues. The Lambda function doesn't have the permission to describe the snapshots.

let's go to the configuration tab and edit, Increase the execution time to 10 seconds and save it.

It's best to keep the execution time as short as possible because AWS charges based on this parameter. The Lambda execution time is one of the factors for billing, so ensure you minimize this time.

Now, let’s assign the permissions for Lambda to perform describe snapshots function. Go to general Configuration→ Permissions

Click the execution role name and you will be taken to IAM Role page

You will see a default permission attached to the Lambda function and now click on the add function and attach policies

If you could not find the Describe and delete snapshot policy, create one with specific actions.

Click Create Policy and select the service as EC2 and in the filter what actions to b performed on EC2 like List, Read, and Write

Click next, provide the name of the policy, and create it.

Now, return to the default Lambda execution role and attach the policy you created earlier.

Now let us run the execution code

It shows unauthorized when calling describe instances, which is expected because we also describe the EC2 instances. We are trying to verify that these snapshots belong to a volume associated with EC2 instances, so we need permission to describe instances and permission to describe volumes.

The policy you provided grants permissions for ec2:DeleteSnapshot and ec2:DescribeSnapshots actions, but it does not include permissions for ec2:DescribeInstances. This is why your Lambda function is failing when it tries to call the describe_instances method

You can either grant EC2FullAccess or create a custom policy. It's always best to create policies with the least privileges necessary.

Create a custom policy with DescribeInstances and DescribeVolume

Now, return to the default Lambda execution role and attach the policy you created earlier.

This time, execution has succeeded.

Your snapshot will be still there

Let’s try to delete the EC2 instance and the Volume will be automatically removed.

You can also add one more if condition to the code. What that condition would look like is, instead of directly deleting the snapshot, you can verify when it was last used or send out a notification asking the team, "Hey, can you confirm if I can delete this snapshot?" Let's assume the team has given you a 30-day threshold, saying, "If the snapshot is 30 days old, you can delete it," or "If the snapshot was last used 30 days ago, you can delete it." So, you can simply add this additional if condition

Now, let's run the Lambda function again. This time, the snapshot should be deleted because it should not be associated with any volume since we removed the instance earlier.

Great, now the snapshot is deleted. This is how you can manage cloud cost optimization in your AWS accounts as a DevOps and Cloud Engineer.

Another example

Create a volume and take a snapshot of it. Then, run the code to see what happens.

The snapshot has been deleted because we wrote the code to delete a snapshot if it is attached to a volume and the volume is not attached to a running instance.

Schedule the Lambda Function Using CloudWatch

You can set up an automatic schedule for the Lambda function to run periodically, ensuring that stale snapshots are cleaned up regularly.

  1. Create a CloudWatch Rule:

    • Navigate to the CloudWatch console.

    • Under Rules, click Create rule.

  2. Schedule the Rule:

    • Select Event Source as Schedule and define the schedule (e.g., run every day at midnight using a cron expression: cron(0 0 * * ? *)).
  3. Set the Target:

    • Choose the Lambda function as the target.

    • Save the rule to ensure it runs as scheduled.

6. Final Verification

Once everything is set up:

  • Check the EC2 console for any snapshots that should have been deleted.

  • Verify that the Lambda function runs as scheduled.

  • Optionally, you can review the CloudWatch Logs for further confirmation that the snapshots are being deleted as expected.

Conclusion

By automating snapshot deletion with Lambda, you can efficiently manage your AWS resources and reduce unnecessary cloud costs. This solution ensures that only the snapshots that are actively used by volumes or EC2 instances remain, while stale snapshots are automatically cleaned up.

For more complex use cases, you can extend this solution to include more conditions, such as deleting snapshots older than a certain threshold or those associated with specific tags. Stay tuned for more advanced AWS automation practices in future posts!

Â