Speeding up Python S3 Copy/Delete function

I wrote a function to cleanup old files in one of my S3 buckets. The function grabs a list of objects from the folder, checks to see if they’re more than a week old, and if so moves them to another folder and deletes them from the original.

The issue is that I have millions of older files, and this lambda only processes about 6 thousand every 15 minutes. I need to dramatically speed up the runtime of the lambda to process everything in a resonable timeframe

Here’s the function code

#This function cycles through an S3 bucket and moves any legacy files to a processing folder. This function also allows us to exclude certain filetypes If you have any other filetypes you'd like to exclude, be sure to add an exclusion to line 21

import json
import boto3
import botocore
import random
import time
import os
from datetime import datetime, timedelta, timezone


s3 = boto3.resource('s3')


def moveAllFiles(bucket, year):
    count = 0
    dayBuffer = 7
    # For every object in the bucket under the prefix
    for obj in bucket.objects.filter(Prefix=prefix + year + "/"):
        filepath = obj.key
        # If the item isn't a special file
        if "SPECIAL" not in filepath:
            filepathlist = filepath.split("/")
            # if the file's year matches the year we're processing, just a safety check
            if filepathlist[1] == year:
                folder = (filepathlist[2])
                # If the file isn't already processed
                if folder != "Processed":
                    copy_source = {
                        'Bucket': bucket.name,
                        'Key': filepath
                    }
                    today = datetime.now()
                    delta = today - timedelta(days=dayBuffer)
                    if obj.last_modified.replace(tzinfo=None) < delta:
                        try:
                            s3.meta.client.copy(copy_source, bucket.name,
                                        "Prefix/" + year + "/Processed/" + filepathlist[2] + "/" + filepathlist[3])
                            print("file " + filepath + " moved to processed folder")
                        except Exception as e:
                            print("error while copying")
                            print(e)
                            raise
                        s3.Object(bucket.name, filepath).delete()
                        print("file " + filepath + " deleted")
                        count = count + 1
                        print(count)
    return count


def lambda_handler(event, context):
    try:
        bucket = s3.Bucket(os.environ['BUCKET_NAME'])
        #Change the year variable to handle different subdirectories
        year = "2008"

        count = moveAllFiles(bucket, year)
        print("removed " + str(count) + " files from " + year)
    except Exception as e:
        print("function timed out")
        print(e)
        print("removed " + str(count) + " files from " + year)
        raise

For some clarification on why this runs the way it does, in our folders we also have filetypes that should not be moved. This doublechecks that the file in question isn’t one of those before copying/deleting. We also need to keep relatively new files, so I added a check to only move stuff older than 1 week. I think I could meet all of my requirements with lifecycle rules but my partner would prefer to run the lifecycle rule on only the processing folder, not the main folders.

So it sounds to me like the main issue here are all of the network calls. What is the correct way to do this? Build a list of everything that needs to be moved then use some kind of promise structure to batch them all out? Boto3 doesn’t allow for bulk upload/delete as far as I can tell. I’m also not sue that an entire list will be built within the 15 minute timeframe.

Basically, I’m looking for ways to speed up this process. I’m fine with babysitting it a bit, but the number of processed files per 15 minutes really needs to be 10x’d for this to be a resonable solution

Thanks for your time!

Answer

From the boto3 docs, both copy() and meta.client.copy() result in “a managed transfer which will perform a multipart copy in multiple threads if necessary”. That sounds like it is downloading each object to the computer issuing the copy command, and then uploading the object back to S3.

If your objects are less than 5GB in size, then you can use the copy_from() method to copy objects from one S3 location to another without having to download and re-upload the object contents, which should dramatically increase the speed of your application.