How to Partition CloudFront Logs for Athena!!!

When you turn on cloudfront logs to s3 , then it create files like this:

DISTRIBUTION_ID.Year–Month–Day–Hour.Random_id.gz

Problem: If you create a table in Athena with single folder which has all the logs, it scans all the data and then gives you specific results. Which means if you want to scan a single day logs then it scans overall records to give you that data and would give a huge bill if you have Terabytes of data. Plus the processing time would be longer.

Solution: Partition the data, partition in this case create hierarchy like this:

structured > DISTRIBUTION_ID > Year > Month > Day > Hour

in the s3 bucket. In this pattern, if you need to scan only a single day logs then you can. It is fast and you would pay only for that day data processing. Other scenarios would be hour wise, month wise, year wise or distribution overall data.

To achieve this we need to divide our approach into two Segments:

  1. Migrate New Logs Automatically into the partition pattern

We can achieve this using the lambda function and that lambda function requires:

a. IAM role with FULL S3 Access Policy

Select Policy AWSS3FULLACCESS, it is predefined by AWS

b. Lambda Code (Tested with Python 3.8)

import boto3def lambda_handler(event, context):
s3 = boto3.client(‘s3’, region_name=’ap-south-1')
print(“Im trying to work with your logs”)
# Iterate over all records in the list provided
for record in event[‘Records’]:
# Get the S3 bucket
bucket = record[‘s3’][‘bucket’][‘name’]
# Get the source S3 object key
key = record[‘s3’][‘object’][‘key’]

filename = key.split(‘/’)[1]
distro = filename.split(‘.’)[0].split(‘/’)[0]
# Get the yyyy-mm-dd-hh from the source S3 object
dateAndHour = filename.split(‘.’)[1].split(‘/’)[0]

year, month, day, hour = dateAndHour.split(‘-’)
# Create destination path
dest = ‘structured/{}/year={}/month={}/day={}/hour={}/{}’.format(
distro, year, month, day, hour, filename
)
# Display source/destination in Lambda output log
print(“- src: s3://%s/%s” % (bucket, key))
print(“- dst: s3://%s/%s” % (bucket, dest))
# Perform copy of the S3 object
s3.copy_object(Bucket=bucket, Key=dest, CopySource=bucket + ‘/’ + key)
# Delete the source S3 object
# Disable this line if a copy is sufficient
#s3.delete_object(Bucket=bucket, Key=key)
return {
‘statusCode’: 200,
‘body’: “”,
‘headers’: {‘Content-Type’: ‘application/json’, ‘Access-Control-Allow-Origin’: ‘*’}
}

Test Json:

{
“Records”: [{
“s3”: { “bucket”: { “name”: “BUCKET_NAME” },
“object”: {
“key”:“CLOUDFRONT_ID.2021–01–01–00.RANDOM_ID.gz”
} } }]
}

c. Trigger Lambda On S3 Bucket Object Creation

Click Add Trigger Button and Select S3 , then select bucket

Event Type: All Object create events

Prefix: set only if you have a path, leave blank if your logs are in root of the bucket

Suffix: .gz

Note: you can use multiple triggers if you have more than one cloudfront distribution logs in separate folder or separate bucket.

2. Migrate Old Logs into the partition pattern

The previous section allowed you to automatic partition of future logs, But we need to migrate the old logs too. For that we have a Python script, and steps to use them:

a. Created a EC2 Instance with new IAM role for ec2 with same policy.

b . SSH the instance and install python3 , pip3 and pip3 install boto3

c. Check you can access the s3 using list bucket.

d. Create a file partition-old-s3-logs.py

import boto3
import sys
bucket = ‘’ #Bucket Name as set on Trigger
path = ‘’ #same as prefix as set on trigger
id = ‘E1O********’ # Cloudfront id , prefix to identify logs
day = sys.argv[1]if len(sys.argv) == 3:
hour = sys.argv[2]
prefix = path+’/’+id+”.”+day+”-”+str(hour)
else:
prefix = path+’/’+id+”.”+day
s3 = boto3.client(‘s3’, region_name=’ap-south-1')response = s3.list_objects(Bucket=bucket, Prefix=prefix)def copyToNew(key):
filename = key.split(‘/’)[1]
distro = filename.split(‘.’)[0].split(‘/’)[0]
dateAndHour = filename.split(‘.’)[1].split(‘/’)[0]
year, month, day, hour = dateAndHour.split(‘-’)
dest = ‘structured/{}/year={}/month={}/day={}/hour={}/{}’.format(distro, year, month, day, hour, filename)
print(“- src: s3://%s/%s” % (bucket, key))
print(“- dst: s3://%s/%s” % (bucket, dest))
s3.copy_object(Bucket=bucket, Key=dest, CopySource=bucket + ‘/’ + key)for r in response[‘Contents’]:
copyToNew(r[‘Key’])

Usage:

# Partition whole day records

python3 partition-old-s3-logs.py 2021–01–01

# Partition specific hour of day records

python3 partition-old-s3-logs.py 2021–01–01 00

Since there is no paging in this, it does not allow full month or year partition in one run. But you can still achieve that using an loop of day increment in separate file and run the partition-old-s3-logs.py file via

 os.system("python3 partition-old-s3-logs.py  "+day)

Thank you for reading Article, To improve the article please comment.

Have A great Day.

#DevOps