AWS EKS Fargate in the Video Conversion Project
Updated: Oct 4, 2020
One of our long term customers has raised a requirement for a simple, robust, scalable, and cost-effective video transcoding solution to complement their existing services. With the customer’s R&D team, we’ve designed and built a serverless video transcoding platform that utilizes Fargate for EKS, SQS, and Lambda to deliver this service to the end-users.
All phases of the project were delivered to the customer as IaC (Infrastructure as Code, Terraform) and Jenkins pipelines code. All infrastructure-related artifacts are managed in Git.
The video transcoding project opened a new line of business for our customers, allowing them to provide a better, faster, and cheaper service to their platform’s end user.
Since the Lambda function has a maximum timeout of 15 minutes and the client had its own docker image for this service already, our choice fell on the EKS Fargate, a serverless solution based on Kubernetes.
Also, we have chosen the AWS EKS Fargate to implement the transcoding platform because with EKS Fargate it is possible to easily build a serverless environment that does not need to be maintained or monitored, everything is managed by the AWS cloud. By utilizing this AWS service we do not need to keep constantly running instances, Fargate instances will be connected automatically exactly when they are needed.
There are also some specifics when working with AWS EKS Fargate that we have had to take into account during this project:
Separate EKS NodeGroup for the operation of service pods is required;
Additional steps are required to use IAM roles;
There is no standard out of the box solution for logging;
Here is the additional list of AWS EKS Fargate limitations from the AWS website:
There is a maximum of 4 vCPU and 30Gb memory per pod;
Currently, there is no support for stateful workloads that require persistent volumes or file systems;
you cannot run DaemonSets, Privileged pods, or pods that use HostNetwork or HostPort;
the only load balancer you can use is an Application Load Balancer
In the end, our final high-level design scheme was as follows:
The client calls the Gateway API and uploads the source video to the S3 Bucket
The Gateway API creates a message in SQS with a parameter that contains the path to the downloaded video file.
SQS is triggering a Lambda Function, which creates a Kubernetes job and creates a message in the second SQS.
The application takes the message from SQS, receives the path to the file from the parameter, transcodes it, loads the already converted video file into the S3 Bucket, and removes the message from the queue.
If the job result is unsuccessful or execution timeout has happened, a message from the "Running" queue enters the dead letter queue for further analysis.
Picture 1 - Solution Design Schematics
Python was chosen as the programming language for the Lambda function.
As we said earlier, the following tasks were assigned to our lambda function:
read the parameter from the SQS message that triggered it;
create a message in the “Running” queue with the parameter that was received from the message trigger;
create Kubernetes batch-job;
We planned to make a kubectl analogy from the Lambda function, for this we decided to use Kubernetes Python Client. For work with SQS we used the PIP package “boto3”.
First we need get parameter from SQS message, so we written following function:
# Get input value from SQS message
message = (event["Records"]["messageAttributes"]["input"])
key = message.get('stringValue')
Where “input” is the name of a parameter created by API Gateway and the content path to the video file in S3 Bucket.
Second, we need to create a message in the “Running” queue with the same parameter.
# Create message in SQS Running
sqs = boto3.client('sqs', region_name=awsRegion)
message = sqs.send_message(
print("Sent message to Running SQS")
key = (message["MessageId"])
print("Message ID: " + key)
Values “awsRegion” and “sqsRunningUrl” were stored in the Lambda environment.
After we read the parameter from the SQS message and created the message in the “Running” queue, we are ready to create a Kubernetes job.
jobName = "transcoder-" + randomString()
namespace = "default"
inputFile = get_input_value(event)
messageId = sqs_running_create_message(inputFile)
# Configureate Pod template container
container = client.V1Container(
image=ecrURL + ":" + ecrTag,
# Create and configure a spec section
template = client.V1PodTemplateSpec(
# Create the specification of deployment
spec = client.V1JobSpec(
# Instantiate the job object
job = client.V1Job(
print("Creating job: "+ str(jobName))
print("Image Tag: " + str(ecrTag))
print("Input: " + str(inputFile))
Values “ecrURL” and “ecrTag” we stored in Lambda environment.
As you can see, we add variables to the pod environment variables “'SQS_QUEUE_URL” and “AWS_REGION” so that after successful conversion the application can remove the message from the queue, also we add the “AWS_S3_BUCKET” variable.
We tested the python script locally and it worked perfectly, but after loading the code into the Lambda function we encountered the following problem - the Python package “Kubernetes” was missing in the Lambda Python. To avoid this problem, we created a virtual environment, installed the “Kubernetes” package, and uploaded the contents of the virtual environment into the Lambda function. After that, the Python script successfully started working in the Lambda function.
And so, the Lambda function works, it reads the parameter from the SQS message, creates a message in the “Running” queue and creates a Kubernetes job, so we went on to test Kubernetes job. And we saw that the server is constantly restarting, we looked at the logs and saw that the application does not have access to the SQS, although the IAM role that the Fargate uses has an attached policy with permission to this SQS. We began to examine in more detail and noticed that the application is trying to access the address "169.254.169.254", which is the address for receiving metadata for EC2 instances, and since we have Fargate, this did not work for us. We looked at all the possible endpoints to get metadata, there are 3 of them in all, this is for EC2 Instance, Lambda and ECS Fargate (unfortunately this does not work for EKS Fargate). We could use “AWS_ACCESS_KEY_ID” and “AWS_SECRET_ACCESS_KEY” inside the container, but it is unsafe and we do not recommend anyone to do this. We decided to use AWS OpenID Connect and Kubernetes service account.
In the AWS console, go to our EKS Cluster and copy the value of the variable "OpenID Connect provider URL"
Go to the “AIM” -> “Identity providers” and create a new provider.
Provider Type - “OpenID Connect”
Provider URL - Value from “OpenID Connect provider URL”
Audience - “sts.amazonaws.com”
And lastly, it remains to create IAM Role with the following contents in the Trust Policy:
Where you need to replace the following:
AWS_REGION - your AWS Region
AWS_ACCOUNT_ID - your AWS Account ID
XXXXXXXXXXXXXXXX - OpenID for EKS
In the future, you can attach all the necessary policies to this AIM Role.
Kubernetes Service Account
Create a service account using the following manifest:
Where you need to replace the following:
AWS_ACCOUNT_ID - your AWS Account ID
IAM_ROLE - name of AIM Role created in the previous step
Do not forget to set the job to use this service account in the Lambda function.
After that, the application successfully got access to the queue, downloaded the RAW video from S3 bucket, converted, uploaded the converted video to S3 Bucket and deleted the message from the SQS.
Kubernetes Job Cleanup
After all the tests, we noticed that after the successful completion of the Kubernetes job, it goes into the completed state, and therefore the Fargate instance remains in the list of nodes, although it has not been charged for the Fargate instance since the server goes into the state, I would like to clean them up. Kubernetes has the “TTLAfterFinished” option, which clears the pods after they are completed, but it is still in the alpha version, and EKS does not allow the use of alpha futures. Therefore, we decided to use a Cron Job that will run the python script and build a Docker image. You can download it from our repository in the Docker hub. This image can clean up both jobs and the associated pods, and separately the pods. You can also set the timeout time that must elapse after the pod/job has completed before deleting it, as well as specify a specific namespace.