Implementing “sleep” in the CloudFormation stack for the delay caused by IAM eventual consistency

Table of Contents

Problem statement

Our customer uses Customizations for AWS Control Tower for the account vending. A new account in the specific organizational unit should deploy different resources as a baseline, for example, IAM roles, VPC with all networking components, and ECS cluster for further application deployment. ECS cluster creation requires a service-linked role that should be explicitly created in case of using CloudFormation. So, a native CloudFormation feature, “depends on” was used to create a strict order of resource creation.

This is the initial CloudFormation stack:

 

AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS ECS Fargate cluster'
Parameters:
  CapacityProviderTypes:
    Type: CommaDelimitedList
    AllowedValues:
      - FARGATE
      - FARGATE_SPOT
  EnvironmentTag:
    Type: String
Conditions:
  IsProd: !Equals
    - !Ref EnvironmentTag
    - prod

Resources:
  FargateClusterRole:
    Type: AWS::IAM::ServiceLinkedRole
    Properties:
      AWSServiceName: ecs.amazonaws.com

  FargateCluster:
    Type: AWS::ECS::Cluster
    DependsOn: 
      - FargateClusterRole
    Properties:
      ClusterName: FargeetClusterPal
      CapacityProviders: !Ref CapacityProviderTypes
      ClusterSettings:
        - Name: containerInsights
          Value: enabled
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT]

If the service-linked role did not exist in advance, the stack failed sometimes and the root cause is the following. CloudFormation sends an API call to AWS to create a service-linked role and receives a successful response. But if, at the same time, we try to find the role in the IAM console, it will not be displayed in 100% of cases. It is not obvious, and not all people know it, but some delays are possible during updates in the IAM configurations.

As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any change that you make in IAM (or other AWS services), including tags used in attribute-based access control (ABAC), takes time to become visible from all possible endpoints. Some of the delay results from the time it takes to send the data from server to server, from replication zone to replication zone, and from Region to Region around the world. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out.

So, as a workaround, we had to implement a “sleep” step between the creation of the service-linked role and the ECS cluster itself to give it some time to propagate all changes and make our stack always work.

Proposed solution

Unfortunately, such a simple thing as “sleep” delay is absent in CloudFormation by the day of writing this post. So we had a couple of options.

The first idea was to create the service-linked role somewhere in previous steps of account vending, for example, during the VPC creation, but this is not quite the logically right solution. The service-linked role is related to the ECS stack, so, ideally, it should be created within it.

The second idea was to use CloudFormation custom resource with Lambda function, where we actually can implement whatever we need, including “sleep” timeout.

This is the new CloudFormation stack:

 

AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS ECS Fargate cluster'
Parameters:
  CapacityProviderTypes:
    Type: CommaDelimitedList
    AllowedValues:
      - FARGATE
      - FARGATE_SPOT
  EnvironmentTag:
    Type: String
Conditions:
  IsProd: !Equals
    - !Ref EnvironmentTag
    - prod

Resources:
  FargateClusterRole:
    Type: AWS::IAM::ServiceLinkedRole
    Properties:
      AWSServiceName: ecs.amazonaws.com

  FargateCluster:
    Type: AWS::ECS::Cluster
    DependsOn: 
      - Delay
    Properties:
      ClusterName: FargeetClusterPal
      CapacityProviders: !Ref CapacityProviderTypes
      ClusterSettings:
        - Name: containerInsights
          Value: enabled
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT]

  Delay:
    Type: 'Custom::Delay'
    DependsOn: 
      - FargateClusterRole
    Properties:
      ServiceToken: !GetAtt DelayFunction.Arn
      TimeToWait: 20

### Custom resource for Delay (sleep), that is natively absent in CloudFormation
  LambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: "lambda-logs"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource:
                  - "arn:aws:logs:*:*:*"

  DelayFunction:
    Type: 'AWS::Lambda::Function'
    Properties:
      Handler: "index.handler"
      Timeout: 120
      Role: !GetAtt 'LambdaRole.Arn'
      Runtime: python3.10
      Code:
        ZipFile: |
          import json
          import cfnresponse
          import time
          def handler(event, context):
             time_to_wait = int(event['ResourceProperties']['TimeToWait'])
             print('wait started')
             time.sleep(time_to_wait)
             responseData = {}
             responseData['Data'] = "wait complete"
             print("wait completed")
             cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData)

As a result, we have a couple of new blocks in the CloudFormation template, which could be replaced by one parameter. Such a feature has been requested since 2020 , but is still absent as a native CloudFormation functionality. Up to now, we can bypass this limitation with custom Lambda resources.

Conclusion

In this post, we looked at CloudFormation custom resource as a tool to implement a “sleep” delay between dependent parts creation within a stack. CloudFormation custom resource is a powerful function, that may be used for many other logics and interactions with third parties.