Skip to main content
  1. Posts/

AWS CloudFormation: Infrastructure as Code for Microservices

If you have been building microservices on AWS for a few years, you know the drill: you provision an S3 bucket through the console for the first time, then do it again for staging, then again for prod, and six months later nobody remembers exactly what settings were applied where, or why. CloudFormation is the answer to that. It lets you describe your AWS infrastructure as code — version-controlled, repeatable, and reviewable.

This post walks through writing and deploying CloudFormation templates for the resources a typical microservice needs: S3, SQS, IAM, and CloudWatch. It assumes you are comfortable with AWS concepts and have used the CLI before, but have not written CloudFormation seriously. By the end you will have a working template structure you can adapt to your own service.

What CloudFormation Actually Does #

CloudFormation takes a template — a YAML or JSON file describing AWS resources — and turns it into a stack: a managed collection of those resources. When you update the template and redeploy, CloudFormation computes a changeset showing exactly what will be created, modified, or deleted before anything happens. When you delete the stack, all the resources in it are cleaned up (unless you tell it otherwise).

The core value is not just automation. It is that your infrastructure lives in your repository alongside your application code. A pull request that adds an SQS queue also adds the CloudFormation template that provisions it. New engineers can read the template to understand what the service depends on. Drift between environments becomes visible.

Template Anatomy #

Every CloudFormation template follows the same structure:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Brief description of what this template provisions'

Parameters:
  # Inputs that let you reuse the template across environments

Conditions:
  # Boolean expressions derived from parameters

Resources:
  # The actual AWS resources — this section is mandatory

Outputs:
  # Values to export for use by other stacks or scripts

Only Resources is required. Everything else is optional, but Parameters and Conditions are what make a single template work across integration, staging, and production.

Parameters: One Template, Many Environments #

Parameters are how you inject environment-specific values without duplicating the template. You define them in the template and supply the actual values in a separate parameter file at deploy time.

Parameters:
  Environment:
    Type: String
    Description: Deployment environment
    AllowedValues:
      - integration
      - staging
      - production

  ServiceName:
    Type: String
    Description: Name of the microservice, used as a prefix for resource names

  RetentionDays:
    Type: Number
    Default: 30
    Description: Log retention period in days

The corresponding parameter file — one per environment — looks like this:

[
  { "ParameterKey": "Environment",   "ParameterValue": "integration" },
  { "ParameterKey": "ServiceName",   "ParameterValue": "my-service"  },
  { "ParameterKey": "RetentionDays", "ParameterValue": "7"           }
]

Keep your parameter files in version control alongside the templates, in a structure like:

cloudformation/
├── s3/
│   ├── template.yml
│   ├── params-integration.json
│   └── params-production.json
├── sqs/
│   ├── template.yml
│   ├── params-integration.json
│   └── params-production.json
├── iam/
│   └── template.yml
└── cloudwatch/
    ├── template.yml
    └── params-integration.json

One template per resource type. Each template is small, focused, and easy to reason about. This is far more manageable than one giant template that provisions everything.

Conditions: Environment-Specific Behaviour #

Conditions let you create or configure resources differently depending on parameter values, without duplicating the template.

Conditions:
  IsProduction: !Equals [ !Ref Environment, production ]
  IsIntegration: !Equals [ !Ref Environment, integration ]

You can then use these in resource definitions:

Resources:
  MyBucket:
    Type: AWS::S3::Bucket
    DeletionPolicy: !If [ IsProduction, Retain, Delete ]
    Properties:
      VersioningConfiguration:
        Status: !If [ IsProduction, Enabled, Suspended ]

In production, the bucket is retained if the stack is deleted and versioning is on. In integration, the bucket is deleted with the stack and versioning is off. Same template, different behaviour.

S3 #

A basic S3 bucket template with encryption, versioning, and a lifecycle policy:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'S3 bucket for my-service data storage'

Parameters:
  Environment:
    Type: String
    AllowedValues: [ integration, staging, production ]

  ServiceName:
    Type: String

Conditions:
  IsProduction: !Equals [ !Ref Environment, production ]

Resources:
  DataBucket:
    Type: AWS::S3::Bucket
    DeletionPolicy: !If [ IsProduction, Retain, Delete ]
    UpdateReplacePolicy: Retain
    Properties:
      BucketName: !Sub '${ServiceName}-data-${Environment}'
      VersioningConfiguration:
        Status: !If [ IsProduction, Enabled, Suspended ]
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      LifecycleConfiguration:
        Rules:
          - Id: ExpireOldVersions
            Status: Enabled
            NoncurrentVersionExpiration:
              NoncurrentDays: 90
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: !Ref ServiceName

  DataBucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref DataBucket
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: DenyInsecureConnections
            Effect: Deny
            Principal: '*'
            Action: 's3:*'
            Resource:
              - !GetAtt DataBucket.Arn
              - !Sub '${DataBucket.Arn}/*'
            Condition:
              Bool:
                'aws:SecureTransport': false

Outputs:
  BucketName:
    Value: !Ref DataBucket
    Description: Name of the data bucket

  BucketArn:
    Value: !GetAtt DataBucket.Arn
    Description: ARN of the data bucket

A few things worth noting:

!Sub substitutes variable references into a string. ${ServiceName}-data-${Environment} becomes my-service-data-integration when those parameters are supplied.

!Ref and !GetAtt reference resources within the same template. !Ref DataBucket returns the bucket name; !GetAtt DataBucket.Arn returns its ARN.

DeletionPolicy: Retain means if someone deletes the stack, the bucket is not deleted. Always use this for production buckets with real data. Without it, aws cloudformation delete-stack will attempt to delete the bucket — and fail if it has contents, leaving the stack in a broken state.

PublicAccessBlockConfiguration — always set all four to true unless you have a specific reason not to. The default is not as locked down as you might expect.

SQS #

A queue template with a dead-letter queue (DLQ) — any message that fails processing a set number of times gets moved to the DLQ rather than being lost or causing infinite retries:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'SQS queues for my-service event processing'

Parameters:
  Environment:
    Type: String
    AllowedValues: [ integration, staging, production ]

  ServiceName:
    Type: String

  MessageRetentionSeconds:
    Type: Number
    Default: 345600  # 4 days
    Description: How long messages are retained if not consumed

  MaxReceiveCount:
    Type: Number
    Default: 3
    Description: Number of times a message is delivered before moving to DLQ

Conditions:
  IsProduction: !Equals [ !Ref Environment, production ]

Resources:
  EventDeadLetterQueue:
    Type: AWS::SQS::Queue
    DeletionPolicy: !If [ IsProduction, Retain, Delete ]
    Properties:
      QueueName: !Sub '${ServiceName}-events-dlq-${Environment}'
      MessageRetentionPeriod: 1209600  # 14 days for DLQ — longer retention for investigation
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: !Ref ServiceName

  EventQueue:
    Type: AWS::SQS::Queue
    DeletionPolicy: !If [ IsProduction, Retain, Delete ]
    Properties:
      QueueName: !Sub '${ServiceName}-events-${Environment}'
      MessageRetentionPeriod: !Ref MessageRetentionSeconds
      VisibilityTimeout: 300
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt EventDeadLetterQueue.Arn
        maxReceiveCount: !Ref MaxReceiveCount
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: !Ref ServiceName

Outputs:
  EventQueueUrl:
    Value: !Ref EventQueue
    Description: URL of the event queue

  EventQueueArn:
    Value: !GetAtt EventQueue.Arn
    Description: ARN of the event queue

  DeadLetterQueueUrl:
    Value: !Ref EventDeadLetterQueue
    Description: URL of the dead-letter queue

VisibilityTimeout is how long a message is hidden from other consumers after one consumer picks it up. Set this longer than your maximum processing time, or you will get duplicate deliveries when processing is slow.

RedrivePolicy wires up the DLQ. When a message has been received maxReceiveCount times without being deleted (i.e., acknowledged), SQS moves it to the dead-letter queue automatically. Always set up a DLQ in production — without one, poison messages loop forever.

IAM #

Your application needs a role to access the S3 bucket and SQS queue you just provisioned. The principle of least privilege: grant only what the service actually needs.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'IAM role for my-service application'

Parameters:
  Environment:
    Type: String
    AllowedValues: [ integration, staging, production ]

  ServiceName:
    Type: String

  DataBucketArn:
    Type: String
    Description: ARN of the S3 data bucket (from S3 stack output)

  EventQueueArn:
    Type: String
    Description: ARN of the SQS event queue (from SQS stack output)

Resources:
  ApplicationRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${ServiceName}-app-role-${Environment}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com  # or ecs-tasks.amazonaws.com for ECS
            Action: sts:AssumeRole
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                  - s3:DeleteObject
                Resource: !Sub '${DataBucketArn}/*'
              - Effect: Allow
                Action:
                  - s3:ListBucket
                Resource: !Ref DataBucketArn
        - PolicyName: SQSAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - sqs:ReceiveMessage
                  - sqs:DeleteMessage
                  - sqs:GetQueueAttributes
                  - sqs:SendMessage
                Resource: !Ref EventQueueArn
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: !Ref ServiceName

Outputs:
  ApplicationRoleArn:
    Value: !GetAtt ApplicationRole.Arn
    Description: ARN of the application role

Cross-stack references: Notice that DataBucketArn and EventQueueArn are parameters here, not hardcoded. They come from the Outputs of the S3 and SQS stacks. This keeps the IAM template decoupled — you can update a bucket ARN without touching the IAM template structure.

Never use Resource: '*' in a production policy unless the action genuinely requires it (a small number of S3 actions, like s3:ListAllMyBuckets, have no resource-level scope). Wildcard resources are the most common IAM mistake and the one that causes the most damage when something goes wrong.

CloudWatch Alarms #

Alarms are often left until something breaks in production. Don’t do that. Define them in the same deployment pipeline as the resources they monitor.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudWatch alarms for my-service'

Parameters:
  Environment:
    Type: String
    AllowedValues: [ integration, staging, production ]

  ServiceName:
    Type: String

  EventQueueName:
    Type: String
    Description: Name of the SQS event queue to monitor

  DeadLetterQueueName:
    Type: String
    Description: Name of the dead-letter queue to monitor

  AlarmEmail:
    Type: String
    Description: Email address for alarm notifications

  DLQThreshold:
    Type: Number
    Default: 1
    Description: Number of messages in DLQ before alarming

Conditions:
  IsProduction: !Equals [ !Ref Environment, production ]

Resources:
  AlarmTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub '${ServiceName}-alarms-${Environment}'
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlarmEmail

  DLQMessageAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${ServiceName}-dlq-messages-${Environment}'
      AlarmDescription: 'Messages appearing in dead-letter queue indicate processing failures'
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref DeadLetterQueueName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: !Ref DLQThreshold
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref AlarmTopic
      OKActions:
        - !Ref AlarmTopic

  QueueDepthAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${ServiceName}-queue-depth-${Environment}'
      AlarmDescription: 'Queue depth is high — consumers may be falling behind'
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref EventQueueName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !If [ IsProduction, !Ref AlarmTopic, !Ref AWS::NoValue ]

  OldestMessageAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${ServiceName}-message-age-${Environment}'
      AlarmDescription: 'Messages are not being consumed — oldest message age is too high'
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref EventQueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 3600  # 1 hour in seconds
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref AlarmTopic

TreatMissingData: notBreaching — when there are no messages in a queue, CloudWatch stops emitting the ApproximateNumberOfMessagesVisible metric. Without this setting, the alarm would fire because it sees missing data as a problem. For queue metrics, missing data means the queue is empty, which is fine.

AWS::NoValue is a special CloudFormation value that removes a property entirely when used with !If. In the QueueDepthAlarm, the AlarmActions list is populated in production but omitted in integration — the alarm still triggers and turns red, but no notification is sent in non-production environments.

Deploying with the AWS CLI #

With templates and parameter files in place, deployment is straightforward. The key concept is the changeset: you create it, review it, then execute it.

# Create or update a stack using a changeset
aws cloudformation deploy \
  --template-file cloudformation/s3/template.yml \
  --stack-name my-service-s3-integration \
  --parameter-overrides file://cloudformation/s3/params-integration.json \
  --capabilities CAPABILITY_NAMED_IAM \
  --no-execute-changeset

# Review the changeset before executing
aws cloudformation describe-change-set \
  --stack-name my-service-s3-integration \
  --change-set-name <changeset-name-from-above>

# Execute when satisfied
aws cloudformation execute-change-set \
  --stack-name my-service-s3-integration \
  --change-set-name <changeset-name-from-above>

--capabilities CAPABILITY_NAMED_IAM is required whenever your template creates IAM resources. CloudFormation requires you to explicitly acknowledge this because IAM changes can affect access across your entire account.

For a simpler deploy-and-execute in one step (appropriate for non-production environments or when you trust the changeset):

aws cloudformation deploy \
  --template-file cloudformation/s3/template.yml \
  --stack-name my-service-s3-integration \
  --parameter-overrides file://cloudformation/s3/params-integration.json \
  --capabilities CAPABILITY_NAMED_IAM

A Deployment Script #

Once you have multiple templates, a small shell script that deploys them in the right order (IAM last, since it references ARNs from other stacks) is worth having:

#!/bin/bash
set -euo pipefail

ENV=${1:-integration}
SERVICE=my-service

echo "Deploying CloudFormation stacks for ${SERVICE} in ${ENV}"

deploy_stack() {
  local template=$1
  local stack_name=$2
  local params=$3

  echo "→ Deploying stack: ${stack_name}"
  aws cloudformation deploy \
    --template-file "cloudformation/${template}/template.yml" \
    --stack-name "${stack_name}" \
    --parameter-overrides "file://cloudformation/${template}/params-${ENV}.json" \
    --capabilities CAPABILITY_NAMED_IAM \
    --tags Environment="${ENV}" Service="${SERVICE}"
}

# Deploy in dependency order
deploy_stack s3        "${SERVICE}-s3-${ENV}"        "s3"
deploy_stack sqs       "${SERVICE}-sqs-${ENV}"       "sqs"
deploy_stack iam       "${SERVICE}-iam-${ENV}"       "iam"
deploy_stack cloudwatch "${SERVICE}-cloudwatch-${ENV}" "cloudwatch"

echo "All stacks deployed."

Run it as:

./scripts/deploy-cfn.sh integration
./scripts/deploy-cfn.sh production

Tagging #

Every resource should have consistent tags. Tags are how you find resources, track costs, and understand what belongs to what when something goes wrong at 2am.

A standard tag set for a microservice:

Tags:
  - Key: Service
    Value: !Ref ServiceName
  - Key: Environment
    Value: !Ref Environment
  - Key: Team
    Value: platform-engineering
  - Key: Repository
    Value: my-service

Define these as a parameter in each template and apply them to every resource. The discipline feels tedious until you are looking at an unfamiliar resource in the console three years later and the tags tell you exactly what it belongs to.

Things That Will Catch You Out #

Stack drift — if someone modifies a resource directly in the console after it was created by CloudFormation, the stack drifts out of sync with the template. aws cloudformation detect-drift --stack-name <name> will identify what changed. The fix is usually to update the template to match reality and redeploy, or to manually revert the console change.

Circular dependencies — if Stack A exports a value that Stack B imports, and Stack B exports a value that Stack A imports, you have a circular dependency and neither can deploy. Break it by passing values as parameters rather than using Fn::ImportValue.

Stack deletion with retained resources — if your stack has DeletionPolicy: Retain on some resources, deleting the stack leaves those resources orphaned. They will not appear in any stack but will continue to exist and incur costs. Keep a record of retained resources.

Update rollback — if a stack update fails partway through, CloudFormation rolls back to the previous state. Usually this works cleanly. Occasionally it does not and the stack ends up in UPDATE_ROLLBACK_FAILED — a state where you cannot update or delete the stack. You need to call ContinueUpdateRollback with resources skipped, which is not fun. Avoid this by testing in integration first.


CloudFormation has rough edges and the YAML can become verbose, but the discipline of treating infrastructure as code pays off at every stage of a service’s life — during development when you need to spin up a clean environment, during incidents when you need to understand what exists, and during off-boarding when you need to know what to clean up. The templates above give you a working foundation. Start with one resource type, get the deployment working end to end in integration, then build out from there.