10 minutes
Building a Step Function with Error Handling in AWS with Terraform
The automation of cloud infrastructure through code (also called Infrastructure as Code or IaC) can be a real game-changer for your use cases, however small and simple they might be. IaC let’s you write code to define and manage your infrastructure, version control it, and deploy it as needed in a robust and repeatable way. One popular IaC frameworks is Terraform, which is cloud-agnostic and allows developers to define infrastructure for different cloud providers like Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP).
In this blog post, I will show you how to create AWS resources using Terraform and the HashiCorp Configuration Language (HCL). I will define a simple workflow that demonstrates the creation of several AWS resources:
- An S3 bucket with an event trigger
- A Step Function that will be triggered by the S3 bucket
- An SNS topic for error notifications
- A Lambda function that will fail (for the sake of the example)
The resources are created to perform the following tasks: when a file with a .txt extension is uploaded to an S3 bucket, a Step Function execution is triggered, which invokes a Lambda function. This Lambda function simply throws an error to demonstrate error handling. If the Lambda function fails, the step function handles a post to our created topic, which is subscribed to our email address, where we receive the error message.
This kind set up is a great starting point to build more complex use cases and it has proven to be super useful and effective several times for me already. Maybe it helps you too!
Getting Started
To follow along with this tutorial, you will need to have the following:
- An AWS account
- AWS CLI installed and configured
- Terraform installed
This tutorial also assumes basic familiarity with AWS and Terraform. If you need to refresh your memory or start off fresh from the very basics, these resources might help you:
If you want to follow along with the exact same example your can check out my GitHub repository.
Let’s get started.
Project Overview
In this tutorial, we will define our AWS resources in different modules that contain our terraform files written in HCL.
.
├── README.md
├── config.tf
├── main.tf
├── modules
│ ├── cloudwatch
│ │ └── cloudwatch.tf
│ ├── lambda
│ │ ├── lambda.tf
│ │ └── scripts
│ │ └── test_failure.py
│ ├── s3
│ │ └── s3.tf
│ ├── sns
│ │ └── sns.tf
│ └── stepfunction
│ └── stepfunction.tf
├── (terraform.tfvars)
├── (variables.tf)
└── (outputs.tf)
A main.tf
-file serves as an entry point for our terraform application, in which we initialize the different variables, as well as modules and outputs.
variable "aws_region" {
description = "aws region to deploy resources in"
type = string
default = "eu-central-1"
sensitive = true
}
variable "aws_profile" {
description = "the aws profile to use for the credentials"
type = string
sensitive = true
}
variable "email_address" {
description = "the email address where sns sends failure messages to"
type = string
sensitive = true
}
# Modules
module "lambda" {
source = "./modules/lambda"
}
module "stepfunction" {
source = "./modules/stepfunction"
lambda_arn = module.lambda.lambda_arn
topic_arn = module.sns.topic_arn
}
module "cloudwatch" {
source = "./modules/cloudwatch"
stepfunction_arn = module.stepfunction.stepfunction_arn
bucket_arn = module.s3.bucket_arn
}
module "sns" {
source = "./modules/sns"
email_address = var.email_address
}
module "s3" {
source = "./modules/s3"
}
# Outputs
output "lambda_arn" {
value = module.lambda.lambda_arn
}
output "topic_arn" {
value = module.sns.topic_arn
}
output "stepfunction_arn" {
value = module.stepfunction.stepfunction_arn
}
output "bucket_arn" {
value = module.s3.bucket_arn
}
The file- and foldernames could of course be different for your project. You could for example very well store your variables in a file called variables.tf
and the outputs in a file called outputs.tf
. This differentiation can certainly be useful for larger projects.
Defining the AWS Resources with Terraform
The actual Step Function and Definition
Let’s start off this tutorial by defininig the integral part for our simple workflow - the actual step function and its definition. It is supposed to start with the invocation of a lambda function, which in case of failure is supposed to publish to an sns topic. Keep in mind, that with Terraform, we always need to define AWS IAM roles and policies to define the permissions of our resources (invoking a specific lambda function and publishing to a certain sns topic).
# variables
variable "lambda_arn" {}
variable "topic_arn" {}
resource "aws_iam_role" "role_for_sfn" {
name = "role_for_sfn"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "states.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_policy" "policy_for_sfn" {
name = "policy_for_sfn"
path = "/"
description = "My policy for sfn"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "lambda:InvokeFunction",
"Resource": [
"${var.lambda_arn}",
"${var.lambda_arn}:*"
],
"Effect": "Allow"
},
{
"Action": "sns:Publish",
"Resource": "${var.topic_arn}",
"Effect": "Allow"
}
]
}
EOF
}
resource "aws_iam_role_policy_attachment" "attach_iam_policy_to_iam_role" {
role = aws_iam_role.role_for_sfn.name
policy_arn = aws_iam_policy.policy_for_sfn.arn
}
resource "aws_sfn_state_machine" "sfn_state_machine" {
name = "TfStateMachine"
role_arn = aws_iam_role.role_for_sfn.arn
definition = <<EOF
{
"StartAt": "TfSubmitJob",
"States": {
"TfSubmitJob": {
"End": true,
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "TfPublishMessage"
}
],
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${var.lambda_arn}",
"Payload.$": "$"
}
},
"TfPublishMessage": {
"Next": "TaskFailed",
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "${var.topic_arn}",
"Message.$": "$.Cause"
}
},
"TaskFailed": {
"Type": "Fail"
}
}
}
EOF
}
# outputs
output "stepfunction_arn" {
value = aws_sfn_state_machine.sfn_state_machine.arn
}
Yeah, I know, these inline json defining can look quite confusing. But the end-result is nothing more, than a pretty simple straight-forward step function.
Configuring the S3 Bucket with Trigger
As mentioned, we want this step function to be triggered by an upload of a file to an S3 bucket. Let’s define this bucket in the s3.tf
file:
resource "aws_s3_bucket" "bucket" {
bucket = "tf-simple-workflow-bucket"
force_destroy = true
}
resource "aws_s3_bucket_public_access_block" "example" {
bucket = aws_s3_bucket.bucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.bucket.id
eventbridge = true
}
# outputs
output "bucket_arn" {
value = aws_s3_bucket.bucket.arn
}
Here, we first create an S3 bucket named tf-simple-workflow-bucket
with force_destroy = true
, which means that the bucket will be destroyed when we delete the stack with terraform destroy
or change the name of the bucket. We also enable eventbridge = true
so that we can put a trigger for the Step Function on the bucket.
Next, we will create this event trigger in the cloudwatch.tf
file, which will invoke a Step Function execution when a file with the .txt extension is uploaded to the bucket.
variable "stepfunction_arn" {}
variable "bucket_arn" {}
data "aws_iam_policy_document" "allow_cloudwatch_to_execute_policy" {
statement {
actions = [
"sts:AssumeRole"
]
principals {
type = "Service"
identifiers = [
"states.amazonaws.com",
"events.amazonaws.com"
]
}
}
}
resource "aws_iam_role" "allow_cloudwatch_to_execute_role" {
name = "aws-events-invoke-StepFunction"
assume_role_policy = data.aws_iam_policy_document.allow_cloudwatch_to_execute_policy.json
}
resource "aws_iam_role_policy" "state_execution" {
name = "state_execution_policy"
role = aws_iam_role.allow_cloudwatch_to_execute_role.id
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1647307985962",
"Action": [
"states:StartExecution"
],
"Effect": "Allow",
"Resource": "${var.stepfunction_arn}"
}
]
}
EOF
}
resource "aws_cloudwatch_event_rule" "stf_trigger_rule" {
name = "stf_trigger_rule"
event_pattern = <<EOF
{
"detail-type": ["Object Created"],
"resources": ["${var.bucket_arn}"],
"detail": {
"object": {
"key": [{
"suffix": ".txt"
}]
}
},
"source": ["aws.s3"]
}
EOF
}
resource "aws_cloudwatch_event_target" "cloudwatch_event_target" {
rule = aws_cloudwatch_event_rule.stf_trigger_rule.name
arn = var.stepfunction_arn
role_arn = aws_iam_role.allow_cloudwatch_to_execute_role.arn
}
A Dummy Lambda Function
In this example, a lambda function that is simply raising an error is serving as a placeholder for the actual processing that could be done for your particular use case. With Terraform the following resources in a file called lambda.tf
, that resides in the lambda module.
# for zipping the lambda
data "archive_file" "python_lambda_package" {
type = "zip"
source_file = "${path.module}/scripts/test_failure.py"
output_path = "${path.module}/scripts/test_failure.zip"
}
resource "aws_iam_role" "iam_for_lambda" {
name = "iam_for_lambda"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_lambda_function" "test_failure_lambda" {
# If the file is not in the current working directory you will need to include a path.module in the filename.
filename = "${path.module}/scripts/test_failure.zip"
function_name = "TfFailureLambda"
role = aws_iam_role.iam_for_lambda.arn
handler = "test_failure.lambda_handler"
source_code_hash = data.archive_file.python_lambda_package.output_base64sha256
runtime = "python3.9"
}
# outputs
output "lambda_arn" {
value = aws_lambda_function.test_failure_lambda.arn
}
The code for the lambda function is written in Python in this example and resides in a file called test_failure.py
in the ./modules/lambda/scripts
folder.
#!/usr/bin/python3.9
def lambda_handler(event, context):
# here only an error is raised, normally a processing workflow could take place here
# e.g. the file is downloaded from s3, then processed and a return value is given based on the content
region = event["region"]
time = event["time"]
bucket = event["detail"]["bucket"]["name"]
file = event["detail"]["object"]["key"]
raise RuntimeError(f"Hey, something went wrong with your step function... Please check the latest execution! Details: File {file} was uploaded to bucket {bucket} in region {region} at {time}.")
Configuring SNS for Error Notification
Great, almost done here. Now before we deploy, we also want to have a mechanism in place for handling unforeseen errors, that might occur during the life time of our step function and we do not always want to be looking at our AWS console all the time. It’s much more convenient to receive a message (e.g. an email) IF something goes wrong. For this we create an sns topic and add an email subscription with our own address which is supposed to receive the notifications in a file called sns.tf
.
variable "email_address" {}
resource "aws_sns_topic" "failure_topic" {
name = "TfFailureTopic"
}
resource "aws_sns_topic_subscription" "failure_topic_target" {
topic_arn = aws_sns_topic.failure_topic.arn
protocol = "email"
endpoint = var.email_address
}
# outputs
output "topic_arn" {
value = aws_sns_topic.failure_topic.arn
}
Deployment
Now, if you have your AWS account, credentials and Terraform all set up correctly, you should be ready to deploy the resources to your account using the usual terraform init
, terraform plan
and then finally terraform apply
commands. If you haven’t put your secret values for the variables aws_region
, aws_profile
and email_address
into a file called terraform.tfvars
, we will be prompted for these variables, and then terraform will deploy our resourcess to AWS.
Testing the Infrastructure
To test our newly created infrastructure, we can upload a file with the .txt
extension to our S3 bucket using the AWS Management Console or the AWS CLI:
touch test.txt
aws s3 cp test.txt s3://tf-simple-workflow-bucket
This should trigger our step function, which will invoke our Lambda function. Since our Lambda function is simply throwing an error in this example, our SNS topic should receive a notification with the error message, which should be sent to the email address we specified during deployment.
Conclusion
In this blog post, we have seen how to use the Terraform and HCL to create AWS resources as code. We have created a simple but super effective and useful workflow that uses an S3 bucket trigger to invoke a step function, which in turn invokes a Lambda function. We have also seen how to use SNS to receive notifications in case of Lambda function failures.
Using Terraform allows us to define our infrastructure as code, which provides several benefits such as version control, automated testing, and simplified deployment. Getting familiar with Terraform can be demanding with a steep learning curve, but might be useful for your purposes, as it can help you with other cloud providers like Microsoft Azure or Google Cloud Platform as well.