Bootstrapping AWS CloudFormation Stacks with Puppet and Structured EC2 User Data


The purpose of this blog is to provide practical advice on how to approach application bootstrapping on AWS using CloudFormation, Puppet, Hiera and Serf. To obtain the most benefit from the blog, you should be technically familiar with these four technologies.

High level tasks include:

  1. Create the your CloudFormation template
  2. Install and configure software on the instances
  3. Connect instance to other instances in the stack

Creating the CloudFormation template

Writing CloudFormation templates can be a painful task if you try to write them directly in JSON. Writing short and simple JSON structures by hand is fine; however describing your entire AWS infrastructure directly as JSON is not very practical for the following reasons:

– Developer workflows problems – keeping large JSON files in source control means that everyone should use exactly the same parser and output formatting, otherwise diffs are completely unusable, which prevents code review.

– Very limited code modularity and re-usability.

– Depending on your parser of choice, syntax errors can be hard to find, especially in large documents.

– Code without comments is never a good idea.

– Syntactically correct JSON isn’t necessary semantically correct CloudFormation. Using a full programming language allows for testing at an earlier stage.

Instead of writing large JSON files directly, it’s much easier to use full-featured programming language to generate the templates.

We have selected Python and Troposphere ( for our CloudFormation generation.

In addition to describing your AWS resources, CloudFormation has the task of providing the instances’ bootstrapping logic. This is done in Instance UserData.

Let’s have a look at some of the possibilities for UserData:

  1. Single shell script as UserData – Requires CloudInit, encoding shell script in JSON template
  2. YAML encoded CloudInit configuration – Requires CloudInit, encoding YAML in JSON template
  3. CloudFormation helper scripts – Generally requires Amazon Linux, May require encoding of shell scripts in the CloudFormation metadata resource
  4. JSON encoded UserData (our preferred option) – Requires custom AMI (Amazon machine image), since the logic which interprets the custom encoded JSON userdata must already exist on the AMI

Exploring Option 4

This option requires using custom AMIs, but that’s actually not a new requirement. We use custom AMIs since we don’t want to install software during the instance boot, which can cause failure and/or a general slow down during autoscaling.

Since we are already building custom AMIs (using, why not install a start-up script that reads the structured UserData and passes the bootstrapping task to a configuration management tool? Configuration management tools are much better equipped for the tasks compared to shell scripts or cloud-init helper scripts.

Install and configure the application on the instance

Using custom AMIs means that installation happens during AMI creation, while configuration happens during instance boot.

Since we are taking the approach of JSON encoded UserData we need something on the instances that understands this UserData and translates it into application configuration.

Take, for example, the following UserData:

"platform": {
"branch": "releases",
"dc": "aws-us-east-1",
"environment": "development",
"profile": "tipaas",
"release": "101",
"role": "webapp",stack”: “development-testing”
"cloudformation": {
"resource_name": "WebAutoscalingGroup",
"stack_name": "rnd-IntegrationCloud-1VTIEDLMDO8YW-Web1a-1IY3WUHI0XCNN"
"webapp_config": {
"elastcache_endpoint": ""

Now, back to the Python troposphere library. It’s very easy to extend the troposphere library to provide a custom UserData Python class, which returns the above JSON when the final CloudFormation template is rendered.

What is left to do is translate the above JSON to a concrete Puppet catalog (remember – Puppet, Hiera and the Puppet modules are already installed on the AMI).

Next steps are:

1) “facter-ize” all the platform variables

2) execute `puppet apply site.pp` where site.pp is an empty manifest containing only an empty default node and let Hiera provide all the classes and variables for the Puppet catalog compilation.

For example “/etc/rc.local”  looks like this:

/usr/loca/sbin/ #reads ec2_userdata and creates facter facts for each variable in platform
/usr/bin/puppet apply site.pp

Snippet of hiera.yaml content (this is very short snippet of our actual hierarchy)

- "%{::t_profile}/role/%{::t_role}/dc/%{::t_dc}/env/%{::t_environment}/stack/%{::t_stack}"
- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}/stack/%{::t_stack}"
- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}/release/%{::t_release}"
- "%{::t_profile}/role/%{::t_role}/dc/%{::t_dc}/env/%{::t_environment}"
- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}"
- "%{::t_profile}/env/%{::t_environment}/stack/%{::t_stack}"
- "%{::t_profile}/env/%{::t_environment}/release/%{::t_release}"
- "%{::t_profile}/env/%{::t_environment}/variables"

Now we have successfully separated the CloudFormation development form the configuration development. Also, we are using the full potential of Hiera and Puppet for separating code from configuration variables.
But there’s more. We use the platform variables/facts and serf ( to connect to other instances of the same stack.

Connecting to other instances in the stack

(Note: This approach is only suitable for development environments or testing phases of CI pipelines.)
Now that we have our facter facts in place we use them to configure a serf agent on each instance of the stack. Serf is an agent for decentralized cluster membership (for more details see
The agent is configured with a set of tags corresponding to the set of platform variables on our UserData. After the serf agent is configured and running we can use it to obtain information about other nodes in the stack.
Here is an example of output obtained by running serf:

#/usr/local/bin/serf members -status alive -format json


"name": "ip-10-100-9-130",
"addr": "",
"port": 7946,
"tags": {
"t_branch": "releases",
"t_dc": "aws-us-east-1",
"t_environment": "development",
"t_profile": "tipaas",
"t_release": "101",
"t_role": "webapp",
"t_stack": "development-testing"

The output of the above command contains one such member definition for each instance in the stack. Now we have to make this information available to Puppet in an easy way. That’s done again with Hiera and facter.

First we create set of custom facts – one for each profile+role, where the remaining platform variables (all but profile and role) match the same set of variables on the node where the custom facts are generated.

#facter | grep serf_my_
serf_my_activemq_broker => ip-10-100-2-21
serf_my_activemq_re => ip-10-100-49-114
serf_my_elk_elasticsearch => ip-10-100-41-79
serf_my_idm_syncope => ip-10-100-9-130
serf_my_mongo_repl_set_instance => ip-10-100-62-139
serf_my_repomgr_nexus => ip-10-100-51-250
serf_my_postgres_db => ip-10-100-20-245
serf_my_tipaas_rt_flow => ip-10-100-105-201
serf_my_tipaas_rt_infra => ip-10-100-36-145
serf_my_tipaas_webapp => ip-10-100-47-174

Now that we have those custom facts we can introduce them in Hiera in appropriate levels of the hierarchy

Example in a Hiera file:

tipaas::activemq_nodes: "%{::serf_my_elk_elasticsearch}"
tipaas::mongo_nodes: "%{::serf_my_mongo_repl_set_instance}"

A Few Conclusions

Encoding shell scripts in CloudFormation templates is a valid approach, but using structured UserData provides better separation of concerns between the infrastructure code and configuration management.

Using Troposphere and Python to develop the CloudFormation allows for common developer workflows such as code reviews, local testing and inline documentation as part of the code.

Combining master-less Puppet and Hiera with Serf ( works really well for orchestrating development and integration environments.

Related Resources

5 Ways to Become A Data Integration Hero

Products Mentioned

Talend Data Integration




Leave a Reply