Stark & Wayne

Quake Speedrun: Recap Level 1 Terraform and Kops:

Last time, we successfully finished deploying the first part of our platform. We have bootstrapped an AWS Env via Terraform and deployed a Kubernetes Cluster via Kops. ‌‌‌‌Now that we have the infrastructure in place, let’s go through what we did.

After we’ve provided our credentials for AWS and installed the CLI’s, the command we ran was:

quake --bootstrap

Let’s look at the script to understand what exactly it does.

pushd ${REPO_ROOT}/terraform
  terraform init --backend-config="path=${TF_STATE_PATH}" &> /dev/null    
  terraform apply -auto-approve
  yq r \     
    <( terraform output -json | \       
      jq 'to_entries |map ( .value=.value.value ) | from_entries'
  > ${REPO_ROOT}/state/kops/vars-${QUAKE_CLUSTER_NAME}.${QUAKE_TLD}.yml    
  for VAR in $(env|grep QUAKE); do
    echo ${VAR} | sed 's#QUAKE_(.*\)=#QUAKE_\1: #' \
    >> ${REPO_ROOT}/state/kops/vars-${QUAKE_CLUSTER_NAME}.${QUAKE_TLD}.yml 

The bootstrap is pretty straight forward. After switching the work-directory (pushd/popd) to the Terraform folder, the script initializes the TF project. By default, Terraform will use the "local backend." This means that we're keeping the Terraform-State locally and only our machine is aware of it. If you are working in a team and want to avoid concurrent deployments, you can utilize one of the available "Remote Backends" and "State Locking" mechanisms. If you want more information about available backends, check out their docs to learn about your options. ‌‌Next, we apply our Terraform config by running terraform apply -auto-approve. This will create our TF defined resources in our AWS account.

Terraform organizes Resources hierarchically into Providers. Providers are Terraform Plugins that provide specific functionality (remote API Calls, local binary calls, script execution, etc) in Hashicorp-Config-Language Markup. There are a plethora of available Terraform Providers that implement diverse resource CRUD (Create Read Update Delete) operations for most of your infrastructure/config requirements.

Once Terraform created our resources, we need to be aware about their IDs/Data. Some of our stack down the road will reference/use these resources, e.g. ExternalDNS requires knowledge about the Route53 Zone. Terraform kindly lets us output such values, in our case using the -json flag to change the output format. We're utilizing JQ (a JSON processor) and YQ (a YAML processor inspired by JQ) to reformat the output for our needs. The output JSON data structure initially is a JSON Object containing a Map of JSON Objects with nested keys:

      "sensitive": false,
      "type": "string",
      "value": "arn:....-af3f-831143c6d1c0"
    "sensitive": false,
    "type": "string",
    "value": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDL/fnGyc5RFfgnjuzTMvxBM0eyhG+KdK6MhBDKdYWGfmW0e+qHcR9kbvlOx+oFc22b40m9tCU8AJeO248cpcwQacjwyXgnUXJgnpu+DoRf8d5znUSVW6cSm1fhTLXFPHnD9xb6cAw2 4oxP571KoygH9X7/YXj24c3TmhKBz2u7SvWFkyeYGD8FZmCGKblGbQhriqf/Sn09TSQEAtCrK6hLAJfkzdYwEQsUHD3vWGYY2bCIuRNDXUpXR6U036Mx+WHw/+ZPU59J6BY712Wi1ZBSCz2kwhW730w2Qnj+JyPn7tkF86rMblIOf6zF/ra/Jgv3/Bh+yun4ga6YXN1Opu5L\n"   

Since we do not need the extra info fields "type" and "sensitive," let's remove them. We're going to map the contents of the field "value" to the parent key, this will drop the not required values and simplify the structure. The output first gets piped into JQ resulting in:

  "QUAKE_CLUSTER_CERT_ARN": "arn:....-af3f-831143c6d1c0",
  "QUAKE_CLUSTER_GITOPS_SSH_PUB": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1vMubU/mZTpNI2BYbC+jG6I1eLerwtPSIZ00E0KokzfLOOjqmxqVwg2qVFhRQ4beAj4Mpg1/F7FO4rOZs0weStWt0xxHPqN81MiPKF0CZZYWG3lnLOsw +ivfJ45wrZutVCE71bVfonqrITKVYY6S2y7K5ic8JIOFMc1JGLweiKPoEfHH74VoG3x9ffIo+CXr06wZTzWePU39PdRzfi42xXyw9e3A2L7bQ9/2VpFylkUvNbiSxAKfU+RiBtZZsBhG/aV5a1GtTo2wnaYfZ3ty/GEwitR9IpfwsUNr1l/2aaRHaCVqACoXGThhhtwPlBL3Rnvl9Ivf1vOIhM6r1r7+l\n",

Then we use a File Descriptor to read the contents into YQ. This is just a convenient way to reformat the JSON into YAML:

QUAKE_CLUSTER_CERT_ARN: "arn:....-af3f-831143c6d1c0"
  ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1vMubU/mZTpNI2BYbC+jG6I1eLerwtPSIZ00E0KokzfLOOjqmxqVwg2qVFhRQ4beAj4Mpg1/F7FO4rOZs0weStWt0xxHPqN81MiPKF0CZZYWG3lnLOsw+ivfJ45wrZutVCE71bVfonqrITKVYY6S2y7K5ic8JIOFMc1JGLweiKPoEfHH74VoG3x9ffIo+CXr06wZTzWePU39PdRzfi42xXyw9e3A2L7bQ9/2VpFylkUvNbiSxAKfU+RiBtZZsBhG/aV5a1GtTo2wnaYfZ3y/GEwitR9IpfwsUNr1l/2aaRHaCVqACoXGThhhtwPlBL3Rnvl9Ivf1vOIhM6r1r7+l

Finally, the output gets written into the state directory.‌‌‌‌The vars file is then used to provide a KOPS Cluster Template with our values by our next command:

quake --deploy
kops toolbox template \
  --template ${REPO_ROOT}/kops-templates/cluster_tpl.yml \
  --values ${REPO_ROOT}/kops-templates/cluster_defaults.yml \
  --values ${REPO_ROOT}/state/kops/vars-${QUAKE_CLUSTER_NAME}.${QUAKE_TLD}.yml \
> ${REPO_ROOT}/state/kops/full-manifest-${QUAKE_CLUSTER_NAME}.${QUAKE_TLD}.yml

kops replace -f ${REPO_ROOT}/state/kops/full-manifest-${QUAKE_CLUSTER_NAME}.${QUAKE_TLD}.yml --force

kops update cluster ${QUAKE_CLUSTER_NAME}.${QUAKE_TLD} --yes

As you can see, we only need three commands to get a Kubernetes cluster up and running. We simply interpolate our cluster template with the defaults and our generated config file using the KOPS template subcommand.‌‌ Kops actually does not create VMs directly, but it creates AutoscalingGroups and LaunchTemplates for our instances. These are then picked up by AWS to create the actual EC2 instances. ‌‌‌‌‌‌

While not in use within this project to execute the actual deployment, you might be interested to know that KOPS does not require to manage the AWS resources "directly." You can outsource the creation of AWS Objects to Terraform. If you're already used to and well experienced with Terraform, this might simplify your operations. Try creating the files with:

quake -d -o tf

But back to the KOPS CLI flow. To be able to use the same sequence of commands to create and update the cluster config idempotently, we're utilizing a combination of KOPS replace & KOPS update. Since we used an existing template (cluster/instancegroups), we did not have to run any of the "kops create cluster/instancegroup" commands.

Templates are useful once you want to stage your environment, if you want to quickly add or remove functionality/config to a cluster without relying on manually editing clusters config files.

While the above script will deploy a cluster, to manipulate an existing cluster we need to understand a bit about technical handling of updates for ASGs within AWS. Changing or Updating a ASG resource, might not automatically redeploy the underlying instances. Essentially, AWS would only rollout new instances if the accompanying LaunchTemplate changes.

Most of the time, our LaunchTemplates won't change, e.g. if we are trying to use a newer BaseImage for our Clusters.‌‌ If you want to trigger the deployment process after you updated your cluster config, you will need to use "kops rolling-update cluster." ‌‌‌‌The rolling-update subcommand is somewhat "blocking." While KOPS will stream info about the current deployment process, it "only" pulls info from Kube/AWS status of the requested work.

One of the biggest issues I found with this is, that it can quickly eat your AWS API Request Quota if you're deploying multiple environments into one account. KOPS will validate draining the current/old node, as well as do validation for the new replacement once it comes up. You can switch to "asynchronous" behaviour by adding the "--cloudonly" flag, or just skip validation to speed up the process.
Unfortunately, in some situations, this ends up being quite confusing as some upgrade cases would require more in-depth control than KOPS "shelling" out to upstream functionality offers. Luckily, the KOPS community provides extensive docs with workarounds and mitigations on such cases.

One of the resources our TF Config created was the KOPS State Store S3 Bucket. KOPS utilizes this Bucket to cache the "live" Cluster Manifests, Kubernetes ETCD State Backups, Certificates, Cluster Secrets etc. Simply put: "It contains everything KOPS needs to operate the clusters defined in that bucket." Additionally, this bucket will contain your regular ETCD-Backups, so treat it with the respect it deserves.
While the Docs say otherwise, currently there is no other backing mechanism than buckets. Thus, this bucket should never be public and also should be encrypted as it contains sensitive data.‌‌‌‌

KOPS deploys the Kubernetes Control Plane (ETCD, Kube-API & the (Calico-)CNI-Pods) on the master nodes. These Pods are running as privileged containers.
For ETCD it utilizes ETCD-MANAGER to deploy the two required ETCD-Clusters (etcd-main, etcd-events) for your Kubernetes. The state for ETCD is persisted on EC2 Volumes that get mounted via master nodes and made available to the respective Pods. Furthermore etcd-manager will create regular backups of your ETCD State into the $KOPS_STATE_STORE Bucket. This is quite useful as this already covers most of Kubernetes' Backup needs from a Control Plane perspective.

Combine this with a proper deletion policy (meaning volumes should not be instantly deleted, but rather marked for deletion) on your Persistent Volume Claims for your StatefulSets and you should be able to recover even your workloads from most failures.

Another interesting, though somewhat feature-lacking, feature of KOPS are Addons. While easy to apply and some integrated out of the box, the choice of available addons is quite small and unfortunately KOPS uses the channels tool to deploy these.

This happens in the cloud-init lifecycle phase of the deployed EC2 machine and before the Kube-API is available. Thus, you cannot stream logs from inside those containers via KUBECTL and you will need to SSH onto the master nodes if something isn't going as it's supposed to.

Thus, debugging broken/misbehaving addons can be quite a pain depending on the Topology of your deployment.

Especially, if the addon in question can lead to cascading failures down the road (I'm looking at you kube-metrics). While I initially used some of the addons for convenience, I eventually moved to HELM/ARGO to manage additional platform components. Upstream Helm Charts in combination with ArgoCD provide more control, more features, more components, and a more widely used toolset. ‌‌‌‌

Last notes on KOPS:

I've hit several Issues relating to tags with KOPS on AWS. The first outage was caused by the ETCD instances not being able to build/join a cluster on create/update of an environment. While our investigation into the matter surfaced an issue with our environment-deletion/cleanup process, the information is good to have none the while.

The root cause for out deployment issues was that the EC2 volumes previously attached to the master nodes were not properly removed/cleaned. This caused multiple volumes with the same tags being discovered and mounted by KOPS. Unfortunately, this also behaved somewhat flaky, as depending on which volume got discovered & mounted first, the deployment would either work or fail. Since this happens in the cloud-init stage, there were no warnings available in the KOPS CLI output / our logs.

The second issue appeared when we updated environments from 1.14 to 1.15. The master nodes took an unreasonable amount of time to come up and most times would require manual restarting of the kubelet service on the master nodes. While doing the research on this, I've found info from other users that looked similar:

We have seen intermittent issues (less than 1%) with the canal CNI not loading (empty /etc/cni/net.d) on startup but have never gotten to a root cause/fix beyond "kick it until it starts".

The CNI config (created by an init container of Calico) was nowhere to be found and we were not able to find the reason for a few days.

We started going through old issues on Github that looked similar. That research uncovered a workaround posted in an issue that had the same symptoms as our deployments (turns out that Spot-Instances may get their tags propagated somewhat later than normal EC2-Instances). While we did not use Spot-Instances, it perfectly described our experience at the time.

Thus, I tried to run the mentioned workaround and noticed that we're getting 401's on requests trying to list the tags. Doing further research we found KOPS' default policies only contain "Autoscaling:DescribeTags" permissions but not "EC2:DescribeTags." After we added the EC2 policy to our IAM-Roles, everything started to deploy faster and more reliable. Workaround not necessary. This is something I'm still planning to follow up as the KOPS code at least in some places suggests to use EC2 Describe Tags.

To sum it up: I hope my failures will prevent yours :)

Thanks for reading and stay tuned for the next part of the QUAKE Series about deploying and using Argo to implement a continuous delivery and a few extra features for our platform.