This is another interesting day in the life of modern platforms (BOSH and Cloud Foundry) and automation (Concourse).
The Problem
Recently we ran into an issue with Concourse. After building a seemingly successful pipeline and using it to deploy the microbosh to AWS, we ran into a snag where the deploys always failed for the same reason: the deployment couldn’t find the AWS instance using the ID in the manifest. But why?
Symptoms
Initially diagnosing the behavior was reasonably straightforward: when running the pipeline, an error like the following would appear:
Started deploy micro bosh > Mount disk. Done (00:00:01)instance i-22222222 has invalid disk: Agent reports while deployer's record shows vol-11111111.
Fix attempt #1
Going into AWS -> EC2 -> Volumes and searching for vol-1111111
would easily pull the volume, but it was attached to a different instance, i-33333333
. In fact, going into Instances and searching for i-22222222
showed that there were no instances with that ID!
This means that for some reason the bosh-deployments.yml
file is wrong. This is the "database" for the bosh micro deploy
state. At this point, I wasn’t yet sure why it had the incorrect state; so I fixed it to match reality according to the AWS Console:
---
instances:
- :id: 1
:name: microbosh
:uuid: (( some UUID ))
:stemcell_cid: (( AMI ))
:stemcell_sha1: (( some SHA ))
:stemcell_name: (( some stemcell ))
:config_sha1: (( some SHA ))
:vm_cid: i-33333333
:disk_cid: vol-11111111
disks: []
registry_instances:
- :id: 14
:instance_id: i-33333333
:settings: (( bunch of stuff ))
Great! Everything is kosher. Trigger the pipeline aaaaaaannnnnnndddddd…
Started deploy micro bosh > Mount disk. Done (00:00:01)instance i-33333333 has invalid disk: Agent reports while deployer's record shows vol-11111111
Going back into AWS shows i-33333333
has been terminated; and when I inspect the volume vol-11111111
shows that it is now attached to a new instance i-44444444
; however, the bosh-deployments.yml
file has i-33333333
.
Hmmm.
Fix Attempt #2
Using one of our earlier blog posts as a guide, I cleaned out all the "dynamic bits" and tried triggering the pipeline again. Unfortunately this did not resolve the issue: even though neither the instance_id
nor vm_cid
fields were even present when I started the pipeline, when it ran the wrong instance ID was populated in both places and the pipeline terminated with the same error.
Fix Attempt #3
At this point I deleted the EC2 instance that was supposed to be attached to the persistent disk. (Note that it is probably obvious that the volume is not set to delete when the instance deletes or else the volume would have been disappearing as well, but you know I double checked that anyway. Because human error, and what not.) Then I created a NEW instance manually using the criteria in the manifest. I updated the bosh-deployments.yml
file and did a manual bosh deploy
. SUCCESS! I triggered the pipeline to run – SUCCESS!
BUT because of the change I made to the pipeline, the pipeline was triggered to run a second time after the successful completion of my manual run. This time it FAILED.
And the instance ID was wrong again.
Deeper Troubleshooting
Clearly, something a bit deeper is going on in the pipeline itself. Since this particular pipeline is pushing its changes to GitHub as a sort of audit trail, to track down where the problem was I look at all its git commits. This is where the problem was made a little more obvious.
By looking at the commits, the problem was rooted between when our pipeline was being triggered and where it was grabbing the deployments. Basically, it was grabbing the "state of the universe" at the beginning and using that to populate bosh-deployments.yml
, started to run and change the state of the universe, but then used the bosh-deployents.yml
file with now-outdated information to try and deploy. This, of course, caused failure.
To prevent pipeline from triggering prematurely and running with out-of-date information, I updated the resources in the pipeline.yml
file to ignore our pipeline-inputs.yml
:
resources:
- name: aws-pipeline-changes
type: git
source:
uri: {{pipeline-git-repo}}
branch: {{pipeline-branch}}
paths:
- environments/aws/templates/pipeline
- environments/aws/templates/bin
- environments/aws/templates/releases
ignore_paths:
- environments/aws/templates/pipeline/pipeline-inputs.yml
After some cautious optimism I ran the pipeline again. The good news: the original issue was fixed. The bad (ish?) news: it failed with a new error:
unexpected end of JSON input
Welp, at least our bosh-deployments.yml
file was fixed. Huzzah.
Fix one Bug Find Another: The JSON Error
The JSON error appeared right at the build stage – before the pipeline would grab anything and do its magic. In the UI, both the stemcell-aws
asset and the environment were in orange. When I clicked on stemcell-aws
, I saw that it wasn’t able to grab the stemcell – it was just dying.
Looking through the resources
in pipeline.yml
, the stemcell-aws
resource was using bosh-io-stemcell
. In Concourse itself, that resource is located at bosh-io-stemcell-resource. The assets/check
file is where the curl
command runs to grab the stemcell:
curl --retry 5 -s -f http://bosh.io/api/v1/stemcells/$name -o $stemcells
So I ran this command on the jump box that hosts our pipeline and it failed. As an important aside the reason why it failed is because of restrictions on our client’s network: only HTTPS connections are allowed and HTTPS connections are redirected before leaving the company intranet. The fix was as simple as changing the curl
command to:
After making the pull request, someone pointed out that the bosh-io-release
resource had a similar line of code and so it would probably have the same problem eventually. To avoid this, we submitted pull requests for that as well with the same fix.
Resolved!
After the Concourse team merged our pull request to fix the JSON error, we were able to definitively verify that our initial issue was resolved with a series of successful pipeline deployments. ✌.ʕʘ‿ʘʔ.✌