What happens when bosh
reports that a disk doesn’t exist that clearly does? This is an issue that we encountered using vSphere and the troubleshooting was a little interesting.
The symptoms
We have a client with a rather large deployment in vSphere. While building the pipeline, we noticed that occasionally a microbosh deployment would fail and we’d have to perform "surgery" on the bosh-deployments.yml
file. What we’d see is bosh micro deploy
failing like this:
...
Stemcell info
-------------
Name: bosh-vsphere-esxi-ubuntu-trusty-go_agent
Version: 3087
Will deploy due to configuration changes
Started prepare for update
Started prepare for update > Stopping agent services. Done (00:00:01)
Started prepare for update > Unmount disk. Done (00:00:01)
Started prepare for update > Detach diskat depth 0 - 20: unable to get local issuer certificate
/var/lib/gems/1.9.1/gems/bosh_vsphere_cpi-2.0.0/lib/cloud/vsphere/disk_provider.rb:52:in `find': Could not find disk with id 'disk-e5b812c1-6e20-4365-aba1-99e24f82b889' (Bosh::Clouds::DiskNotFound)
from /var/lib/gems/1.9.1/gems/bosh_vsphere_cpi-2.0.0/lib/cloud/vsphere/cloud.rb:321:in `block in detach_disk'
from /var/lib/gems/1.9.1/gems/bosh_common-1.3072.0/lib/common/thread_formatter.rb:49:in `with_thread_name'
from /var/lib/gems/1.9.1/gems/bosh_vsphere_cpi-2.0.0/lib/cloud/vsphere/cloud.rb:319:in `detach_disk'
from /var/lib/gems/1.9.1/gems/bosh_cpi-1.3072.0/lib/cloud/internal_cpi.rb:22:in `invoke_cpi_method'
from /var/lib/gems/1.9.1/gems/bosh_cpi-1.3072.0/lib/cloud/internal_cpi.rb:10:in `method_missing'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:327:in `block in detach_disk'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:85:in `step'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:326:in `detach_disk'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:190:in `update'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:102:in `block in update_deployment'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:92:in `with_lifecycle'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/deployer/instance_manager.rb:102:in `update_deployment'
from /var/lib/gems/1.9.1/gems/bosh_cli_plugin_micro-1.3072.0/lib/bosh/cli/commands/micro.rb:179:in `perform'
from /var/lib/gems/1.9.1/gems/bosh_cli-1.3072.0/lib/cli/command_handler.rb:57:in `run'
from /var/lib/gems/1.9.1/gems/bosh_cli-1.3072.0/lib/cli/runner.rb:56:in `run'
from /var/lib/gems/1.9.1/gems/bosh_cli-1.3072.0/bin/bosh:19:in `<top (required)>'
from /usr/local/bin/bosh:23:in `load'
from /usr/local/bin/bosh:23:in `<main>'
Looking at vSphere, the disk did in fact exist:
Not only did it exist, it belongs to the VM CID that is in the bosh-deployments.yml
file:
---
instances:
- :id: 1
:name: microbosh
:uuid: {{UUID}}
:stemcell_cid: {{SCID}}
:stemcell_sha1: {{SSHA}}
:stemcell_name: bosh-stemcell-3087-vsphere-esxi-ubuntu-trusty-go_agent
:config_sha1: {{CONF}}
:vm_cid: vm-36a656b0-b1b3-4fd4-9cf7-63fa95a5bc63
:disk_cid: disk-e5b812c1-6e20-4365-aba1-99e24f82b889
What did the log say?
The log was a little less than helpful, as the deployment seemed to go through the normal motions the deploy until encountering an error such at this one:
...
I, [2015-09-28T14:16:35.558692 #28566] [attach_disk(vm-36a656b0-b1b3-4fd4-9cf7-63fa95a5bc63, disk-e5b812c1-6e20-4365-aba1-99e24f82b889)] INFO -- : Attaching disk
I, [2015-09-28T14:16:37.683131 #28566] [attach_disk(vm-36a656b0-b1b3-4fd4-9cf7-63fa95a5bc63, disk-e5b812c1-6e20-4365-aba1-99e24f82b889)] INFO -- : Finished attaching disk
I, [2015-09-28T14:17:10.390309 #28566] [0x783ffc] INFO -- : Director is ready:
...
E, [2015-09-28T15:16:26.775114 #29900] [0xcf5800] ERROR -- : not unmounting disk-e5b812c1-6e20-4365-aba1-99e24f82b889 as it doesn't belong to me: []
I, [2015-09-28T15:16:28.919513 #29900] [detach_disk(vm-36a656b0-b1b3-4fd4-9cf7-63fa95a5bc63, disk-e5b812c1-6e20-4365-aba1-99e24f82b889)] INFO -- : Detaching disk: disk-e5b812c1-6e20-4365-aba1-99e24f82b889 from vm: vm-36a656b0-b1b3-4fd4-9cf7-63fa95a5bc63
...
So basically the disk would successfully attach to the VM, but then when it came time to detach the VM would respond with "it isn’t mine!". I feel like there’s a bad joke in here, but I digress.
So what happened?
While we continued to investigate we started to note that some of the INFO
statuses did not make sense. For example, we saw this VM being created:
...
INFO -- : Setting VM env: {"vm"=>{"name"=>"vm-7c516a12-8ad5-4033-8a92-ff35030e3a2d" ...
while the VM CID in the bosh-deployments.yml
file was vm-700f482f-36cf-4c86-95c5-c86a18499c4a
. This caused us to investigate a little further.
What happened
Essentially, what appears to have happened is that somehow the original deployment wasn’t saved correctly or at all in bosh-deployments.yml
(at this point it is hard to say). Then for every subsequent redeploy it would try to create a new microbosh with the same specifications: including the same IP address. Since vSphere allows multiple VMs to have the same IP without warning, by the time we caught this issue there were over a dozen VMs in vSphere with the same IP address. This collision is what caused confusion when creating/destroying the VMs as part of the normal deployment process, which would then fail.
Since we are deploying the microboshes via a pipeline, and thus whenever a microbosh deploy fails for any reason we see this issue anew, we are adding a script to our vSphere pipeline to basically detect if an IP is in use and if it is in use only by the VM specified by bosh-deployments.yml
.