Jul 09, 2014 Highly available BOSH with Binary BOSHes
[updated with Lessons from the Real World section]
One great feature of BOSH is it can resurrect dead servers. AWS kills a VM and then sends you the warning email afterwards? BOSH will spot the VM missing and recreate it. Awesome.
But who is resurrecting BOSH?
Well, another BOSH or Micro BOSH. But who resurrects that one?
During the 2013 BOSH Summit, the idea was proposed to have two BOSHes each deploying and resurrecting the other. Same cost as before - replace the Micro BOSH with a single VM "Meta BOSH".
I call them "Binary BOSHes" and this week I sat down to figure out how to do it (coincidentally some excitement for the idea was building on the mailing list this week).
There are quite a few steps to be honest:
- Bootstrap Micro BOSH
- Micro BOSH deploys Primary BOSH
- Primary BOSH deploys Meta BOSH
- Stop all BOSHes
- Sync Micro BOSH data to Meta BOSH
- Start Meta BOSH; verify it thinks it is Micro BOSH
- Recreate Primary BOSH from Meta BOSH
- Terminate a BOSH; the other BOSH will resurrect it
- Destroy Micro BOSH
- Finally target Primary BOSH for general use
I'm going to assume you already know how to bootstrap Micro BOSH and deploy a single VM BOSH with it. I do work for a fancy consultancy who can also help with anything BOSH related if you'd like us to help.
After bootstrapping, there will be 3 VMs:
The examples in this post assume deployments files are in
Deploy a Micro BOSH (for example with bosh-bootstrap) with
Then use it to deploy Primary BOSH (deployment file
boshes/bosh-primary.yml). Finally, use Primary BOSH to deploy Meta BOSH (deployment file
This post assumes that Primary BOSH and Meta BOSH are both simple VMs, such as this example solo deployment manifest.
At this point you will have the following files:
cd ~/workspace/deployments $ tree . ├── boshes │ ├── bosh-meta.yml │ └── bosh-primary.yml ├── microbosh │ └── micro_bosh.yml
In this post I'll assume the BOSHes have the following static IP addresses and BOSH CLI aliases:
$ bosh target 10.10.0.3 micro $ bosh target 10.10.0.4 primary $ bosh target 10.10.0.5 meta
FYI when you target a BOSH you can give it an alias. Now you can switch between BOSHes with their aliases, rather than remembering their IP addresses:
$ bosh target meta
Stop all BOSHes
To ensure data integrity (avoid anyone trying to use the BOSHes during migration) stop all the BOSHes:
ssh email@example.com -i ~/.ssh/id_rsa_bosh sudo su - monit stop all watch monit summary
When all the processes on each VM have stopped, we are ready to sync the data from Micro BOSH to Meta BOSH.
Repeat for Primary BOSH (
10.10.0.4) and Meta BOSH VMs (
Migrate Micro BOSH to Meta BOSH
Meta BOSH will be used solely to maintain, upgrade and resurrect the Primary BOSH. Primary BOSH will be used to operate normal BOSH deployments, such as Cloud Foundry.
This stage has three steps:
- Configure Micro BOSH to allow rsync via root user
- Setup Meta BOSH with SSH access to Micro BOSH
- Rsync Micro BOSH data to Meta BOSH
Within Micro BOSH permit SSH for the root login:
- Restart SSH service:
sudo service ssh restart
Within Meta BOSH, setup the private SSH key to allow
rsync to copy files from Micro BOSH into this Meta BOSH.
sudo su - mkdir ~/.ssh chmod 700 ~/.ssh touch ~/.ssh/id_rsa_bosh chmod 600 ~/.ssh/id_rsa_bosh vi ~/.ssh/id_rsa_bosh
Paste in the private key (probably also
~/.ssh/id_rsa_bosh) used to SSH into Micro BOSH.
Add an SSH configuration host for easy reference by
rsync later. In
Host micro User root Hostname 10.10.0.3 IdentityFile ~/.ssh/id_rsa_bosh
Again, within Meta BOSH, rsync over the Micro BOSH's data folder:
cd /var/vcap/store rm -rf * rsync micro:/var/vcap/store/* . --progress -r
/var/vcap/store/director folders are synced over.
(Thanks to Alan Moran & Leandro Cacciagioni from Altoros for some suggestions on the rsync instructions above)
Restart Meta BOSH
Now restart Meta BOSH processes.
monit start all watch monit summary
Wait until all processes are up and running.
Now, exit from Meta BOSH itself
And do some sanity tests:
$ bosh target meta $ bosh status $ bosh stemcells $ bosh releases $ bosh deployments +--------------+------------+--------------------------------+ | Name | Release(s) | Stemcell(s) | +--------------+------------+--------------------------------+ | bosh-primary | bosh/89 | bosh-aws-xen-ubuntu-lucid/2624 | +--------------+------------+--------------------------------+
The results should match those of the Micro BOSH: it should think that it was the BOSH that deployed Primary BOSH.
Recreate Primary BOSH
At this point Primary BOSH VM is still running but the BOSH processes within it are not. We cannot just restart the processes and commence happy times.
This VM's BOSH agent still thinks there is a Micro BOSH. It still thinks it should be able to receive requests and transmit health information to it. But its dead, and later it will be destroyed.
The simplest solution is to manually destroy - via your IaaS' console - the Primary BOSH VM and ask Meta BOSH to recreate it.
Assuming you do not already have resurrection enabled in your Meta BOSH, you can request the Primary BOSH VM recreated:
$ bosh cck
Answer that you want the VM recreated and wait patiently.
The new VM will have a BOSH agent that belongs to Meta BOSH, rather than Micro BOSH. Success!
Meta BOSH is now able to re-deploy configuration changes to Primary BOSH. And vice versa.
In the deployment manifests of both BOSHes, add some resurrection configuration for the Health Manager.
properties: ... hm: resurrector_enabled: true resurrector: minimum_down_jobs: 5 percent_threshold: 0.2 time_threshold: 600
The values above are the defaults at the time of writing.
See the documentation for more information on BOSH resurrection.
In the IaaS console, manually terminate the Primary BOSH. Do it, be bold!
Wait a minute or two and you'll see Meta BOSH start to resurrect it.
$ bosh target meta $ bosh tasks recent --no-filter +----+------------+-------------------------+-------+--------------------------------+ | # | State | Timestamp | User | Description | +----+------------+-------------------------+-------+--------------------------------+ | 32 | processing | 2014-07-09 02:20:00 UTC | admin | scan and fix | ... $ bosh task 32 Director task 32 Started scanning 1 vms Started scanning 1 vms > Checking VM states. Done (00:00:33) Started scanning 1 vms > 0 OK, 0 unresponsive, 1 missing, 0 unbound, 0 out of sync. Done (00:00:00) Done scanning 1 vms (00:00:33) Started applying problem resolutions > missing_vm 10: Recreate VM using last known apply spec
Hurray! Each BOSH now resurrects the other!
When you're comfortable, you can now delete your Micro BOSH VM.
You can delete it via the BOSH CLI, which will also delete the persistent disk:
$ cd ~/workspace/deployments $ bosh micro delete
Alternately, delete it manually via the IaaS console and optionally delete the persistent disk or not.
$ bosh target primary
The Primary BOSH is where you will now upload releases and deploy them.
You only need to target Meta BOSH if you are upgrading, scaling or maintaining Primary BOSH.
Lessons from the Real World
In this article I didn't discuss what the BOSH manifest for deploying BOSH itself looks like. It was implicit that you'd take a solo manifest example and adapt it to your needs. E.g. aws or openstack.
There is one important change - do not set a DNS recursor to a Micro BOSH. Why? Because we killed it.
If you forget to do this - like I did - then all your DNS resolutions - either to external nameservers or internal BOSH DNS will be slow.
I was clocking
host api.my-cloudfoundry.com at 3 seconds from within a BOSH VM; and a few milliseconds from any other VM.
So the following snippet from a manifest is bad:
properties: dns: address: 18.104.22.168 # CHANGE: Elastic IP 1 db: *bosh_db user: powerdns password: powerdns database: name: powerdns webserver: password: powerdns recursor: 22.214.171.124 # CHANGE: microBOSH IP address