Vulcan Nerve Pinch, Like that but for BOSH-Lite

August 16, 2016

You know when Spock does that thing on Star Trek and he just grabs the guy by the neck in the right place and blamo, he’s out cold? It’s called the Vulcan Nerve Pinch. The right move at the right time is all it takes to go from chaos to calm.

Well we want to present two "Killer Moves" that can help restore sanity to your bosh-lite environment.

Let’s say you’ve previously Deployed Cloud Foundry Locally with bosh-lite with Mac-OSX (Late 2015), following our blog. What if you shutdown your machine? Well it appears that all the Warden instance are foobar…

Killer Moves

Here’s what we’ll be talking about:

./bin/add-routes
bosh cloudcheck

Vagrant Up?

The very first check is just to make sure the vagrant machine is running. Get it to a running state first, then begin the next steps.

Enable Network

Once you change directory to where you’re running bosh-lite run the ./bin/add-route script.

Why do we need to run ./bin/add-route? To bridge the networks together between the virtualbox machine that is the gateway at 192.168.50.4 to the Warden instances running on the 10.244.0.0/16 range.

For example on Mac OS X the route reports the following information after running the script:

$ route -n get 192.168.50.4
   route to: 192.168.50.4
destination: 192.168.50.4
  interface: vboxnet1
      flags: <UP,HOST,DONE,LLINFO,WASCLONED,IFSCOPE,IFREF>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0      1500      -438
$ route -n get 10.244.0.0/16
   route to: 10.244.0.0
destination: 10.244.0.0
       mask: 255.255.0.0
    gateway: 192.168.50.4
  interface: vboxnet1
      flags: <UP,GATEWAY,DONE,STATIC,PRCLONING>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0      1500         0

Also, the routes do not persist between reboots, so it’s good to remember that if you’re having problems running bosh ssh to your local bosh-lite instances, re-run the ./bin/add-route script.

This is crucial for the bosh cloudcheck step to work.

`bosh cloudcheck`

Gather Information

You might have gathered information first by running bosh vms and seen output reports the status of your Cloud Foundry deployment:

$ bosh vms
[Acting as user 'admin' on 'Bosh Lite Director'
Deployment 'cf-warden'
Director task 242
Task 242 done
+---------------------------------------------------------------------------+---------+-----+-----------+--------------+
| VM                                                                        | State   | AZ  | VM Type   | IPs          |
+---------------------------------------------------------------------------+---------+-----+-----------+--------------+
| api_z1/0 (f9b7bb05-7f9b-4758-9775-9d9c817f09b2)                           | unresponsive agent | n/a | large_z1  | 10.244.0.138 |
| blobstore_z1/0 (c772d8be-52d2-4e9e-b4f6-4e25b4dd84d6)                     | unresponsive agent | n/a | medium_z1 | 10.244.0.130 |
| consul_z1/0 (62bfb613-0ef1-42ad-8ffb-96d718633484)                        | unresponsive agent | n/a | small_z1  | 10.244.0.54  |
| doppler_z1/0 (8891552e-57b4-4466-8708-1dcb979d5c3b)                       | unresponsive agent | n/a | medium_z1 | 10.244.0.146 |
| etcd_z1/0 (ec4520d6-12c8-4b9f-90a1-b434c0271e22)                          | unresponsive agent | n/a | medium_z1 | 10.244.0.42  |
| ha_proxy_z1/0 (5bfd75ea-fc4b-4be1-a5cd-e9a89bb6fa10)                      | unresponsive agent | n/a | router_z1 | 10.244.0.34  |
| hm9000_z1/0 (e117d6a2-802f-40f8-b945-a28b0d069faa)                        | unresponsive agent | n/a | medium_z1 | 10.244.0.142 |
| loggregator_trafficcontroller_z1/0 (13aebd87-fac2-489c-8b6e-68e1145fa0e9) | unresponsive agent | n/a | small_z1  | 10.244.0.150 |
| nats_z1/0 (c6f680b5-f51c-448c-b62a-d997f274c62b)                          | unresponsive agent | n/a | medium_z1 | 10.244.0.6   |
| postgres_z1/0 (5a157c5a-3a06-46e6-9350-07703dd2f68f)                      | unresponsive agent | n/a | medium_z1 | 10.244.0.30  |
| router_z1/0 (456ee602-d347-4cea-9964-8376859084fb)                        | unresponsive agent | n/a | router_z1 | 10.244.0.22  |
| runner_z1/0 (6cbed5a2-88b7-4e85-9d22-9c4de4c64e24)                        | unresponsive agent | n/a | runner_z1 | 10.244.0.26  |
| uaa_z1/0 (614eefa9-dc34-4521-a5e6-d07ee9b26f9f)                           | unresponsive agent | n/a | medium_z1 | 10.244.0.134 |
+---------------------------------------------------------------------------+---------+-----+-----------+--------------+
VMs total: 13

Restore Jobs

If the state of any of the Warden instances are other than "running", BOSH has a command called cloudcheck with an alias cck that will scan your deploy, all jobs that are not running and prompt for action.

bosh cck

Check Yourself

Each job that is part of the deploy will prompt you, asking what action to take. We recommend restoring the instance.

Once you’ve selected what to do about each job it confirms the actions and gets to work.

Coffee Break

Depending on the number of jobs it needs to restore, the bosh cloudcheck may take a while. When I’ve restored an entire 13 instance Cloud Foundry it’s taken at least thirty minutes.

Most of the output as well will post errors as each of the CF services are restored. This is usually because of dependencies between each of the jobs that will be fixed once all of the instances are running again.

Confirm Integration

Once all the instances are back up and running, it’s usually good to do a test deploy to confirm that all the instances are able to connect to each other, have the correct IP addresses for each other, etc.

bosh deploy -n

Share Your Tips

Do you have any other tips that work for you to keep your development environment running smooth? Feel free to share them in the comments below.

Written by:
Ashley Gerwitz

Marketing Manager at Stark & Wayne and Qarik