Stark & Wayne
  • by Chris Weibel

You are never gonna keep it down

Purple Rain got you down? Monit thrashing etcd? Just want to know if ETCD is healthy in your Cloud Foundry deployment?

Checking Health

Start by getting the list of etcd servers in your CF deployment:

bosh vms <your deployment> | grep etc

Adjust the following script for your etcd hosts. By adjust, change the values in {} to match your job/index values of your vms and run it on any hm9000 server since it will have the necessary certs if you are using self-signed certs for etcd:

for etcd in etcd-{z1-0,z1-1,z2-0}; do
  for role in self leader; do
    echo -n "${etcd} ${role}: "
    curl -k -s\
      --cacert /var/vcap/jobs/hm9000/config/certs/etcd_ca.crt \
      --cert /var/vcap/jobs/hm9000/config/certs/etcd_client.crt \
      --key /var/vcap/jobs/hm9000/config/certs/etcd_client.key \
      https://${etcd}${role} | jq .

Now the script will run 2 curls against each of the etcd nodes: one for /self and one for /leader

This will let you know the following:

etcd-z1-0 self: {                   
  "name": "etcd-z1-0",
  "id": "11e9f50c565d5b40",
  "state": "StateFollower",         #<< etcd_z1/0 says it is a follower
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.438615083s",
    "startTime": "2017-01-27T11:07:14.524076843Z"
etcd-z1-0 leader: {
  "message": "not current leader"   #<< etcd_z1/0 says it isn't leader

etcd-z1-1 self: {                   
  "name": "etcd-z1-1",
  "id": "795ba739b14eb9f4",
  "state": "StateFollower",         #<< etcd_z1/1 says it is a follower
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.474123185s",
    "startTime": "2017-01-27T11:07:14.526643444Z"
etcd-z1-1 leader: {         
  "message": "not current leader"   #<< etcd_z1/1 says it isn't leader

etcd-z2-0 self: {
  "name": "etcd-z2-0",
  "state": "StateLeader",           #<< etcd_z2/0 says it is leader
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.520462761s",
    "startTime": "2017-01-27T11:07:14.529530067Z"
etcd-z2-0 leader: {
  "leader": "ef0d6a8fb314ed3a",
  "followers": {
    "11e9f50c565d5b40": {           #<< etcd_z2/0 says it has a follower
      "latency": {                  #   corresponds to id of etcd_z1/0
      "counts": {
        "fail": 0,
        "success": 8111881
    "795ba739b14eb9f4": {           #<< etcd_z2/0 says it has a follower
      "latency": {                  #   corresponds to id of etcd_z1/1
      "counts": {
        "fail": 33,
        "success": 7876536

In this 3 node cluster:

In the event you have more than 1 node reporting it's a leader, or one of your nodes under self has a blank leader field - you have a split brain or out of sync etcd cluster

For split brain - 2 leaders

For one or more nodes that are not leader but don't know who the leader is

You have 450+ runners and etcd runs out of file descriptors


This does happen on large deployments since the default ulimit is 1024 for every stemcell we've looked at so far. CF v243 it's uses etcd release v66 which doesn't handle ulimits correctly. This looks like it may be addressed in newer releases of etcd used in newer CF versions.

To workaround the current problem:

sudo -i
monit stop etcd

Verify etcd is down via (etcd metrics server is fine to remain up)

ps -ef | grep etcd

Once all nodes have etcd stopped, dump the etcd cluster db files by deleting the member directory and all sub directories and files:

rm -rf /var/vcap/store/etcd/member

Modify limits.conf:

vim /etc/security/limits.conf

Add in the following:

* soft nofile 4096
* hard nofile 4096

Modify /var/vcap/jobs/etcd/bin/etcd_ctl around line 82 to add the ulimit just before calling the etcd executable:

ulimit -n 4096  # <=== Add this just before \/ existing line below

/var/vcap/packages/etcd/etcd ${etcd_sec_flags}  

Reboot the vm. If it does not come up clean attempt these steps:

su vcap
ulimit -n 4096
sudo monit start etcd

Log in as root and ait for it to come up clean:

tail -f /var/vcap/sys/log/etcd/*.log

Rinse and repeat with remaining etcd nodes

Etcd is a great tool, just needs a kick once in a while!

Lastly, this is a repost of documentation Chris McGowan created for one of our clients. It was full of goodies and needed to be shared. Everyone should have nice things!

Find more great articles with similar tags cloudfoundry etcd