You are never gonna keep it down
Purple Rain got you down? Monit thrashing etcd? Just want to know if ETCD is healthy in your Cloud Foundry deployment?
Checking Health
Start by getting the list of etcd servers in your CF deployment:
bosh vms <your deployment> | grep etc
Adjust the following script for your etcd hosts. By adjust, change the values in {} to match your job/index values of your vms and run it on any hm9000
server since it will have the necessary certs if you are using self-signed certs for etcd:
for etcd in etcd-{z1-0,z1-1,z2-0}; do
for role in self leader; do
echo -n "${etcd} ${role}: "
curl -k -s\
--cacert /var/vcap/jobs/hm9000/config/certs/etcd_ca.crt \
--cert /var/vcap/jobs/hm9000/config/certs/etcd_client.crt \
--key /var/vcap/jobs/hm9000/config/certs/etcd_client.key \
https://${etcd}.cf-etcd.service.cf.internal:4001/v2/stats/${role} | jq .
done
echo
done
Now the script will run 2 curls against each of the etcd nodes: one for /self
and one for /leader
This will let you know the following:
etcd-z1-0 self: {
"name": "etcd-z1-0",
"id": "11e9f50c565d5b40",
"state": "StateFollower", #<< etcd_z1/0 says it is a follower
"leaderInfo": {
"leader": "ef0d6a8fb314ed3a",
"uptime": "7h43m21.438615083s",
"startTime": "2017-01-27T11:07:14.524076843Z"
}...
}
etcd-z1-0 leader: {
"message": "not current leader" #<< etcd_z1/0 says it isn't leader
}
etcd-z1-1 self: {
"name": "etcd-z1-1",
"id": "795ba739b14eb9f4",
"state": "StateFollower", #<< etcd_z1/1 says it is a follower
"leaderInfo": {
"leader": "ef0d6a8fb314ed3a",
"uptime": "7h43m21.474123185s",
"startTime": "2017-01-27T11:07:14.526643444Z"
}...
}
etcd-z1-1 leader: {
"message": "not current leader" #<< etcd_z1/1 says it isn't leader
}
etcd-z2-0 self: {
"name": "etcd-z2-0",
"state": "StateLeader", #<< etcd_z2/0 says it is leader
"leaderInfo": {
"leader": "ef0d6a8fb314ed3a",
"uptime": "7h43m21.520462761s",
"startTime": "2017-01-27T11:07:14.529530067Z"
}...
}
etcd-z2-0 leader: {
"leader": "ef0d6a8fb314ed3a",
"followers": {
"11e9f50c565d5b40": { #<< etcd_z2/0 says it has a follower
"latency": { # corresponds to id of etcd_z1/0
...
},
"counts": {
"fail": 0,
"success": 8111881
}
},
"795ba739b14eb9f4": { #<< etcd_z2/0 says it has a follower
"latency": { # corresponds to id of etcd_z1/1
...
},
"counts": {
"fail": 33,
"success": 7876536
}
}
}
}
In this 3 node cluster:
etcd-2-0
is the leader.etcd-1-0
andetcd-1-1
both report they are not the leaderetcd-2-0
under it’s leader output shows the id of its followers. You can look under the self calls ofetcd-1-0 and etcd-1-1
id
to make sure they match.
In the event you have more than 1 node reporting it’s a leader, or one of your nodes under self has a blank leader
field – you have a split brain or out of sync etcd cluster
For split brain – 2 leaders
bosh ssh
into each etcd vmsudo -i
andmonit stop etcd
- Verify etcd is down via
ps -ef | grep etcd
– etcd metrics server is fine to remain up - Once all nodes have etcd stopped, reset the etcd cluster db files by deleting the
/var/vcap/store/etcd/member
directory and all sub directories and files monit start etcd
on the first node, wait for it to come up cleantail -f /var/vcap/sys/log/etcd/*.log
then start the remaining nodes one at a time- Re-run the script to validate
For one or more nodes that are not leader but don’t know who the leader is
bosh ssh
into the leaderless vmsudo -i
andmonit stop etcd
- Verify etcd is down via
ps -ef | grep etcd
– etcd metrics server is fine to remain up - Delete the
/var/vcap/store/etcd/member
directory and all sub directories and files monit start etcd
, wait for it to come up cleantail -f /var/vcap/sys/log/etcd/*.log
- Re-run the script to validate
You have 450+ runners and etcd runs out of file descriptors
This does happen on large deployments since the default ulimit is 1024 for every stemcell we’ve looked at so far. CF v243 it’s uses etcd release v66 which doesn’t handle ulimits correctly. This looks like it may be addressed in newer releases of etcd used in newer CF versions.
To workaround the current problem:
bosh ssh
into each etcd vm then:
sudo -i
monit stop etcd
Verify etcd is down via (etcd metrics server is fine to remain up)
ps -ef | grep etcd
Once all nodes have etcd stopped, dump the etcd cluster db files by deleting the member
directory and all sub directories and files:
rm -rf /var/vcap/store/etcd/member
Modify limits.conf
:
vim /etc/security/limits.conf
Add in the following:
* soft nofile 4096
* hard nofile 4096
Modify /var/vcap/jobs/etcd/bin/etcd_ctl
around line 82 to add the ulimit just before calling the etcd executable:
...
ulimit -n 4096 # <=== Add this just before \/ existing line below
/var/vcap/packages/etcd/etcd ${etcd_sec_flags}
...
Reboot the vm. If it does not come up clean attempt these steps:
su vcap
ulimit -n 4096
sudo monit start etcd
Log in as root and ait for it to come up clean:
tail -f /var/vcap/sys/log/etcd/*.log
Rinse and repeat with remaining etcd nodes
Etcd is a great tool, just needs a kick once in a while!
Lastly, this is a repost of documentation Chris McGowan created for one of our clients. It was full of goodies and needed to be shared. Everyone should have nice things!