Cloud Foundry Loggregator likes to break on me. This is going to my collection of how I (try to) fix loggregator.
Problem 1
Background
I’ve just updated CF from v190 -> v191
CF Version: 191
Date of Problem: 10/30/2014
➜  ~  cf logs galaxy
FAILED
Error dialing loggregator server: websocket: bad handshake.
Please ask your Cloud Foundry Operator to check the platform configuration (loggregator endpoint is wss://loggregator.10.244.0.34.xip.io:4443).
Bad handshake?! Wtf? Let’s see what CF_TRACE gives me.
WEBSOCKET REQUEST: [2014-10-30T17:27:48Z]
GET /tail/?app=1fd65ad8-4324-4fcd-9b2e-e8b0e0c6d172 HTTP/1.1
Host: wss://loggregator.10.244.0.34.xip.io:4443
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: [HIDDEN]
Authorization: [PRIVATE DATA HIDDEN]
Origin: http://localhost
WEBSOCKET RESPONSE: [2014-10-30T17:27:48Z]
HTTP/1.1 404 Not Found
Date: Thu, 30 Oct 2014 17:27:07 GMT
Content-Length: 81
Content-Type: text/plain; charset=utf-8
X-Cf-Requestid: 0fe6dece-f166-4329-4074-dce4d793c116
X-Cf-Routererror: unknown_route
FAILED
Error dialing loggregator server: websocket: bad handshake.
Please ask your Cloud Foundry Operator to check the platform configuration (loggregator endpoint is wss://loggregator.10.244.0.34.xip.io:4443).
FAILED
My app is not found 🙁
Investigation
List of jobs upgraded
 Started updating job ha_proxy_z1 > ha_proxy_z1/0
  Started updating job router_z1 > router_z1/0
  Started updating job api_z1 > api_z1/0
  Started updating job nats_z1 > nats_z1/0
  Started updating job api_worker_z1 > api_worker_z1/0
  Started updating job runner_z1
  Started updating job runner_z1 > runner_z1/0
  Started updating job clock_global > clock_global/0
     Done updating job nats_z1 > nats_z1/0 (00:00:23)
     Done updating job api_worker_z1 > api_worker_z1/0 (00:00:29)
     Done updating job ha_proxy_z1 > ha_proxy_z1/0 (00:01:21)
     Done updating job router_z1 > router_z1/0 (00:01:25)
     Done updating job clock_global > clock_global/0 (00:01:31)
     Done updating job api_z1 > api_z1/0 (00:01:57)
     Done updating job runner_z1 > runner_z1/0 (00:03:53)
  Started updating job runner_z1 > runner_z1/1. Done (00:03:18)
  Started updating job runner_z1 > runner_z1/2. Done (00:02:53)
     Done updating job runner_z1 (00:10:04)
Nothing about loggregator…
I got lucky this time and looked at right machine first time. Loggregator Traffic controller logs seem to be culprit.
{"timestamp":1414694116.569226742,"process_id":1497,"source":"loggregator trafficcontroller","log_level":"error","message":"Publishing router.register failed: nats: Connection Closed","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/routerregistrar/router_registrar.go","line":118,"method":"github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/routerregistrar.func·004"}
Solution
# on loggregator traffic controller box
monit stop all
monit start all
So what happen here, it seem that nats updated and loggregator traffic controller was still connected. It could no longer talk to nats to register to router. A simple restart of traffic controller fix it.