What can you do when your app timeout connecting to your log server in your CF deployment?
The following error occurred when I pushed my app to CF in AWS.
timeout connecting to log server, no log will be shown
Starting app cf-env in org codex / space cf-app-testing as admin...
FAILED
Error restarting application: StagerError
To get more information I ran CF_TRACE=true cf push
, then I got the following message hanging there for what felt like like forever.
WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1
Host: wss://doppler.system.staging.xiujiaogao.com:4443
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: [HIDDEN]
Origin: http://localhost
Authorization: [PRIVATE DATA HIDDEN]
Since it failed when it sent a request to the doppler server, so I ran bosh vms
to check if the doppler VMs were running. I next logged into the doppler server and ran monit summary
to check if all the processes were running.
The output from running monit summary
is as follows:
Process 'doppler' running
Process 'syslog_drain_binder' running
Process 'metron_agent' running
Process 'toolbelt' running
System 'system_localhost' running
Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the /var/vcap/sys/log/doppler/doppler.stderr.log
file.
panic: sync cluster failed
goroutine 1 [running]:
panic(0xb0d3c0, 0xc8201460f0)
/var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)
/var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185
main.main()
/var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9
For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:
monit stop etcd (on all nodes in etcd cluster)
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
monit start etcd (one-by-one on each node in etcd cluster)
It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.
Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running cf push
worked so that the app is now running.