Aug 23, 2016 What You Should Do When Your App Can Not Connect to Log Servers in CF
What can you do when your app timeout connecting to your log server in your CF deployment?
The following error occurred when I pushed my app to CF in AWS.
timeout connecting to log server, no log will be shown Starting app cf-env in org codex / space cf-app-testing as admin... FAILED Error restarting application: StagerError
To get more information I ran
CF_TRACE=true cf push, then I got the following message hanging there for what felt like like forever.
WEBSOCKET REQUEST: [2016-08-17T19:45:38Z] GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1 Host: wss://doppler.system.staging.xiujiaogao.com:4443 Upgrade: websocket Connection: Upgrade Sec-WebSocket-Version: 13 Sec-WebSocket-Key: [HIDDEN] Origin: http://localhost Authorization: [PRIVATE DATA HIDDEN]
Since it failed when it sent a request to the doppler server, so I ran
bosh vms to check if the doppler VMs were running. I next logged into the doppler server and ran
monit summary to check if all the processes were running.
The output from running
monit summary is as follows:
Process 'doppler' running Process 'syslog_drain_binder' running Process 'metron_agent' running Process 'toolbelt' running System 'system_localhost' running
Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the
panic: sync cluster failed goroutine 1 [running]: panic(0xb0d3c0, 0xc8201460f0) /var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6 main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0) /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185 main.main() /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9
For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:
monit stop etcd (on all nodes in etcd cluster) rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster) monit start etcd (one-by-one on each node in etcd cluster)
It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.
Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running
cf push worked so that the app is now running.