What You Should Do When Your App Can Not Connect to Log Servers in CF

August 24, 2016

What can you do when your app timeout connecting to your log server in your CF deployment?

The following error occurred when I pushed my app to CF in AWS.

timeout connecting to log server, no log will be shown
Starting app cf-env in org codex / space cf-app-testing as admin...
FAILED
Error restarting application: StagerError

To get more information I ran CF_TRACE=true cf push, then I got the following message hanging there for what felt like like forever.

WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1
Host: wss://doppler.system.staging.xiujiaogao.com:4443
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: [HIDDEN]
Origin: http://localhost
Authorization: [PRIVATE DATA HIDDEN]

Since it failed when it sent a request to the doppler server, so I ran bosh vms to check if the doppler VMs were running. I next logged into the doppler server and ran monit summary to check if all the processes were running.

The output from running monit summary is as follows:

Process 'doppler'                   running
Process 'syslog_drain_binder'       running
Process 'metron_agent'              running
Process 'toolbelt'                  running
System 'system_localhost'           running

Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the /var/vcap/sys/log/doppler/doppler.stderr.log file.

panic: sync cluster failed
goroutine 1 [running]:
panic(0xb0d3c0, 0xc8201460f0)
        /var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185
main.main()
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9

For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:

monit stop etcd (on all nodes in etcd cluster)
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
monit start etcd (one-by-one on each node in etcd cluster)

It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.

Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running cf push worked so that the app is now running.

中文版请见:当你不能连接到Cloud Foundry 中的日志服务器时怎么办?.

Written by:
Dr. Xiujiao Gao 高秀娇