当你的应用链接Cloud Foundry (CF) 的日志服务器超时,你该怎么办?
当我往部署在AWS中的CF发布我的应用时,出现了以下错误:
timeout connecting to log server, no log will be shown
Starting app cf-env in org codex / space cf-app-testing as admin...
FAILED
Error restarting application: StagerError
为了获取更详细的错误日志,我运行了CF_TRACE=true cf push
,我看到下面的信息一直停在那里,一动不动的。
WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1
Host: wss://doppler.system.staging.xiujiaogao.com:4443
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: [HIDDEN]
Origin: http://localhost
Authorization: [PRIVATE DATA HIDDEN]
因为错误发生在向doppler服务器发送请求的时候,我运行bosh vms
查看是否所有的doppler服务器都在正常运行。接下来我远程登录到doppler服务器,运行monit summary
来查看是否所有作业都在正常运行。
运行 monit summary
的输出如下:
Process 'doppler' running
Process 'syslog_drain_binder' running
Process 'metron_agent' running
Process 'toolbelt' running
System 'system_localhost' running
一切看起来运行正常,于是我去查看具体的日志文件,在/var/vcap/sys/log/doppler/doppler.stderr.log
文件中, 我看到了以下错误信息.
panic: sync cluster failed
goroutine 1 [running]:
panic(0xb0d3c0, 0xc8201460f0)
/var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)
/var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185
main.main()
/var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9
由于某些原因,日志服务器组不能同步。我一个超级同事Geoff推荐我试试下面HM-9000灾难恢复方法,总结步骤如下:
monit stop etcd (on all nodes in etcd cluster)
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
monit start etcd (one-by-one on each node in etcd cluster)
很遗憾,这个办法没能解决我的问题。但是我觉得依然值得分享,因为这个办法很有可能解决其它一些类似的日志问题。
既然看起来一切都运转良好,HM9000重新设置没能解决问题,我想到去查看我的Security Group设置和路由表。我登陆Amazon的AWS Console,这两项都在VPC服务下面左边的一栏中。我发现与日志服务器相关的Security Group设置中被 web socket 用来通信的端口4443被禁止了。当我允许通过流量进入端口4443后,我成功发布了我的应用!
To read the English version, please go to What You Should Do When Your App Can Not Connect to Log Servers in CF.