Chef Push Client: No messages being received on command port
I am working on a very large project around CHEF and CHEF automate. As part of this project, I built a full lab in AWS to mirror my client’s environment. Yesterday, I ran into an interesting issue and decided to share the results of what I found. The solution is a highly-available CHEF server environment with a single CHEF Automate server and multiple Build Nodes. I noticed that my build nodes failed to register with CHEF push-jobs-server and it was a real mystery. I have never had an issue with registration in a standalone environment so I wondered if it had something to do with the load-balanced front-end CHEF servers. The behavior that I noticed indicates that there is an issue with communication
- Could not start phase run. No build nodes available – The CHEF Automate pipeline reported that no build nodes could be found.
- knife node status is empty – Since this is a new environment, I only have bootstrapped my build nodes. I use the knife-pushy gem to run commands against opscode-push-jobs-server and this should have reported my systems. The fact that I didn’t get a 404 message was a good thing as it tells me that push-jobs-server is registered and working.
- ERROR: [buildnode01.exospheredata.lab] No messages being received on command port in 5s. Possible encryption problem? – The push-client logs on the build server were full of messages about communication on the command ports 10000 and 10002. I checked and verified that my firewalls were clean. Since this is AWS, I double checked my Security Group to confirm that the LAN allowed all protocols.
- Restarting the push-client had no effect and the logs didn’t update on the pushy-server – I decided to tail the push-jobs-server log while I restarted the client. No changes in the log and more importantly, I saw an entry that the server was listening on port 10000
It was at this point that I decided that something is wrong with the client. In an AWS environment, my first reaction is redeploy the server but in this case, there was no change. This told me that there was something going on in the environment. I dove deeper into the client logs and decided to remove the current log, restart the service and tail the client log. This will be the best solution to figure out what was happening and why three systems couldn’t communicate with the push-jobs-server.
cat /var/log/push-jobs-client/current INFO: [buildnode01.exospheredata.lab] Starting client ... INFO: [buildnode01.exospheredata.lab] Retrieving configuration from https://chefserver.exospheredata.lab/organizations/demolab//pushy/config/buildnode01.exospheredata.lab: ... INFO: [buildnode01.exospheredata.lab] Resolved cheffrontend01 to '' and -1 others INFO: [buildnode01.exospheredata.lab] Starting ZMQ version {:major=>4, :minor=>0, :patch=>5} INFO: [buildnode01.exospheredata.lab] Listening for server heartbeat at tcp://cheffrontend01:10000 INFO: [buildnode01.exospheredata.lab] Connecting to command channel at tcp://cheffrontend01:10002 INFO: [buildnode01.exospheredata.lab] Considering server online, and starting to heartbeat INFO: [buildnode01.exospheredata.lab] Started client. INFO: [buildnode01.exospheredata.lab] Starting heartbeat / offline detection thread on interval 10.0 ... INFO: [buildnode01.exospheredata.lab] Starting reconfigure thread. Will reconfigure / reload keys after 3600 seconds, less up to splay 0.1. INFO: [buildnode01.exospheredata.lab] Starting command / server heartbeat receive thread ... ERROR: [buildnode01.exospheredata.lab] No messages being received on command port in 4s. Possible encryption problem?
Wait a minute! I just noticed that the push-client is trying to connect to a single node in my HA CHEF Server environment. I have an HAProxy server in front of these servers and the IP and Hostname resolves to that proxy for chefserver.exospheredata.lab. This is the only host in that group with a DNS entry and look, the push-client is trying… and failing… to resolve one of the front-end nodes. Now, I know that SSL communication with CHEF is very specific about having the right hostname for the cert and the access URL. It seems that in an HA environment that the nodes must to be resolvable. I added a new A-record in DNS for the front-end servers and then restarted the clients. Immediately, the environment was working and the nodes are registered in push-jobs. A quick refresh of my Automate job later, and I was off to the races.
Tl;dr – When using Chef Push Jobs, each front-end node in a CHEF HA cluster must have a fully resolvable hostname.