This post aims to introduce the user to how we’ve managed to elliminate a part of the toil that takes our time as the daily operator team of SMOK, in the best spirits of Google’s SRE approach to the problem.
For those who don’t already know – SMOK, which stands for System Monitoringu Odległych Kotłowni (trans. System for Monitoring Remote Boiler-rooms, which coincidentally happens to be a Polish word for a dragon) is a system which allows you to connect your boiler to the Internet, change it’s temperature and alert the service company when it breaks down.
Introduction to the problem
Our devices can be online at two levels. The first one is the TCP connection to the server, and the second is the logical level of online, when the TCP device has been configured to operate properly (had the proper protocols loaded and adjusted it’s baudrate), and sometimes had to report back to another service to confirm that it’s indeed online.
We’ve been facing two kinds of problems, first was that “the circles are spinning”. The circles spinning is a front-end based major symptom that our end user faces (we have either a green tick to mark that data is recent or a spinning circle if it isn’t), that stems from the orders not reaching the device, or in fact not being processed at all! The other problem is that a device which has established a TCP connection to the server is not reported as online, and it evades us as to why. Any SRE of SMOK is reluctant to work with code written about seven years ago in the era of Python 2.7.
We’ve had the detector of the circles spinning, worked as a service that picked at random one online device out of every device node (a logical grouping of SMOK devices being administered by the same set of services) but it still required a SRE being on call and resetting it manually.
The solution proved to be quite simple. Instead of reporting the failures (which we still do report) we’ve came up with a MHC-based killer. The MHC is a service that gathers reports about the large-scale health of the SMOK service and presents it to the op team. We’ve outfitted it with ability to kill a service that behaved incorrectly. It has been fitted with some safety measures, as in not kill a particular service more often that 5 times over last 20 minutes, to prevent it from rampantly killing everything in the way.
We’ve also added a detector that kills the service in case it sees that a device has established a TCP connection but has not been declared online by the respective SMOK services.
Doing this allowed us to note a historically high level of online devices.
Of course the proper solution would be to fix the services themselves, but since they’ve been writted so long ago (about six years ago when SMOK was a one-person project) and consist of a substantial amount of code, rewriting them is much more problematic that it seems at first.
Before that our op team could handle at least 5 restarts over a day. Now, with the automated MHC-based service killer, that number has dropped to zero and our SREs can focus on more interesting engineering tasks.
The best part of the problem is that it actually took us 30 minutes to type up the code, and them an hour to debug it.