SMOK – a study of a case

What happens if you jack the operator price so high you won’t be able to afford him on call? Hell yeah, let’s jump into the study of a case of my favourite system, SMOK.

Actually a big shout out to the friend who inspired this blog post who (in)famously asked me “well, with real cash flows you’ll never need to overdesign your system to the point of a single server failing that an operator’s on-call price won’t be higher than what you could reasonably afford”.

And then I brought his assertion down with a system of mine that did exactly that, with wonderful results.

Introduction

Fast forward to 2009. SMOK is a system that I quickly prototyped after my father (who also owns the best and the most expensive service company in Poland) brought home a PLC that you could connect to it via TCP and change it’s registers (or in that case to control it’s screen). Note you that it was my still secondary education about 50% complete, I still had 2 more years to go, and IT was insufficiently advanced back then to tackle the problem of NAT traversion directly. I thought “well shite. With the WAN providers to the local schools that were the only possible points of mounting this is never going to work”. Let’s design a solution that will allow us to connect to a central server instead of port-punching our way to the local PLC, and let’s allow a user to connect to it instead. And boom, 4 months later 12 local schools near Rzeszów were connected to our network. In the meantime a certain housing association from a city to be (never) revealed (later) came to us with the problem that they’ve published their critical heating infrastructure endpoints on open WAN and wanted to aggregate data from it. Practice as unsafe as it sounds, let’s note to you that it was still 2009, a far cry from the wild west security practices in use nowadays, so I wrote a simple Python script to do exactly that. The first version was Spartan to say the least, but it got the job done. Next year we made a makeover to it, added a few features, but it was still jQuery powered.

Back then four assertions about the system were made:

The system had to be understandable by a second year technikum student, nothing too complex.
The system had to fail in a ways to be waited out through the night, and then restarted at 4 pm when the only available sysop came back home.
Operator’s hour price was infinitely high: there was only a single operator, and if he’s done for, the system’s toast. He was to be able to still study for his school, for fuck’s sake!
Power consumption and network usage are at premium

Facing these considerations following approaches were made:

MySQL to store data. And MySQL’s a particularly nasty ways of storing time series, leading your to having indices the sizes of data. Apparently a certain PLC company in Poland still hasn’t learned to store their time series right.
Handling incoming connections with two threads: one wrote, second read (because I had no idea how to do it right with select later epoll). It’s just like the Polish joke about the former State Police called “Milicja” “why do Milicja officers always walk in pairs? One can read and the other can write”.
Just awful frontend skills will do, since my design skills suck (although I can paint pretty if forced to do so by circumstances).
The system worked off a Intel Atom 330 board.
The protocol was called P24, it was a thin layer upon MODBUS and supported no encryption and no authentication at all (because who cares).

Fast forward 2015

Fast forward 2015, and this is the first date in the smok4/core’s repository that really marks the date of where the real SMOK history starts. Back then I designed the system to work off a 3-replica Cassandra quorum, because you see, the system had only a single programmer (me), single sysops (me), and a single architect (me as well). The architect was infinitely smart (because I said so), and since there was no one to question me whether I was doing it right. And since I did my research right I modeled a schema that has carried us all the seven years of our Cassandra exploitation (if you ever need a bunch of Cassandra sysops and schema designer smart-asses just dial right up to our website at https://smok.co). Our prices will be certainly lower than DataStax’s, as well as will be paid in a depreciating foreign currency (PLN/USD was at 0,22 at the time of this article being in writing). The considerations at the point of that system were:

No expensive hardware, we can’t afford that. Say yes to ECC memory though. It’s mandatory to run modern server systems. The senility of high-uptime non-ECC Linux machines always stroke me really hard, with certain processes that were unkillable or file descriptors that never managed to close. I tend to attribute these problems to bits flipped with random cosmic rays. Even now I like to keep my servers well restarted, although that can be attributed to keeping our kernels well-updated and secure, although I still find 100+ days of uptime Linux servers to me notably untrustworthy. Radiolab did an episode on that, did you watch it yet? SMOK was becoming a really profitable branch of my father’s business, we could get our hands on better and better things, particularly thanks to the efforts of my fellow co-founder and co-conspirator, who by accident happens to be the son of the co-shareholder in my father’s enterprise. This guy can really get his hands on shiny 2-year-used hardware really dirt cheap, particularly thanks to a brand of shiny 160-core 256 GB RAM machines he got us on March 2021, which (thanks to our current load average predictions, and thanks to my optimizations) will last us a lifetime.
Say hi to new employees. Nothing too crazy such as on-call engineer well versed enough to bring the system up from it’s knees, just a general help-out on a contract with minor tasks. The core engineering practices still came up to be faced by me, and me, alone. I’m the only senior on this project so far. So far there are six employees on the payroll, including two engineers, one part time and the other full time, one “managing” the other, one administrative tasks part-timer and two marketing employees, one part time and the other full time. Please note that according to Polish law, board members are not (arguably) employees, so take note that board members have been excluded from this count.
No cloud, we can’t afford that. No, real, ever since we’ve decided to run our own machines we had at least my previously mentioned co-founder working the networking miracles that power SMOK, and a sysop for the miraculous price of zero, we’ve once considered cloud, and ever since the possible price that came up was three times the amount that we presently paid (mostly for power, through). We even once considered getting an AS, but ever since our CEO took a look at the price of that thing, we has like “hell no, I’m not going to pay for any of your sheningans, go to hell”. It’s infinitely simpler to just operate two redundant WAN uplinks and just switch the hell out of our DNS’es which propagate with speeds up to 1 hour. In the uneventful event that our network fails, we’ll just be able to hook up with my networking co-founder for him to update the routers, and for me to update the DNSes, and we’re just back up with an hour of downtime. All thanks to our efforts to keep our costs low. The conscious efforts of services dialing up to our server’s using DNS was made because the previous 2009-ish ISP of ours nad the nasty habit of resetting the WAN IP address allocated daily to prevent people from setting up server’s on it (which we cleverly circumvented through (ab)use of No-IP).
Data storage is to be done using Cassandra. SMOK had a minor run-in with Vanad (near 2011) and Anabel (near 2013) to store it’s data facilities, but since 2015 a decision to switch to a cluster of 3 machines Cassandra setup was done, with splendid results.
We’ll need to facilitate encryption and secure authentication. I quickly whipped up a RFC4503 stream cipher with a SHA-1 based authentication mechanism based off a pre-shared secret and a nonce generated both by client and by the server. By the way, how do you generate entropy when you’re a programmable BASIC module? I won’t tell you because it’s a trade secret. The reason I had used Rabbit was because AES-128 took 256 bytes per second to encrypt on Tibbo’s DS1206 devices we used back then to deliver our Ethernet services. Right now they support AES-128 hardwarely which is faster, but at the time of writing (the software) they did not. I even wrote an AES-128 implementation in their Tibbo Basic language and sent it to them which, arguably, must have inspired them. Our GSM line of business was quickly dispatched to the nearest design bureau, which whipped up the AERIA GSM module, which is in use today, and which arguably has broken just once (the other time the whole boiler room being flooded with water helped). This is because we didn’t constraint the design bureau to make stupid saving choices, as the modules weren’t just produced at such volume as to make these stupid optimizations. The TCP protocol is optimized to an absurd level, the whole thing – if only server-to-device traffic was concerned – could run off a 14k dial up modem.

On the May of 2018 we’ve founded SMOK LLC, which is a product company that took our fathers’ business line of profit, instead hooking us to a customer-provider relationship (they still make an awful lot of money off it, just not as much). This relationship arguably helped them to ward off lowering turnover in services thanks to COVID (it helped a lot of our customers to survive it), since all of them had a subscription-like service with their customer’s in turn to service their boilers if they went bad (since who needs a system that tells you your boiler is down if there’s no one to help you). We’ve also had the aforementioned CEO, who being the owner of our biggest consumer couldn’t just afford not to have a piece of that cake as well. And yet still, the only engineer to be had on-call to solve most of the mysteries is me. I’ve once had my employee carlift me to our office in order to sort out the mystery of our Cassandra cluster not being able to bootstrap itself out of a myserious failure that took all of our three server’s down. Our power grid’s as secure as it can be, thanks to over 10 kW of solar power installed, and our power usage standing at 1 kW the last time I read it (and that was before my smart gRPC optimization that took our load average from 20 to 3). I’ve once had a situation where I was coming back from Kraków (where I was studying for a semester), and the whole city district is black and off. I quickly whip up my smartphone to check whether our service is still up, and yes it was. I was never more proud of all the work that all of us put to place up this far. This allows us to look with confidence towards the future.

It all brings down to a talk that I gave an employee of mine today: “If a server fails, and Grafana fails to report this to us, we’ll have to wait it out to the day after without the customer ever noticing, bar that, without us even noticing, and when we find the smouldering ruins of a single server what do we do then? Disaster recovery”. This has happened more than once. The customer won’t even notice a single server of us failing. Hell, sometimes a server was down for a week and nobody noticed, this has happened more often before we initiated a Grafana-based monitoring solution that pages us when a fault occurs. This has certainly saved us from taking the only applicable operator from his on-call, when he might be out drinking with friends, more than once. Trust me, I once tried to bring a system up while being totally wasted. My efforts did not help at all, I had to sober up first.

The ability to have your servers (any single one of them) survive a catastrophic failure is vital to modern corporate resiliency, at least the way I see it. The real defense is in depth, and if your build your corporate software the way I built it, you’re safe, at least the time your own servers fail. This, in turn, allows your customers to look up to the awesome uptime of your servers, which assuredly thanks to our 1/8-FTE of on-call engineers, which we can achieve thanks to that with awesome cost results. Particularly thanks to the numerous wheel of misfortune games we’ve been having a next generation of SREs capable of scaling this system beyond our previous constraints is being produced. This day marks one year since we’ve had an issue that has had a major impact on our customers. This really allows us to look far ahead and plan for our next major moves, with our bases covered well enough.

SMOK – a study of a case

Introduction

Fast forward 2015

By Piotr Maślanka

Leave a comment Cancel reply