A lot of hot media tech likes to talk about "scale" as if it's a new problem. Like most things in tech, it is not new, and we don't need to use young, hot things to achieve it, and we probably shouldn't.
We believe there's a place for tech like NoSQL databases, asynchronous programming, deployment containers, etc. We believe we all know what that place is after we have enough experience to understand the tradeoffs involved in such technologies. This article is not going to illuminate or diminish those places or technologies. This is about how we can all scale without new things, and without a lot of extra expense in hardware or administration.
A word of caution, this is not likely to be plausible for those of us new to such infrastructure architecture. That is ok, we were there too and we learned over time. These are practices that will require us to know what we are interfacing with. Like anything, there are tradeoffs there. We will discuss some of those.
Let's dive in. A lot of my own frustrations early on in dev was in one arena that seemed very much to be putting up walls to keep me from my data: the dreaded SQL. It was finicky, exacting, and it was really hard to both develop with and understand the results of. Joins are a particular mess for most of us, sometimes even after we've been doing SQL work for a long time.
Ultimately this pain lessened, but it is important to note that it still exists. What I have learned here is that when I need something stable and predictable every time it ran, that SQL has huge advantages, and as tooling has improved (and I have learned of these tools, like the plan evaluation in Postgres), this predictability is critical to scaling in ways that matter to us at NSC.
Specifically we gain the ability to make our queries incredibly efficient on fairly large datasets (not "big data" size, but it's the wrong tech for that), and the ability to know exactly when an aberration occurs. This means development effort is slightly longer up front, but it takes less to maintain, and all the monitoring (and most of the response to problems) can be automated. Queries take a predictable range of time. If a query is taking too long, we can fire off a monitoring alert. A response script can analyze some common problems that might occur (but in practice represent hardware failure or similar critical events), and there are many things that the script can handle and send a followup alert on what it did.
The same can be said of many of our other data stores. File I/O (and Network I/O) is one of those places programming paradigms typically break down. When we decided, for Phone Janitor, to store our voicemails as files on a filesystem rather than in the database, this raised some eyebrows from some. It turns out this means failure in filesystems is a known thing with 50 years of mitigation efforts from the community, and we can build on that. It greatly simplifies how we handle operations, and most of our mitigations can be done without human intervention (unless a drive dies, of course). That means we don't have to break our programming patterns as much, because the mechanisms are isolated and we can rely on a lot of history.
Bifurcation of responsibilities seems to help in other areas too. The voicemail storage is its own system. The call handling is done on its own system. The web API is on its own system. All these systems can exist on their own server, or together, and they just get pointed to the right place to communicate to each other. Plus, our choice to just use HTTP communications between them means we can load balance for free and handle entire node failures automatically.
Some would call this microservices or something, and the marketing can debate that all they want. This is not a new concept, and a long time ago it did not require convincing management of its efficiency. When we all used timeshares on a mainframe from our terminal, this was just a normal requirement. This command has to run in an isolated way and feed into this other subsystem. Piping data around was just how we did things for a long time. That means we know not only how it can be done, but how it can go wrong, which means we can plan for it appropriately. It just takes more reading, because I never used a mainframe, so I have to learn from those that came before from their storiess.
I might be making this sound simple and straightforward, but it is not, and it is just my responsibility as a dev. In tech we often argue with sales. Sales wants everything to be as easy to sell as possible, and that means features and buzzword compatibility all over. That just translates to work for devs. Devs want things to be as simple as possible to write. That means less features and edge cases. Reality does not care about any of our jobs, so it should be assumed that work will need to be done by someone. It might as well be done by everyone together.
My personal philosophy is to do a little more work up front to save a lot of work later. This stands in stark contrast to a lot of tech business philosophy, like "move fast and break things." In my career, I've worked on projects where speed is absolutely of the essence, and I have agreed with that real business need. Like most things, this means "moving fast and breaking things" is true sometimes, but not others. We currently deal with people's phone calls, and they want them to perform and be available. That means we can neither break things nor move fast.
Our solution is to plan ahead and do work up front. Our first UI for Phone Janitor was not pretty. A lot of people gave a lot of grief when its availability got leaked and seemed to have got a lot of people talking. But, people signed up, and they used that interface happily for 2 months, because it did what it needed to. Would that work in a different service or company? Maybe, maybe not, but it turns out reliability was better than shiny things (those have come since).
I say this so we can understand, together, that sometimes business needs differ. That said, we have other problems to think about as well besides our own. Twitter is a well known example of being able to write it once quickly and grow, but then they had to rewrite it all to address growing concerns about stability (of which scale was only a part).
It seems we will all have to do the work at some point, and it will likely still be us in the end. The best argument I've seen about not doing that work now is the "that's a problem for the next guy" axiom. I say best because this is actually true. It's accurate and works for a lot of people's careers. The next guy gets some good PR by pushing the last guy under the bus (also accurately) and saves the day, too. It just is not a positive cycle, and doesn't scale socially.
We meandered into the business realm for a reason, mostly because if we aren't on the same team as the rest of the company, we aren't working for the company, and we don't believe in our product enough. Even if it's just our personal project, we should admit to ourselves whether or not this is a throwaway tinkering or if it is really something we want to do right.
The old ways teach us that doing it right isn't really about the tech we choose. The old ways say the why of that choice is the real deciding factor. We used SQL, Python, and Java to build Phone Janitor. That absolutely does not matter, because we can be just as ignorant of the needs of the project while using any other tech. Scalability the tried and true way is about planning ahead and knowing not just what your choice of tech brings to the table, but what it costs.
For us, the costs are not being pretty, taking more time to develop up front, and requiring a much more deliberate deployment plan. Putting voicemails on a filesystem means we have to reinvent indexing for that. Have we thought about inodes or filesystem degradation? Well yes, because of experience (I've only been at this for 20 years, but I've made a lot of those mistakes before). Would we have saved ourselves some of these headaches if we had chosen differently? Yes. This is the set of headaches we wanted to deal with, and trading them for different headaches just wouldn't scale for us. How do others scale? I hope they do it differently, but ultimately it's less about how they do it, and more about why they chose that. Laziness? That's bad. They just knew what they were getting into with that choice? That's a lot better.
Choose to know how you're going to fail, and plan for it. Scale around that.