Infrastructure upgrade. A common task, especially for a SaaS company like ours, right?
Well, our latest one wasn’t the typical software upgrade as we added a little twist to it.
The main goal was to upgrade our software to the latest version, but this time we’ve included two more steps to the project - reducing the number of IPs our customers have to allowlist and achieving database redundancy.
And while the majority of the update went smoothly, we also experienced some service interruptions along the way.
In the following lines, we will share behind-the-scenes information about why we performed the upgrade, how it went, and what our biggest learnings are moving forward.
Let’s get right into it!
We understand the responsibility we have and the role NitroPack plays in the success of our clients’ businesses. That's why ensuring optimal performance, the highest security standards, and 24/7 service stability are crucial.
Upgrading our infrastructure regularly is one of the many ways to ensure all of that. And as mentioned earlier, there were three parts to this particular upgrade:
As a cloud-based solution, all optimizations that NitroPack does for our 100,000+ client sites are performed on our infrastructure. Currently, we use more than 100 servers to run the service. With this update, we had to upgrade the software that orchestrates our fleet of servers to the latest stable version.
Put simply, our IP allowlisting process wasn’t user-friendly.
Before the update, our clients had to allowlist more than 40 IP addresses that served NitroPack’s outgoing traffic (requests) to clients’ sites. On top of that, these IPs weren’t fixed, meaning that when one of our servers retires, a new one comes up with a different IP address.
This process of generating new IPs required our customers to regularly allowlist dozens of new addresses in order for NitroPack to successfully optimize their sites.
After the update, our customers need to allowlist only three fixed, never-changing IPs.
For a long time, we wanted to achieve database redundancy as it would improve the overall performance of NitroPack’s website and dashboard, and we would be able to enhance service security. Furthermore, this update would allow us to perform future database upgrades with zero downtime.
With these goals in mind, we broke down the process into two steps, leaving us a day between the two so we had a good amount of breather:
Step 1: (November 4th): Reducing the number of IPs
Step 2: (November 6th): Database redundancy & server software update
But not everything went according to our expectations.
Despite our thorough preparation, we faced some unexpected issues in all three updates. Here’s what happened:
On November 4th, we were scheduled to perform a preliminary service update, aiming to reduce the number of IP addresses for NitroPack's outgoing traffic.
Тhe initial release had an internal software bug that did not allow our service to create outgoing connections in some instances. Unfortunately, this problem only appeared when our infrastructure experienced peak traffic situations. That’s why we did not detect the issue on our staging environment during the initial tests. The connectivity issue led to NitroPack not performing outgoing optimizations reliably, and some of our clients experienced service instability for a couple of hours.
The good news is that our dev team managed to mitigate the issue promptly by updating the software of our HTTP client.
When we started working on the database redundancy on November 6th, we quickly found out that it would take much longer than planned. This forced us to push the server software upgrade a day later.
But it was with good intentions in mind. We wanted to be extremely cautious with backing up the database in the first place so that we could perform the redundancy without any issues.
When we started deploying the server software upgrade on November 7th, a small number (less than 1%) of the servers started throwing unexpected errors that ultimately prevented the update from being deployed and slowed down the entire process. There was no way to fix the issues ourselves, and we were forced to escalate it to our Server Provider.
Although it seems like a tiny hiccup, considering the magnitude of the task, this unexpected error caused a small portion (less than 2%) of NitroPack’s clients to experience intermittent brief service downtime.
When we thought we were done with the upgrade - our monitoring system began registering frequent occurrences of CDN errors with HTTP status code 502.
Unfortunately, the error affected all of our clients, causing instability of CDN resource delivery for a couple of days. After examining the problem, we issued a software update that fixed the service instability permanently.
Once everything with NitroPack was working properly, we did a retrospective meeting to reflect on what we could improve in the future based on our learnings from this infrastructure upgrade.
Overall, we’re proud of how the whole upgrade went. We experienced occasional service instability, but it was in a controlled manner, and we successfully avoided downtime of the entire service for the entire client base.
However, we know that patting ourselves on the back and focusing entirely on what went right won’t move us forward as a company and service.
That’s why we want to communicate the improvements we will implement in order to perform future upgrades in an even more efficient way. Here are the main points:
This will allow us to detect errors that only happen in high-volume situations with tens of thousands of requests. For instance, that would have helped us identify the HTTP client and reusable connections issues in advance, and we would have been able to execute the IP update without interruptions.
Based on the issues that occurred during this upgrade, we were able to identify areas where we lacked both monitoring and alerting. The great news is that we have already configured both for these areas.
Because the various teams involved with the upgrade were not ideally synced with each other, we feel we couldn’t quite communicate proactively and in a timely manner the progress of the upgrade to our existing customers. We had our status.nitropack.io page so everyone could follow what was happening with the upgrade, but that wasn’t enough.
Next time, we will communicate the progress of the upgrade across as many channels as possible, including social media, email, dashboard, and website!
For future updates, we will strive for better coordination with our Service Provider to ensure timely solutions to unexpected issues. This will help us handle unforeseen server errors more efficiently.
Finally, we would like to thank all of our clients for their patience and understanding. Your trust is our driving force to improve our service constantly.
Niko is one of the NitroPack storytellers. He is passionate about writing quality content and turning complex optimization concepts into engaging articles.