Recent TADAAM TV outages & how our start-up responded
Nothing more frustrating than when technology fails. You as a customer may have noticed it, but from Saturday 15th until Tuesday 18th of August our TV service was interrupted during evening hours. This is not the level of service we expect to deliver, and for this we'd like to apologize to you. If you’re wondering what went wrong exactly, we would like to clarify the situation in this blog.
Here’s what happened
On Saturday 15th August, we found out that our TADAAM customers weren't able to watch TV anymore due to a technical error in one of our systems. It's not unusual to encounter technical challenges, but this one was difficult to trace. Therefore, it took longer to solve than anticipated and we apologize for that. From the beginning, we have been busy checking for internal problems. However, we quickly understood that the issue would have to be caused by external factors. The problem was found in traffic management of TV streams from the TADAAM app and TADAAM TV box.
If you only read one thing
We apologize that the issue took a few days to fix. Since it was a problem we don’t typically see, there was no precedent for solving it, hence the delay and lack of more information. It’s very unfortunate, but sometimes systems fail and it’s out of our control. We hope you know we put everything on the table to find the source and ultimately fix the issue.
Some context on finding the solution
Here’s the long version of the story. Please note, this was a technical issue, so explaining it can get pretty complex. Therefore, we’ll try and explain terms along the way.
Day 1: Saturday - finding the problem
On Saturday August 15, we discovered that our TV service wasn’t working as it should. We immediately started investigating the issue, and discovered that our front services were under heavy load. Following this, we invested in resource upgrades as quickly as possible. Next, we shut down our internal services to free those resources for our customers. At first glance, that looked like an attack to our services. We had to briefly take down our systems in order to reload them, which solved the issue around 22h.
Day 2: Sunday - deeper into the issue
The day started with us contacting our infrastructure provider, to inform them we had been under unexpected heavy load, which could cause some downtime. Following this, our developers spent the whole day refactoring our architecture and code bases, looking for a service which was leaking connection. But unfortunately, our technical systems encountered the same issue as before, around 19h. Sadly, once more, we were experiencing an outage caused by the same unidentified external problem.
Our engineers immediately took action by shutting down several secondary services, in order to keep the TADAAM TV service up and running for our customers. A number of veteran TADAAMers spontaneously came out of their retirement to help tackle the issue. Again, we checked for internal issues, but didn’t find any. The source of the problem could only have been with the IT infrastructure, since we only noticed a glitch when the load increased. This is also why, unfortunately, it happened several times. We couldn’t easily detect the source, because all our tests came out positive. We managed to restore services more than once, but in the evening, our TADAAM TV service encountered the same issue.
Day 3: Monday - identifying the external source
Our engineers didn’t stop. Investigating aggressively in our systems, rearchetecting our base functions, testing and deploying a number of functionalities for analyses of our systems. Creating hourly reports for each connection and monitoring them one by one.
Then, around 22h, we found it. DigitalOcean reached out to us, informing us they had discovered an issue in their infrastructure. Just for the avoidance of doubt: this error was not caused by an attack or breach of any kind. The issue was caused by the performance of the Load Balancer*.
* A Load Balancer is a mechanism that efficiently distributes network traffic for high-traffic websites. It acts as the “traffic cop” sitting in front of your servers and routing client requests across all servers. If the capacity of the Load Balancer is limited, the traffic can’t be distributed across the online server, resulting in an overworked system and therefore, a significantly degraded performance.
After the periodic renewal of the security certificate*, we found a bug that set a connection limit for our Load Balancer. This didn’t only impact TADAAM’s, but also many other services worldwide.
* What kind of certificate is this? It is a bit of code on your web server that provides security for online communications. It allows secure connections for instance for credit card transactions, data transfer and logins.
Day 4: Tuesday - a complex problem, a complex solution
A bug in a software update of DigitalOcean’s systems had impacted several of their load balancers and put a limit to only 2000 connections. Such things shouldn’t happen, but in a complex IT environment we understand that it is inevitable if an error occurs once in a while. Once the problem was identified, DigitalOcean has been very responsive in solutioning together with our engineers. We solved it by creating new load balancers and started upgrading the infrastructure. Also, we migrated all our servers to different locations and put in place new architectural implementations to avoid the issue. In other words: the moment we could all finally gasp for air.
How we will prevent this from happening in the future
But we know it doesn’t end there. We have since been working hard to prevent this from happening in the future, in order to provide the best service to our customers. Therefore, we have added different monitoring tools and have put in place a more efficient, direct way to communicate with our infrastructure partner. Now, we are aware of any kind of load coming on our servers and can take immediate action to avoid any type of outages in the future.
Here’s what we want you to know
Clearly, we are disappointed that we weren’t able to help you faster. Unfortunately, TADAAM is a growing start-up that doesn’t always have control over what happens externally. But we bring our product to you with pride and dedication, and we are extremely grateful for your trust in us. As you can read in our Story, our engineering teams have been literally working day and night to fix this as quickly as possible.
Even though we took this outage as an opportunity to learn, any interruption to your service is unacceptable, and we wholeheartedly apologise for that. We hope that you found this debrief informative and we will continue to be of service to you. We have since been working day and night to prevent such situations in the future, so that you can keep enjoying TADAAM TV without any worries.